MKBQA: Question Answering over Knowledge Graph Based on Semantic Analysis and Priority Marking Method

Wang, Xiang; Li, Yanchao; Wang, Huiyong; Lv, Menglong

doi:10.3390/app13106104

Open AccessArticle

MKBQA: Question Answering over Knowledge Graph Based on Semantic Analysis and Priority Marking Method

School of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang 050018, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(10), 6104; https://doi.org/10.3390/app13106104

Submission received: 14 April 2023 / Revised: 4 May 2023 / Accepted: 12 May 2023 / Published: 16 May 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In the field of question answering-based knowledge graphs, due to the complexity of the construction of knowledge graphs, a domain-specific knowledge graph often cannot contain some common-sense knowledge, which makes it impossible to answer questions that involve common-sense and domain knowledge at the same time. Therefore, this study proposes a knowledge graph-based question answering method in the computer science domain, which facilitates obtaining complete answers in this domain. In order to solve the problem of natural language problems being difficult to match with structured knowledge, a series of logic rules are first designed to convert natural language into triples of the question. Then, a semantic query expansion strategy based on WordNet is proposed and a priority marking algorithm is proposed to mark the order of triples of the question. Finally, when a question triple corresponds to multiple triples in the knowledge graph, it can be solved by the proposed SimCSE-based similarity method. The designed logic rules can deal with each type of question in a targeted manner according to the different question words and can effectively transform the question text into question triples. In addition, the proposed priority marking algorithm can effectively mark the order in the triple of the question. MKBQA can answer not only computer science-related questions but also extended open domain questions. In practical applications, answering a domain question often cannot rely solely on one knowledge graph. It is necessary to combine domain knowledge and common-sense knowledge. The MKBQA method provides a new idea and can be easily migrated from the field of computer science to other fields. Experiment results on real-world data sets show that, as compared to baselines, our method achieves significant improvements to question answering and can combine common-sense and domain-specific knowledge graphs to give a more complete answer.

Keywords:

knowledge graph; question answering; question matching; predicate match; question answering data set

1. Introduction

Knowledge graphs (KGs) can be divided into generic knowledge graphs and domain-specific knowledge graphs. It is important to construct KGs from the domain corpus to solve domain-specific problems. Domain-specific KGs have relevant and semantically interlinked applications with domain-specific problems. The authors of [1] give a definition of domain knowledge graphs: the domain knowledge graph is an explicit conceptualization of a high-level subject-matter domain and its specific subdomains represented in terms of semantically interrelated entities and relations. In recent years, numerous knowledge graphs about open domains have been proposed, such as DBpedia [2], Freebase [3], and YAGO [4]. In addition, some small-scale domain-specific knowledge graphs have appeared, e.g., SCIKG [5], which contains information about computer science. The scope of a domain-specific knowledge graph is usually insufficient to answer questions involving common-sense, such as the question in Figure 1. When a user enters a question that involves both domain knowledge and common-sense knowledge, DBpedia as the only background knowledge cannot provide an answer. As the only background knowledge, SCIKG can only obtain a partial answer. However, combining SCIKG and DBpedia as background knowledge obtains a full answer. This means that two knowledge graphs are needed to obtain a complete answer to the question, so it is necessary to combine the two knowledge graphs as background knowledge.

To obtain a complete answer, the current mainstream approach is to use additional knowledge. For example, video information [6], text description information [7], or image information [8] are usually used as additional background knowledge. The method of using external information is an important way to answer the complete question. On the one hand, it can answer questions in a specific field. On the other hand, it can also answer common sense knowledge related to knowledge in the specific field. Therefore, in the question answering process, we try to combine domain-specific knowledge graphs and open-domain knowledge graphs to obtain complete answers to questions that require both domain-specific knowledge and open-domain knowledge as background knowledge.

Therefore, this paper refers to the above method, adds additional information, and uses two structured knowledge bases together as a background knowledge base to obtain complete answers.

We design a question answering method to answer cross-domain questions. In order to solve the problem of natural language problems and structured knowledge being unable to match, a set of logical rules is first designed to transform natural language into (triples of the question, QTs). Then, a semantic query extension strategy based on WordNet is proposed, and a priority marking method is designed to mark the order of problem triples according to the dependencies in the question text. Finally, in order to solve the multi-hop problem in the knowledge graph, a predicate matching strategy based on similarity is proposed. Finally, we solve the cross-domain Q&A problem through the above methods and verify the feasibility of our method through experiments.

The details can be summarized as follows:

(1): A series of logic rules is proposed to obtain a structured representation of questions. We analyze questions using the dependence relationship of the syntactic structure among sentence elements. We use two types of logic rules to transform questions into structured representation QTs; one is to extract effective dependency relationships and the other is to convert effective relationships into QTs based on different question words. Meanwhile, a proper noun correction strategy is designed to correct the QTs with incorrect entity recognition.
(2): A priority marking method is proposed. The compound nouns in the question are extracted according to the POS tag and the dependency relationship analyzed by Stanford CoreNLP. Stanford CoreNLP is used to analyze the dependency relationship between these compound nouns, which is used to determine the order of QTs.
(3): A predicate matching strategy based on similarity is proposed. Aiming at the complex problem of multi-hop, this paper uses SimCSE to calculate the similarity between the one-hop relationship of the subject entity and the predicate in the triples of the question and selects the path by setting the similarity threshold.

2. Related Work

With the development of knowledge graphs, their applications have received extensive attention. Question answering based on knowledge graph (KBQA) uses structured data as background knowledge to obtain answers to questions. The key to the KBQA is how to provide users with complete answers.

With the aim of providing users with a complete answer, there are currently two types of methods. As for the first one, the answer is obtained by reasoning about the potential knowledge in the knowledge graph. For example, the question is parsed and converted into a query language to obtain the answer [9,10,11,12,13,14]. The knowledge in the question is extracted and mapped to the knowledge graph and candidate answers are selected and pruned to determine the final answer [15,16,17]. In addition, machine learning methods are used for reasoning. The learning ability of the machine learning model allows one to dig out connections between knowledge in a large amount of training data to obtain answers [18,19,20,21,22].

The other type is to use the method of entity alignment to expand the scope of knowledge graphs to answer questions. For example, Sun et al. [23,24] used text knowledge for the KBQA as additional background knowledge. Riquelme et al. [25] designed a model that used knowledge graphs to explain VQA to verify the answer. To improve VQA, Garcia et al. [6] put forward a video understanding model, which combined the visual information, text information of the videos, and the knowledge in the knowledge graph. Besides, in order to use a knowledge graph to solve problems in the intersection of two specific domains, some researchers proposed methods to extend a knowledge graph. After the expansion, a large-scale knowledge graph was obtained and then cross-domain questions can be answered using the knowledge graph-based question answering method, such as in the literature [26,27]. With the development of knowledge graphs and question answering methods, there have also been some methods discussing how to make the linked knowledge graph expand dynamically. For example, in the literature [28], with the dynamic increase of knowledge graph knowledge, the entire background knowledge base is dynamically updated. Although the first method can infer some potential knowledge in the knowledge graph, it still cannot answer questions that involve both domain knowledge and common-sense knowledge. Although the second method makes up for the shortcomings of the first method, it adds the steps of entity alignment and increases the difficulty of questioning and answering.

The first approach obtains latent knowledge in the knowledge graph through reasoning, and the second approach adds background knowledge through physics. This paper refers to the second method to increase structured knowledge, expand the background knowledge base, and use two knowledge graphs to answer questions instead of complex operations such as entity alignment. Considering the complexity of using the entity alignment method, a preferential labeling method is designed to use these two different knowledge graphs to achieve the goal of obtaining a complete answer.

2.1. Problem Description and Concept Definition

2.1.1. Concept Definition

In order to conveniently express the given rules, algorithms, etc., the following conceptual definitions are given.

Definition 1.

Dependency_set. This is a set of dependencies obtained by parsing the given question using Stanford CoreNLP, represented by the relationship(position 1, position 2), which represents the relationship between the word at position 2 and the word at position 1. The meaning of the relationship is shown in Table 1 [29].

Definition 2.

Compound_set. This is a set of nouns, including words with compound relations which are extracted from all the dependency relations.

Definition 3.

Del_set. This is a set of dependencies that need to be deleted. This set stores dependencies that can be removed, which contain less information for the QA task, such as root(position 1, position 2).

Definition 4.

Effective_set. This is a set of effective dependencies, which are used to generate QTs according to logical rules proposed by us.

2.1.2. Problem Description

This study aims to answer questions in the computer science (CS) domain by combining the knowledge in specific domains and open domains. To simplify the description, we first define the questions that are answered by the MKBQA method as complex questions. Complex questions are composed of two types of entities: domain-specific and open domain, which means that these two types of knowledge graphs are needed as background knowledge to provide a complete answer. As shown in Figure 2, after a question is given and structured, we can obtain multiple question triples. We need to sort all QTs to obtain sequential triples (the order in the figure is: T2, T1). After that, the QTs need to be mapped to the knowledge graph in turn. When T2 is looking for an answer in SCIKG, we are faced with a multi-hop problem. A multi-hop intermediate answer needs to be obtained (e.g., NL in Figure 2) and an unprocessed triple (T1) needs to be rewritten. Specifically, in Figure 2, NL is used to replace the interest in T1. Then, mapping the rewritten T1 and DBpedia can obtain the final answer.

In this process, we need to pay attention to the following issues of complex questions. 1. One question will be converted into at least two question triples, so how do we determine the order of question triples? 2. How do we answer a question involving multiple hops?

3. MKBQA Method

We propose a method to jointly answer conceptual questions from two types of knowledge bases, namely domain-specific knowledge bases and open domain knowledge bases. In order to combine the knowledge in the two knowledge graphs and use them to complement each other, the method shown in Figure 3 is designed.

Step 1: A structured representation of the question. We use Stanford CoreNLP to parse the question, obtain dependencies of the question, and filter dependencies by extracting logical rules to obtain effective dependencies. Valid dependencies are transformed into QTs of question-structured representations using transformation logic rules.

Step 2: The order of the QTs is determined using a priority marking algorithm. The question triples are sorted using the priority labeling algorithm, in which the priority labeling method is to sort multiple compound nouns based on Stanford CoreNLP and POS tags, and indirectly obtains ordered question triples.

Step 3: Similarity-based predicate matching strategies and complete answer templates. The problem triples are processed in turn, and the similarity between the one-hop relationship centered on the subject entity and the predicate is calculated. When an intermediate answer is found, the tail entity pointed to by the relation is taken as the intermediate answer. Otherwise, we take the entity pointed to by this relationship as the new center and repeat the above steps until the middle answer is determined. After the intermediate answer is obtained, the remaining unprocessed triples are rewritten using the intermediate answer and the answer is retrieved from DBpedia using a SPARQL query.

3.1. Representation of the Question

We obtain answers by transforming questions into QTs followed by resource matching in a multimodal knowledge graph. In order to obtain the QTs, the following three rules, Rule 1–Rule 3 are designed to obtain the effective dependencies of the question.

First of all, we use POS tagging [30] to label compound nouns and use Stanford CoreNLP [31] for syntactic analysis. We use the compound noun algorithm to identify all the compound nouns in the problem, select all the training set problems of QALD2-4, POS mark them, and identify 237 nouns in QALD2, including 108 compound nouns. In total, 195 nouns were identified in QALD3, including 85 compound nouns. In QALD4, 213 nouns were recognized, including 104 compound nouns. In order to delete all invalid dependencies, Rule 1 is designed, and the dependencies with less information are stored in Del_set. For example, Root represents the root node of the dependency tree, which has no practical significance in this article and should be deleted. Secondly, in order to obtain all nouns (including compound nouns), Rule 2 is designed. In all nouns, the vocabulary corresponding to all compound noun relations is merged, and all compound nouns and individual nouns are added to Compound_set. Finally, in order to simplify the adjustment of effective dependencies, Rule 3 is designed to indicate that all dependencies need to be adjusted. There is a subject relationship between the five words. If there is compound (5, 4), in subj (5, 2), the second word should refer to the compound noun composed of the fourth word and the fifth word, and the specific rules are described below.

Rule 1. Pruning rule

\forall (x, r, y) (\begin{array}{l} (x, r, y) \in D e p e n d e n c y_s e t \land \\ (I s E q u a l (r, r o o t) \lor I s E q u a l (r, c a s e) \lor I s E q u a l (r, p u n c t)) \\ \to (x, r, y) \in D el_set \end{array})

(1)

Rule 2. Rule for extracting compound nouns

\forall (x, r, y) (\begin{array}{l} x \in N o u n \land y \in N o u n \land I s_E q u a l (r, c o m p o u n d) \\ \to (x, y) \in C o m p o u n d_s e t \end{array})

(2)

Rule 3. Re-adjust the constraint rules

\begin{array}{l} \forall (x, r, y) \forall (x^{*}, y^{*}) (\begin{array}{l} (x, r, y) \in D e p e n d e n c y_s e t \land (x^{*}, y^{*}) \in C o m p o u n d_s e t \land \\ (I s E q u a l (x, x^{*}) \lor I s E q u a l (x, y^{*})) \\ \land_{} r \notin D e l_s e t \to ((x^{*}, y^{*}), r, y) \in E f f e c t i v e_s e t \end{array}) \\ \forall (x, r, y) \forall (x^{*}, y^{*}) (\begin{array}{l} (x, r, y) \in D e p e n d e n c y_s e t \land (x^{*}, y^{*}) \in C o m p o u n d_s e t \land \\ (I s E q u a l (y, x^{*}) \lor I s E q u a l (y, y^{*})) \\ \land_{} r \notin D e l_s e t \to (x, r, (x^{*}, y^{*})) \in E f f e c t i v e_s e t \end{array}) \end{array}

(3)

where (x, r, y) indicates that the word at position x, y has a dependency relationship r. Noun means POS tags are noun attributes. IsEqual (x, y) means judging whether x and y are equivalent.

Rule 1 indicates that when r is Root, case, or punct, it needs to be added to the set Del_set. Rule 2 indicates that when x and y are nouns, and the relation r between them is a compound noun relation (compound), x and y need to be added to the set Compound_set. When x and y are nouns and the relation r between them is not a compound noun relation (compound), the nouns x and y are added to the set Compound_set, respectively. In Rule 3, (x*, y*) represents any noun element in Compound_set. When r is not an element in Del_set and x is equivalent to x* (or y*), x needs to be replaced with (x*, y*) and so does y.

In order to clarify the steps of using the above rules, the following example is given for description. The Position in Figure 4 represents the position of the word. Rule 1 filters the set of dependencies. After removing all elements in Del_set, the rest dependencies are preserved. All compound nouns are obtained by Rule 2. Finally, according to Rule 3, the compound nouns are brought into the preserved dependence relationship and an effective set of dependence relationships is obtained.

After obtaining the effective dependencies, they need to be converted to QTs. We give the following conversion rules.

Rule 4. Modifying with compound nouns

\begin{array}{l} \forall (x_{1}, r_{1}, y_{1}) \forall (x_{2}, r_{2}, y_{2}) \forall (x_{3}, r_{3}, y_{3}) \\ (\begin{array}{l} (x_{1}, r_{1}, y_{1}) \in E f f e c t i v e_s e t \land (x_{2}, r_{2}, y_{2}) \in E f f e c t i v e_s e t \\ \land (x_{3}, r_{3}, y_{3}) \in E f f e c t i v e_s e t \land I s E q u a l (r_{1}, n s u b j) \land I s E q u a l (r_{2}, n o m d) \land \\ I s E q u a l (r_{3}, n o m d) \land I s E q u a l (x_{2}, y_{1}) \land I s E q u a l (x_{3}, y_{2}) \land \\ I s E q u a l (x_{1}, ‘ w h a t ’) \to (((x_{1}, y_{1}, y_{2}) \in T) \land ((x_{2}, x_{3}, y_{3}) \in T)) \end{array}) \end{array}

(4)

Rule 5. Using object

\begin{array}{l} \forall (x_{1}, r_{1}, y_{1}) \forall (x_{2}, r_{2}, y_{2}) \forall (x_{3}, r_{3}, y_{3}) \\ (\begin{array}{l} (x_{1}, r_{1}, y_{1}) \in E f f e c t i v e_s e t \land (x_{2}, r_{2}, y_{2}) \in E f f e c t i v e_s e t \land \\ (x_{3}, r_{3}, y_{3}) \in E f f e c t i v e_s e t \land I s E q u a l (r_{1}, n s u b j) \land \\ I s E q u a l (r_{2}, n o m d : p o s s) \land I s E q u a l (r_{3}, o b j) \land \\ I s E q u a l (x_{1}, x_{3}) \land I s E q u a l (x_{2}, y_{1}) \land I s E q u a l (y_{3}, ‘ w h a t ’) \\ \to (((y_{3}, x_{1}, y_{1}) \in T) \land ((x_{1}, y_{1}, y_{2}) \in T)) \end{array}) \end{array}

(5)

Rule 6. There is a pronoun

\begin{array}{l} \forall (x_{1}, r_{1}, y_{1}) (x_{2}, r_{2}, y_{2}) (x_{3}, r_{3}, y_{3}) (x_{4}, r_{4}, y_{4}) \\ (\begin{matrix} (x_{1}, r_{1}, y_{1}) \in E f f e c t i v e_s e t \land (x_{2}, r_{2}, y_{2}) \in E f f e c t i v e_s e t \land \\ (x_{3}, r_{3}, y_{3}) \in E f f e c t i v e_s e t \land (x_{4}, r_{4}, y_{4}) \in E f f e c t i v e_s e t \land \\ I s E q u a l (r_{1}, n o m d : p o s s) \land I s E q u a l (r_{2}, n s u b j) \land \\ I s E q u a l (r_{3}, n s u b j) \land I s E q u a l (r_{4}, c o n j) \land \\ (I s E q u a l (y_{3}, ‘ i t ’) \lor I s E q u a l (y_{3}, y_{4})) \land I s E q u a l (x_{1}, y_{2}, x_{4}) \\ \to (((x_{2}, x_{1}, y_{1}) \in T) \land ((x_{2}, x_{4}, y_{4}) \in T)) \end{matrix}) \end{array}

(6)

In these formulas, T represented the QT to be added. Rule 4 and Rule 5 are used to deal with questions of the What type (without pronouns) and Rule 6 is for questions with pronouns.

Rule 4 indicates that when the effective dependencies are nsubj (x₁, y₁) and nomd:poss (x₂, y₂) and x₁ is ‘what’, then the triples should be (x₁, y₁, y₂) and (x₁, x₃, y₃). For example, the question What is the concept of research interest of Yanghua Tang? The effective dependencies are nsubj (what, concept), nmod (concept, research interest), and nmod (research interest, Yanghua Tang). These relations satisfy the conditions that x₂ and y₁ are equal, y₂ and x₃ are equal, and x₁ is what. Thus, we can obtain the triples <what, concept, research interest> and <what, research interest, Yanghua Tang>.

Rule 5 indicates that when the effective dependencies are nsubj (x₁, y₁), nmod:poss (x₂, y₂), and obj (x₃, y₃), and y₃ is ‘what’, then the triples should be <y₃, x₁, y₁> and <y₃, y₁, y₂>. For example, for the question in Figure 5, the effective dependencies are nsubj (mean, research interest), nmod: poss (research interest, Ehud Reiter), and obj (mean, what). They satisfy that x₁ and x₃ are equal, x₂ and y₁ are equal, and y₃ is what. Then, the triples <what, mean, research interest> and <what, research interest, Ehud Reiter> can be obtained.

Rule 6 addresses questions that contain pronouns. It represents when the effective dependencies are nmod:poss (x₁, y₁), nusbj (x₂, y₂), nusbj (x₃, y₃), and conj (x₄, y₄). When y₃ is a pronoun and x₁, y₂, x₄ are equal, we can obtain the triples <x₂, x₁, y₁> and <x₂, x₄, y₄>. For example, the question is What is Schaub’s interest and what does it mean? The effective dependencies are nmod: poss (research interest, Schaub), nsubj (what, research interest), nsubj (mean, it), and conj (research interest, mean). The generated triples are <what, research interest, Schaub> and <what, research interest, mean>.

3.2. Expansions Based on WordNet

Since there is a vocabulary gap between the entities and relationships in the triple and the knowledge graph, this paper bridges the gap by expanding triples. Through experimental comparison, this paper finds that ConceptNet [32] contains more relations than WordNet, but the expansion effect is not as good as WordNet, so WordNet is used for vocabulary expansion.

We use WordNet to expand all the entities and relationships in QTs from the four dimensions: synonyms, other part-of-speech, upper meaning, and lower meaning. Other part-of-speech refers to the expansion of a word according to other part-of-speech besides itself. For example, the verb direct can be expanded to obtain the director and the adjective directly. However, many of these expanded results are unnecessary, so we also provide two methods to filter these expanded words.

(1): Sense expansion

The different semantics of each word in WordNet correspond to different sensei, and each semantic is composed of multiple words. As i increases, the frequency of each word in the semantics gradually decreases. We take each Top1 (the most relevant result) for each Sensei (i from 1 to n) and add it to the expanded result, as shown in Figure 5.

We use Formula (7) to filter the expanded results.

S e n s e_s e t = \cup_{j = 1}^{n} \cup_{i = 1}^{k} g e t S e n s e (i, a_{j}), 1 \leq k \leq n

(7)

where a_j represents the j-th word (an entity, relationship, or concept vocabulary) that needs to be expanded, Sense_set is the set of expanded vocabulary, n represents all semantics, k represents a certain semantics, and getSense (i, a_j) represents the result of a_j expansion under i-th semantics. This formula means that under certain semantics, all of the entities and relations’ words are semantically expanded, and the semantics and all the expansion results before the semantics are added as the final expansion result. For example, if the word mean is expanded, there are 10 semantic dimensions (n = 10). If Sense5 is selected as the semantic expansion dimension (k = 5), the final expansion result should be the sum of the expansion results of Sense1–Sense5.

(2): Similarity-based filtering

After obtaining the expanded vocabulary, it is necessary to filter all expanded vocabulary to remove some relatively irrelevant words. SimCSE [28] uses simple Dropout to perform vector representation learning for text, which greatly improves the uniformity of feature space distribution than directly using pre-trained Bert for vector representation learning. It does not need to label the data, and directly treats dropout as data amplification, bringing similar samples closer and dissimilar samples farther away. It fits the data set of our experiments. Therefore, this paper uses SimCSE to filter the expanded vocabulary with Formulas (8) and (9).

F i n a l W o r d s = \cup_{j = 1}^{n} \cup_{i = 1}^{m} (g e t S i m C S E (a_{j}, b_{i}))

(8)

g e t S i m C S E (a_{j}, b_{i}) = {_{\emptyset, S i m C S E (a_{j}, b_{i}) < k^{*}}^{b_{i}, S i m C S E (a_{j}, b_{i}) \geq k^{*}}, 0 \leq K^{*} \leq 1

(9)

where a_j represents the j-th word among the n words to be expanded, b_i means the i-th word expanded by a_j, and the value of i is 1 to m. SimCSE (a_j, b_i) means to calculate the similarity between the expanded word and the expanded result word, and the threshold is K*. The value of SimCSE (a_j, b_i) is shown in Equation (9). Findwords represents a collection of expanded vocabularies. This formula means that when the similarity between the expanded vocabulary and the expanded result vocabulary is greater than the threshold K*, it will be added to the expanded vocabulary set.

3.3. Triple Sorting Based on Priority Marking Algorithm

The question triples (QTs) obtained in this paper using dependencies can be classified into two types, one is QT and the other is extended QT. It is only necessary to determine the order of QT, and then the order of the default extended QT is consistent with the order of the corresponding QT. QT also has two types, one is the open domain triple and the other is the specific domain triple. For the purpose of sorting these two types of QTs, two problems need to be solved, one is how to find the relationship between these two QTs and the other is how to use this relationship to find the order between these two QTs. The Triple Sorting Based on Priority Marking is described in Algorithm1.

(1): Common vocabulary

Since the question triples obtained according to logic rules are all generated by the same question text, there must be a certain vocabulary between the question triples, so that the two question triples are related. The common associated vocabulary (common vocabulary) between the question triples determines the order of the question triples. In order to find common vocabulary, this paper designs the method shown in Figure 6.

Step 1 Question triple elements are fetched. All subjects, predicates, and objects in the question triple are preserved without removing duplicate elements, which means there are duplicate elements inside.

Step 2 Common vocabulary is acquired. When the repeating element belongs to the entity set, the word is a common vocabulary and the program ends; otherwise, take the next repeating word and execute Step 2.

In order to have a clear understanding of the above method steps, the following examples are given. For example, the question is What is the concept of research interest of Yanghua Tang? The QTs are <what, concept, research interest> and <what, research interest, Yanghua Tang>. There are the following steps.

Step 1 The sequence of elements in the QTs are <what, concept, research interest, what, research interest, Yanghua Tang>.

Step 2 The repeated words in the question triple are <what, research interest>, what does not belong to the entity set, so research interest is chosen to continue to Step 2. research interest belongs to the entity set, so the research interest is a common vocabulary.

(2): Sequential triples

After finding the common vocabulary, it is necessary to determine the order of these two question triples. Since QT is obtained by transforming logical rules from effective dependencies, the order can be determined by dependencies. We devised the method shown in Figure 7 to determine the final triplet order.

Step 1. Dependencies corresponding to domain entities are extracted. After removing the common vocabulary from the noun set, what remains are entities from different background knowledge bases, and the valid dependencies that these entities are located at are found. Since the number of valid dependencies extracted is greater than or equal to 2, a set of two dependencies containing both entities is selected.

Step 2. The order of QTs is determined. Common vocabulary is marked in the resulting valid dependencies. When the condition that there is a common vocabulary in both valid relations is satisfied, the two triples are sorted according to the relationship with the common vocabulary by taking advantage of the transitive property of the dependency. If the condition for the presence of a common vocabulary in both valid relations is not met, another set of dependencies containing both entities is selected, and Step 2 is performed.

In order to have a more intuitive understanding of the above method, an example as shown below is given. For example, the question is What is the concept of research interest of Yanghua Tang? The question triples are <what, concept, research interest> and <what, research interest, Yanghua Tang>. Valid dependencies are nsubj (what, concept), nmod (concept, research interest), and nmod (research interest, Tang Yanghua). The common vocabulary is research interest, and the steps are as follows.

Step 1. After removing the common vocabulary from the noun set, the entities are Yanghua Tang and concept. The valid dependencies where these two entities reside are nsubj (what, concept), nmod (concept, research interest), and nmod(research interest, Tang Yanghua). Since there are three valid dependencies, either of the two valid dependencies that contain the same entity, such as nsubj (what, concept) and nmod (research interest, Tang Yanghua) are taken.

Step 2. The two dependencies do not contain common vocabulary at the same time, so the dependencies are re-selected as nmod (concept, research interest) and nmod (research interest, Tang Yanghua). Both of these two valid dependencies contain common words, and the dependencies with the common words are research interest->concept and Tang Yanghua->research interest, respectively. According to the transitivity of the dependencies, we obtain Tang Yanghua->research interest->concept. Finally, according to the triplet in which the entity is located, it can be determined that the triplet sequence is <what, research interest, Yanghua Tang>, <what, concept, research interest>.

(3): Algorithm description (Algorithm 1)

Algorithm 1: Triple Sorting Based on Priority Marking Algorithm.

Input:

S e q u e n t i a l_Q T s

T r i p l e_s e t, C o m p o u n d_s e t, T r i p l e, T

Output:

S e q u e n t i a l_Q T s

1 Set <List>

R_d e p e n d e n c e, Q u e s t i o n_d e p e n d e n c e, t a g, W o r d, S e q u e n t i a l_o n e_Q T s

;
2 for each

p r o p e r_n o u n, t r i p l e

in

z i p (C o m p o u n d_s e t, T r i p l e)

:
3

Q u e s t i o n_d e p e n d e n c e .

append(

G e t_c o m p o u n d (p r o p e r_n o u n, t r i p l e)

)
4 for each

q u e s t i o n_d e p e n d e n c e

in

Q u e s t i o n_d e p e n d e n c e

:
5 for each

R_e l e m e n t

in

q u e s t i o n_d e p e n d e n c e

:
6 for

r_e l e m e n t

in

R_e l e m e n t

:
7 if

t y p e (r_e l e m e n t)

==

’ l i s t ’

:
8 for

r_p o s i t i o n

in

r_e l e m e n t

:
9 if

r_p o s i t i o n

not in

t a g

:
10

t a g .

i n s e r t (0, r_p o s i t i o n)

11 for

w o r d_p o s i t i o n

in

t a g

:
12

W o r d . a p p e n d (F i n d W o r d (w o r d_p o s i t i o n))

13 for

w o r d

in

W o r d

:
14 for

t r i p l e

in

T

:
15 for

t r i p l e_e l e m e n t

in

t r i p l e

:
16 if

w o r d

==

t r i p l e_e l e m e n t

:
17

S e q u e n t i a l_o n e_Q T s . a p p e n d (t r i p l e)

18

S e q u e n t i a l_Q T s . a p p e n d (S e q u e n t i a l_o n e_Q T s)

Lines 1–3 are meant to obtain the dependencies associated with the two entity types (open domain and domain-specific). Lines 4–10 provide the order of entity dependencies. Lines 11–12 are the corresponding word based on its position. This is because dependencies record the position information of words, while problem triples record words. Lines 13–18 record the order of the triples. The algorithm complexity is O(n³), where n is the number of words or compound nouns in the question.

3.4. Answer

To obtain a complete answer, we need to perform intermediate answer extraction on the triplet with the highest priority after obtaining the ordered question triples. After finding the intermediate answer, the remaining unprocessed question triples are rewritten to finally find the question answer.

(1): Middle answer

To find the middle answer, we propose a SimCSE similarity-based predicate matching method to find all candidate relations in CIMKG and prune the candidate relations by adjusting the similarity threshold. In the matching process, the multi-hop problem in the knowledge graph needs to be solved, as shown in Figure 8.

Step 1: According to QT generation rules, the object of the question triplet is taken as the subject entity, and the predicate is taken as the predicate P that matches the relation.

Step 2: Obtain all the one-hop edges related to the subject entity. When the edge is empty, end the program, otherwise use SimCSE to calculate the similarity between all one-hop edges and the predicate.

Step 3: When there is no similarity larger than the given threshold, the tail entity of the edge with the greatest similarity with the predicate is used as a new subject entity to perform Step 2. Otherwise, all edges larger than the threshold are taken as candidate answers, and the candidate answers are pruned using Formula (10).

\begin{matrix} \forall < t, r, e > I s E q u a l (t y p e (e), t y p e (T o p i c E n t i t y)) \\ \to (e \in A n s w e r) \end{matrix}

(10)

where <t, r, e> represents a hop edge, t is the head entity, r is the relationship, e is the tail entity, and Answer is the final answer set. This formula indicates that when types of the subject entity and the candidate answer are consistent, the candidate answer is taken as the final answer.

(2): Complete answer

After the ordered QT and the intermediate answer are obtained, another QT needs to be rewritten, and then the knowledge graph is used to obtain the final answer. Since both question triples come from the same question text, the generated question triples can be sorted according to the noun dependency. Therefore, the two question triples can be rewritten using the common vocabulary, and the common vocabulary in remaining question triples can be replaced by the intermediate answer.

After rewriting, you can use the query template to obtain the final answer to the question. The overall process is shown in Figure 9. After determining the domain-specific triples and the common vocabulary of the open domain triples (entity2 in the figure) and obtaining the intermediate answer, we need to replace the common vocabulary entity2 in the unprocessed triples (open domain triples) with the intermediate answer, thereby obtaining a triple with only one unknown entity (rewrite triple). Finally, the template shown in the figure is used to query the triples.

4. Results

In order to evaluate the effect of using the priority labeling algorithm proposed in this paper for question answering, a comparative experiment is carried out on the public data set QALD3. The experimental results show that the F1 indicator has been improved. The specific experiments are as follows.

4.1. Data Sets

The data sets used in this study are SCIKG [5], DBpedia [33], QALD-3 [34], and CSQA [35]. SCIKG is a knowledge graph about the domain of computer science, which contains information such as the title of the paper, the author of the paper, the name of the expert, the position, and the research interest.

4.2. Structured Representation of Questions

CSQA is a set of questions constructed by the author of this article. The template is shown in Table 2. The $e in the table represents the entity name in SCIKG. This question set is a conceptual question raised by ten users in the computer field based on the background knowledge base of DBpedia and SCIKG, and then these questions are abstracted to obtain a question template, and finally, 200 computer field questions are generated.

In order to evaluate the given logic rules, all the data of the training set and part of the test set of QALD2-4 are selected to evaluate the structured representation of the problem. The overall evaluation effect is shown in Table 3.

In Table 3, ALL_Right indicates the total number of correct QTs, Done indicates the number of question triples given by the method in this paper, and Right indicates the correct question triple number. P = Right/Done, R = Right/ALL_Right, F1 = 2 ∗ Recall ∗ Precision/(Recall + Precision). The experimental results show that the structured representation of the rule-based question is evaluated on F1 and the result is above 89%, which is considered valid in this paper.

In order to evaluate the structured rules of the questions proposed in the text, an evaluation experiment is carried out on the QALD3 data set, and the experimental results are shown in the table.

Since the evaluation metric of the baseline method is P, the text uses P to evaluate the logic rules. As can be seen from Table 4, the result of the question structuring rules proposed in this paper in the QALD3 data set is better than those proposed by FactQA [36] in the P evaluation index, which proves that the rules proposed in this paper are effective.

4.3. Extension Based on WordNet

(1): Semantic expansion based on WordNet

On the CSQA question set, 50 questions are randomly selected three times, and ConceptNet and WordNet are used to conduct experiments on these 50 questions, respectively. The final result is the mean of the three results, as shown in Table 5. Synonym in ConceptNet is used as synonym expansion relations, and Synonym in WordNet is used as synonym expansion relations.

In the table, P = Right/Expansion, R = Right/All, where Right is the correct number of expansions, Expansion is the number of expanded words, and All is the number of answers (the value is 50). The results show that for the data in this paper, the expansion effect of ConceptNet is not as good as that of WordNet, so this paper chooses WordNet as the expansion tool.

This paper uses the QALD3 data set for comparative experiments and uses WordNet for semantic expansion of 526 entities and relations and 56 classes in QALD3. These 526 entities and relations come from keywords in the QALD data set. Fifty-six classes are counted from SPARQL files in the QALD data set. Each Sensei’s TOP1 is used as an expanded vocabulary. The calculation formula in Figure 9 is: Precision = Right/expansion_result, Recall = Right/All, F1 = 2 ∗ Precision ∗ Recall/ (Precision + Recall), where Right represents the number of correct words for expansion, All represents the number of entities and relationships (526) or the number of classes (56), and expansion_result is the number of words expanded by WordNet.

Figure 10 shows the extended results for entities and relationships, where S stands for Synonyms and O stands for other-parts-of-speech. In the figure, Sense_ALL represents the sum of multiple semantics. As i increases, the value of F1 gradually decreases, so Sense1 is selected as the extension set. When using synonyms to expand entities and relationships, the F1 value is the largest, so synonyms are selected to expand entities and relationships.

In Figure 11, U stands for upper meaning, and L stands for lower meaning. As can be seen from the figure, in the case of U + L, the class extension effect is the best. Therefore, this paper uses hyponyms to extend the class. Table 6 shows the results of scaling up on the CSQA data set.

Finally, for entities and relationships, the dimension with the largest F1 value is selected for expansion, that is, synonyms. In the same way, we select upper meaning and lower meaning to expand the class.

(2): SimCSE-based filtering strategy

The purpose of filtering the extended results is to maintain the accuracy of the results, so it needs to be filtered without reducing the accuracy. All Sense1 results are filtered using SimCSE, and the results obtained are shown in Table 7.

Threshold in the table represents the threshold, Result represents the number of expanded words, Right represents the correct number of expansions, Total represents the total number of expanded words. P = Right/Result, R = Right/Total. When the threshold is 0.6, the F1 value is the largest, which proves that the filtering effect is the best at this time, so 0.6 is set as the threshold for extended filtering of entities and relationships. Similarly, the threshold 0.7 corresponding to the maximum value of P is selected as the filtering threshold of the class, as shown in Table 8.

4.4. Triple Sorting Based on Priority Marking Algorithm

The QTs generated by structured rules are sorted using a priority labeling algorithm. Three batches of data of 50, 100, and 150 questions are randomly selected from 200 questions, each batch is taken three times, and the average of the three times is used as the final result. The results obtained are shown in Table 9.

In the table, P = Right/Done, R = Right/Total, where Right is the exact number of questions after marking using the priority marking algorithm, Total is the number of five users who manually sort all the question triples, and it has three values, namely 50, 100 and 150. Done is the number of all questions obtained using the priority marking algorithm. The results in the table are the mean of three experiments. As can be seen from the table, the order of most triples in CSQA can be effectively determined by the priority labeling algorithm.

4.5. Relation Querying and Matching Based on SimCSE Similarity

Based on the relationship matching method of SimCSE similarity, the similarity threshold is set to filter the relationship, and the results are shown in Table 10. Because the final answer path needs to be determined, both P and R must be guaranteed, so F1 is selected as the evaluation index.

Results in Table 10 are obtained by randomly selecting 50 data from the CSQA data set and taking the average of three times. Threshold represents different thresholds, P = Right/GiveAnswer, R = Right/AllAnswer, where Right is the number of correct answers selected by SimCSE. AllAnswer is the total number of answers in 50 data. Because there are multiple answers to a question, the value is larger than or equal to 50. GiveAnswer is the number of answers given using SimCSE. Table 10 is the final result of taking the average of the three experimental results. Finally, we select the maximum threshold of F1, which is 0.5.

4.6. Question Answering Effect

The MKBQA method is used to conduct a question and answer on the CSQA data set, and SCIKG and DBpedia are used as the background knowledge base. The results are shown in Table 4, Table 5, Table 6, Table 7, Table 8, Table 9 and Table 10, and the F1 effect is 77.68%. In order to have a certain understanding of this result, a comparative experiment is conducted on the public data set QALD3, with DBpedia as the background knowledge base, and the experimental results are shown in Table 11.

The table shows the comparison results between our method and baseline models on the QALD3 data set. Among them, Total is the number of questions, Processed is the number of answers given by the method in this paper, and Right is the number of the most correct questions. P = Right/Processed, R = Right/Total.

The results show that MKBQA outperforms CASIA and SWIP on F1. It is proved that the method proposed in this paper is effective. Through the analysis of all error results, it is found that 30% of the question errors in QALD are caused by understanding questions, as shown in Table 12.

The key in Table 12 represents the corresponding relationships in the knowledge graph. Because the key cannot be directly obtained from the question, the verbs in the question cannot be expanded without accurate correspondence. For example, answering the question How many monarchical countries are there in Europe? needs the type of government of the country to be counted, but the relevant information of the government type cannot be extracted from the question, so the corresponding governmentType cannot be found from the knowledge graph, resulting in the failure of the answer retrieval. CASIA is superior to MKBQA in question processing. Another 5% of errors are caused by inaccurate representation of the problem structure. MKBQA slightly outperforms CASIA in terms of accuracy, proving that MKBQA is effective.

5. Conclusions

This paper proposes a question answering method based on priority tagging, which combines the domain knowledge graph and common sense knowledge graph as background knowledge to the answer and successfully answers cross-domain questions in the computer field. In the process of answering, logic rules are designed to convert questions into question triples, separating domain knowledge from common sense knowledge. To solve the problem of different knowledge graphs required by QTs as background knowledge bases, a priority labeling algorithm is proposed. The outstanding attribute of this method is that it can utilize domain knowledge graphs and open domain knowledge graphs to obtain complete answers. Experimental results show that MKBQA is an effective method.

The MKBQA method provides new ideas for application fields such as knowledge graph-based intelligent question answering robots and knowledge graph-based search.

Although the question template constructed in this paper has a reference value for other fields and can even be used directly, the question set template is not sufficient, because the number of computer field personnel who provide problems does not reach a certain number. Therefore, the questions asked can only be used for testing with a small amount of data. If a large amount of data is required in the experiment, more questions need to be asked. Therefore, subsequent work in this paper will construct a large question set with good transferability.

Author Contributions

Conceptualization, X.W.; methodology, X.W. and Y.L.; software, Y.L.; validation, X.W., Y.L. and H.W.; investigation, Y.L.; resources, Y.L.; data curation, H.W.; writing—original draft preparation, Y.L.; writing—review and editing, H.W. and M.L.; visualization, Y.L.; supervision, X.W. and M.L.; project administration, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Hebei Natural Science Foundation (Grant number F2022208002) and the Science and Technology Project of the Hebei Education Department (Grant number ZD2021048).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data sets generated during and analyzed during the current study are available from the corresponding author upon reasonable request.

Acknowledgments

We would like to thank the anonymous reviewers for their valuable and helpful comments, which substantially improved this paper. At last, we also would also like to thank all of the editors for their professional advice and help.

Conflicts of Interest

The authors declare no conflict of interest.

References

Abu-Salih, B. Domain-specific knowledge graphs: A survey. J. Netw. Comput. Appl. 2021, 185, 103076. [Google Scholar] [CrossRef]
Lehmann, J.; Isele, R.; Jakob, M.; Jentzsch, A.; Kontokostas, D.; Mendes, P.N.; Hellmann, S.; Morsey, M.; Van Kleef, P.; Auer, S. DBpedia-A large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 2015, 6, 167–195. [Google Scholar] [CrossRef]
Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; Taylor, J. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, BC, Canada, 9–12 June 2008; pp. 1247–1250. [Google Scholar]
Suchanek, F.M.; Kasneci, G.; Weikum, G. Yago: A core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web, Banff, AB, Canada, 8–12 May 2007; pp. 697–706. [Google Scholar]
Wang, T.; Wang, Y.; Tan, C. Construction and application of knowledge graph system in computer science. In Proceedings of the 2018 International Conference on Security, Pattern Analysis, and Cybernetics (SPAC), Jinan, China, 14–17 December 2018; pp. 169–172. [Google Scholar]
Garcia, N.; Otani, M.; Chu, C.; Nakashima, Y. KnowIT VQA: Answering knowledge-based questions about videos. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 10826–10834. [Google Scholar]
Han, J.; Cheng, B.; Wang, X. Open domain question answering based on text enhanced knowledge graph with hyperedge infusion. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP, Seattle, WA, USA, 16–20 November 2020; pp. 1475–1481. [Google Scholar]
Yu, J.; Zhu, Z.; Wang, Y.; Zhang, W.; Hu, Y.; Tan, J. Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recognit. 2020, 108, 107563. [Google Scholar] [CrossRef]
Cui, W.; Xiao, Y.; Wang, H.; Song, Y.; Hwang, S.-W.; Wang, W. KBQA: Learning question answering over QA corpora and knowledge bases. arXiv 2019, arXiv:1903.02419. [Google Scholar] [CrossRef]
Bakhshi, M.; Nematbakhsh, M.; Mohsenzadeh, M.; Rahmani, A.M. Data-driven construction of SPARQL queries by approximate question graph alignment in question answering over knowledge graphs. Expert Syst. Appl. 2020, 146, 113205. [Google Scholar] [CrossRef]
Shin, S.; Lee, K.-H. Processing knowledge graph-based complex questions through question decomposition and recomposition. Inf. Sci. 2020, 523, 234–244. [Google Scholar] [CrossRef]
Wang, Y.; Xu, X.; Hong, Q.; Jin, J.; Wu, T. Top-k star queries on knowledge graphs through semantic-aware bounding match scores. Knowl.-Based Syst. 2021, 213, 106655. [Google Scholar] [CrossRef]
Shin, S.; Jin, X.; Jung, J.; Lee, K.-H. Predicate constraints based question answering over knowledge graph. Inf. Process. Manag. 2019, 56, 445–462. [Google Scholar] [CrossRef]
Zheng, W.; Cheng, H.; Yu, J.X.; Zou, L.; Zhao, K. Interactive natural language question answering over knowledge graphs. Inf. Sci. 2019, 481, 141–159. [Google Scholar] [CrossRef]
Shen, C.; Huang, T.; Liang, X.; Li, F.; Fu, K. Chinese knowledge base question answering by attention-based multi-granularity model. Information 2018, 9, 98. [Google Scholar] [CrossRef]
Zhang, H.; Xu, G.; Liang, X.; Zhang, W.; Sun, X.; Huang, T. Multi-view multitask learning for knowledge base relation detection. Knowl.-Based Syst. 2019, 183, 104870. [Google Scholar] [CrossRef]
Ghosh, S.; Razniewski, S.; Weikum, G. Uncovering hidden semantics of set information in knowledge bases. J. Web Semant. 2020, 64, 100588. [Google Scholar] [CrossRef]
Zhang, L.; Lin, C.; Zhou, D.; He, Y.; Zhang, M. A bayesian end-to-end model with estimated uncertainties for simple question answering over knowledge bases. Comput. Speech Lang. 2021, 66, 101167. [Google Scholar] [CrossRef]
Hao, Z.; Wu, B.; Wen, W.; Cai, R. A subgraph-representation-based method for answering complex questions over knowledge bases. Neural Netw. 2019, 119, 57–65. [Google Scholar] [CrossRef] [PubMed]
Saxena, A.; Tripathi, A.; Talukdar, P. Improving multi-hop question answering over knowledge graphs using knowledge base embeddings. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Seattle, WA, USA, 5–10 July 2020; pp. 4498–4507. [Google Scholar]
Wang, X.; Zhao, S.; Han, J.; Cheng, B.; Yang, H.; Ao, J.; Li, Z. Modelling long-distance node relations for KBQA with global dynamic graph. In Proceedings of the Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 2572–2582. [Google Scholar]
Liu, A.; Huang, Z.; Lu, H.; Wang, X.; Yuan, C. BB-KBQA: BERT-based knowledge base question answering. In Proceedings of the China National Conference on Chinese Computational Linguistics, Kunming, China, 18–20 October 2019; pp. 81–92. [Google Scholar]
Sun, H.; Bedrax-Weiss, T.; Cohen, W.W. Pullnet: Open domain question answering with iterative retrieval on knowledge bases and text. arXiv 2019, arXiv:1904.09537. [Google Scholar]
Sun, H.; Dhingra, B.; Zaheer, M.; Mazaitis, K.; Salakhutdinov, R.; Cohen, W.W. Open domain question answering using early fusion of knowledge bases and text. arXiv 2018, arXiv:1809.00782. [Google Scholar]
Riquelme, F.; De Goyeneche, A.; Zhang, Y.; Niebles, J.C.; Soto, A. Explaining VQA predictions using visual grounding and a knowledge base. Image Vis. Comput. 2020, 101, 103968. [Google Scholar] [CrossRef]
Mosbach, S.; Menon, A.; Farazi, F.; Krdzavac, N.; Zhou, X.; Akroyd, J.; Kraft, M. Multiscale cross-domain thermochemical knowledge-graph. J. Chem. Inf. Model. 2020, 60, 6155–6166. [Google Scholar] [CrossRef]
Eibeck, A.; Lim, M.Q.; Kraft, M. J-Park Simulator: An ontology-based platform for cross-domain scenarios in process industry. Comput. Chem. Eng. 2019, 131, 106586. [Google Scholar] [CrossRef]
Gao, T.; Yao, X.; Chen, D. Simcse: Simple contrastive learning of sentence embeddings. arXiv 2021, arXiv:2104.08821. [Google Scholar]
De Marneffe, M.-C.; Manning, C.D. Stanford Typed Dependencies Manual; Technical Report; Stanford University: Stanford, CA, USA, 2008. [Google Scholar]
Kumawat, D.; Jain, V. POS tagging approaches: A comparison. Int. J. Comput. Appl. 2015, 118, 32–38. [Google Scholar] [CrossRef]
Manning, C.D.; Surdeanu, M.; Bauer, J.; Finkel, J.; Mcclosky, D. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, USA, 22–27 June 2014; pp. 55–60. [Google Scholar]
Speer, R.; Chin, J.; Havasi, C. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the Thirty-first AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
Auer, S.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyganiak, R.; Ives, Z. Dbpedia: A nucleus for a web of open data. In Proceedings of the Semantic Web: 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007 + ASWC 2007, Busan, Republic of Korea, 11–15 November 2007; Springer: Berlin/Heidelberg, Germany, 2007; pp. 722–735. [Google Scholar]
He, S.; Liu, S.; Chen, Y.; Zhou, G.; Liu, K.; Zhao, J. CASIA@ QALD-3: A Question Answering System over Linked Data. In Proceedings of the Working Notes for CLEF 2013 Conference, Valencia, Spain, 23–26 September 2013. [Google Scholar]
Saha, A.; Pahuja, V.; Khapra, M.; Sankaranarayanan, K.; Chandar, S. Complex sequential question answering: Towards learning to converse over linked question answer pairs with a knowledge graph. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Zhang, X.; Meng, M.; Sun, X.; Bai, Y. FactQA: Question answering over domain knowledge graph based on two-level query expansion. Data Technol. Appl. 2020, 54, 34–63. [Google Scholar] [CrossRef]
Cimiano, P.; Lopez, V.; Unger, C.; Cabrio, E.; Ngonga Ngomo, A.C.; Walter, S. Multilingual Question Answering over Linked Data (QALD-3): Lab Overview. In Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages, 4th International Conference of the CLEF Initiative, CLEF 2013, Valencia, Spain, 23–26 September 2013; pp. 321–332. [Google Scholar]

Figure 1. Comparison of results based on different background knowledge.

Figure 2. Description of complex questions.

Figure 3. Method overview.

Figure 4. Examples of steps to use Rules 1–3.

Figure 5. Example based on WordNet extension.

Figure 6. Steps of the method for finding common vocabulary.

Figure 7. Determining the QTs’ order.

Figure 8. Predicate matching based on SimCSE similarity.

Figure 9. DBpedia mapping triples example.

Figure 10. Result of entity and relationship expansion.

Figure 11. Class expansion result.

Table 1. Description of partial dependencies.

Relationship	Description
Root	Rely on the root of the tree structure
advmov	Adverb_Noun Relation/Adverb_Adjective Relation
nsubj/nsubj_pas	Subject_predicate relationship
aux/auxpass	Auxiliary Verb_Verb Relationship
nomd/nomdposs	Compound noun modification
amod	Adjective modification
advmod	Adverb modification
case	Case
dobj	Direct object
compound	Compound noun relations
pass	Passive modifier
obj	Object
conj	Constellation
punct	Punctuation

Table 2. CSQA question set.

Relationship	Question Template
Complex-question	What is the concept of research interest of $e?
	What does $e’s research interest mean?
	What is the mean of $e’s research interest?
	What is $e’s interest and what does it mean?
	What are $e’s interests and what are their definitions?

Table 3. Question structured results.

	QALD2	QALD3	QALD4
ALL_Right (piece)	100	124	113
Done (piece)	89	119	108
Right (piece)	85	110	102
Precision (%)	95.51	92.44	94.44
Recall (%)	85	88.71	90.27
F1 (%)	89.95	90.54	92.31

Table 4. Comparison of the results of rule-based structured representation.

	Data Set	Precision (%)
Ours	QALD3	66
FactQA [36]	QALD3	62.24
Ours	CSQA	84

Table 5. Comparison of ConceptNet and WordNet Extended Results.

Data Set	Model	Right	Expansion	P	R	F1
CSQA	Wordnet	24	26	92.31	48.00	63.16
CSQA	ConceptNet	18	25	72.00	36.00	48.00

Table 6. Extended results of the CSQA data set.

Extended Object	Dimension	P	R	F1
Entity + Relation	S	93.94	46.5	62.21
	O	56.85	41.5	47.98
	S + O	51.63	55.5	53.50
Class	U	56.85	41.5	47.98
	L	93.94	46.5	62.21
	U + L	90.20	76.67	82.89

Table 7. Threshold for entities and relationships (the max in the table means selecting the most similar vocabulary.).

Total	Threshold	Result (Piece)	Right (Piece)	P (%)	R (%)	F1 (%)
526	0.9	423	232	54.85	44.11	48.90
	0.8	432	244	56.48	46.39	50.94
	0.7	445	248	55.73	47.15	51.08
	0.6	452	253	55.97	48.10	51.74
	0.5	517	253	48.94	48.10	48.52
	0.4	535	254	47.48	48.29	47.88
	0.3	541	255	47.13	48.48	47.80
	0.2	563	259	46.00	49.24	47.56
	0.1	578	260	44.98	49.43	47.10

Table 8. Threshold of concept vocabulary.

Total	Threshold	Result (Piece)	Right (Piece)	P (%)	R (%)	F1 (%)
56	0.9	50	31	62.00	55.36	58.49
	0.8	63	37	58.73	66.07	62.18
	0.7	88	45	51.14	80.36	62.50
	0.6	109	49	46.23	87.50	60.50
	0.5	234	50	21.37	89.29	34.49
	0.4	268	52	19.40	92.86	32.09
	0.3	321	52	16.20	92.86	27.59
	0.2	356	53	14.89	94.64	25.73
	0.1	378	53	14.02	94.64	24.42

Table 9. QTs sorting results.

Data Set	Method	P	R	F1
CSQA	Randomly select 50 data (3 times)	80.00%	85.00%	82.42%
CSQA	Randomly select 100 data (3 times)	79.00%	78.67%	78.83%
CSQA	Randomly select 150 data (3 times)	86.67%	88.00%	87.33%

Table 10. Different threshold results (mean result).

Threshold	Right	AllAnswer	GiveAnswer	P (%)	R (%)	F1 (%)
0.1	39	50	102	38.24	78.00	51.32
0.2	39	50	97	40.21	78.00	53.06
0.3	32	50	85	37.65	64.00	47.41
0.4	31	50	81	38.27	62.00	47.33
0.5	30	50	56	53.57	60.00	56.60
0.6	23	50	45	51.11	46.00	48.42
0.7	13	50	30	43.33	26.00	32.50
0.8	3	50	29	10.00	6.00	7.50
0.9	0	/	/	0	0	0

Table 11. Comparison results.

Model	Data Set	Total	Processed	Right	Precision	Recall
MKBQA	CSQA	200	200	127	63.5%	100%
MKBQA	QALD3	99	40	26	26.26%	65.00%
Intui2 [37]	QALD3	99	99	28	28.28%	28.28%
SWIP [37]	QALD3	99	21	15	71.43%	15.15%

Table 12. Question understanding error.

Question	Key
Is Frank Herbert still alive?	deathDate
Give me the birthdays of all actors of the television show Charmed.	starring
How many monarchical countries are there in Europe?	governmentType
…	…

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, X.; Li, Y.; Wang, H.; Lv, M. MKBQA: Question Answering over Knowledge Graph Based on Semantic Analysis and Priority Marking Method. Appl. Sci. 2023, 13, 6104. https://doi.org/10.3390/app13106104

AMA Style

Wang X, Li Y, Wang H, Lv M. MKBQA: Question Answering over Knowledge Graph Based on Semantic Analysis and Priority Marking Method. Applied Sciences. 2023; 13(10):6104. https://doi.org/10.3390/app13106104

Chicago/Turabian Style

Wang, Xiang, Yanchao Li, Huiyong Wang, and Menglong Lv. 2023. "MKBQA: Question Answering over Knowledge Graph Based on Semantic Analysis and Priority Marking Method" Applied Sciences 13, no. 10: 6104. https://doi.org/10.3390/app13106104

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MKBQA: Question Answering over Knowledge Graph Based on Semantic Analysis and Priority Marking Method

Abstract

1. Introduction

2. Related Work

2.1. Problem Description and Concept Definition

2.1.1. Concept Definition

2.1.2. Problem Description

3. MKBQA Method

3.1. Representation of the Question

3.2. Expansions Based on WordNet

3.3. Triple Sorting Based on Priority Marking Algorithm

3.4. Answer

4. Results

4.1. Data Sets

4.2. Structured Representation of Questions

4.3. Extension Based on WordNet

4.4. Triple Sorting Based on Priority Marking Algorithm

4.5. Relation Querying and Matching Based on SimCSE Similarity

4.6. Question Answering Effect

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI