1 Introduction
-
We introduce Bio-SODA—a novel natural language processing engine over knowledge graphs that does not require prior training data (question-answer pairs) for translating natural language questions into SPARQL.
-
We define a novel ranking algorithm for selecting the best automatically generated SPARQL statements in response to a given natural language question. The ranking algorithm combines syntactic and semantic similarity, as well as node centrality in the knowledge graph. Many existing question answering systems either rely on simple metrics for ranking, such as the length of the answer query graph [6], or require extensive training data in order to learn a ranking function [21]. To the best of our knowledge, our approach is the first to take into account all three factors (syntactic and semantic similarity, as well as node centrality) for ranking queries.
-
Our experiments on various real-world datasets show that Bio-SODA outperforms state-of-the-art KGQA systems by 20% on the F1-score using the official QALD4 biomedical benchmark and by an even higher factor on the more complex bioinformatics dataset.
-
Finally, in addition to the work presented in the conference version of this paper (see [20]), here we introduce Bio-SODA UX, a prototype graphical user interface enabling users to interact with knowledge graph data and disambiguate natural language questions over the data dynamically; we demonstrate through selected use cases how the interface can assist users in exploring the available data and in finding the information of interest from the underlying knowledge graph.
2 Related work
3 Challenges of natural language processing over knowledge graphs
-
Lack of training data.For many scientific knowledge graphs there is no sufficiently long and diverse log of questions and their corresponding queries in order to derive a representative training set for a machine learning-based solution. So far, existing training corpora have proven costly to construct [14], with the added drawback that any semi-automatically generated dataset risks compiling a set of question-answer pairs that are non-representative for the information needs of real users of the KGQA system, e.g. domain experts.
-
Rule-based approaches perform well, but are costly to build and maintain.So far, state-of-the-art solutions for question answering over generic RDF-based knowledge graphs have been mostly rule-based systems, relying on manually handcrafted rules. For example, GFMed [24] and Pomelo [25], the top 2 ranked systems in the QALD4 biomedical challenge, have achieved very good results in the challenge, but at the cost of very little generality. In essence, these systems suffer from significant overfitting: to be applicable to a new domain, their rule sets would need extensive or even complete rewriting. Moreover, even for a new dataset within the same domain, for which the schema differs, new rules need to be added in order to accommodate the differences.In some cases it is beneficial to incorporate a minimal set of rules in KGQA systems, particularly for deriving complex concepts. However, this should be a last resort and not the main translation mechanism, given that a large rule set is hard to maintain and scale.
-
Schema-less, incomplete data.One of the strengths of relational databases is to have a database schema which enables strict data modelling and guarantees certain data integrity and data quality aspects. However, since RDF does not strictly enforce a (database) schema, real-world datasets using RDF knowledge graphs often exhibit poor structure [35, 36]. Typical examples are properties with missing or generic domains and ranges. In other words, a question answering system over RDF knowledge graphs typically does not have complete schema information. Hence, an important step when working with such incomplete knowledge graphs is to enrich the existing (incomplete) schema, for example, by inferring property ranges and domains based on instance-level data.
-
Disambiguation.In many cases, different users have different expectations (query intents) when asking the same question. An example would be the question What are all the Big Data projects?, asked over the European Projects dataset. Possible interpretations of this request are either to retrieve all projects in a Big Data call, or all projects by institutions that have the term Big Data in their name or all projects whose title or abstract include the terms Big Data etc. The system should ultimately let users decide which interpretation was intended when asking the question, also informing them of the range of possible options, according to the available underlying data.
4 Bio-SODA: a high-level perspective
5 Bio-SODA: system architecture
-
Preprocessing Phase: This phase includes building indexes for efficient lookup as well as automatically generating a schema graph, which will serve as the basis for constructing candidate SPARQL queries in response to user questions. This phase is only executed once, when initialising the system.
-
SPARQL Query Generation Phase: This phase represents the natural language query translation process and includes (1) looking up query tokens in the database, (2) ranking the candidate tokens, (3) constructing the candidate query graphs, (4) ranking the query graphs in order of relevance to the user question; and finally (5) constructing a valid SPARQL query and presenting the results.
-
User Dialog and Disambiguation:More and more RDF datasets are available in diverse scientific fields, yet for practitioners, they are often difficult to explore. Indeed, the increasing size and complexity of the data necessitate not only faster indexes but also smarter user interfaces providing dynamic querying and filtering possibilities. Our experience in prototyping Bio-SODA showed that, in order to enable data exploration, the system must also guide the user in the process of exploring the data models, assist in disambiguating questions, and finally dynamically choose the most relevant answers for specific use cases. To this purpose we designed the Bio-SODA UX interface, which can be operated online7 to explore knowledge graphs and assist users without technical knowledge of the underlying data models or query languages in disambiguating questions targeting the data stored in the knowledge graphs.
5.1 Preprocessing phase
-
Inverted Index: This index stores the vocabulary of the system. More precisely, all the properties that should be searchable from the RDF data store are indexed, according to a configuration file that specifies the list of properties of interest (by default, all string literals will be indexed). A further configuration option is whether URI fragments should also be parsed and indexed. In this case, these fragments are split by a predefined punctuation list, and through a camel case regex (e.g., “possibleDiseaseTarget” will be indexed as the corresponding keywords “possible disease target”).The inverted index is stored in a relational database for fast searches and it is used to match tokens (sequences of keywords in a user query) against the RDF data. More precisely, the index stores: keywords (N-grams of literals indexed), the indexed instance URI, the class of this instance, the property from which the keywords were indexed (e.g. label), as well as the PageRank score of the instance (see Table 1). PageRank scores are computed using the approach presented in [32].We note that the size of the inverted index depends on a few characteristics of the knowledge graph, including the verbosity of literals (i.e., attributes that are strings), as well as the total number of attributes that should be indexed by Bio-SODA (an explicit list of these attributes can be provided in the system). For example, very verbose fields should not be indexed in their entirety, but perhaps in a summary form. This variability reflects also in the size of the inverted index compared to the size of the original dataset. Table 2 provides an overview across the 3 datasets considered in this study. For example, the QALD4 and Bioinformatics datasets also contain numerical data, which is not indexed, leading to a smaller index size than in the case of CORDIS. Generally, in terms of time, building the inverted index can take a few hours for large datasets, but this highly depends on the performance and availability of the SPARQL endpoint through which the dataset is accessible, given that the inverted index is built by querying this endpoint. We note that we have not focused on optimizing the inverted index construction and instead leave this as future work.
-
Schema Graph Extractor: This module is used in order to enrich the (incomplete) schema of the knowledge graph(s) using instance-level data from the RDF store. The Schema Graph is essentially the accurate schema of the integrated RDF data which Bio-SODA automatically extracts from data instances8. Moreover, the Schema Graph serves as the basis for constructing candidate query graphs from selected entry points (i.e., matches for tokens in a user question).Computing a Schema Graph allows the system to compensate for incomplete schema information, for example, in cases where domains and ranges for properties are either missing or ill-defined. A second benefit of the Schema Graph is that it enables integrating multiple data models from different knowledge graphs. More precisely, since the search algorithm works at the level of the Schema Graph, it is agnostic to the actual physical representation of the data, meaning it can be easily extensible to support the case of multiple, complementary, knowledge graphs in the future. The minimal requirement for achieving this is that these KGs overlap, i.e., they have classes in common, such that they can be joined in an integrated Schema Graph.Extracting the schema graph is achieved via SPARQL queries that compute, for example, domains and ranges of all properties, based on the classes of the instances which they connect. As a simplified example, a triple asserting “Migraine \(\rightarrow\) possibleDrug \(\rightarrow\) Ibuprofen” will result in Disease \(\rightarrow\) possibleDrug \(\rightarrow\) Drug being added to the Schema Graph.Currently, as a minimum requirement we assume that each instance in the RDF data has a well-defined class, i.e. an explicit rdf:type. If this is not the case, additional preprocessing with external tools (for example, using RDF schema discovery techniques [36]), would be required in order to properly define types for all RDF instances.
Lookup Key | URI | Class | Property | PageRank |
---|---|---|---|---|
Stroke | side_effects:C0038454 | sider:side_effects | sider:side-EffectName | 0.34 |
Drug | drugbank:drugs | owl:Class | rdfs:label | 91 |
Drug | sider:drugs | owl:Class | rdfs:label | 2.3 |
Possible disease target | diseasome:possible-DiseaseTarget | rdf:Property | uri_match | 80 |
Dataset | Sources | Dataset Size | Index Size |
---|---|---|---|
QALD4-biomedical | Drugbank, Diseasome, Sider | 200 MB | 150 MB |
Bioinformatics | Bgee, OMA | 30 GB | 8.5 GB |
CORDIS | EU projects | 1 GB | 1 GB |
5.2 SPARQL query generation phase
-
Lookup Module:The lookup module has the role of retrieving the best candidate matches for tokens identified in a user query. A token is defined by the longest sequence of keywords that matches an entry in the Inverted Index (implemented in a relational database for fast searches). For example, in the question “What are the possible disease targets of Ibuprofen?” the two tokens extracted will be “possible disease target” (corresponding to an RDF property name) and “Ibuprofen” (corresponding to one or more Drug instances).
-
Candidate Ranking Module:The lookup module can return a large number of candidate matches per token. In order to find best candidate matches, the ranking module groups together equivalent matches and ranks them in order of relevance to the initial query. For example, instances of the class Drug with matching rdfs:label are grouped together. In our running example illustrated in Fig. 3, the genes BRCA1 and BRCA2 are a match for the keyword BRCA.Furthermore, both string similarity and node importance are taken into account when ranking. Including the PageRank score as a measure of importance in the knowledge graph reduces the influence of the quality of labels assigned (labels which can be imprecise, see discussion in Sect. 3).The intuition behind this is that domain knowledge graphs usually cluster around a few important concepts, which will be reflected in the PageRank scores of the corresponding nodes. For example, UniProt9 [37], a protein knowledge base containing more than 60 billion triples, currently includes only 177 classes. Out of these, only few classes, such as Protein and Annotation, have a central role, and will usually be the target of domain expert questions.Likewise, in the case of the CORDIS EU projects dataset (see Sect. 7 for details), two different classes of Projects are available, EC-Project and ERC-Project. However, there is significantly more information in the dataset for the first class. In the lack of query logs or handcrafted rules for mapping query tokens to the correct candidates, the PageRank score can serve as a good proxy for ranking candidates according to node centrality, similarly to the initial approach used by web search engines [38].As an added benefit, scoring with PageRank also ensures that metadata matches are prioritized. For example, Drug as a class name will rank higher than an instance match.Finally, to ensure that candidate matches not only have good string similarity, but are also semantically similar, word embeddings are also used in the candidate ranking. The similarity comparison ensures that spurious matches, such as gene compared to oogenesis, are discarded based on a pre-defined similarity threshold in the system configuration.Any word embeddings can in principle be used with Bio-SODA. For the two main bioinformatics use cases considered in this paper, we use Word Vectors extracted from PubMed, as described in [39]. The candidate ranking module presents to the user top N matches per query token, where N is configurable in the system. We note that it is important to limit the number of matches per token for performance reasons. This is because the total number of candidate queries generated for a question with T query tokens (i.e. concepts searched by the user) will be in the worst case \(N^{T}\) (there are up to N matches for each of the T tokens).
-
Query Graph Construction Module:The goal of this module is to use the matches from the previous step to generate a list of candidate query graphs. We extend the approach presented in [40] to translate matches to query graph patterns. More precisely, we apply the iterative algorithm shown in Algorithm 1: for each set of candidate matches (one match per query token), we augment the Schema Graph by attaching the candidate matches to their corresponding class. Next, we find the minimal subgraph that covers all matches. For this purpose, we solve the approximate Steiner tree problem by computing the minimal spanning tree that covers one match per token.Note that there might be multiple such subgraphs, given that two classes can be connected via multiple properties. However, unless the user can be involved in disambiguating, it is important to generate all the variants, given that two equal-length subgraphs might actually have opposite semantics. Recall the example shown in Fig. 2, where the properties e.g, isAbsentIn versus isExpressedIn both connect the same two classes, but represent disjoint result sets.Finally, in some cases handcrafted rules for inferring new concepts or relationships are required, due to the complexity of the corresponding query graphs. In such cases translating user questions into SPARQL cannot be done via simple entity linking methods. Therefore, if needed, our approach also supports adding rules to derive implicit information from the original knowledge graph as part of the question answering pipeline. These rules are implemented as sub-queries similar to the
SELECT
SPARQL query form. In this case, the rule head is the SPARQL query projection, and the rule body is the WHERE clause content. -
Query Graph Ranking Module:The query graph ranking module plays an important role in presenting the user with a meaningful, ordered list of results. In contrast to existing work, we do not return the overall minimal subgraph as the top result, but rather the graph that maximizes the sum of the match scores of the candidates covered. To understand why this is the case, consider the following question: “What are the drugs for asthma?”. This question translates to a 2-hop query graph, joining Drug and Disease via the possibleDiseaseTarget path (see Fig. 2). However, one likely scenario is that the description of a Drug instance includes the keyword asthma. In this case, the minimal query graph would be 1-hop only, retrieving only Drug instances that explicitly contain the keyword in the description, probably a small subset of all instances which have the corresponding Disease as a possible target. In this case, the minimal result would have good precision, but very low recall.
-
Query Executor Module:Finally, the query executor translates the ranked query graphs into SPARQL queries, assigning meaningful variable names, also adding human-readable fields to the result set wherever possible. Importantly, we do not only return the best result, but rather a ranked list of possible interpretations (top N, where N is configurable in the system). This gives the user the opportunity to inspect the results in order to choose only the interpretation (i.e. SPARQL query) that matches the question intent.
6 Bio-SODA UX: user dialog interface
6.1 Design
6.2 Implementation
-
ranked candidate matches for each concept in the original question. For each candidate match, the back-end will also indicate what is the class of the match (e.g. a Gene, Protein or Anatomical Entity) - which helps attach the candidate match to the graph displaying all candidates. The user will then have the option to select the best candidate match from the corresponding drop-down list, while also inspecting the data model graph and seeing how each candidate connects. Furthermore, the response from the back-end includes additional information for each candidate, such as a label or a description of the RDF entry it matched. This information can be seen by clicking on the corresponding node displayed in the graph. Since for any given term, there can be many possible entries, in order to group results together, only one example candidate is provided for each class.
-
a SPARQL query is returned and executed, corresponding to the best answer according Bio-SODA ranking algorithm. The results are displayed in the table on the bottom of the page and the SPARQL query itself can be inspected by clicking “Show SPARQL query”.
6.3 Use case
6.3.1 Example 1
6.3.2 Example 2
7 Experiments
7.1 Datasets
Dataset | Sources | #Classes | #Triples | Size on Disk |
---|---|---|---|---|
QALD4-biomedical | Drugbank, Diseasome, Sider | 12 | 0.69 M | 200 MB |
Bioinformatics | Bgee, OMA | 37 | 430 M | 30 GB |
CORDIS | EU projects dataset | 26 | 6.5 M | 1 GB |
7.2 Queries
7.3 Results
Datasets and Systems | Precision | Recall | F1 |
---|---|---|---|
Dataset 1: QALD4 | |||
GFMed | 1 | 0.99 | 0.99 |
SQG | 0.42 | 0.42 | 0.42 |
Sparklis (5.5 steps/query) | 0.88 | 0.88 | 0.88 |
Bio-SODA | 0.61 | 0.60 | 0.60 |
Dataset 2: Bioinformatics | |||
GFMed | 0 | 0 | 0 |
SQG | 0.16 | 0.16 | 0.16 |
Sparklis | - | - | - |
Bio-SODA | 0.6 | 0.6 | 0.6 |
Dataset 3: CORDIS | |||
GFMed | 0 | 0 | 0 |
SQG | 0.33 | 0.33 | 0.33 |
Sparklis (6.2 steps/query) | 1 | 1 | 1 |
Bio-SODA | 0.66 | 0.66 | 0.66 |
7.4 Impact of ranking algorithm
Dataset | (a) Correct with Bio-SODA Ranking | (b) Correct with String Similarity Ranking |
---|---|---|
QALD4 | 30/50 | 23/50 |
Bioinformatics | 18/30 | 12/30 |
CORDIS | 20/30 | 3/30 |
7.5 Error analysis and remaining problems
-
Aggregations. Our system currently does not support questions that require aggregations, such as Count, Sum etc. An example of such a question would be Count the projects in the life sciences domain. A possible solution to this would be to include pre-defined patterns or training a question classifier for this purpose.
-
Superlatives/Comparatives. Another unsupported feature in the current prototype is the use of quantifiers (superlatives or comparatives). An example would be Which drug has the highest number of side-effects?
-
Conjunctions. Conjunctive questions which involve multiple instances of the same class are not supported in the current prototype. An example of such a case is List drugs that lead to strokes and arthrosis. This limitation derives from our methodology in computing the minimal subgraph covering candidate matches, which would require special handling for cases when multiple candidates of the same class are present in a question.
-
Properties with same domain and range. Stemming from the same limitation mentioned above, these properties are currently not supported. In QALD4, the only instance of this is the diseaseSubtypeOf property, which has the Disease class as both domain and range. In the bioinformatics dataset we handle symmetric properties describing ortholog and paralog genes through custom rewrite rules.
-
Ranking. One of the major sources of failure in our prototype remains ranking. In the QALD4 dataset, ranking problems affect 4 out of 50 queries. An example is: What are the diseases caused by Valdecoxib?. Here, the system cannot correctly choose Drug - sideEffect - Side_Effect over the alternative Disease - possibleDrug - Drug. The reason for this is that the Disease class matches exactly the term in the question, while the Drug class in Diseasome has a higher PageRank score than the one in Sider.A more complex corner-case is part of the bioinformatics dataset, What are the genes with lung in the description? The term lung is commonly used to refer to an Anatomical Entity. This is also reflected in the node importance of this match in the dataset. Therefore, the system cannot correctly determine that, in the context of this question, it should instead be considered part of the description property of a Gene. The correct candidate match scores very low, resulting in the correct answer also being ranked too low. However, through user disambiguation in the Bio-SODA UX, this question can also be correctly answered. The process is shown in Fig. 7 and a discussion is included in Sect. 6.3. A similar example from QALD4 is Which drugs have bipolar disorder as indication?, where bipolar disorder is matched against a Disease instead of a drug indication. In these cases user disambiguation, at the level of candidate matches, is an important component in solving the problem.
-
Incomplete information. This problem affects mainly the results in the QALD4 dataset, more precisely 4 out of 50 queries. We have already covered the example of the question targeting the EDNRB gene, which lacks the correct label in the official dataset. We currently do not enrich the inverted index with synonyms or external information, which means questions must be formulated in terms of the available vocabulary of the dataset. However, this limitation could be addressed by indexing synonyms from external data sources. Additional three questions cannot be answered because they refer to URIs that do not have any class defined in the data, therefore the system cannot attach the candidate matches anywhere in the Schema Graph.An example is the drugType property, which can take two values, either http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugtype/experimental or http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugtype/approved. We believe a better modelling of the data should have provided, for example, either these as a xsd:anyURI datatype, given they are not used for any other purposes, or defined some class for both.
-
Query complexity (difficult queries). The bioinformatics dataset covers queries with high complexity, which are difficult to solve especially since they include symmetric properties, with multiple instances of the same class, each filtered according to different conditions.An example of such a question is: Retrieve Oryctolagus cuniculus’ proteins encoded by genes that are orthologous to Mus musculus’ HBB-Y gene. Here, the task is to retrieve Gene instances in a particular Taxon (species), namely the rabbit (Oryctolagus cuniculus), which are orthologs (symmetric property) of a second instance of Gene, labeled HBB-Y, in a different species, namely the mouse (Mus musculus). The resulting query has over 15 triple patterns, with 3 filters (the 2 species names plus the gene name).
-
Others. Two questions in the QALD4 dataset have particular challenges, the first being a stemming error. In the question Give me drugs in the gaseous state, the term gaseous cannot be correctly stemmed to gas. The second type of error is due to unsupported ASK queries, e.g. Are there drugs that target the Protein kinase C beta type?. Here, Bio-SODA retrieves examples of such drugs, instead of the boolean True. However, we do not consider this a fundamental limitation and a question type classifier could be added in future work.
8 Lessons learned
-
Generality: The system should be easily adaptable to new datasets. In particular, the system should be able to answer questions in a new domain with minimal manual intervention and without relying on extensive training data, which is hard to obtain in many domains. Along this line, a desirable property is also the ability to cope with “real-world” datasets, dealing with incompleteness in the data, for example in the form of:
-
missing schema information (should be inferred from instance-level data);
-
missing labels (should be incorporated from URIs whenever meaningful);
-
-
Extensibility: The system should easily work with multiple datasets (provided they are already semantically aligned—i.e., data integration is a prior requirement). Many studies introduce possible approaches for data integration, including a recent approach for ontology-based data integration, covering one of the bioinformatics use cases presented in this paper [44].
-
Configurability: The database owner must be able to specify which properties (e.g. labels, descriptions) should be searchable using the system. Our experience with real-world datasets showed that in general it is not desirable for all properties to be indexed and thus be searchable. As an example, in many cases, fields in the queried data sources can be either redundant or too verbose. In bioinformatics, these are abstracts of papers that are assigned as values to an RDF property, whose length can therefore be up to 300 words. Similarly, in the CORDIS dataset, these are the abstracts of the EU projects. These cases should be handled through a dedicated approach, for example, based on classical information retrieval methods as discussed in [45].
-
Explainability: The system should clearly guide the user through how a question was processed and interpreted. This starts from explaining which concepts were matched in relation to the original question, continuing with how these candidate matches are composed together in a query graph in order to provide the final SPARQL query. Finally, the query results should be understandable as well. Therefore, the projected variable names should also be meaningful.