research-article

Bio-SODA: Enabling Natural Language Question Answering over Knowledge Graphs without Training Data

Authors:
Ana Claudia Sima

SIB Swiss Institute of Bioinformatics

SIB Swiss Institute of Bioinformatics
View Profile

,
Tarcisio Mendes de Farias

SIB Swiss Institute of Bioinformatics

SIB Swiss Institute of Bioinformatics
View Profile

,
Maria Anisimova

Zurich University of Applied Sciences

Zurich University of Applied Sciences
View Profile

,
Christophe Dessimoz

University of Lausanne

University of Lausanne
View Profile

,
Marc Robinson-Rechavi

University of Lausanne

University of Lausanne
View Profile

,
Erich Zbinden

Zurich University of Applied Sciences

Zurich University of Applied Sciences
View Profile

,
Kurt Stockinger

Zurich University of Applied Sciences

Zurich University of Applied Sciences
View Profile

SSDBM '21: Proceedings of the 33rd International Conference on Scientific and Statistical Database ManagementJuly 2021Pages 61–72https://doi.org/10.1145/3468791.3469119

Published:11 August 2021Publication History

SSDBM '21: Proceedings of the 33rd International Conference on Scientific and Statistical Database Management

Pages 61–72

ABSTRACT

The problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (KGQA). However, many of these approaches are either specifically targeted at open-domain question answering using DBpedia, or require large training datasets to translate a natural language question to SPARQL in order to query the knowledge graph. Hence, these approaches often cannot be applied directly to complex scientific datasets where no prior training data is available.

In this paper, we focus on the challenges of natural language processing over knowledge graphs of scientific datasets. In particular, we introduce Bio-SODA, a natural language processing engine that does not require training data in the form of question-answer pairs for generating SPARQL queries. Bio-SODA uses a generic graph-based approach for translating user questions to a ranked list of SPARQL candidate queries. Furthermore, Bio-SODA uses a novel ranking algorithm that includes node centrality as a measure of relevance for selecting the best SPARQL candidate query. Our experiments with real-world datasets across several scientific domains, including the official bioinformatics Question Answering over Linked Data (QALD) challenge, as well as the CORDIS dataset of European projects, show that Bio-SODA outperforms publicly available KGQA systems by an F1-score of least 20% and by an even higher factor on more complex bioinformatics datasets.

References

Katrin Affolter, Kurt Stockinger, and Abraham Bernstein. 2019. A comparative survey of recent natural language interfaces for databases. The VLDB Journal 28, 5 (2019), 793–819.Google ScholarDigital Library
Adrian M Altenhoff, Clément-Marie Train, Kimberly J Gilbert, Ishita Mediratta, Tarcisio Mendes de Farias, David Moi, Yannis Nevers, Hale-Seda Radoykova, Victor Rossier, Alex Warwick Vesztrocy, 2021. OMA orthology in 2021: website overhaul, conserved isoforms, ancestral gene order and more. Nucleic acids research 49, D1 (2021), D373–D379.Google Scholar
Sihem Amer-Yahia, Georgia Koutrika, Frederic Bastian, Theofilos Belmpas, Martin Braschler, Ursin Brunner, Diego Calvanese, Maximilian Fabricius, Orest Gkini, Catherine Kosten, Davide Lanti, Antonis Litke, Hendrik Lücke-Tieke, Francesco Alessandro Massucci, Tarcisio Mendes de Farias, Alessandro Mosca, Francesco Multari, Nikolaos Papadakis, Dimitris Papadopoulos, Yogendra Patil, Aurélien Personnaz, Guillem Rull, Ana Sima, Ellery Smith, Dimitrios Skoutas, Srividya Subramanian, Guohui Xiao, and Kurt Stockinger. 2021. INODE: Building an End-to-End Data Exploration System in Practice [Extended Vision]. arxiv:2104.04194 [cs.LG]Google Scholar
Frederic B Bastian, Julien Roux, Anne Niknejad, Aurélie Comte, Sara S Fonseca Costa, Tarcisio Mendes De Farias, Sébastien Moretti, Gilles Parmentier, Valentine Rech De Laval, Marta Rosikiewicz, 2021. The Bgee suite: integrated curated expression atlas and comparative transcriptomics in animals. Nucleic Acids Research 49, D1 (2021), D831–D847.Google ScholarCross Ref
Lukas Blunschi, Claudio Jossen, Donald Kossmann, Magdalini Mori, and Kurt Stockinger. 2012. Soda: Generating sql for business users. Proceedings of the VLDB Endowment 5, 10 (2012), 932–943.Google ScholarDigital Library
Angela Bonifati, Wim Martens, and Thomas Timm. 2019. An analytical study of large SPARQL query logs. The VLDB Journal (2019), 1–25.Google Scholar
Ursin Brunner and Kurt Stockinger. 2021. ValueNet: A Natural Language-to-SQL System that Learns from Database Information. International Conference on Data Engineering (ICDE) (2021).Google ScholarCross Ref
Nilesh Chakraborty, Denis Lukovnikov, Gaurav Maheshwari, Priyansh Trivedi, Jens Lehmann, and Asja Fischer. 2019. Introduction to Neural Network based Approaches for Question Answering over Knowledge Graphs. arXiv preprint arXiv:1907.09361(2019).Google Scholar
Daniel Deutch, Nave Frost, and Amir Gilad. 2020. Explaining Natural Language query results. The VLDB Journal 29, 1 (2020), 485–508.Google ScholarDigital Library
Dennis Diefenbach, Andreas Both, Kamal Singh, and Pierre Maret. 2018. Towards a question answering system over the semantic web. Semantic WebPreprint(2018), 1–19.Google Scholar
Dennis Diefenbach, José Giménez-Garcıa, Andreas Both, Kamal Singh, and Pierre Maret. 2020. QAnswer KG: Designing a portable Question Answering System over RDF data. (2020).Google Scholar
Dennis Diefenbach and Andreas Thalhammer. 2018. Pagerank and generic entity summarization for rdf knowledge bases. In European Semantic Web Conference. Springer, 145–160.Google ScholarDigital Library
Mohnish Dubey, Debayan Banerjee, Abdelrahman Abdelkawi, and Jens Lehmann. 2019. Lc-quad 2.0: A large dataset for complex question answering over wikidata and dbpedia. In International Semantic Web Conference. Springer, 69–78.Google ScholarDigital Library
Paolo Ferragina and Ugo Scaiella. 2010. Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). In Proceedings of the 19th ACM international conference on Information and knowledge management. 1625–1628.Google ScholarDigital Library
Sébastien Ferré. 2017. Sparklis: an expressive query builder for SPARQL endpoints with guidance in natural language. Semantic Web 8, 3 (2017), 405–418.Google ScholarDigital Library
Katerina Gkirtzou, Kostis Karozos, Vasilis Vassalos, and Theodore Dalamagas. 2015. Keywords-to-sparql translation for rdf data search and exploration. In International Conference on Theory and Practice of Digital Libraries. Springer, 111–123.Google ScholarCross Ref
Thierry Hamon, Natalia Grabar, and Fleur Mougin. 2017. Querying biomedical linked data with natural language questions. Semantic Web 8, 4 (2017), 581–599.Google ScholarDigital Library
Thierry Hamon, Natalia Grabar, Fleur Mougin, and Frantz Thiessard. 2014. Description of the POMELO System for the Task 2 of QALD-2014.CLEF (Working Notes) 1212 (2014), 28.Google Scholar
Ali Hasnain, Qaiser Mehmood, Syeda Sana e Zainab, Muhammad Saleem, Claude Warren, Durre Zehra, Stefan Decker, and Dietrich Rebholz-Schuhmann. 2017. Biofed: federated query processing over life sciences linked open data. Journal of biomedical semantics 8, 1 (2017), 13.Google ScholarCross Ref
Kenza Kellou-Menouer and Zoubida Kedad. 2015. Schema discovery in RDF data sources. In International Conference on Conceptual Modeling. Springer, 481–495.Google ScholarCross Ref
Andreas Kokkalis, Panagiotis Vagenas, Alexandros Zervakis, Alkis Simitsis, Georgia Koutrika, and Yannis Ioannidis. 2012. Logos: a system for translating queries into narratives. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. 673–676.Google ScholarDigital Library
Fei Li and HV Jagadish. 2014. Constructing an interactive natural language interface for relational databases. Proceedings of the VLDB Endowment 8, 1 (2014), 73–84.Google ScholarDigital Library
Fei Li and HV Jagadish. 2016. Understanding natural language queries over relational databases. ACM SIGMOD Record 45, 1 (2016), 6–13.Google ScholarDigital Library
Denis Lukovnikov, Asja Fischer, Jens Lehmann, and Sören Auer. 2017. Neural network-based question answering over knowledge graphs on word and character level. In Proceedings of the 26th international conference on World Wide Web. 1211–1220.Google ScholarDigital Library
Gaurav Maheshwari, Priyansh Trivedi, Denis Lukovnikov, Nilesh Chakraborty, Asja Fischer, and Jens Lehmann. 2019. Learning to rank query graphs for complex question answering over knowledge graphs. In International Semantic Web Conference. Springer, 487–504.Google ScholarDigital Library
Anca Marginean. 2017. Question answering over biomedical linked data with grammatical framework. Semantic Web 8, 4 (2017), 565–580.Google ScholarDigital Library
Pablo N Mendes, Max Jakob, Andrés García-Silva, and Christian Bizer. 2011. DBpedia spotlight: shedding light on the web of documents. In Proceedings of the 7th international conference on semantic systems. 1–8.Google ScholarDigital Library
SPFGH Moen and Tapio Salakoski2 Sophia Ananiadou. 2013. Distributional semantics resources for biomedical text processing. Proceedings of LBM (2013), 39–44.Google Scholar
Stefanie Nadig, Martin Braschler, and Kurt Stockinger. 2020. Database Search vs. Information Retrieval: A Novel Method for Studying Natural Language Querying of Semi-Structured Data. In International Conference on Language Resources and Evaluation (LREC).Google Scholar
Axel-Cyrille Ngonga Ngomo, Lorenz Bühmann, Christina Unger, Jens Lehmann, and Daniel Gerber. 2013. Sorry, i don’t speak SPARQL: translating SPARQL queries into natural language. In Proceedings of the 22nd international conference on World Wide Web. 977–988.Google ScholarDigital Library
Alex Olieman, Hosein Azarbonyad, Mostafa Dehghani, Jaap Kamps, and Maarten Marx. 2014. Entity linking by focusing DBpedia candidate entities. In Proceedings of the first international workshop on Entity recognition & disambiguation. 13–24.Google ScholarDigital Library
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The pagerank citation ranking: Bringing order to the web.Technical Report. Stanford InfoLab.Google Scholar
Heiko Paulheim and Christian Bizer. 2013. Type inference on noisy rdf data. In International semantic web conference. Springer, 510–525.Google Scholar
Nicole Redaschi, UniProt Consortium, 2009. Uniprot in RDF: Tackling data integration and distributed annotation with the semantic web. Nature precedings (2009), 1–1.Google Scholar
Diptikalyan Saha, Avrilia Floratou, Karthik Sankaranarayanan, Umar Farooq Minhas, Ashish R Mittal, and Fatma Özcan. 2016. ATHENA: an ontology-driven system for natural language querying over relational data stores. Proceedings of the VLDB Endowment 9, 12 (2016), 1209–1220.Google ScholarDigital Library
Ahmad Sakor, Kuldeep Singh, and Maria-Esther Vidal. 2019. FALCON: An Entity and Relation Linking Framework over DBpedia. (2019).Google Scholar
Ana Claudia Sima, Tarcisio Mendes de Farias, Erich Zbinden, Maria Anisimova, Manuel Gil, Heinz Stockinger, Kurt Stockinger, Marc Robinson-Rechavi, and Christophe Dessimoz. 2019. Enabling semantic queries across federated bioinformatics databases. Database 2019(2019).Google Scholar
Kuldeep Singh, Ioanna Lytra, Arun Sethupat Radhakrishna, Saeedeh Shekarpour, Maria-Esther Vidal, and Jens Lehmann. 2018. No one is perfect: Analysing the performance of question answering components over the dbpedia knowledge graph. arXiv preprint arXiv:1809.10044(2018).Google Scholar
Dezhao Song, Frank Schilder, Charese Smiley, Chris Brew, Tom Zielund, Hiroko Bretz, Robert Martin, Chris Dale, John Duprey, Tim Miller, 2015. TR discover: A natural language interface for querying and analyzing interlinked datasets. In International Semantic Web Conference. Springer, 21–37.Google ScholarDigital Library
Priyansh Trivedi, Gaurav Maheshwari, Mohnish Dubey, and Jens Lehmann. 2017. Lc-quad: A corpus for complex question answering over knowledge graphs. In International Semantic Web Conference. Springer, 210–218.Google ScholarDigital Library
Christina Unger, Corina Forascu, Vanessa Lopez, Axel-Cyrille Ngonga Ngomo, Elena Cabrio, Philipp Cimiano, and Sebastian Walter. 2014. Question answering over linked data (QALD-4).Google Scholar
Svitlana Vakulenko, Javier David Fernandez Garcia, Axel Polleres, Maarten de Rijke, and Michael Cochez. 2019. Message Passing for Complex Question Answering over Knowledge Graphs. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 1431–1440.Google ScholarDigital Library
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, 2018. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887(2018).Google Scholar
Hamid Zafar, Giulio Napolitano, and Jens Lehmann. 2018. Formal query generation for question answering over knowledge bases. In European Semantic Web Conference. Springer, 714–728.Google ScholarDigital Library
Weiguo Zheng, Jeffrey Xu Yu, Lei Zou, and Hong Cheng. 2018. Question answering over knowledge graphs: question understanding via template decomposition. Proceedings of the VLDB Endowment 11, 11 (2018), 1373–1386.Google ScholarDigital Library

Recommendations

Bio-SODA UX: enabling natural language question answering over knowledge graphs with user disambiguation
Abstract
The problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (...
Read More
Natural language question answering over RDF data
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

As more and more RDF data becomes available, such as DBpedia, Yago and Freebase, it is desired to provide users with simple interfaces to access the datasets. Although the SPARQL query language is a standard way to query RDF data, it remains tedious and ...
Read More
Natural language question answering over knowledge graph: the marriage of SPARQL query and keyword search
Abstract
Natural language question answering over knowledge graph has received widespread attention. However, the existing methods always aim to improve every phase of natural language question answering and neglect the defects; namely, not all query ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SSDBM '21: Proceedings of the 33rd International Conference on Scientific and Statistical Database Management
July 2021
275 pages
ISBN:9781450384131
DOI:10.1145/3468791
Editors:
Qiang Zhu
University of Michigan - Dearborn, USA
,
Xingquan (Hill) Zhu
Florida Atlantic University, USA
,
Yicheng Tu
University of South Florida, USA
,
Zichen (Frank) Xu
Nanchang University, China
,
Anand Kumar
Amazon Inc., USA
Copyright © 2021 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 August 2021
Check for updates
Author Tags
Knowledge Graphs
Question Answering
Ranking
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate56of146submissions,38%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 172
  Total Downloads
- Downloads (Last 12 months)38
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Bio-SODA: Enabling Natural Language Question Answering over Knowledge Graphs without Training Data

SSDBM '21: Proceedings of the 33rd International Conference on Scientific and Statistical Database Management

ABSTRACT

References

Cited By

Recommendations

Bio-SODA UX: enabling natural language question answering over knowledge graphs with user disambiguation

Natural language question answering over RDF data

Natural language question answering over knowledge graph: the marriage of SPARQL query and keyword search

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Bio-SODA: Enabling Natural Language Question Answering over Knowledge Graphs without Training Data

SSDBM '21: Proceedings of the 33rd International Conference on Scientific and Statistical Database Management

ABSTRACT

References

Cited By

Recommendations

Bio-SODA UX: enabling natural language question answering over knowledge graphs with user disambiguation

Natural language question answering over RDF data

Natural language question answering over knowledge graph: the marriage of SPARQL query and keyword search

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media