ABSTRACT
The problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (KGQA). However, many of these approaches are either specifically targeted at open-domain question answering using DBpedia, or require large training datasets to translate a natural language question to SPARQL in order to query the knowledge graph. Hence, these approaches often cannot be applied directly to complex scientific datasets where no prior training data is available.
In this paper, we focus on the challenges of natural language processing over knowledge graphs of scientific datasets. In particular, we introduce Bio-SODA, a natural language processing engine that does not require training data in the form of question-answer pairs for generating SPARQL queries. Bio-SODA uses a generic graph-based approach for translating user questions to a ranked list of SPARQL candidate queries. Furthermore, Bio-SODA uses a novel ranking algorithm that includes node centrality as a measure of relevance for selecting the best SPARQL candidate query. Our experiments with real-world datasets across several scientific domains, including the official bioinformatics Question Answering over Linked Data (QALD) challenge, as well as the CORDIS dataset of European projects, show that Bio-SODA outperforms publicly available KGQA systems by an F1-score of least 20% and by an even higher factor on more complex bioinformatics datasets.
- Katrin Affolter, Kurt Stockinger, and Abraham Bernstein. 2019. A comparative survey of recent natural language interfaces for databases. The VLDB Journal 28, 5 (2019), 793–819.Google ScholarDigital Library
- Adrian M Altenhoff, Clément-Marie Train, Kimberly J Gilbert, Ishita Mediratta, Tarcisio Mendes de Farias, David Moi, Yannis Nevers, Hale-Seda Radoykova, Victor Rossier, Alex Warwick Vesztrocy, 2021. OMA orthology in 2021: website overhaul, conserved isoforms, ancestral gene order and more. Nucleic acids research 49, D1 (2021), D373–D379.Google Scholar
- Sihem Amer-Yahia, Georgia Koutrika, Frederic Bastian, Theofilos Belmpas, Martin Braschler, Ursin Brunner, Diego Calvanese, Maximilian Fabricius, Orest Gkini, Catherine Kosten, Davide Lanti, Antonis Litke, Hendrik Lücke-Tieke, Francesco Alessandro Massucci, Tarcisio Mendes de Farias, Alessandro Mosca, Francesco Multari, Nikolaos Papadakis, Dimitris Papadopoulos, Yogendra Patil, Aurélien Personnaz, Guillem Rull, Ana Sima, Ellery Smith, Dimitrios Skoutas, Srividya Subramanian, Guohui Xiao, and Kurt Stockinger. 2021. INODE: Building an End-to-End Data Exploration System in Practice [Extended Vision]. arxiv:2104.04194 [cs.LG]Google Scholar
- Frederic B Bastian, Julien Roux, Anne Niknejad, Aurélie Comte, Sara S Fonseca Costa, Tarcisio Mendes De Farias, Sébastien Moretti, Gilles Parmentier, Valentine Rech De Laval, Marta Rosikiewicz, 2021. The Bgee suite: integrated curated expression atlas and comparative transcriptomics in animals. Nucleic Acids Research 49, D1 (2021), D831–D847.Google ScholarCross Ref
- Lukas Blunschi, Claudio Jossen, Donald Kossmann, Magdalini Mori, and Kurt Stockinger. 2012. Soda: Generating sql for business users. Proceedings of the VLDB Endowment 5, 10 (2012), 932–943.Google ScholarDigital Library
- Angela Bonifati, Wim Martens, and Thomas Timm. 2019. An analytical study of large SPARQL query logs. The VLDB Journal (2019), 1–25.Google Scholar
- Ursin Brunner and Kurt Stockinger. 2021. ValueNet: A Natural Language-to-SQL System that Learns from Database Information. International Conference on Data Engineering (ICDE) (2021).Google ScholarCross Ref
- Nilesh Chakraborty, Denis Lukovnikov, Gaurav Maheshwari, Priyansh Trivedi, Jens Lehmann, and Asja Fischer. 2019. Introduction to Neural Network based Approaches for Question Answering over Knowledge Graphs. arXiv preprint arXiv:1907.09361(2019).Google Scholar
- Daniel Deutch, Nave Frost, and Amir Gilad. 2020. Explaining Natural Language query results. The VLDB Journal 29, 1 (2020), 485–508.Google ScholarDigital Library
- Dennis Diefenbach, Andreas Both, Kamal Singh, and Pierre Maret. 2018. Towards a question answering system over the semantic web. Semantic WebPreprint(2018), 1–19.Google Scholar
- Dennis Diefenbach, José Giménez-Garcıa, Andreas Both, Kamal Singh, and Pierre Maret. 2020. QAnswer KG: Designing a portable Question Answering System over RDF data. (2020).Google Scholar
- Dennis Diefenbach and Andreas Thalhammer. 2018. Pagerank and generic entity summarization for rdf knowledge bases. In European Semantic Web Conference. Springer, 145–160.Google ScholarDigital Library
- Mohnish Dubey, Debayan Banerjee, Abdelrahman Abdelkawi, and Jens Lehmann. 2019. Lc-quad 2.0: A large dataset for complex question answering over wikidata and dbpedia. In International Semantic Web Conference. Springer, 69–78.Google ScholarDigital Library
- Paolo Ferragina and Ugo Scaiella. 2010. Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). In Proceedings of the 19th ACM international conference on Information and knowledge management. 1625–1628.Google ScholarDigital Library
- Sébastien Ferré. 2017. Sparklis: an expressive query builder for SPARQL endpoints with guidance in natural language. Semantic Web 8, 3 (2017), 405–418.Google ScholarDigital Library
- Katerina Gkirtzou, Kostis Karozos, Vasilis Vassalos, and Theodore Dalamagas. 2015. Keywords-to-sparql translation for rdf data search and exploration. In International Conference on Theory and Practice of Digital Libraries. Springer, 111–123.Google ScholarCross Ref
- Thierry Hamon, Natalia Grabar, and Fleur Mougin. 2017. Querying biomedical linked data with natural language questions. Semantic Web 8, 4 (2017), 581–599.Google ScholarDigital Library
- Thierry Hamon, Natalia Grabar, Fleur Mougin, and Frantz Thiessard. 2014. Description of the POMELO System for the Task 2 of QALD-2014.CLEF (Working Notes) 1212 (2014), 28.Google Scholar
- Ali Hasnain, Qaiser Mehmood, Syeda Sana e Zainab, Muhammad Saleem, Claude Warren, Durre Zehra, Stefan Decker, and Dietrich Rebholz-Schuhmann. 2017. Biofed: federated query processing over life sciences linked open data. Journal of biomedical semantics 8, 1 (2017), 13.Google ScholarCross Ref
- Kenza Kellou-Menouer and Zoubida Kedad. 2015. Schema discovery in RDF data sources. In International Conference on Conceptual Modeling. Springer, 481–495.Google ScholarCross Ref
- Andreas Kokkalis, Panagiotis Vagenas, Alexandros Zervakis, Alkis Simitsis, Georgia Koutrika, and Yannis Ioannidis. 2012. Logos: a system for translating queries into narratives. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. 673–676.Google ScholarDigital Library
- Fei Li and HV Jagadish. 2014. Constructing an interactive natural language interface for relational databases. Proceedings of the VLDB Endowment 8, 1 (2014), 73–84.Google ScholarDigital Library
- Fei Li and HV Jagadish. 2016. Understanding natural language queries over relational databases. ACM SIGMOD Record 45, 1 (2016), 6–13.Google ScholarDigital Library
- Denis Lukovnikov, Asja Fischer, Jens Lehmann, and Sören Auer. 2017. Neural network-based question answering over knowledge graphs on word and character level. In Proceedings of the 26th international conference on World Wide Web. 1211–1220.Google ScholarDigital Library
- Gaurav Maheshwari, Priyansh Trivedi, Denis Lukovnikov, Nilesh Chakraborty, Asja Fischer, and Jens Lehmann. 2019. Learning to rank query graphs for complex question answering over knowledge graphs. In International Semantic Web Conference. Springer, 487–504.Google ScholarDigital Library
- Anca Marginean. 2017. Question answering over biomedical linked data with grammatical framework. Semantic Web 8, 4 (2017), 565–580.Google ScholarDigital Library
- Pablo N Mendes, Max Jakob, Andrés García-Silva, and Christian Bizer. 2011. DBpedia spotlight: shedding light on the web of documents. In Proceedings of the 7th international conference on semantic systems. 1–8.Google ScholarDigital Library
- SPFGH Moen and Tapio Salakoski2 Sophia Ananiadou. 2013. Distributional semantics resources for biomedical text processing. Proceedings of LBM (2013), 39–44.Google Scholar
- Stefanie Nadig, Martin Braschler, and Kurt Stockinger. 2020. Database Search vs. Information Retrieval: A Novel Method for Studying Natural Language Querying of Semi-Structured Data. In International Conference on Language Resources and Evaluation (LREC).Google Scholar
- Axel-Cyrille Ngonga Ngomo, Lorenz Bühmann, Christina Unger, Jens Lehmann, and Daniel Gerber. 2013. Sorry, i don’t speak SPARQL: translating SPARQL queries into natural language. In Proceedings of the 22nd international conference on World Wide Web. 977–988.Google ScholarDigital Library
- Alex Olieman, Hosein Azarbonyad, Mostafa Dehghani, Jaap Kamps, and Maarten Marx. 2014. Entity linking by focusing DBpedia candidate entities. In Proceedings of the first international workshop on Entity recognition & disambiguation. 13–24.Google ScholarDigital Library
- Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The pagerank citation ranking: Bringing order to the web.Technical Report. Stanford InfoLab.Google Scholar
- Heiko Paulheim and Christian Bizer. 2013. Type inference on noisy rdf data. In International semantic web conference. Springer, 510–525.Google Scholar
- Nicole Redaschi, UniProt Consortium, 2009. Uniprot in RDF: Tackling data integration and distributed annotation with the semantic web. Nature precedings (2009), 1–1.Google Scholar
- Diptikalyan Saha, Avrilia Floratou, Karthik Sankaranarayanan, Umar Farooq Minhas, Ashish R Mittal, and Fatma Özcan. 2016. ATHENA: an ontology-driven system for natural language querying over relational data stores. Proceedings of the VLDB Endowment 9, 12 (2016), 1209–1220.Google ScholarDigital Library
- Ahmad Sakor, Kuldeep Singh, and Maria-Esther Vidal. 2019. FALCON: An Entity and Relation Linking Framework over DBpedia. (2019).Google Scholar
- Ana Claudia Sima, Tarcisio Mendes de Farias, Erich Zbinden, Maria Anisimova, Manuel Gil, Heinz Stockinger, Kurt Stockinger, Marc Robinson-Rechavi, and Christophe Dessimoz. 2019. Enabling semantic queries across federated bioinformatics databases. Database 2019(2019).Google Scholar
- Kuldeep Singh, Ioanna Lytra, Arun Sethupat Radhakrishna, Saeedeh Shekarpour, Maria-Esther Vidal, and Jens Lehmann. 2018. No one is perfect: Analysing the performance of question answering components over the dbpedia knowledge graph. arXiv preprint arXiv:1809.10044(2018).Google Scholar
- Dezhao Song, Frank Schilder, Charese Smiley, Chris Brew, Tom Zielund, Hiroko Bretz, Robert Martin, Chris Dale, John Duprey, Tim Miller, 2015. TR discover: A natural language interface for querying and analyzing interlinked datasets. In International Semantic Web Conference. Springer, 21–37.Google ScholarDigital Library
- Priyansh Trivedi, Gaurav Maheshwari, Mohnish Dubey, and Jens Lehmann. 2017. Lc-quad: A corpus for complex question answering over knowledge graphs. In International Semantic Web Conference. Springer, 210–218.Google ScholarDigital Library
- Christina Unger, Corina Forascu, Vanessa Lopez, Axel-Cyrille Ngonga Ngomo, Elena Cabrio, Philipp Cimiano, and Sebastian Walter. 2014. Question answering over linked data (QALD-4).Google Scholar
- Svitlana Vakulenko, Javier David Fernandez Garcia, Axel Polleres, Maarten de Rijke, and Michael Cochez. 2019. Message Passing for Complex Question Answering over Knowledge Graphs. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 1431–1440.Google ScholarDigital Library
- Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, 2018. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887(2018).Google Scholar
- Hamid Zafar, Giulio Napolitano, and Jens Lehmann. 2018. Formal query generation for question answering over knowledge bases. In European Semantic Web Conference. Springer, 714–728.Google ScholarDigital Library
- Weiguo Zheng, Jeffrey Xu Yu, Lei Zou, and Hong Cheng. 2018. Question answering over knowledge graphs: question understanding via template decomposition. Proceedings of the VLDB Endowment 11, 11 (2018), 1373–1386.Google ScholarDigital Library
Recommendations
Bio-SODA UX: enabling natural language question answering over knowledge graphs with user disambiguation
AbstractThe problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (...
Natural language question answering over RDF data
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataAs more and more RDF data becomes available, such as DBpedia, Yago and Freebase, it is desired to provide users with simple interfaces to access the datasets. Although the SPARQL query language is a standard way to query RDF data, it remains tedious and ...
Natural language question answering over knowledge graph: the marriage of SPARQL query and keyword search
AbstractNatural language question answering over knowledge graph has received widespread attention. However, the existing methods always aim to improve every phase of natural language question answering and neglect the defects; namely, not all query ...
Comments