skip to main content
10.1145/3468791.3469119acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

Bio-SODA: Enabling Natural Language Question Answering over Knowledge Graphs without Training Data

Published:11 August 2021Publication History

ABSTRACT

The problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (KGQA). However, many of these approaches are either specifically targeted at open-domain question answering using DBpedia, or require large training datasets to translate a natural language question to SPARQL in order to query the knowledge graph. Hence, these approaches often cannot be applied directly to complex scientific datasets where no prior training data is available.

In this paper, we focus on the challenges of natural language processing over knowledge graphs of scientific datasets. In particular, we introduce Bio-SODA, a natural language processing engine that does not require training data in the form of question-answer pairs for generating SPARQL queries. Bio-SODA uses a generic graph-based approach for translating user questions to a ranked list of SPARQL candidate queries. Furthermore, Bio-SODA uses a novel ranking algorithm that includes node centrality as a measure of relevance for selecting the best SPARQL candidate query. Our experiments with real-world datasets across several scientific domains, including the official bioinformatics Question Answering over Linked Data (QALD) challenge, as well as the CORDIS dataset of European projects, show that Bio-SODA outperforms publicly available KGQA systems by an F1-score of least 20% and by an even higher factor on more complex bioinformatics datasets.

References

  1. Katrin Affolter, Kurt Stockinger, and Abraham Bernstein. 2019. A comparative survey of recent natural language interfaces for databases. The VLDB Journal 28, 5 (2019), 793–819.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Adrian M Altenhoff, Clément-Marie Train, Kimberly J Gilbert, Ishita Mediratta, Tarcisio Mendes de Farias, David Moi, Yannis Nevers, Hale-Seda Radoykova, Victor Rossier, Alex Warwick Vesztrocy, 2021. OMA orthology in 2021: website overhaul, conserved isoforms, ancestral gene order and more. Nucleic acids research 49, D1 (2021), D373–D379.Google ScholarGoogle Scholar
  3. Sihem Amer-Yahia, Georgia Koutrika, Frederic Bastian, Theofilos Belmpas, Martin Braschler, Ursin Brunner, Diego Calvanese, Maximilian Fabricius, Orest Gkini, Catherine Kosten, Davide Lanti, Antonis Litke, Hendrik Lücke-Tieke, Francesco Alessandro Massucci, Tarcisio Mendes de Farias, Alessandro Mosca, Francesco Multari, Nikolaos Papadakis, Dimitris Papadopoulos, Yogendra Patil, Aurélien Personnaz, Guillem Rull, Ana Sima, Ellery Smith, Dimitrios Skoutas, Srividya Subramanian, Guohui Xiao, and Kurt Stockinger. 2021. INODE: Building an End-to-End Data Exploration System in Practice [Extended Vision]. arxiv:2104.04194 [cs.LG]Google ScholarGoogle Scholar
  4. Frederic B Bastian, Julien Roux, Anne Niknejad, Aurélie Comte, Sara S Fonseca Costa, Tarcisio Mendes De Farias, Sébastien Moretti, Gilles Parmentier, Valentine Rech De Laval, Marta Rosikiewicz, 2021. The Bgee suite: integrated curated expression atlas and comparative transcriptomics in animals. Nucleic Acids Research 49, D1 (2021), D831–D847.Google ScholarGoogle ScholarCross RefCross Ref
  5. Lukas Blunschi, Claudio Jossen, Donald Kossmann, Magdalini Mori, and Kurt Stockinger. 2012. Soda: Generating sql for business users. Proceedings of the VLDB Endowment 5, 10 (2012), 932–943.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Angela Bonifati, Wim Martens, and Thomas Timm. 2019. An analytical study of large SPARQL query logs. The VLDB Journal (2019), 1–25.Google ScholarGoogle Scholar
  7. Ursin Brunner and Kurt Stockinger. 2021. ValueNet: A Natural Language-to-SQL System that Learns from Database Information. International Conference on Data Engineering (ICDE) (2021).Google ScholarGoogle ScholarCross RefCross Ref
  8. Nilesh Chakraborty, Denis Lukovnikov, Gaurav Maheshwari, Priyansh Trivedi, Jens Lehmann, and Asja Fischer. 2019. Introduction to Neural Network based Approaches for Question Answering over Knowledge Graphs. arXiv preprint arXiv:1907.09361(2019).Google ScholarGoogle Scholar
  9. Daniel Deutch, Nave Frost, and Amir Gilad. 2020. Explaining Natural Language query results. The VLDB Journal 29, 1 (2020), 485–508.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Dennis Diefenbach, Andreas Both, Kamal Singh, and Pierre Maret. 2018. Towards a question answering system over the semantic web. Semantic WebPreprint(2018), 1–19.Google ScholarGoogle Scholar
  11. Dennis Diefenbach, José Giménez-Garcıa, Andreas Both, Kamal Singh, and Pierre Maret. 2020. QAnswer KG: Designing a portable Question Answering System over RDF data. (2020).Google ScholarGoogle Scholar
  12. Dennis Diefenbach and Andreas Thalhammer. 2018. Pagerank and generic entity summarization for rdf knowledge bases. In European Semantic Web Conference. Springer, 145–160.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Mohnish Dubey, Debayan Banerjee, Abdelrahman Abdelkawi, and Jens Lehmann. 2019. Lc-quad 2.0: A large dataset for complex question answering over wikidata and dbpedia. In International Semantic Web Conference. Springer, 69–78.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Paolo Ferragina and Ugo Scaiella. 2010. Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). In Proceedings of the 19th ACM international conference on Information and knowledge management. 1625–1628.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Sébastien Ferré. 2017. Sparklis: an expressive query builder for SPARQL endpoints with guidance in natural language. Semantic Web 8, 3 (2017), 405–418.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Katerina Gkirtzou, Kostis Karozos, Vasilis Vassalos, and Theodore Dalamagas. 2015. Keywords-to-sparql translation for rdf data search and exploration. In International Conference on Theory and Practice of Digital Libraries. Springer, 111–123.Google ScholarGoogle ScholarCross RefCross Ref
  17. Thierry Hamon, Natalia Grabar, and Fleur Mougin. 2017. Querying biomedical linked data with natural language questions. Semantic Web 8, 4 (2017), 581–599.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Thierry Hamon, Natalia Grabar, Fleur Mougin, and Frantz Thiessard. 2014. Description of the POMELO System for the Task 2 of QALD-2014.CLEF (Working Notes) 1212 (2014), 28.Google ScholarGoogle Scholar
  19. Ali Hasnain, Qaiser Mehmood, Syeda Sana e Zainab, Muhammad Saleem, Claude Warren, Durre Zehra, Stefan Decker, and Dietrich Rebholz-Schuhmann. 2017. Biofed: federated query processing over life sciences linked open data. Journal of biomedical semantics 8, 1 (2017), 13.Google ScholarGoogle ScholarCross RefCross Ref
  20. Kenza Kellou-Menouer and Zoubida Kedad. 2015. Schema discovery in RDF data sources. In International Conference on Conceptual Modeling. Springer, 481–495.Google ScholarGoogle ScholarCross RefCross Ref
  21. Andreas Kokkalis, Panagiotis Vagenas, Alexandros Zervakis, Alkis Simitsis, Georgia Koutrika, and Yannis Ioannidis. 2012. Logos: a system for translating queries into narratives. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. 673–676.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Fei Li and HV Jagadish. 2014. Constructing an interactive natural language interface for relational databases. Proceedings of the VLDB Endowment 8, 1 (2014), 73–84.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Fei Li and HV Jagadish. 2016. Understanding natural language queries over relational databases. ACM SIGMOD Record 45, 1 (2016), 6–13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Denis Lukovnikov, Asja Fischer, Jens Lehmann, and Sören Auer. 2017. Neural network-based question answering over knowledge graphs on word and character level. In Proceedings of the 26th international conference on World Wide Web. 1211–1220.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Gaurav Maheshwari, Priyansh Trivedi, Denis Lukovnikov, Nilesh Chakraborty, Asja Fischer, and Jens Lehmann. 2019. Learning to rank query graphs for complex question answering over knowledge graphs. In International Semantic Web Conference. Springer, 487–504.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Anca Marginean. 2017. Question answering over biomedical linked data with grammatical framework. Semantic Web 8, 4 (2017), 565–580.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Pablo N Mendes, Max Jakob, Andrés García-Silva, and Christian Bizer. 2011. DBpedia spotlight: shedding light on the web of documents. In Proceedings of the 7th international conference on semantic systems. 1–8.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. SPFGH Moen and Tapio Salakoski2 Sophia Ananiadou. 2013. Distributional semantics resources for biomedical text processing. Proceedings of LBM (2013), 39–44.Google ScholarGoogle Scholar
  29. Stefanie Nadig, Martin Braschler, and Kurt Stockinger. 2020. Database Search vs. Information Retrieval: A Novel Method for Studying Natural Language Querying of Semi-Structured Data. In International Conference on Language Resources and Evaluation (LREC).Google ScholarGoogle Scholar
  30. Axel-Cyrille Ngonga Ngomo, Lorenz Bühmann, Christina Unger, Jens Lehmann, and Daniel Gerber. 2013. Sorry, i don’t speak SPARQL: translating SPARQL queries into natural language. In Proceedings of the 22nd international conference on World Wide Web. 977–988.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Alex Olieman, Hosein Azarbonyad, Mostafa Dehghani, Jaap Kamps, and Maarten Marx. 2014. Entity linking by focusing DBpedia candidate entities. In Proceedings of the first international workshop on Entity recognition & disambiguation. 13–24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The pagerank citation ranking: Bringing order to the web.Technical Report. Stanford InfoLab.Google ScholarGoogle Scholar
  33. Heiko Paulheim and Christian Bizer. 2013. Type inference on noisy rdf data. In International semantic web conference. Springer, 510–525.Google ScholarGoogle Scholar
  34. Nicole Redaschi, UniProt Consortium, 2009. Uniprot in RDF: Tackling data integration and distributed annotation with the semantic web. Nature precedings (2009), 1–1.Google ScholarGoogle Scholar
  35. Diptikalyan Saha, Avrilia Floratou, Karthik Sankaranarayanan, Umar Farooq Minhas, Ashish R Mittal, and Fatma Özcan. 2016. ATHENA: an ontology-driven system for natural language querying over relational data stores. Proceedings of the VLDB Endowment 9, 12 (2016), 1209–1220.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Ahmad Sakor, Kuldeep Singh, and Maria-Esther Vidal. 2019. FALCON: An Entity and Relation Linking Framework over DBpedia. (2019).Google ScholarGoogle Scholar
  37. Ana Claudia Sima, Tarcisio Mendes de Farias, Erich Zbinden, Maria Anisimova, Manuel Gil, Heinz Stockinger, Kurt Stockinger, Marc Robinson-Rechavi, and Christophe Dessimoz. 2019. Enabling semantic queries across federated bioinformatics databases. Database 2019(2019).Google ScholarGoogle Scholar
  38. Kuldeep Singh, Ioanna Lytra, Arun Sethupat Radhakrishna, Saeedeh Shekarpour, Maria-Esther Vidal, and Jens Lehmann. 2018. No one is perfect: Analysing the performance of question answering components over the dbpedia knowledge graph. arXiv preprint arXiv:1809.10044(2018).Google ScholarGoogle Scholar
  39. Dezhao Song, Frank Schilder, Charese Smiley, Chris Brew, Tom Zielund, Hiroko Bretz, Robert Martin, Chris Dale, John Duprey, Tim Miller, 2015. TR discover: A natural language interface for querying and analyzing interlinked datasets. In International Semantic Web Conference. Springer, 21–37.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Priyansh Trivedi, Gaurav Maheshwari, Mohnish Dubey, and Jens Lehmann. 2017. Lc-quad: A corpus for complex question answering over knowledge graphs. In International Semantic Web Conference. Springer, 210–218.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Christina Unger, Corina Forascu, Vanessa Lopez, Axel-Cyrille Ngonga Ngomo, Elena Cabrio, Philipp Cimiano, and Sebastian Walter. 2014. Question answering over linked data (QALD-4).Google ScholarGoogle Scholar
  42. Svitlana Vakulenko, Javier David Fernandez Garcia, Axel Polleres, Maarten de Rijke, and Michael Cochez. 2019. Message Passing for Complex Question Answering over Knowledge Graphs. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 1431–1440.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, 2018. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887(2018).Google ScholarGoogle Scholar
  44. Hamid Zafar, Giulio Napolitano, and Jens Lehmann. 2018. Formal query generation for question answering over knowledge bases. In European Semantic Web Conference. Springer, 714–728.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Weiguo Zheng, Jeffrey Xu Yu, Lei Zou, and Hong Cheng. 2018. Question answering over knowledge graphs: question understanding via template decomposition. Proceedings of the VLDB Endowment 11, 11 (2018), 1373–1386.Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    SSDBM '21: Proceedings of the 33rd International Conference on Scientific and Statistical Database Management
    July 2021
    275 pages
    ISBN:9781450384131
    DOI:10.1145/3468791

    Copyright © 2021 Owner/Author

    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 11 August 2021

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate56of146submissions,38%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format