skip to main content
research-article

DBTagger: multi-task learning for keyword mapping in NLIDBs using Bi-directional recurrent neural networks

Published:01 January 2021Publication History
Skip Abstract Section

Abstract

Translating Natural Language Queries (NLQs) to Structured Query Language (SQL) in interfaces deployed in relational databases is a challenging task, which has been widely studied in database community recently. Conventional rule based systems utilize series of solutions as a pipeline to deal with each step of this task, namely stop word filtering, tokenization, stemming/lemmatization, parsing, tagging, and translation. Recent works have mostly focused on the translation step overlooking the earlier steps by using adhoc solutions. In the pipeline, one of the most critical and challenging problems is keyword mapping; constructing a mapping between tokens in the query and relational database elements (tables, attributes, values, etc.). We define the keyword mapping problem as a sequence tagging problem, and propose a novel deep learning based supervised approach that utilizes POS tags of NLQs. Our proposed approach, called DBTagger (DataBase Tagger), is an end-to-end and schema independent solution, which makes it practical for various relational databases. We evaluate our approach on eight different datasets, and report new state-of-the-art accuracy results, 92.4% on the average. Our results also indicate that DBTagger is faster than its counterparts up to 10000 times and scalable for bigger databases.

References

  1. Katrin Affolter, Kurt Stockinger, and Abraham Bernstein. 2019. A comparative survey of recent natural language interfaces for databases. The VLDB Journal 28, 5 (2019), 793--819.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Christopher Baik, H. V. Jagadish, and Yunyao Li. 2019. Bridging the Semantic Gap with SQL Query Logs in Natural Language Interfaces to Databases. In 2019 IEEE 35th International Conference on Data Engineering (ICDE '19).Google ScholarGoogle ScholarCross RefCross Ref
  3. Fuat Basik, Benjamin Hättasch, Amir Ilkhechi, Arif Usta, Shekar Ramaswamy, Prasetya Utama, Nathaniel Weir, Carsten Binnig, and Ugur Cetintemel. 2018. DB-Pal: A Learned NL-Interface for Databases. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). 1765--1768. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Lukas Blunschi, Claudio Jossen, Donald Kossmann, Magdalini Mori, and Kurt Stockinger. 2012. SODA: Generating SQL for Business Users. Proc. VLDB Endow. 5, 10 (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ben Bogin, Jonathan Berant, and Matt Gardner. 2019. Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL '19). 4560--4565.Google ScholarGoogle ScholarCross RefCross Ref
  6. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146.Google ScholarGoogle ScholarCross RefCross Ref
  7. Rich Caruana. 1997. Multitask learning. Machine learning 28, 1 (1997), 41--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2015. Gated Feedback Recurrent Neural Networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML '15). 2067--2075. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural Language Processing (Almost) from Scratch. Journal of machine learning research 12, ARTICLE (2011), 2493--2537. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Timothy Dozat. 2016. Incorporating Nesterov Momentum into Adam. In International Conference on Learning Representations Workshop.Google ScholarGoogle Scholar
  11. A. Graves, A. Mohamed, and G. Hinton. 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. 6645--6649.Google ScholarGoogle Scholar
  12. Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-Guang Lou, Ting Liu, and Dongmei Zhang. 2019. Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL '19). 4524--4535.Google ScholarGoogle ScholarCross RefCross Ref
  13. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16). 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  14. Gary G. Hendrix, Earl D. Sacerdoti, Daniel Sagalowicz, and Jonathan Slocum. 1978. Developing a Natural Language Interface to Complex Data. ACM Trans. Database Syst. 3, 2 (1978), 105--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. ArXiv abs/1207.0580 (2012).Google ScholarGoogle Scholar
  16. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17). 4700--4708.Google ScholarGoogle ScholarCross RefCross Ref
  18. Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wen-tau Yih, and Xiaodong He. 2018. Natural Language to Structured Query Generation via Meta-Learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (NAACL '18). 732--738.Google ScholarGoogle ScholarCross RefCross Ref
  19. Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. ArXiv abs/1508.01991 (2015).Google ScholarGoogle Scholar
  20. Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer. 2017. Learning a Neural Semantic Parser from User Feedback. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL '17). 963--973.Google ScholarGoogle ScholarCross RefCross Ref
  21. Hyeonji Kim, Byeong-Hoon So, Wook-Shin Han, and Hongrae Lee. 2020. Natural language to SQL: Where are we today? Proceedings of the VLDB Endowment 13, 10 (2020), 1737--1750. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML '01). 282--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL '16). 260--270.Google ScholarGoogle ScholarCross RefCross Ref
  24. Fei Li and H. V. Jagadish. 2014. Constructing an Interactive Natural Language Interface for Relational Databases. Proc. VLDB Endow. 8, 1 (2014), 73--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Yunyao Li and Davood Rafiei. 2017. Natural Language Data Management and Interfaces: Recent Development and Open Challenges. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD '17). 1765--1770. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Xuezhe Ma and Eduard Hovy. 2016. End-to-end Sequence Labeling via Bidirectional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL '16). 1064--1074.Google ScholarGoogle Scholar
  27. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Association for Computational Linguistics (ACL) System Demonstrations. 55--60.Google ScholarGoogle Scholar
  28. Tomas Mikolov, G.s Corrado, Kai Chen, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the International Conference on Learning Representations (ICLR '13). 1--12.Google ScholarGoogle Scholar
  29. George A. Miller. 1995. WordNet: A Lexical Database for English. Commun. ACM 38, 11 (1995), 39--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Fatma Özcan, Abdul Quamar, Jaydeep Sen, Chuan Lei, and Vasilis Efthymiou. 2020. State of the Art and Open Challenges in Natural Language Interfaces to Data. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD '20). 2629--2636. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ana-Maria Popescu, Oren Etzioni, and Henry Kautz. 2003. Towards a Theory of Natural Language Interfaces to Databases. In Proceedings of the 8th International Conference on Intelligent User Interfaces (IUI '03). 149--157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Diptikalyan Saha, Avrilia Floratou, Karthik Sankaranarayanan, Umar Farooq Minhas, Ashish R. Mittal, and Fatma Özcan. 2016. ATHENA: An Ontology-Driven System for Natural Language Querying over Relational Data Stores. Proc. VLDB Endow. 9, 12 (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Nathaniel Weir, Prasetya Utama, Alex Galakatos, Andrew Crotty, Amir Ilkhechi, Shekar Ramaswamy, Rohin Bhushan, Nadja Geisler, Benjamin Hättasch, Steffen Eger, Ugur Cetintemel, and Carsten Binnig. 2020. DBPal: A Fully Pluggable NL2SQL Training Pipeline. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD '20). 2347--2361. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Xiaojun Xu, Chang Liu, and Dawn Song. 2017. Sqlnet: Generating structured queries from natural language without reinforcement learning. arXiv preprint arXiv:1711.04436 (2017).Google ScholarGoogle Scholar
  35. Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig. 2017. SQLizer: Query Synthesis from Natural Language. Proceedings of the ACM on Programming Languages 1, OOPSLA (2017), 1--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kaisheng Yao, Baolin Peng, Yu Zhang, Dong Yu, Geoffrey Zweig, and Yangyang Shi. 2014. Spoken language understanding using long short-term memory neural networks. In 2014 IEEE Spoken Language Technology Workshop (SLT). 189--194.Google ScholarGoogle ScholarCross RefCross Ref
  37. Semih Yavuz, Izzeddin Gur, Yu Su, and Xifeng Yan. 2018. What It Takes to Achieve 100% Condition Accuracy on WikiSQL. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP '18). 1702--1711.Google ScholarGoogle ScholarCross RefCross Ref
  38. Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL '20). 8413--8426.Google ScholarGoogle ScholarCross RefCross Ref
  39. Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018. Recent trends in deep learning based natural language processing. IEEE Computational Intelligence Magazine 13, 3 (2018), 55--75.Google ScholarGoogle ScholarCross RefCross Ref
  40. Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, and Dragomir Radev. 2018. TypeSQL: Knowledge-Based Type-Aware Neural Text-to-SQL Generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (NAACL '18). 588--594.Google ScholarGoogle ScholarCross RefCross Ref
  41. Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang, Dongxu Wang, Zifan Li, and Dragomir Radev. 2018. SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP '18). 1653--1663.Google ScholarGoogle ScholarCross RefCross Ref
  42. Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP '18). 3911--3921.Google ScholarGoogle ScholarCross RefCross Ref
  43. Matthew D. Zeiler. 2012. ADADELTA: An Adaptive Learning Rate Method. ArXiv abs/1212.5701 (2012).Google ScholarGoogle Scholar
  44. Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103 (2017).Google ScholarGoogle Scholar

Index Terms

  1. DBTagger: multi-task learning for keyword mapping in NLIDBs using Bi-directional recurrent neural networks
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 14, Issue 5
      January 2021
      142 pages
      ISSN:2150-8097
      Issue’s Table of Contents

      Publisher

      VLDB Endowment

      Publication History

      • Published: 1 January 2021
      Published in pvldb Volume 14, Issue 5

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader