research-article

DBTagger: multi-task learning for keyword mapping in NLIDBs using Bi-directional recurrent neural networks

Authors:
Arif Usta

Bilkent University, Ankara, Turkey

Bilkent University, Ankara, Turkey
View Profile

,
Akifhan Karakayali

Bilkent University, Ankara, Turkey

Bilkent University, Ankara, Turkey
View Profile

,
Özgür Ulusoy

Bilkent University, Ankara, Turkey

Bilkent University, Ankara, Turkey
View Profile

Proceedings of the VLDB Endowment Volume 14 Issue 5pp 813–821https://doi.org/10.14778/3446095.3446103

Published:01 January 2021Publication History

Proceedings of the VLDB Endowment

Abstract

Translating Natural Language Queries (NLQs) to Structured Query Language (SQL) in interfaces deployed in relational databases is a challenging task, which has been widely studied in database community recently. Conventional rule based systems utilize series of solutions as a pipeline to deal with each step of this task, namely stop word filtering, tokenization, stemming/lemmatization, parsing, tagging, and translation. Recent works have mostly focused on the translation step overlooking the earlier steps by using adhoc solutions. In the pipeline, one of the most critical and challenging problems is keyword mapping; constructing a mapping between tokens in the query and relational database elements (tables, attributes, values, etc.). We define the keyword mapping problem as a sequence tagging problem, and propose a novel deep learning based supervised approach that utilizes POS tags of NLQs. Our proposed approach, called DBTagger (DataBase Tagger), is an end-to-end and schema independent solution, which makes it practical for various relational databases. We evaluate our approach on eight different datasets, and report new state-of-the-art accuracy results, 92.4% on the average. Our results also indicate that DBTagger is faster than its counterparts up to 10000 times and scalable for bigger databases.

References

Katrin Affolter, Kurt Stockinger, and Abraham Bernstein. 2019. A comparative survey of recent natural language interfaces for databases. The VLDB Journal 28, 5 (2019), 793--819.Google ScholarDigital Library
Christopher Baik, H. V. Jagadish, and Yunyao Li. 2019. Bridging the Semantic Gap with SQL Query Logs in Natural Language Interfaces to Databases. In 2019 IEEE 35th International Conference on Data Engineering (ICDE '19).Google ScholarCross Ref
Fuat Basik, Benjamin Hättasch, Amir Ilkhechi, Arif Usta, Shekar Ramaswamy, Prasetya Utama, Nathaniel Weir, Carsten Binnig, and Ugur Cetintemel. 2018. DB-Pal: A Learned NL-Interface for Databases. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD '18). 1765--1768. Google ScholarDigital Library
Lukas Blunschi, Claudio Jossen, Donald Kossmann, Magdalini Mori, and Kurt Stockinger. 2012. SODA: Generating SQL for Business Users. Proc. VLDB Endow. 5, 10 (2012). Google ScholarDigital Library
Ben Bogin, Jonathan Berant, and Matt Gardner. 2019. Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL '19). 4560--4565.Google ScholarCross Ref
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (2017), 135--146.Google ScholarCross Ref
Rich Caruana. 1997. Multitask learning. Machine learning 28, 1 (1997), 41--75. Google ScholarDigital Library
Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2015. Gated Feedback Recurrent Neural Networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML '15). 2067--2075. Google ScholarDigital Library
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural Language Processing (Almost) from Scratch. Journal of machine learning research 12, ARTICLE (2011), 2493--2537. Google ScholarDigital Library
Timothy Dozat. 2016. Incorporating Nesterov Momentum into Adam. In International Conference on Learning Representations Workshop.Google Scholar
A. Graves, A. Mohamed, and G. Hinton. 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. 6645--6649.Google Scholar
Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-Guang Lou, Ting Liu, and Dongmei Zhang. 2019. Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL '19). 4524--4535.Google ScholarCross Ref
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '16). 770--778.Google ScholarCross Ref
Gary G. Hendrix, Earl D. Sacerdoti, Daniel Sagalowicz, and Jonathan Slocum. 1978. Developing a Natural Language Interface to Complex Data. ACM Trans. Database Syst. 3, 2 (1978), 105--147. Google ScholarDigital Library
Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. ArXiv abs/1207.0580 (2012).Google Scholar
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780. Google ScholarDigital Library
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '17). 4700--4708.Google ScholarCross Ref
Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wen-tau Yih, and Xiaodong He. 2018. Natural Language to Structured Query Generation via Meta-Learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (NAACL '18). 732--738.Google ScholarCross Ref
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. ArXiv abs/1508.01991 (2015).Google Scholar
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer. 2017. Learning a Neural Semantic Parser from User Feedback. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL '17). 963--973.Google ScholarCross Ref
Hyeonji Kim, Byeong-Hoon So, Wook-Shin Han, and Hongrae Lee. 2020. Natural language to SQL: Where are we today? Proceedings of the VLDB Endowment 13, 10 (2020), 1737--1750. Google ScholarDigital Library
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML '01). 282--289. Google ScholarDigital Library
Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL '16). 260--270.Google ScholarCross Ref
Fei Li and H. V. Jagadish. 2014. Constructing an Interactive Natural Language Interface for Relational Databases. Proc. VLDB Endow. 8, 1 (2014), 73--84. Google ScholarDigital Library
Yunyao Li and Davood Rafiei. 2017. Natural Language Data Management and Interfaces: Recent Development and Open Challenges. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD '17). 1765--1770. Google ScholarDigital Library
Xuezhe Ma and Eduard Hovy. 2016. End-to-end Sequence Labeling via Bidirectional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL '16). 1064--1074.Google Scholar
Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Association for Computational Linguistics (ACL) System Demonstrations. 55--60.Google Scholar
Tomas Mikolov, G.s Corrado, Kai Chen, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the International Conference on Learning Representations (ICLR '13). 1--12.Google Scholar
George A. Miller. 1995. WordNet: A Lexical Database for English. Commun. ACM 38, 11 (1995), 39--41. Google ScholarDigital Library
Fatma Özcan, Abdul Quamar, Jaydeep Sen, Chuan Lei, and Vasilis Efthymiou. 2020. State of the Art and Open Challenges in Natural Language Interfaces to Data. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD '20). 2629--2636. Google ScholarDigital Library
Ana-Maria Popescu, Oren Etzioni, and Henry Kautz. 2003. Towards a Theory of Natural Language Interfaces to Databases. In Proceedings of the 8th International Conference on Intelligent User Interfaces (IUI '03). 149--157. Google ScholarDigital Library
Diptikalyan Saha, Avrilia Floratou, Karthik Sankaranarayanan, Umar Farooq Minhas, Ashish R. Mittal, and Fatma Özcan. 2016. ATHENA: An Ontology-Driven System for Natural Language Querying over Relational Data Stores. Proc. VLDB Endow. 9, 12 (2016). Google ScholarDigital Library
Nathaniel Weir, Prasetya Utama, Alex Galakatos, Andrew Crotty, Amir Ilkhechi, Shekar Ramaswamy, Rohin Bhushan, Nadja Geisler, Benjamin Hättasch, Steffen Eger, Ugur Cetintemel, and Carsten Binnig. 2020. DBPal: A Fully Pluggable NL2SQL Training Pipeline. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD '20). 2347--2361. Google ScholarDigital Library
Xiaojun Xu, Chang Liu, and Dawn Song. 2017. Sqlnet: Generating structured queries from natural language without reinforcement learning. arXiv preprint arXiv:1711.04436 (2017).Google Scholar
Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig. 2017. SQLizer: Query Synthesis from Natural Language. Proceedings of the ACM on Programming Languages 1, OOPSLA (2017), 1--26. Google ScholarDigital Library
Kaisheng Yao, Baolin Peng, Yu Zhang, Dong Yu, Geoffrey Zweig, and Yangyang Shi. 2014. Spoken language understanding using long short-term memory neural networks. In 2014 IEEE Spoken Language Technology Workshop (SLT). 189--194.Google ScholarCross Ref
Semih Yavuz, Izzeddin Gur, Yu Su, and Xifeng Yan. 2018. What It Takes to Achieve 100% Condition Accuracy on WikiSQL. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP '18). 1702--1711.Google ScholarCross Ref
Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL '20). 8413--8426.Google ScholarCross Ref
Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018. Recent trends in deep learning based natural language processing. IEEE Computational Intelligence Magazine 13, 3 (2018), 55--75.Google ScholarCross Ref
Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, and Dragomir Radev. 2018. TypeSQL: Knowledge-Based Type-Aware Neural Text-to-SQL Generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (NAACL '18). 588--594.Google ScholarCross Ref
Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang, Dongxu Wang, Zifan Li, and Dragomir Radev. 2018. SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP '18). 1653--1663.Google ScholarCross Ref
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP '18). 3911--3921.Google ScholarCross Ref
Matthew D. Zeiler. 2012. ADADELTA: An Adaptive Learning Rate Method. ArXiv abs/1212.5701 (2012).Google Scholar
Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103 (2017).Google Scholar

Index Terms

DBTagger: multi-task learning for keyword mapping in NLIDBs using Bi-directional recurrent neural networks
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Unsupervised translated word sense disambiguation in constructing bilingual lexical database
SAC '18: Proceedings of the 33rd Annual ACM Symposium on Applied Computing

The performance of a machine translation system depends on the availability of bilingual lexical dictionary and completion of its word sense disambiguation performance. Word sense disambiguation plays a vital role in several applications such as machine ...
Read More
SemEval-2010 task 3: cross-lingual word sense disambiguation
SEW '09: Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions

We propose a multilingual unsupervised Word Sense Disambiguation (WSD) task for a sample of English nouns. Instead of providing manually sensetagged examples for each sense of a polysemous noun, our sense inventory is built up on the basis of the ...
Read More
Sql: Learn Basics of Queries and Implement Easily (sql programming, SQL 2016, sql database programming, sql for beginners, sql beginners guide, sql ... sql workbook, sql guide, MSSQL) (Volume 1)
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 14, Issue 5
January 2021
142 pages
ISSN:2150-8097
Editors:
Xin Luna Dong
Amazon
,
Felix Naumann
HPI, University of Potsdam
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 January 2021
Published in pvldb Volume 14, Issue 5
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 107
  Total Downloads
- Downloads (Last 12 months)28
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DBTagger: multi-task learning for keyword mapping in NLIDBs using Bi-directional recurrent neural networks

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Unsupervised translated word sense disambiguation in constructing bilingual lexical database

SemEval-2010 task 3: cross-lingual word sense disambiguation

Sql: Learn Basics of Queries and Implement Easily (sql programming, SQL 2016, sql database programming, sql for beginners, sql beginners guide, sql ... sql workbook, sql guide, MSSQL) (Volume 1)

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

DBTagger: multi-task learning for keyword mapping in NLIDBs using Bi-directional recurrent neural networks

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Unsupervised translated word sense disambiguation in constructing bilingual lexical database

SemEval-2010 task 3: cross-lingual word sense disambiguation

Sql: Learn Basics of Queries and Implement Easily (sql programming, SQL 2016, sql database programming, sql for beginners, sql beginners guide, sql ... sql workbook, sql guide, MSSQL) (Volume 1)

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media