ABSTRACT
Natural language is a promising alternative interface to DBMSs because it enables non-technical users to formulate complex questions in a more concise manner than SQL. Recently, deep learning has gained traction for translating natural language to SQL, since similar ideas have been successful in the related domain of machine translation. However, the core problem with existing deep learning approaches is that they require an enormous amount of training data in order to provide accurate translations. This training data is extremely expensive to curate, since it generally requires humans to manually annotate natural language examples with the corresponding SQL queries (or vice versa). Based on these observations, we propose DBPal, a new approach that augments existing deep learning techniques in order to improve the performance of models for natural language to SQL translation. More specifically, we present a novel training pipeline that automatically generates synthetic training data in order to (1) improve overall translation accuracy, (2) increase robustness to linguistic variation, and (3) specialize the model for the target database. As we show, our DBPal training pipeline is able to improve both the accuracy and linguistic robustness of state-of-the-art natural language to SQL translation models.
Supplemental Material
- Ion Androutsopoulos, Graeme D. Ritchie, and Peter Thanisch. 1995. Natural language interfaces to databases - an introduction. Natural Language Engineering, Vol. 1, 1 (1995), 29--81.Google ScholarCross Ref
- Islam Beltagy, Katrin Erk, and Raymond Mooney. 2014. Semantic Parsing using Distributional Semantics and Probabilistic Logic. In ACL 2014 Workshop on Semantic Parsing. 7--11.Google ScholarCross Ref
- Jonathan Berant and Percy Liang. 2014. Semantic Parsing via Paraphrasing. In ACL. 1415--1425.Google Scholar
- Sonia Bergamaschi, Francesco Guerra, Matteo Interlandi, Raquel Trillo Lado, and Yannis Velegrakis. 2013. QUEST: A Keyword Search System for Relational Data based on Semantic and Machine Learning Techniques. PVLDB, Vol. 6, 12 (2013), 1222--1225.Google ScholarDigital Library
- Rahul Bhagat and Eduard H. Hovy. 2013. What Is a Paraphrase? Computational Linguistics, Vol. 39, 3 (2013), 463--472.Google ScholarCross Ref
- Lukas Blunschi, Claudio Jossen, Donald Kossmann, Magdalini Mori, and Kurt Stockinger. 2012. SODA: Generating SQL for Business Users. PVLDB, Vol. 5, 10 (2012), 932--943.Google Scholar
- Ruichu Cai, Boyan Xu, Zhenjie Zhang, Xiaoyan Yang, Zijian Li, and Zhihao Liang. 2018. An Encoder-Decoder Framework Translating Natural Language to Database Queries. In IJCAI. 3977--3983.Google Scholar
- Shumo Chu, Chenglong Wang, Konstantin Weitz, and Alvin Cheung. 2017. Cosette: An Automated Prover for SQL. In CIDR.Google Scholar
- Stephen Clark and James R. Curran. 2004. Parsing the WSJ Using CCG and Log-Linear Models. In ACL. 103--110.Google Scholar
- Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom M. Mitchell, Kamal Nigam, and Seá n Slattery. 2000. Learning to construct knowledge bases from the World Wide Web. Artif. Intell., Vol. 118, 1--2 (2000), 69--113.Google ScholarCross Ref
- Andrew Crotty, Alex Galakatos, Emanuel Zgraggen, Carsten Binnig, and Tim Kraska. 2015. Vizdom: Interactive Analytics through Pen and Touch. PVLDB, Vol. 8, 12 (2015), 2024--2027.Google ScholarDigital Library
- Andrew Crotty, Alex Galakatos, Emanuel Zgraggen, Carsten Binnig, and Tim Kraska. 2016. The case for interactive data exploration accelerators (IDEAs). In HILDA@SIGMOD.Google Scholar
- Tri Dao, Albert Gu, Alexander Ratner, Virginia Smith, Chris De Sa, and Christopher Ré. 2019. A Kernel Theory of Modern Data Augmentation. In ICML, Vol. 97. 1528--1537.Google Scholar
- Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W. Bruce Croft. 2017. Neural Ranking Models with Weak Supervision. In SIGIR. 65--74.Google Scholar
- Li Dong and Mirella Lapata. 2016. Language to Logical Form with Neural Attention. In ACL.Google Scholar
- Li Dong and Mirella Lapata. 2018. Coarse-to-Fine Decoding for Neural Semantic Parsing. In ACL. 731--742.Google Scholar
- Philipp Eichmann, Andrew Crotty, Alexander Galakatos, and Emanuel Zgraggen. 2017. Discrete Time Specifications In Temporal Queries. In CHI Extended Abstracts. 2536--2542.Google Scholar
- Catherine Finegan-Dollak, Jonathan K. Kummerfeld, Li Zhang, Karthik Ramanathan, Sesh Sadasivam, Rui Zhang, and Dragomir R. Radev. 2018. Improving Text-to-SQL Evaluation Methodology. In ACL. 351--360.Google Scholar
- Alex Galakatos, Andrew Crotty, Emanuel Zgraggen, Carsten Binnig, and Tim Kraska. 2017. Revisiting Reuse for Approximate Query Processing. PVLDB, Vol. 10, 10 (2017), 1142--1153.Google ScholarDigital Library
- Mi-Young Huh, Pulkit Agrawal, and Alexei A. Efros. 2016. What makes ImageNet good for transfer learning? CoRR, Vol. abs/1608.08614 (2016).Google Scholar
- Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer. 2017. Learning a Neural Semantic Parser from User Feedback. In ACL. 963--973.Google Scholar
- Robin Jia and Percy Liang. 2016. Data Recombination for Neural Semantic Parsing. In ACL.Google Scholar
- Rogers Jeffrey Leo John, Navneet Potti, and Jignesh M. Patel. 2017. Ava: From Data to Insights Through Conversations. In CIDR.Google Scholar
- Fei Li and H. V. Jagadish. 2014a. Constructing an Interactive Natural Language Interface for Relational Databases. PVLDB, Vol. 8, 1 (2014), 73--84.Google ScholarDigital Library
- Fei Li and Hosagrahar Visvesvaraya Jagadish. 2014b. NaLIR: an interactive natural language interface for querying relational databases. In SIGMOD. 709--712.Google Scholar
- Percy Liang, Michael I. Jordan, and Dan Klein. 2011. Learning Dependency-Based Compositional Semantics. In ACL. 590--599.Google Scholar
- Gabriel Lyons, Vinh Tran, Carsten Binnig, Ugur Cetintemel, and Tim Kraska. 2016. Making the Case for Query-by-Voice with EchoQuery. In SIGMOD. 2129--2132.Google Scholar
- Ellie Pavlick and Chris Callison-Burch. 2016. Simple PPDB: A Paraphrase Database for Simplification. In ACL.Google Scholar
- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In EMNLP. 1532--1543.Google Scholar
- Ana-Maria Popescu, Alex Armanasu, Oren Etzioni, David Ko, and Alexander Yates. 2004. Modern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability. In COLING.Google Scholar
- Ana-Maria Popescu, Oren Etzioni, and Henry A. Kautz. 2003. Towards a theory of natural language interfaces to databases. In IUI. 149--157.Google Scholar
- Rodolfo A. Pazos Rangel, Joaqu'i n Pé rez Ortega, Juan Javier Gonzá lez Barbosa, Alexander F. Gelbukh, Grigori Sidorov, and Myriam J. Rodr'i guez M. 2005. A Domain Independent Natural Language Interface to Databases Capable of Processing Complex Queries. In MICAI, Vol. 3789. 833--842.Google Scholar
- Alexander Ratner, Stephen H. Bach, Henry R. Ehrenberg, Jason Alan Fries, Sen Wu, and Christopher Ré. 2017a. Snorkel: Rapid Training Data Creation with Weak Supervision. PVLDB, Vol. 11, 3 (2017), 269--282.Google Scholar
- Alexander J. Ratner, Henry R. Ehrenberg, Zeshan Hussain, Jared Dunnmon, and Christopher Ré. 2017b. Learning to Compose Domain-Specific Transformations for Data Augmentation. In NIPS. 3236--3246.Google Scholar
- Diptikalyan Saha, Avrilia Floratou, Karthik Sankaranarayanan, Umar Farooq Minhas, Ashish R. Mittal, and Fatma Özcan. 2016. ATHENA: An Ontology-Driven System for Natural Language Querying over Relational Data Stores. PVLDB, Vol. 9, 12 (2016), 1209--1220.Google ScholarDigital Library
- Zeyuan Shang, Emanuel Zgraggen, Benedetto Buratti, Ferdinand Kossmann, Philipp Eichmann, Yeounoh Chung, Carsten Binnig, Eli Upfal, and Tim Kraska. 2019. Democratizing Data Science through Interactive Curation of ML Pipelines. In SIGMOD. 1171--1188.Google Scholar
- Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. 2017. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. In ICCV. 843--852.Google Scholar
- Sandeep Tata and Guy M. Lohman. 2008. SQAK: doing more with keywords. In SIGMOD. 889--902.Google Scholar
- Cagatay Turkay, Nicola Pezzotti, Carsten Binnig, Hendrik Strobelt, Barbara Hammer, Daniel A. Keim, Jean-Daniel Fekete, Themis Palpanas, Yunhai Wang, and Florin Rusu. 2018. Progressive Data Science: Potential and Challenges. CoRR, Vol. abs/1812.08032 (2018).Google Scholar
- Marta Vila, Maria Antònia Marti, and Horacio Rodriguez. 2011. Paraphrase Concept and Typology. A Linguistically Based and Computationally Oriented Approach. Procesamiento del Lenguaje Natural, Vol. 46 (2011), 83--90.Google Scholar
- Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey E. Hinton. 2015. Grammar as a Foreign Language. In NIPS. 2773--2781.Google Scholar
- Yushi Wang, Jonathan Berant, and Percy Liang. 2015. Building a Semantic Parser Overnight. In ACL. 1332--1342.Google Scholar
- Xiaojun Xu, Chang Liu, and Dawn Song. 2017. SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning. CoRR, Vol. abs/1711.04436 (2017).Google Scholar
- Jeffrey Xu Yu, Lu Qin, and Lijun Chang. 2010. Keyword Search in Relational Databases: A Survey. IEEE Data Eng. Bull., Vol. 33, 1 (2010), 67--78.Google Scholar
- Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang, Dongxu Wang, Zifan Li, and Dragomir R. Radev. 2018a. SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task. In EMNLP. 1653--1663.Google Scholar
- Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir R. Radev. 2018b. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In EMNLP. 3911--3921.Google Scholar
- John M. Zelle and Raymond J. Mooney. 1996. Learning to Parse Database Queries Using Inductive Logic Programming. In AAAI. 1050--1055.Google Scholar
- Luke S. Zettlemoyer and Michael Collins. 2007. Online Learning of Relaxed CCG Grammars for Parsing to Logical Form. In EMNLP. 678--687.Google Scholar
- Emanuel Zgraggen, Alex Galakatos, Andrew Crotty, Jean-Daniel Fekete, and Tim Kraska. 2017. How Progressive Visualizations Affect Exploratory Analysis. IEEE Trans. Vis. Comput. Graph., Vol. 23, 8 (2017), 1977--1987.Google ScholarDigital Library
- Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. CoRR, Vol. abs/1709.00103 (2017).Google Scholar
Index Terms
- DBPal: A Fully Pluggable NL2SQL Training Pipeline
Recommendations
DBPal: A Learned NL-Interface for Databases
SIGMOD '18: Proceedings of the 2018 International Conference on Management of DataIn this demo, we present DBPal, a novel data exploration tool with a natural language interface. DBPal leverages recent advances in deep models to make query understanding more robust in the following ways: First, DBPal uses novel machine translation ...
Interactive natural language interface
To override the complexity of SQL, and to facilitate the manipulation of data in databases for common people (not SQL professionals), many researches have turned out to use natural language instead of SQL. The idea of using natural language instead of ...
Generic interactive natural language interface to databases (GINLIDB)
EC'09: Proceedings of the 10th WSEAS international conference on evolutionary computingTo override the complexity of SQL, and to facilitate the manipulation of data in databases for common people (not SQL professionals), many researches have turned out to use natural language instead of SQL. The idea of using natural language instead of ...
Comments