research-article

DBPal: A Fully Pluggable NL2SQL Training Pipeline

Authors:
Nathaniel Weir

Johns Hopkins University, Baltimore, MD, USA

Johns Hopkins University, Baltimore, MD, USA
View Profile

,
Prasetya Utama

Technische Universität Darmstadt, Darmstadt, Germany

Technische Universität Darmstadt, Darmstadt, Germany
View Profile

,
Alex Galakatos

Brown University, Providence, RI, USA

Brown University, Providence, RI, USA
View Profile

,
Andrew Crotty

Brown University, Providence, RI, USA

Brown University, Providence, RI, USA
View Profile

,
Amir Ilkhechi

Brown University, Providence, RI, USA

Brown University, Providence, RI, USA
View Profile

,
Shekar Ramaswamy

Brown University, Providence, RI, USA

Brown University, Providence, RI, USA
View Profile

,
Rohin Bhushan

Brown University, Providence, RI, USA

Brown University, Providence, RI, USA
View Profile

,
Nadja Geisler

Technische Universität Darmstadt, Darmstadt, Germany

Technische Universität Darmstadt, Darmstadt, Germany
View Profile

,
Benjamin Hättasch

Technische Universität Darmstadt, Darmstadt, Germany

Technische Universität Darmstadt, Darmstadt, Germany
View Profile

,
Steffen Eger

Technische Universität Darmstadt, Darmstadt, Germany

Technische Universität Darmstadt, Darmstadt, Germany
View Profile

,
Ugur Cetintemel

Brown University, Providence, RI, USA

Brown University, Providence, RI, USA
View Profile

,
Carsten Binnig

Technische Universität Darmstadt, Darmstadt, Germany

Technische Universität Darmstadt, Darmstadt, Germany
View Profile

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataJune 2020Pages 2347–2361https://doi.org/10.1145/3318464.3380589

Published:31 May 2020Publication History

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Pages 2347–2361

ABSTRACT

Natural language is a promising alternative interface to DBMSs because it enables non-technical users to formulate complex questions in a more concise manner than SQL. Recently, deep learning has gained traction for translating natural language to SQL, since similar ideas have been successful in the related domain of machine translation. However, the core problem with existing deep learning approaches is that they require an enormous amount of training data in order to provide accurate translations. This training data is extremely expensive to curate, since it generally requires humans to manually annotate natural language examples with the corresponding SQL queries (or vice versa). Based on these observations, we propose DBPal, a new approach that augments existing deep learning techniques in order to improve the performance of models for natural language to SQL translation. More specifically, we present a novel training pipeline that automatically generates synthetic training data in order to (1) improve overall translation accuracy, (2) increase robustness to linguistic variation, and (3) specialize the model for the target database. As we show, our DBPal training pipeline is able to improve both the accuracy and linguistic robustness of state-of-the-art natural language to SQL translation models.

Supplemental Material

3318464.3380589.mp4

mp4

128.6 MB

Download

References

Ion Androutsopoulos, Graeme D. Ritchie, and Peter Thanisch. 1995. Natural language interfaces to databases - an introduction. Natural Language Engineering, Vol. 1, 1 (1995), 29--81.Google ScholarCross Ref
Islam Beltagy, Katrin Erk, and Raymond Mooney. 2014. Semantic Parsing using Distributional Semantics and Probabilistic Logic. In ACL 2014 Workshop on Semantic Parsing. 7--11.Google ScholarCross Ref
Jonathan Berant and Percy Liang. 2014. Semantic Parsing via Paraphrasing. In ACL. 1415--1425.Google Scholar
Sonia Bergamaschi, Francesco Guerra, Matteo Interlandi, Raquel Trillo Lado, and Yannis Velegrakis. 2013. QUEST: A Keyword Search System for Relational Data based on Semantic and Machine Learning Techniques. PVLDB, Vol. 6, 12 (2013), 1222--1225.Google ScholarDigital Library
Rahul Bhagat and Eduard H. Hovy. 2013. What Is a Paraphrase? Computational Linguistics, Vol. 39, 3 (2013), 463--472.Google ScholarCross Ref
Lukas Blunschi, Claudio Jossen, Donald Kossmann, Magdalini Mori, and Kurt Stockinger. 2012. SODA: Generating SQL for Business Users. PVLDB, Vol. 5, 10 (2012), 932--943.Google Scholar
Ruichu Cai, Boyan Xu, Zhenjie Zhang, Xiaoyan Yang, Zijian Li, and Zhihao Liang. 2018. An Encoder-Decoder Framework Translating Natural Language to Database Queries. In IJCAI. 3977--3983.Google Scholar
Shumo Chu, Chenglong Wang, Konstantin Weitz, and Alvin Cheung. 2017. Cosette: An Automated Prover for SQL. In CIDR.Google Scholar
Stephen Clark and James R. Curran. 2004. Parsing the WSJ Using CCG and Log-Linear Models. In ACL. 103--110.Google Scholar
Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom M. Mitchell, Kamal Nigam, and Seá n Slattery. 2000. Learning to construct knowledge bases from the World Wide Web. Artif. Intell., Vol. 118, 1--2 (2000), 69--113.Google ScholarCross Ref
Andrew Crotty, Alex Galakatos, Emanuel Zgraggen, Carsten Binnig, and Tim Kraska. 2015. Vizdom: Interactive Analytics through Pen and Touch. PVLDB, Vol. 8, 12 (2015), 2024--2027.Google ScholarDigital Library
Andrew Crotty, Alex Galakatos, Emanuel Zgraggen, Carsten Binnig, and Tim Kraska. 2016. The case for interactive data exploration accelerators (IDEAs). In HILDA@SIGMOD.Google Scholar
Tri Dao, Albert Gu, Alexander Ratner, Virginia Smith, Chris De Sa, and Christopher Ré. 2019. A Kernel Theory of Modern Data Augmentation. In ICML, Vol. 97. 1528--1537.Google Scholar
Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W. Bruce Croft. 2017. Neural Ranking Models with Weak Supervision. In SIGIR. 65--74.Google Scholar
Li Dong and Mirella Lapata. 2016. Language to Logical Form with Neural Attention. In ACL.Google Scholar
Li Dong and Mirella Lapata. 2018. Coarse-to-Fine Decoding for Neural Semantic Parsing. In ACL. 731--742.Google Scholar
Philipp Eichmann, Andrew Crotty, Alexander Galakatos, and Emanuel Zgraggen. 2017. Discrete Time Specifications In Temporal Queries. In CHI Extended Abstracts. 2536--2542.Google Scholar
Catherine Finegan-Dollak, Jonathan K. Kummerfeld, Li Zhang, Karthik Ramanathan, Sesh Sadasivam, Rui Zhang, and Dragomir R. Radev. 2018. Improving Text-to-SQL Evaluation Methodology. In ACL. 351--360.Google Scholar
Alex Galakatos, Andrew Crotty, Emanuel Zgraggen, Carsten Binnig, and Tim Kraska. 2017. Revisiting Reuse for Approximate Query Processing. PVLDB, Vol. 10, 10 (2017), 1142--1153.Google ScholarDigital Library
Mi-Young Huh, Pulkit Agrawal, and Alexei A. Efros. 2016. What makes ImageNet good for transfer learning? CoRR, Vol. abs/1608.08614 (2016).Google Scholar
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer. 2017. Learning a Neural Semantic Parser from User Feedback. In ACL. 963--973.Google Scholar
Robin Jia and Percy Liang. 2016. Data Recombination for Neural Semantic Parsing. In ACL.Google Scholar
Rogers Jeffrey Leo John, Navneet Potti, and Jignesh M. Patel. 2017. Ava: From Data to Insights Through Conversations. In CIDR.Google Scholar
Fei Li and H. V. Jagadish. 2014a. Constructing an Interactive Natural Language Interface for Relational Databases. PVLDB, Vol. 8, 1 (2014), 73--84.Google ScholarDigital Library
Fei Li and Hosagrahar Visvesvaraya Jagadish. 2014b. NaLIR: an interactive natural language interface for querying relational databases. In SIGMOD. 709--712.Google Scholar
Percy Liang, Michael I. Jordan, and Dan Klein. 2011. Learning Dependency-Based Compositional Semantics. In ACL. 590--599.Google Scholar
Gabriel Lyons, Vinh Tran, Carsten Binnig, Ugur Cetintemel, and Tim Kraska. 2016. Making the Case for Query-by-Voice with EchoQuery. In SIGMOD. 2129--2132.Google Scholar
Ellie Pavlick and Chris Callison-Burch. 2016. Simple PPDB: A Paraphrase Database for Simplification. In ACL.Google Scholar
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In EMNLP. 1532--1543.Google Scholar
Ana-Maria Popescu, Alex Armanasu, Oren Etzioni, David Ko, and Alexander Yates. 2004. Modern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability. In COLING.Google Scholar
Ana-Maria Popescu, Oren Etzioni, and Henry A. Kautz. 2003. Towards a theory of natural language interfaces to databases. In IUI. 149--157.Google Scholar
Rodolfo A. Pazos Rangel, Joaqu'i n Pé rez Ortega, Juan Javier Gonzá lez Barbosa, Alexander F. Gelbukh, Grigori Sidorov, and Myriam J. Rodr'i guez M. 2005. A Domain Independent Natural Language Interface to Databases Capable of Processing Complex Queries. In MICAI, Vol. 3789. 833--842.Google Scholar
Alexander Ratner, Stephen H. Bach, Henry R. Ehrenberg, Jason Alan Fries, Sen Wu, and Christopher Ré. 2017a. Snorkel: Rapid Training Data Creation with Weak Supervision. PVLDB, Vol. 11, 3 (2017), 269--282.Google Scholar
Alexander J. Ratner, Henry R. Ehrenberg, Zeshan Hussain, Jared Dunnmon, and Christopher Ré. 2017b. Learning to Compose Domain-Specific Transformations for Data Augmentation. In NIPS. 3236--3246.Google Scholar
Diptikalyan Saha, Avrilia Floratou, Karthik Sankaranarayanan, Umar Farooq Minhas, Ashish R. Mittal, and Fatma Özcan. 2016. ATHENA: An Ontology-Driven System for Natural Language Querying over Relational Data Stores. PVLDB, Vol. 9, 12 (2016), 1209--1220.Google ScholarDigital Library
Zeyuan Shang, Emanuel Zgraggen, Benedetto Buratti, Ferdinand Kossmann, Philipp Eichmann, Yeounoh Chung, Carsten Binnig, Eli Upfal, and Tim Kraska. 2019. Democratizing Data Science through Interactive Curation of ML Pipelines. In SIGMOD. 1171--1188.Google Scholar
Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. 2017. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. In ICCV. 843--852.Google Scholar
Sandeep Tata and Guy M. Lohman. 2008. SQAK: doing more with keywords. In SIGMOD. 889--902.Google Scholar
Cagatay Turkay, Nicola Pezzotti, Carsten Binnig, Hendrik Strobelt, Barbara Hammer, Daniel A. Keim, Jean-Daniel Fekete, Themis Palpanas, Yunhai Wang, and Florin Rusu. 2018. Progressive Data Science: Potential and Challenges. CoRR, Vol. abs/1812.08032 (2018).Google Scholar
Marta Vila, Maria Antònia Marti, and Horacio Rodriguez. 2011. Paraphrase Concept and Typology. A Linguistically Based and Computationally Oriented Approach. Procesamiento del Lenguaje Natural, Vol. 46 (2011), 83--90.Google Scholar
Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey E. Hinton. 2015. Grammar as a Foreign Language. In NIPS. 2773--2781.Google Scholar
Yushi Wang, Jonathan Berant, and Percy Liang. 2015. Building a Semantic Parser Overnight. In ACL. 1332--1342.Google Scholar
Xiaojun Xu, Chang Liu, and Dawn Song. 2017. SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning. CoRR, Vol. abs/1711.04436 (2017).Google Scholar
Jeffrey Xu Yu, Lu Qin, and Lijun Chang. 2010. Keyword Search in Relational Databases: A Survey. IEEE Data Eng. Bull., Vol. 33, 1 (2010), 67--78.Google Scholar
Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang, Dongxu Wang, Zifan Li, and Dragomir R. Radev. 2018a. SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task. In EMNLP. 1653--1663.Google Scholar
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir R. Radev. 2018b. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In EMNLP. 3911--3921.Google Scholar
John M. Zelle and Raymond J. Mooney. 1996. Learning to Parse Database Queries Using Inductive Logic Programming. In AAAI. 1050--1055.Google Scholar
Luke S. Zettlemoyer and Michael Collins. 2007. Online Learning of Relaxed CCG Grammars for Parsing to Logical Form. In EMNLP. 678--687.Google Scholar
Emanuel Zgraggen, Alex Galakatos, Andrew Crotty, Jean-Daniel Fekete, and Tim Kraska. 2017. How Progressive Visualizations Affect Exploratory Analysis. IEEE Trans. Vis. Comput. Graph., Vol. 23, 8 (2017), 1977--1987.Google ScholarDigital Library
Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. CoRR, Vol. abs/1709.00103 (2017).Google Scholar

Index Terms

DBPal: A Fully Pluggable NL2SQL Training Pipeline
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Data management systems

Recommendations

DBPal: A Learned NL-Interface for Databases
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

In this demo, we present DBPal, a novel data exploration tool with a natural language interface. DBPal leverages recent advances in deep models to make query understanding more robust in the following ways: First, DBPal uses novel machine translation ...
Read More
Interactive natural language interface

To override the complexity of SQL, and to facilitate the manipulation of data in databases for common people (not SQL professionals), many researches have turned out to use natural language instead of SQL. The idea of using natural language instead of ...
Read More
Generic interactive natural language interface to databases (GINLIDB)
EC'09: Proceedings of the 10th WSEAS international conference on evolutionary computing

To override the complexity of SQL, and to facilitate the manipulation of data in databases for common people (not SQL professionals), many researches have turned out to use natural language instead of SQL. The idea of using natural language instead of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
June 2020
2925 pages
ISBN:9781450367356
DOI:10.1145/3318464
General Chairs:
David Maier
Portland State University, USA
,
Rachel Pottinger
University of British Columbia, Canada
,
Program Chairs:
AnHai Doan
University of Wisconsin, USA
,
Wang-Chiew Tan
Megagon Labs, USA
,
Publications Chairs:
Abdussalam Alawini
University of Illinois at Urbana-Champaign, USA
,
Hung Q. Ngo
RelationalAI, USA
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 May 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
NL2SQL
NLIDB
natural language interface to database
natural language to SQL
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 648
  Total Downloads
- Downloads (Last 12 months)93
- Downloads (Last 6 weeks)19
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DBPal: A Fully Pluggable NL2SQL Training Pipeline

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

DBPal: A Learned NL-Interface for Databases

Interactive natural language interface

Generic interactive natural language interface to databases (GINLIDB)

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

DBPal: A Fully Pluggable NL2SQL Training Pipeline

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

DBPal: A Learned NL-Interface for Databases

Interactive natural language interface

Generic interactive natural language interface to databases (GINLIDB)

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media