skip to main content
10.1145/3318464.3380589acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

DBPal: A Fully Pluggable NL2SQL Training Pipeline

Published:31 May 2020Publication History

ABSTRACT

Natural language is a promising alternative interface to DBMSs because it enables non-technical users to formulate complex questions in a more concise manner than SQL. Recently, deep learning has gained traction for translating natural language to SQL, since similar ideas have been successful in the related domain of machine translation. However, the core problem with existing deep learning approaches is that they require an enormous amount of training data in order to provide accurate translations. This training data is extremely expensive to curate, since it generally requires humans to manually annotate natural language examples with the corresponding SQL queries (or vice versa). Based on these observations, we propose DBPal, a new approach that augments existing deep learning techniques in order to improve the performance of models for natural language to SQL translation. More specifically, we present a novel training pipeline that automatically generates synthetic training data in order to (1) improve overall translation accuracy, (2) increase robustness to linguistic variation, and (3) specialize the model for the target database. As we show, our DBPal training pipeline is able to improve both the accuracy and linguistic robustness of state-of-the-art natural language to SQL translation models.

Skip Supplemental Material Section

Supplemental Material

3318464.3380589.mp4

mp4

128.6 MB

References

  1. Ion Androutsopoulos, Graeme D. Ritchie, and Peter Thanisch. 1995. Natural language interfaces to databases - an introduction. Natural Language Engineering, Vol. 1, 1 (1995), 29--81.Google ScholarGoogle ScholarCross RefCross Ref
  2. Islam Beltagy, Katrin Erk, and Raymond Mooney. 2014. Semantic Parsing using Distributional Semantics and Probabilistic Logic. In ACL 2014 Workshop on Semantic Parsing. 7--11.Google ScholarGoogle ScholarCross RefCross Ref
  3. Jonathan Berant and Percy Liang. 2014. Semantic Parsing via Paraphrasing. In ACL. 1415--1425.Google ScholarGoogle Scholar
  4. Sonia Bergamaschi, Francesco Guerra, Matteo Interlandi, Raquel Trillo Lado, and Yannis Velegrakis. 2013. QUEST: A Keyword Search System for Relational Data based on Semantic and Machine Learning Techniques. PVLDB, Vol. 6, 12 (2013), 1222--1225.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Rahul Bhagat and Eduard H. Hovy. 2013. What Is a Paraphrase? Computational Linguistics, Vol. 39, 3 (2013), 463--472.Google ScholarGoogle ScholarCross RefCross Ref
  6. Lukas Blunschi, Claudio Jossen, Donald Kossmann, Magdalini Mori, and Kurt Stockinger. 2012. SODA: Generating SQL for Business Users. PVLDB, Vol. 5, 10 (2012), 932--943.Google ScholarGoogle Scholar
  7. Ruichu Cai, Boyan Xu, Zhenjie Zhang, Xiaoyan Yang, Zijian Li, and Zhihao Liang. 2018. An Encoder-Decoder Framework Translating Natural Language to Database Queries. In IJCAI. 3977--3983.Google ScholarGoogle Scholar
  8. Shumo Chu, Chenglong Wang, Konstantin Weitz, and Alvin Cheung. 2017. Cosette: An Automated Prover for SQL. In CIDR.Google ScholarGoogle Scholar
  9. Stephen Clark and James R. Curran. 2004. Parsing the WSJ Using CCG and Log-Linear Models. In ACL. 103--110.Google ScholarGoogle Scholar
  10. Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom M. Mitchell, Kamal Nigam, and Seá n Slattery. 2000. Learning to construct knowledge bases from the World Wide Web. Artif. Intell., Vol. 118, 1--2 (2000), 69--113.Google ScholarGoogle ScholarCross RefCross Ref
  11. Andrew Crotty, Alex Galakatos, Emanuel Zgraggen, Carsten Binnig, and Tim Kraska. 2015. Vizdom: Interactive Analytics through Pen and Touch. PVLDB, Vol. 8, 12 (2015), 2024--2027.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Andrew Crotty, Alex Galakatos, Emanuel Zgraggen, Carsten Binnig, and Tim Kraska. 2016. The case for interactive data exploration accelerators (IDEAs). In HILDA@SIGMOD.Google ScholarGoogle Scholar
  13. Tri Dao, Albert Gu, Alexander Ratner, Virginia Smith, Chris De Sa, and Christopher Ré. 2019. A Kernel Theory of Modern Data Augmentation. In ICML, Vol. 97. 1528--1537.Google ScholarGoogle Scholar
  14. Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W. Bruce Croft. 2017. Neural Ranking Models with Weak Supervision. In SIGIR. 65--74.Google ScholarGoogle Scholar
  15. Li Dong and Mirella Lapata. 2016. Language to Logical Form with Neural Attention. In ACL.Google ScholarGoogle Scholar
  16. Li Dong and Mirella Lapata. 2018. Coarse-to-Fine Decoding for Neural Semantic Parsing. In ACL. 731--742.Google ScholarGoogle Scholar
  17. Philipp Eichmann, Andrew Crotty, Alexander Galakatos, and Emanuel Zgraggen. 2017. Discrete Time Specifications In Temporal Queries. In CHI Extended Abstracts. 2536--2542.Google ScholarGoogle Scholar
  18. Catherine Finegan-Dollak, Jonathan K. Kummerfeld, Li Zhang, Karthik Ramanathan, Sesh Sadasivam, Rui Zhang, and Dragomir R. Radev. 2018. Improving Text-to-SQL Evaluation Methodology. In ACL. 351--360.Google ScholarGoogle Scholar
  19. Alex Galakatos, Andrew Crotty, Emanuel Zgraggen, Carsten Binnig, and Tim Kraska. 2017. Revisiting Reuse for Approximate Query Processing. PVLDB, Vol. 10, 10 (2017), 1142--1153.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Mi-Young Huh, Pulkit Agrawal, and Alexei A. Efros. 2016. What makes ImageNet good for transfer learning? CoRR, Vol. abs/1608.08614 (2016).Google ScholarGoogle Scholar
  21. Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer. 2017. Learning a Neural Semantic Parser from User Feedback. In ACL. 963--973.Google ScholarGoogle Scholar
  22. Robin Jia and Percy Liang. 2016. Data Recombination for Neural Semantic Parsing. In ACL.Google ScholarGoogle Scholar
  23. Rogers Jeffrey Leo John, Navneet Potti, and Jignesh M. Patel. 2017. Ava: From Data to Insights Through Conversations. In CIDR.Google ScholarGoogle Scholar
  24. Fei Li and H. V. Jagadish. 2014a. Constructing an Interactive Natural Language Interface for Relational Databases. PVLDB, Vol. 8, 1 (2014), 73--84.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Fei Li and Hosagrahar Visvesvaraya Jagadish. 2014b. NaLIR: an interactive natural language interface for querying relational databases. In SIGMOD. 709--712.Google ScholarGoogle Scholar
  26. Percy Liang, Michael I. Jordan, and Dan Klein. 2011. Learning Dependency-Based Compositional Semantics. In ACL. 590--599.Google ScholarGoogle Scholar
  27. Gabriel Lyons, Vinh Tran, Carsten Binnig, Ugur Cetintemel, and Tim Kraska. 2016. Making the Case for Query-by-Voice with EchoQuery. In SIGMOD. 2129--2132.Google ScholarGoogle Scholar
  28. Ellie Pavlick and Chris Callison-Burch. 2016. Simple PPDB: A Paraphrase Database for Simplification. In ACL.Google ScholarGoogle Scholar
  29. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In EMNLP. 1532--1543.Google ScholarGoogle Scholar
  30. Ana-Maria Popescu, Alex Armanasu, Oren Etzioni, David Ko, and Alexander Yates. 2004. Modern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability. In COLING.Google ScholarGoogle Scholar
  31. Ana-Maria Popescu, Oren Etzioni, and Henry A. Kautz. 2003. Towards a theory of natural language interfaces to databases. In IUI. 149--157.Google ScholarGoogle Scholar
  32. Rodolfo A. Pazos Rangel, Joaqu'i n Pé rez Ortega, Juan Javier Gonzá lez Barbosa, Alexander F. Gelbukh, Grigori Sidorov, and Myriam J. Rodr'i guez M. 2005. A Domain Independent Natural Language Interface to Databases Capable of Processing Complex Queries. In MICAI, Vol. 3789. 833--842.Google ScholarGoogle Scholar
  33. Alexander Ratner, Stephen H. Bach, Henry R. Ehrenberg, Jason Alan Fries, Sen Wu, and Christopher Ré. 2017a. Snorkel: Rapid Training Data Creation with Weak Supervision. PVLDB, Vol. 11, 3 (2017), 269--282.Google ScholarGoogle Scholar
  34. Alexander J. Ratner, Henry R. Ehrenberg, Zeshan Hussain, Jared Dunnmon, and Christopher Ré. 2017b. Learning to Compose Domain-Specific Transformations for Data Augmentation. In NIPS. 3236--3246.Google ScholarGoogle Scholar
  35. Diptikalyan Saha, Avrilia Floratou, Karthik Sankaranarayanan, Umar Farooq Minhas, Ashish R. Mittal, and Fatma Özcan. 2016. ATHENA: An Ontology-Driven System for Natural Language Querying over Relational Data Stores. PVLDB, Vol. 9, 12 (2016), 1209--1220.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Zeyuan Shang, Emanuel Zgraggen, Benedetto Buratti, Ferdinand Kossmann, Philipp Eichmann, Yeounoh Chung, Carsten Binnig, Eli Upfal, and Tim Kraska. 2019. Democratizing Data Science through Interactive Curation of ML Pipelines. In SIGMOD. 1171--1188.Google ScholarGoogle Scholar
  37. Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. 2017. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. In ICCV. 843--852.Google ScholarGoogle Scholar
  38. Sandeep Tata and Guy M. Lohman. 2008. SQAK: doing more with keywords. In SIGMOD. 889--902.Google ScholarGoogle Scholar
  39. Cagatay Turkay, Nicola Pezzotti, Carsten Binnig, Hendrik Strobelt, Barbara Hammer, Daniel A. Keim, Jean-Daniel Fekete, Themis Palpanas, Yunhai Wang, and Florin Rusu. 2018. Progressive Data Science: Potential and Challenges. CoRR, Vol. abs/1812.08032 (2018).Google ScholarGoogle Scholar
  40. Marta Vila, Maria Antònia Marti, and Horacio Rodriguez. 2011. Paraphrase Concept and Typology. A Linguistically Based and Computationally Oriented Approach. Procesamiento del Lenguaje Natural, Vol. 46 (2011), 83--90.Google ScholarGoogle Scholar
  41. Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey E. Hinton. 2015. Grammar as a Foreign Language. In NIPS. 2773--2781.Google ScholarGoogle Scholar
  42. Yushi Wang, Jonathan Berant, and Percy Liang. 2015. Building a Semantic Parser Overnight. In ACL. 1332--1342.Google ScholarGoogle Scholar
  43. Xiaojun Xu, Chang Liu, and Dawn Song. 2017. SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning. CoRR, Vol. abs/1711.04436 (2017).Google ScholarGoogle Scholar
  44. Jeffrey Xu Yu, Lu Qin, and Lijun Chang. 2010. Keyword Search in Relational Databases: A Survey. IEEE Data Eng. Bull., Vol. 33, 1 (2010), 67--78.Google ScholarGoogle Scholar
  45. Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang, Dongxu Wang, Zifan Li, and Dragomir R. Radev. 2018a. SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task. In EMNLP. 1653--1663.Google ScholarGoogle Scholar
  46. Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir R. Radev. 2018b. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In EMNLP. 3911--3921.Google ScholarGoogle Scholar
  47. John M. Zelle and Raymond J. Mooney. 1996. Learning to Parse Database Queries Using Inductive Logic Programming. In AAAI. 1050--1055.Google ScholarGoogle Scholar
  48. Luke S. Zettlemoyer and Michael Collins. 2007. Online Learning of Relaxed CCG Grammars for Parsing to Logical Form. In EMNLP. 678--687.Google ScholarGoogle Scholar
  49. Emanuel Zgraggen, Alex Galakatos, Andrew Crotty, Jean-Daniel Fekete, and Tim Kraska. 2017. How Progressive Visualizations Affect Exploratory Analysis. IEEE Trans. Vis. Comput. Graph., Vol. 23, 8 (2017), 1977--1987.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. CoRR, Vol. abs/1709.00103 (2017).Google ScholarGoogle Scholar

Index Terms

  1. DBPal: A Fully Pluggable NL2SQL Training Pipeline

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
        June 2020
        2925 pages
        ISBN:9781450367356
        DOI:10.1145/3318464

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 31 May 2020

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader