skip to main content
research-article
Open Access

SQLizer: query synthesis from natural language

Published:12 October 2017Publication History
Skip Abstract Section

Abstract

This paper presents a new technique for automatically synthesizing SQL queries from natural language (NL). At the core of our technique is a new NL-based program synthesis methodology that combines semantic parsing techniques from the NLP community with type-directed program synthesis and automated program repair. Starting with a program sketch obtained using standard parsing techniques, our approach involves an iterative refinement loop that alternates between probabilistic type inhabitation and automated sketch repair. We use the proposed idea to build an end-to-end system called SQLIZER that can synthesize SQL queries from natural language. Our method is fully automated, works for any database without requiring additional customization, and does not require users to know the underlying database schema. We evaluate our approach on over 450 natural language queries concerning three different databases, namely MAS, IMDB, and YELP. Our experiments show that the desired query is ranked within the top 5 candidates in close to 90% of the cases and that SQLIZER outperforms NALIR, a state-of-the-art tool that won a best paper award at VLDB'14.

References

  1. I Androutsopoulos, G Ritchie, and P Thanisch. 1993. Masque/sql: An Efficient and Portable Natural Language Query Interface for Relational Databases. Tech report, University of Edinburgh (1993).Google ScholarGoogle Scholar
  2. Ion Androutsopoulos, Graeme D. Ritchie, and Peter Thanisch. 1995. Natural language interfaces to databases - An Introduction. Natural Language Engineering (1995).Google ScholarGoogle Scholar
  3. Thomas Ball, Mayur Naik, and Sriram K. Rajamani. 2003. From symptom to cause: localizing errors in counterexample traces. In POPL. 97–105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Daniel W. Barowy, Sumit Gulwani, Ted Hart, and Benjamin G. Zorn. 2015. FlashRelate: extracting relational data from semi-structured spreadsheets using examples. ACM, 218–228.Google ScholarGoogle Scholar
  5. Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic Parsing on Freebase from Question-Answer Pairs. In EMNLP. 1533–1544.Google ScholarGoogle Scholar
  6. Bob Carpenter. 1997. Type-logical semantics. MIT press.Google ScholarGoogle Scholar
  7. Alvin Cheung, Armando Solar-Lezama, and Samuel Madden. 2013. Optimizing database-backed applications with query synthesis. In PLDI. 3–14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. F. Codd. 1974. Seven Steps to Rendezvous with the Casual User. In IFIP Working Conference Data Base Management. 179–200.Google ScholarGoogle Scholar
  9. Aditya Desai, Sumit Gulwani, Vineet Hingorani, Nidhi Jain, Amey Karkare, Mark Marron, Sailesh R, and Subhajit Roy. 2016. Program synthesis using natural language. In ICSE.Google ScholarGoogle Scholar
  10. Ramez Elmasri and Shamkant B. Navathe. 2011. Fundamentals of Database Systems. Addison-Wesley.Google ScholarGoogle Scholar
  11. Yu Feng, Ruben Martins, Jacob Van Geffen, Isil Dillig, and Swarat Chaudhuri. 2017a. Component-based Synthesis of Table Consolidation and Transformation Tasks from Examples. In Programming Language Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yu Feng, Ruben Martins, Yuepeng Wang, Isil Dillig, and Thomas W. Reps. 2017b. Component-based synthesis for complex APIs. In POPL. 599–612. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. John K Feser, Swarat Chaudhuri, and Isil Dillig. 2015. Synthesizing data structure transformations from input-output examples. In PLDI. 229–239.Google ScholarGoogle Scholar
  14. Claire Le Goues, Michael Dewey-Vogt, Stephanie Forrest, and Westley Weimer. 2012. A systematic study of automated program repair. In ICSE. 3–13.Google ScholarGoogle Scholar
  15. Alex Groce and Willem Visser. 2003. What Went Wrong: Explaining Counterexamples. In SPIN.Google ScholarGoogle Scholar
  16. Barbara J. Grosz, Douglas E. Appelt, Paul A. Martin, and Fernando C. N. Pereira. 1987. TEAM: An Experiment in the Design of Transportable Natural-Language Interfaces. Artificial Intelligence 32, 2 (1987), 173–243. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. In ACM SIGPLAN Notices, Vol. 46. ACM, 317–330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Sumit Gulwani and Mark Marron. 2014. NLyze: interactive programming by natural language for spreadsheet data analysis and manipulation. In SIGMOD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Tihomir Gvero, Viktor Kuncak, Ivan Kuraj, and Ruzica Piskac. 2013. Complete completion using types and weights. In ACM SIGPLAN Notices , Vol. 48. ACM, 27–38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Gary G. Hendrix, Earl D. Sacerdoti, Daniel Sagalowicz, and Jonathan Slocum. 1978. Developing a Natural Language Interface to Complex Data. TODS 3, 2 (1978), 105–147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. James A. Jones and Mary Jean Harrold. 2005. Empirical evaluation of the tarantula automatic fault-localization technique. In ASE. 273–282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. James A. Jones, Mary Jean Harrold, and John T. Stasko. 2002. Visualization of test information to assist fault localization. In ICSE . 467–477. Google ScholarGoogle ScholarCross RefCross Ref
  23. Manu Jose and Rupak Majumdar. 2011a. Bug-Assist: assisting fault localization in ANSI-C programs. In CAV. Google ScholarGoogle ScholarCross RefCross Ref
  24. Manu Jose and Rupak Majumdar. 2011b. Cause clue clauses: error localization using maximum satisfiability. PLDI (2011).Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Rohit J. Kate and Raymond J. Mooney. 2006. Using String-Kernels for Learning Semantic Parsers. In ACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Rohit J. Kate, Yuk Wah Wong, and Raymond J. Mooney. 2005. Learning to Transform Natural to Formal Languages. In AAAI. 1062–1068.Google ScholarGoogle Scholar
  27. Vu Le and Sumit Gulwani. 2014. FlashExtract: a framework for data extraction by examples. ACM, 542–553.Google ScholarGoogle Scholar
  28. Vu Le, Sumit Gulwani, and Zhendong Su. 2013. Smartsynth: Synthesizing smartphone automation scripts from natural language. In MobiSys. 193–206.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Fei Li and H. V. Jagadish. 2014. Constructing an Interactive Natural Language Interface for Relational Databases. PVLDB 8, 1 (2014), 73–84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yunyao Li, Huahai Yang, and H. V. Jagadish. 2006. Constructing a Generic Natural Language Interface for an XML Database. In EDBT. 737–754. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Percy Liang and Christopher Potts. 2015. Bringing machine learning and compositional semantics together. Annual Review of Linguistics 1, 1 (2015), 355–376. Google ScholarGoogle ScholarCross RefCross Ref
  32. Fan Long and Martin Rinard. 2015. Staged program repair with condition synthesis. In ESEC/FSE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Fan Long and Martin Rinard. 2016. Automatic patch generation by learning correct code. In POPL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Bill MacCartney and Christopher D Manning. 2009. An extended model of natural logic. In IWCS. 140–156.Google ScholarGoogle Scholar
  35. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In ACL System Demonstrations. 55–60. Google ScholarGoogle ScholarCross RefCross Ref
  36. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS.Google ScholarGoogle Scholar
  37. Scott Miller, David Stallard, Robert J. Bobrow, and Richard M. Schwartz. 1996. A Fully Statistical Approach to Natural Language Interfaces. In ACL. 55–61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chandra. 2013. SemFix: program repair via semantic analysis. In ICSE. 772–781.Google ScholarGoogle Scholar
  39. Peter-Michael Osera and Steve Zdancewic. 2015. Type- and example-directed program synthesis. In PLDI.Google ScholarGoogle Scholar
  40. Nadia Polikarpova, Ivan Kuraj, and Armando Solar-Lezama. 2016. Program synthesis from polymorphic refinement types. In PLDI. 522–538. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Oleksandr Polozov and Sumit Gulwani. 2015. FlashMeta: A framework for inductive program synthesis. ACM, 107–126.Google ScholarGoogle Scholar
  42. Ana-Maria Popescu, Alex Armanasu, Oren Etzioni, David Ko, and Alexander Yates. 2004. Modern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability. In COLING.Google ScholarGoogle Scholar
  43. Ana-Maria Popescu, Oren Etzioni, and Henry A. Kautz. 2003. Towards a theory of natural language interfaces to databases. In IUI. 149–157.Google ScholarGoogle Scholar
  44. Chris Quirk, Raymond Mooney, and Michel Galley. 2015. Language to code: Learning semantic parsers for if-this-then-that recipes. In ACL. 878–888.Google ScholarGoogle Scholar
  45. Mohammad Raza, Sumit Gulwani, and Natasa Milic-Frayling. 2015. Compositional Program Synthesis from Natural Language and Examples. In IJCAI.Google ScholarGoogle Scholar
  46. Armando Solar Lezama. 2008. Program Synthesis By Sketching. Ph.D. Dissertation. EECS Department, University of California, Berkeley.Google ScholarGoogle Scholar
  47. Armando Solar-Lezama, Rodric M. Rabbah, Rastislav Bodík, and Kemal Ebcioglu. 2005. Programming by sketching for bit-streaming programs. In PLDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Armando Solar-Lezama, Liviu Tancau, Rastislav Bodík, Sanjit A. Seshia, and Vijay A. Saraswat. 2006. Combinatorial sketching for finite programs. In ASPLOS. 404–415. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Lappoon R Tang and Raymond J Mooney. 2000. Automated construction of database interfaces: Integrating statistical and relational learning for semantic parsing. In EMNLP. 133–141.Google ScholarGoogle Scholar
  50. Quoc Trung Tran, Chee-Yong Chan, and Srinivasan Parthasarathy. 2009. Query by output. In SIGMOD. 535–548. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Chenglong Wang, Alvin Cheung, and Ras Bodik. 2017. Synthesizing Highly Expressive SQL Queries from Input-Output Examples. In Programming Language Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. David H. D. Warren and Fernando C. N. Pereira. 1982. An Efficient Easily Adaptable System for Interpreting Natural Language Queries. American Journal of Computational Linguistics 8, 3-4 (1982), 110–122.Google ScholarGoogle Scholar
  53. Westley Weimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest. 2009. Automatically finding patches using genetic programming. In ICSE. 364–374.Google ScholarGoogle Scholar
  54. Ben Wiedermann, Ali Ibrahim, and William R. Cook. 2008. Interprocedural query extraction for transparent persistence. In OOPSLA . 19–36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. William A Woods, Ronald M Kaplan, and Bonnie Nash-Webber. 1972. The lunar sciences natural language information system. Bolt, Beranek and Newman.Google ScholarGoogle Scholar
  56. Navid Yaghmazadeh, Christian Klinger, Isil Dillig, and Swarat Chaudhuri. 2016. Synthesizing transformations on hierarchically structured data. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation . ACM, 508–521. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. John M. Zelle and Raymond J. Mooney. 1993. Learning Semantic Grammars with Constructive Inductive Logic Programming. In AAAI. 817–822.Google ScholarGoogle Scholar
  58. John M. Zelle and Raymond J. Mooney. 1996. Learning to Parse Database Queries Using Inductive Logic Programming. In AAAI . 1050–1055.Google ScholarGoogle Scholar
  59. Sai Zhang and Yuyin Sun. 2013. Automatically synthesizing SQL queries from input-output examples. In ASE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Moshé M. Zloof. 1975. Query-by-Example: the Invocation and Definition of Tables and Forms. In VLDB. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SQLizer: query synthesis from natural language

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image Proceedings of the ACM on Programming Languages
              Proceedings of the ACM on Programming Languages  Volume 1, Issue OOPSLA
              October 2017
              1786 pages
              EISSN:2475-1421
              DOI:10.1145/3152284
              Issue’s Table of Contents

              Copyright © 2017 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 12 October 2017
              Published in pacmpl Volume 1, Issue OOPSLA

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader