Abstract
This paper presents a new technique for automatically synthesizing SQL queries from natural language (NL). At the core of our technique is a new NL-based program synthesis methodology that combines semantic parsing techniques from the NLP community with type-directed program synthesis and automated program repair. Starting with a program sketch obtained using standard parsing techniques, our approach involves an iterative refinement loop that alternates between probabilistic type inhabitation and automated sketch repair. We use the proposed idea to build an end-to-end system called SQLIZER that can synthesize SQL queries from natural language. Our method is fully automated, works for any database without requiring additional customization, and does not require users to know the underlying database schema. We evaluate our approach on over 450 natural language queries concerning three different databases, namely MAS, IMDB, and YELP. Our experiments show that the desired query is ranked within the top 5 candidates in close to 90% of the cases and that SQLIZER outperforms NALIR, a state-of-the-art tool that won a best paper award at VLDB'14.
- I Androutsopoulos, G Ritchie, and P Thanisch. 1993. Masque/sql: An Efficient and Portable Natural Language Query Interface for Relational Databases. Tech report, University of Edinburgh (1993).Google Scholar
- Ion Androutsopoulos, Graeme D. Ritchie, and Peter Thanisch. 1995. Natural language interfaces to databases - An Introduction. Natural Language Engineering (1995).Google Scholar
- Thomas Ball, Mayur Naik, and Sriram K. Rajamani. 2003. From symptom to cause: localizing errors in counterexample traces. In POPL. 97–105. Google ScholarDigital Library
- Daniel W. Barowy, Sumit Gulwani, Ted Hart, and Benjamin G. Zorn. 2015. FlashRelate: extracting relational data from semi-structured spreadsheets using examples. ACM, 218–228.Google Scholar
- Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic Parsing on Freebase from Question-Answer Pairs. In EMNLP. 1533–1544.Google Scholar
- Bob Carpenter. 1997. Type-logical semantics. MIT press.Google Scholar
- Alvin Cheung, Armando Solar-Lezama, and Samuel Madden. 2013. Optimizing database-backed applications with query synthesis. In PLDI. 3–14. Google ScholarDigital Library
- E. F. Codd. 1974. Seven Steps to Rendezvous with the Casual User. In IFIP Working Conference Data Base Management. 179–200.Google Scholar
- Aditya Desai, Sumit Gulwani, Vineet Hingorani, Nidhi Jain, Amey Karkare, Mark Marron, Sailesh R, and Subhajit Roy. 2016. Program synthesis using natural language. In ICSE.Google Scholar
- Ramez Elmasri and Shamkant B. Navathe. 2011. Fundamentals of Database Systems. Addison-Wesley.Google Scholar
- Yu Feng, Ruben Martins, Jacob Van Geffen, Isil Dillig, and Swarat Chaudhuri. 2017a. Component-based Synthesis of Table Consolidation and Transformation Tasks from Examples. In Programming Language Design and Implementation. Google ScholarDigital Library
- Yu Feng, Ruben Martins, Yuepeng Wang, Isil Dillig, and Thomas W. Reps. 2017b. Component-based synthesis for complex APIs. In POPL. 599–612. Google ScholarDigital Library
- John K Feser, Swarat Chaudhuri, and Isil Dillig. 2015. Synthesizing data structure transformations from input-output examples. In PLDI. 229–239.Google Scholar
- Claire Le Goues, Michael Dewey-Vogt, Stephanie Forrest, and Westley Weimer. 2012. A systematic study of automated program repair. In ICSE. 3–13.Google Scholar
- Alex Groce and Willem Visser. 2003. What Went Wrong: Explaining Counterexamples. In SPIN.Google Scholar
- Barbara J. Grosz, Douglas E. Appelt, Paul A. Martin, and Fernando C. N. Pereira. 1987. TEAM: An Experiment in the Design of Transportable Natural-Language Interfaces. Artificial Intelligence 32, 2 (1987), 173–243. Google ScholarDigital Library
- Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. In ACM SIGPLAN Notices, Vol. 46. ACM, 317–330. Google ScholarDigital Library
- Sumit Gulwani and Mark Marron. 2014. NLyze: interactive programming by natural language for spreadsheet data analysis and manipulation. In SIGMOD. Google ScholarDigital Library
- Tihomir Gvero, Viktor Kuncak, Ivan Kuraj, and Ruzica Piskac. 2013. Complete completion using types and weights. In ACM SIGPLAN Notices , Vol. 48. ACM, 27–38. Google ScholarDigital Library
- Gary G. Hendrix, Earl D. Sacerdoti, Daniel Sagalowicz, and Jonathan Slocum. 1978. Developing a Natural Language Interface to Complex Data. TODS 3, 2 (1978), 105–147. Google ScholarDigital Library
- James A. Jones and Mary Jean Harrold. 2005. Empirical evaluation of the tarantula automatic fault-localization technique. In ASE. 273–282. Google ScholarDigital Library
- James A. Jones, Mary Jean Harrold, and John T. Stasko. 2002. Visualization of test information to assist fault localization. In ICSE . 467–477. Google ScholarCross Ref
- Manu Jose and Rupak Majumdar. 2011a. Bug-Assist: assisting fault localization in ANSI-C programs. In CAV. Google ScholarCross Ref
- Manu Jose and Rupak Majumdar. 2011b. Cause clue clauses: error localization using maximum satisfiability. PLDI (2011).Google ScholarDigital Library
- Rohit J. Kate and Raymond J. Mooney. 2006. Using String-Kernels for Learning Semantic Parsers. In ACL. Google ScholarDigital Library
- Rohit J. Kate, Yuk Wah Wong, and Raymond J. Mooney. 2005. Learning to Transform Natural to Formal Languages. In AAAI. 1062–1068.Google Scholar
- Vu Le and Sumit Gulwani. 2014. FlashExtract: a framework for data extraction by examples. ACM, 542–553.Google Scholar
- Vu Le, Sumit Gulwani, and Zhendong Su. 2013. Smartsynth: Synthesizing smartphone automation scripts from natural language. In MobiSys. 193–206.Google ScholarDigital Library
- Fei Li and H. V. Jagadish. 2014. Constructing an Interactive Natural Language Interface for Relational Databases. PVLDB 8, 1 (2014), 73–84. Google ScholarDigital Library
- Yunyao Li, Huahai Yang, and H. V. Jagadish. 2006. Constructing a Generic Natural Language Interface for an XML Database. In EDBT. 737–754. Google ScholarDigital Library
- Percy Liang and Christopher Potts. 2015. Bringing machine learning and compositional semantics together. Annual Review of Linguistics 1, 1 (2015), 355–376. Google ScholarCross Ref
- Fan Long and Martin Rinard. 2015. Staged program repair with condition synthesis. In ESEC/FSE. Google ScholarDigital Library
- Fan Long and Martin Rinard. 2016. Automatic patch generation by learning correct code. In POPL. Google ScholarDigital Library
- Bill MacCartney and Christopher D Manning. 2009. An extended model of natural logic. In IWCS. 140–156.Google Scholar
- Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In ACL System Demonstrations. 55–60. Google ScholarCross Ref
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS.Google Scholar
- Scott Miller, David Stallard, Robert J. Bobrow, and Richard M. Schwartz. 1996. A Fully Statistical Approach to Natural Language Interfaces. In ACL. 55–61. Google ScholarDigital Library
- Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chandra. 2013. SemFix: program repair via semantic analysis. In ICSE. 772–781.Google Scholar
- Peter-Michael Osera and Steve Zdancewic. 2015. Type- and example-directed program synthesis. In PLDI.Google Scholar
- Nadia Polikarpova, Ivan Kuraj, and Armando Solar-Lezama. 2016. Program synthesis from polymorphic refinement types. In PLDI. 522–538. Google ScholarDigital Library
- Oleksandr Polozov and Sumit Gulwani. 2015. FlashMeta: A framework for inductive program synthesis. ACM, 107–126.Google Scholar
- Ana-Maria Popescu, Alex Armanasu, Oren Etzioni, David Ko, and Alexander Yates. 2004. Modern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability. In COLING.Google Scholar
- Ana-Maria Popescu, Oren Etzioni, and Henry A. Kautz. 2003. Towards a theory of natural language interfaces to databases. In IUI. 149–157.Google Scholar
- Chris Quirk, Raymond Mooney, and Michel Galley. 2015. Language to code: Learning semantic parsers for if-this-then-that recipes. In ACL. 878–888.Google Scholar
- Mohammad Raza, Sumit Gulwani, and Natasa Milic-Frayling. 2015. Compositional Program Synthesis from Natural Language and Examples. In IJCAI.Google Scholar
- Armando Solar Lezama. 2008. Program Synthesis By Sketching. Ph.D. Dissertation. EECS Department, University of California, Berkeley.Google Scholar
- Armando Solar-Lezama, Rodric M. Rabbah, Rastislav Bodík, and Kemal Ebcioglu. 2005. Programming by sketching for bit-streaming programs. In PLDI. Google ScholarDigital Library
- Armando Solar-Lezama, Liviu Tancau, Rastislav Bodík, Sanjit A. Seshia, and Vijay A. Saraswat. 2006. Combinatorial sketching for finite programs. In ASPLOS. 404–415. Google ScholarDigital Library
- Lappoon R Tang and Raymond J Mooney. 2000. Automated construction of database interfaces: Integrating statistical and relational learning for semantic parsing. In EMNLP. 133–141.Google Scholar
- Quoc Trung Tran, Chee-Yong Chan, and Srinivasan Parthasarathy. 2009. Query by output. In SIGMOD. 535–548. Google ScholarDigital Library
- Chenglong Wang, Alvin Cheung, and Ras Bodik. 2017. Synthesizing Highly Expressive SQL Queries from Input-Output Examples. In Programming Language Design and Implementation. Google ScholarDigital Library
- David H. D. Warren and Fernando C. N. Pereira. 1982. An Efficient Easily Adaptable System for Interpreting Natural Language Queries. American Journal of Computational Linguistics 8, 3-4 (1982), 110–122.Google Scholar
- Westley Weimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest. 2009. Automatically finding patches using genetic programming. In ICSE. 364–374.Google Scholar
- Ben Wiedermann, Ali Ibrahim, and William R. Cook. 2008. Interprocedural query extraction for transparent persistence. In OOPSLA . 19–36. Google ScholarDigital Library
- William A Woods, Ronald M Kaplan, and Bonnie Nash-Webber. 1972. The lunar sciences natural language information system. Bolt, Beranek and Newman.Google Scholar
- Navid Yaghmazadeh, Christian Klinger, Isil Dillig, and Swarat Chaudhuri. 2016. Synthesizing transformations on hierarchically structured data. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation . ACM, 508–521. Google ScholarDigital Library
- John M. Zelle and Raymond J. Mooney. 1993. Learning Semantic Grammars with Constructive Inductive Logic Programming. In AAAI. 817–822.Google Scholar
- John M. Zelle and Raymond J. Mooney. 1996. Learning to Parse Database Queries Using Inductive Logic Programming. In AAAI . 1050–1055.Google Scholar
- Sai Zhang and Yuyin Sun. 2013. Automatically synthesizing SQL queries from input-output examples. In ASE. Google ScholarDigital Library
- Moshé M. Zloof. 1975. Query-by-Example: the Invocation and Definition of Tables and Forms. In VLDB. Google ScholarDigital Library
Index Terms
- SQLizer: query synthesis from natural language
Recommendations
Access to Indexed Hierarchical Databases Using a Relational Query Language
An efficient means of accessing indexed hierarchical databases using a relational query language is presented. The purpose is to achieve an effective sharing of heterogeneous distributed databases. Translation of hierarchical data to an equivalent ...
Query processing over object views of relational data
This paper presents an approach to object view management for relational databases. Such a view mechanism makes it possible for users to transparently work with data in a relational database as if it was stored in an object-oriented (OO) database. A ...
Providing Quality Responses with Natural Language Interfaces: The Null Value Problem
An underlying relational database model and the database query language SQL are assumed, and methods are presented for responding with appropriate answers to null value responses. This is done by using a knowledge base based on RM/T, an extended ...
Comments