Abstract
SQL is one of the most popular tools for data analysis, and it is now used by an increasing number of users without having expertise in databases. Several studies have proposed programming-by-example approaches to help such non-experts to write correct SQL queries. While existing methods support a variety of SQL features such as aggregation and nested query, they suffer a significant increase in computational cost as the scale of example tables increases. In this paper, we propose an efficient algorithm utilizing properties known in relational algebra to synthesize SQL queries from input and output tables. Our key insight is that a projection operator in a program sketch can be lifted above other operators by applying transformation rules in relational algebra, while preserving the semantics of the program. This enables a quick inference of appropriate columns in the projection operator, which is an essential component in synthesis but causes combinatorial explosions in prior work. We also introduce a novel form of constraints and its top-down propagation mechanism for efficient sketch completion. We implemented this algorithm in our tool PATSQL and evaluated it on 226 queries from prior benchmarks and Kaggle's tutorials. As a result, PATSQL solved 68% of the benchmarks and found 89% of the solutions within a second. Our tool is available at https://naist-se.github.io/patsql/.
- Katrin Affolter, Kurt Stockinger, and Abraham Bernstein. 2019. A comparative survey of recent natural language interfaces for databases. The VLDB Journal 28 (2019), 793--819.Google ScholarDigital Library
- Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. 2018. A Survey of Machine Learning for Big Code and Naturalness. Comput. Surveys 51, 4, Article 81 (2018). Google ScholarDigital Library
- Christopher Baik, Zhongjun Jin, Michael Cafarella, and H. V. Jagadish. 2020. Duoquest: A Dual-Specification System for Expressive SQL Queries. CoRR abs/2003.07438 (2020). Google ScholarDigital Library
- Matej Balog, Alexander Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. 2017. DeepCoder: Learning to Write Programs. In Proceedings of the International Conference on Learning Representations.Google Scholar
- Rohan Bavishi, Caroline Lemieux, Roy Fox, Koushik Sen, and Ion Stoica. 2019. AutoPandas: Neural-Backed Generators for Program Synthesis. Proceedings of the ACM on Programming Languages (OOPSLA) 3, 1--27. Google ScholarDigital Library
- Ben Bogin, Jonathan Berant, and Matt Gardner. 2019. Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4560--4565.Google ScholarCross Ref
- Lin Cheng. 2019. SqlSol: An accurate SQL Query Synthesizer. In International Conference on Formal Engineering Methods. 104--120.Google ScholarCross Ref
- Alvin Cheung, Armando Solar-Lezama, and Samuel Madden. 2012. Inferring SQL Queries Using Program Synthesis. CoRR abs/1208.2013 (2012).Google Scholar
- Jan Van den Bussche and Stijn Vansummeren. 2009. Translating SQL into the relational algebra. Course notes, Hasselt University and Université Libre de Bruxelles (2009).Google Scholar
- Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli. 2017. RobustFill: Neural Program Learning under Noisy I/O. In Proceedings of the 34th International Conference on Machine Learning. 990--998. Google ScholarDigital Library
- Anna Fariha and Alexandra Meliou. 2019. Example-Driven Query Intent Discovery: Abductive Reasoning Using Semantic Similarity. Proceedings of the VLDB Endowment 12, 11 (2019), 1262--1275. Google ScholarDigital Library
- Yu Feng, Ruben Martins, Jacob Van Geffen, Isil Dillig, and Swarat Chaudhuri. 2017. Component-based Synthesis of Table Consolidation and Transformation Tasks from Examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. 422--436. Google ScholarDigital Library
- Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2008. Database Systems: The Complete Book, Second Edition. Prentice Hall Press, Upper Saddle River, NJ, USA. Google ScholarDigital Library
- Sumit Gulwani. 2011. Automating String Processing in Spreadsheets using Input-Output Examples. In Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. Google ScholarDigital Library
- Sumit Gulwani and Prateek Jain. 2019. Programming by Examples: PL meets ML. Dependable Software Systems Engineering (2019).Google Scholar
- Sumit Gulwani, Alex Polozov, and Rishabh Singh. 2017. Program Synthesis. Foundations and Trends in Programming Languages 4, 1--2 (2017), 1--119.Google ScholarCross Ref
- Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-Guang Lou, Ting Liu, and Dongmei Zhang. 2019. Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4524--4535.Google ScholarCross Ref
- Shivam Handa and Martin C. Rinard. 2020. Inductive program synthesis over noisy data. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 87--98. Google ScholarDigital Library
- Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wen tau Yih, and Xiaodong He. 2018. Natural Language to Structured Query Generation via Meta-Learning. In The 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google ScholarCross Ref
- Dmitri V. Kalashnikov, Laks V.S. Lakshmanan, and Divesh Srivastava. 2018. FastQRE: Fast Query Reverse Engineering. In Proceedings of the 2018 International Conference on Management of Data. 337--350. Google ScholarDigital Library
- Neel Kant. 2018. Recent Advances in Neural Program Synthesis. CoRR abs/1802.02353 (2018).Google Scholar
- Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2018. Data Scientists in Software Teams: State of the Art and Challenges. IEEE Transactions on Software Engineering 44, 11 (2018), 1024--1038. Google ScholarDigital Library
- Juraj Kubelka, Romain Robbes, and Alexandre Bergel. 2018. The road to live programming: insights from the practice. In Proceedings of the 40th International Conference on Software Engineering. 1090--1101. Google ScholarDigital Library
- Vu Le and Sumit Gulwani. 2014. FlashExtract: A Framework for Data Extraction by Examples. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation. New York, NY, USA, 542--553. Google ScholarDigital Library
- Woosuk Lee, Kihong Heo, Rajeev Alur, and Mayur Naik. 2018. Accelerating Search-based Program Synthesis Using Learned Probabilistic Models. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. 436--449. Google ScholarDigital Library
- Denis Mayr Lima Martins. 2019. Reverse engineering database queries from examples: State-of-the-art, challenges, and research opportunities. Information Systems 83 (2019), 89--100.Google ScholarCross Ref
- Ruben Martins, Jia Chen, Yanju Chen, Yu Feng, and Isil Dillig. 2019. Trinity: An Extensible Synthesis Framework for Data Science. Proceedings of the VLDB Endowment 12, 12 (2019), 1914--1917. Google ScholarDigital Library
- Aditya Krishna Menon, Omer Tamuz, Sumit Gulwani, Butler Lampson, and Adam Tauman Kalai. 2013. A Machine Learning Framework for Programming by Example. In Proceedings of the 30th International Conference on Machine Learning. 187--195. Google ScholarDigital Library
- Fiona Nah. 2003. A Study on Tolerable Waiting Time: How Long Are Web Users Willing to Wait?. In The Americas Conference on Information Systems. 2212--2222.Google Scholar
- Maxwell I. Nye, Luke B. Hewitt, Joshua B. Tenenbaum, and Armando Solar-Lezama. 2019. Learning to Infer Program Sketches. In Proceedings of the 36th International Conference on Machine Learning. 4861--4870.Google Scholar
- Pedro Orvalho, Miguel Terra-Neves, Miguel Ventura, Ruben Martins, and Vasco Manquinho. 2020. SQUARES : A SQL Synthesizer Using Query Reverse Engineering. Proceedings of the VLDB Endowment 13, 12 (2020), 2853--2856. Google ScholarDigital Library
- Kensen Shi, David W. Bieber, and Rishabh Singh. 2020. TF-Coder: Program Synthesis for Tensor Manipulations. CoRR abs/2003.09040 (2020).Google Scholar
- Alyssa C. Smith, Brandon C. W. Ralph, Jeremy Marty-Dugas, and Daniel Smilek. 2019. Loading... loading... The influence of download time on information search. PLOS ONE 14 (2019), 1--24.Google ScholarCross Ref
- Calvin Smith and Aws Albarghouthi. 2016. MapReduce Program Synthesis. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation. 326--340. Google ScholarDigital Library
- Calvin Smith and Aws Albarghouthi. 2019. Program Synthesis with Equivalence Reduction. In International Conference on Verification, Model Checking, and Abstract Interpretation. 24--47.Google Scholar
- Wei Chit Tan, Meihui Zhang, Hazem Elmeleegy, and Divesh Srivastava. 2017. Reverse Engineering Aggregation Queries. Proceedings of the VLDB Endowment 10, 11, 1394--1405. Google ScholarDigital Library
- Chenglong Wang, Alvin Cheung, and Rastislav Bodik. 2017. Synthesizing Highly Expressive SQL Queries from Input-output Examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. 452--466. Google ScholarDigital Library
- Chenglong Wang, Yu Feng, Rastislav Bodik, Alvin Cheung, and Isil Dillig. 2020. Visualization by Example. In Proceedings of the 47th ACM SIGPLAN Symposium on Principles of Programming Languages.Google ScholarDigital Library
- Yuepeng Wang, Rushi Shah, Abby Criswell, Rong Pan, and Isil Dillig. 2020. Data Migration Using Datalog Program Synthesis. Proceedings of the VLDB Endowment 13, 7 (2020), 1006--1019. Google ScholarDigital Library
- Navid Yaghmazadeh, Xinyu Wang, and Isil Dillig. 2018. Automated Migration of Hierarchical Data to Relational Tables Using Programming-by-Example. Proceedings of the VLDB Endowment 11, 5 (2018), 580--593. Google ScholarDigital Library
- Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang, Dongxu Wang, Zifan Li, and Dragomir Radev. 2018. SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 1653--1663.Google ScholarCross Ref
- Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.Google ScholarCross Ref
- Sai Zhang and Yuyin Sun. 2013. Automatically Synthesizing SQL Queries from Input-output Examples. In Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering. 224--234. Google ScholarDigital Library
Comments