research-article

PATSQL: efficient synthesis of SQL queries from example tables with quick inference of projected columns

Authors:
Keita Takenouchi

NTT DATA, Tokyo, Japan

NTT DATA, Tokyo, Japan
View Profile

,
Takashi Ishio

Nara Institute of Science and Technology, Nara, Japan

Nara Institute of Science and Technology, Nara, Japan
View Profile

,
Joji Okada

NTT DATA, Tokyo, Japan

NTT DATA, Tokyo, Japan
View Profile

,
Yuji Sakata

NTT DATA, Tokyo, Japan

NTT DATA, Tokyo, Japan
View Profile

Proceedings of the VLDB Endowment Volume 14 Issue 11pp 1937–1949https://doi.org/10.14778/3476249.3476253

Published:01 July 2021Publication History

Proceedings of the VLDB Endowment

Abstract

SQL is one of the most popular tools for data analysis, and it is now used by an increasing number of users without having expertise in databases. Several studies have proposed programming-by-example approaches to help such non-experts to write correct SQL queries. While existing methods support a variety of SQL features such as aggregation and nested query, they suffer a significant increase in computational cost as the scale of example tables increases. In this paper, we propose an efficient algorithm utilizing properties known in relational algebra to synthesize SQL queries from input and output tables. Our key insight is that a projection operator in a program sketch can be lifted above other operators by applying transformation rules in relational algebra, while preserving the semantics of the program. This enables a quick inference of appropriate columns in the projection operator, which is an essential component in synthesis but causes combinatorial explosions in prior work. We also introduce a novel form of constraints and its top-down propagation mechanism for efficient sketch completion. We implemented this algorithm in our tool PATSQL and evaluated it on 226 queries from prior benchmarks and Kaggle's tutorials. As a result, PATSQL solved 68% of the benchmarks and found 89% of the solutions within a second. Our tool is available at https://naist-se.github.io/patsql/.

References

Katrin Affolter, Kurt Stockinger, and Abraham Bernstein. 2019. A comparative survey of recent natural language interfaces for databases. The VLDB Journal 28 (2019), 793--819.Google ScholarDigital Library
Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. 2018. A Survey of Machine Learning for Big Code and Naturalness. Comput. Surveys 51, 4, Article 81 (2018). Google ScholarDigital Library
Christopher Baik, Zhongjun Jin, Michael Cafarella, and H. V. Jagadish. 2020. Duoquest: A Dual-Specification System for Expressive SQL Queries. CoRR abs/2003.07438 (2020). Google ScholarDigital Library
Matej Balog, Alexander Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. 2017. DeepCoder: Learning to Write Programs. In Proceedings of the International Conference on Learning Representations.Google Scholar
Rohan Bavishi, Caroline Lemieux, Roy Fox, Koushik Sen, and Ion Stoica. 2019. AutoPandas: Neural-Backed Generators for Program Synthesis. Proceedings of the ACM on Programming Languages (OOPSLA) 3, 1--27. Google ScholarDigital Library
Ben Bogin, Jonathan Berant, and Matt Gardner. 2019. Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4560--4565.Google ScholarCross Ref
Lin Cheng. 2019. SqlSol: An accurate SQL Query Synthesizer. In International Conference on Formal Engineering Methods. 104--120.Google ScholarCross Ref
Alvin Cheung, Armando Solar-Lezama, and Samuel Madden. 2012. Inferring SQL Queries Using Program Synthesis. CoRR abs/1208.2013 (2012).Google Scholar
Jan Van den Bussche and Stijn Vansummeren. 2009. Translating SQL into the relational algebra. Course notes, Hasselt University and Université Libre de Bruxelles (2009).Google Scholar
Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli. 2017. RobustFill: Neural Program Learning under Noisy I/O. In Proceedings of the 34th International Conference on Machine Learning. 990--998. Google ScholarDigital Library
Anna Fariha and Alexandra Meliou. 2019. Example-Driven Query Intent Discovery: Abductive Reasoning Using Semantic Similarity. Proceedings of the VLDB Endowment 12, 11 (2019), 1262--1275. Google ScholarDigital Library
Yu Feng, Ruben Martins, Jacob Van Geffen, Isil Dillig, and Swarat Chaudhuri. 2017. Component-based Synthesis of Table Consolidation and Transformation Tasks from Examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. 422--436. Google ScholarDigital Library
Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2008. Database Systems: The Complete Book, Second Edition. Prentice Hall Press, Upper Saddle River, NJ, USA. Google ScholarDigital Library
Sumit Gulwani. 2011. Automating String Processing in Spreadsheets using Input-Output Examples. In Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. Google ScholarDigital Library
Sumit Gulwani and Prateek Jain. 2019. Programming by Examples: PL meets ML. Dependable Software Systems Engineering (2019).Google Scholar
Sumit Gulwani, Alex Polozov, and Rishabh Singh. 2017. Program Synthesis. Foundations and Trends in Programming Languages 4, 1--2 (2017), 1--119.Google ScholarCross Ref
Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-Guang Lou, Ting Liu, and Dongmei Zhang. 2019. Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4524--4535.Google ScholarCross Ref
Shivam Handa and Martin C. Rinard. 2020. Inductive program synthesis over noisy data. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 87--98. Google ScholarDigital Library
Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wen tau Yih, and Xiaodong He. 2018. Natural Language to Structured Query Generation via Meta-Learning. In The 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google ScholarCross Ref
Dmitri V. Kalashnikov, Laks V.S. Lakshmanan, and Divesh Srivastava. 2018. FastQRE: Fast Query Reverse Engineering. In Proceedings of the 2018 International Conference on Management of Data. 337--350. Google ScholarDigital Library
Neel Kant. 2018. Recent Advances in Neural Program Synthesis. CoRR abs/1802.02353 (2018).Google Scholar
Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2018. Data Scientists in Software Teams: State of the Art and Challenges. IEEE Transactions on Software Engineering 44, 11 (2018), 1024--1038. Google ScholarDigital Library
Juraj Kubelka, Romain Robbes, and Alexandre Bergel. 2018. The road to live programming: insights from the practice. In Proceedings of the 40th International Conference on Software Engineering. 1090--1101. Google ScholarDigital Library
Vu Le and Sumit Gulwani. 2014. FlashExtract: A Framework for Data Extraction by Examples. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation. New York, NY, USA, 542--553. Google ScholarDigital Library
Woosuk Lee, Kihong Heo, Rajeev Alur, and Mayur Naik. 2018. Accelerating Search-based Program Synthesis Using Learned Probabilistic Models. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. 436--449. Google ScholarDigital Library
Denis Mayr Lima Martins. 2019. Reverse engineering database queries from examples: State-of-the-art, challenges, and research opportunities. Information Systems 83 (2019), 89--100.Google ScholarCross Ref
Ruben Martins, Jia Chen, Yanju Chen, Yu Feng, and Isil Dillig. 2019. Trinity: An Extensible Synthesis Framework for Data Science. Proceedings of the VLDB Endowment 12, 12 (2019), 1914--1917. Google ScholarDigital Library
Aditya Krishna Menon, Omer Tamuz, Sumit Gulwani, Butler Lampson, and Adam Tauman Kalai. 2013. A Machine Learning Framework for Programming by Example. In Proceedings of the 30th International Conference on Machine Learning. 187--195. Google ScholarDigital Library
Fiona Nah. 2003. A Study on Tolerable Waiting Time: How Long Are Web Users Willing to Wait?. In The Americas Conference on Information Systems. 2212--2222.Google Scholar
Maxwell I. Nye, Luke B. Hewitt, Joshua B. Tenenbaum, and Armando Solar-Lezama. 2019. Learning to Infer Program Sketches. In Proceedings of the 36th International Conference on Machine Learning. 4861--4870.Google Scholar
Pedro Orvalho, Miguel Terra-Neves, Miguel Ventura, Ruben Martins, and Vasco Manquinho. 2020. SQUARES : A SQL Synthesizer Using Query Reverse Engineering. Proceedings of the VLDB Endowment 13, 12 (2020), 2853--2856. Google ScholarDigital Library
Kensen Shi, David W. Bieber, and Rishabh Singh. 2020. TF-Coder: Program Synthesis for Tensor Manipulations. CoRR abs/2003.09040 (2020).Google Scholar
Alyssa C. Smith, Brandon C. W. Ralph, Jeremy Marty-Dugas, and Daniel Smilek. 2019. Loading... loading... The influence of download time on information search. PLOS ONE 14 (2019), 1--24.Google ScholarCross Ref
Calvin Smith and Aws Albarghouthi. 2016. MapReduce Program Synthesis. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation. 326--340. Google ScholarDigital Library
Calvin Smith and Aws Albarghouthi. 2019. Program Synthesis with Equivalence Reduction. In International Conference on Verification, Model Checking, and Abstract Interpretation. 24--47.Google Scholar
Wei Chit Tan, Meihui Zhang, Hazem Elmeleegy, and Divesh Srivastava. 2017. Reverse Engineering Aggregation Queries. Proceedings of the VLDB Endowment 10, 11, 1394--1405. Google ScholarDigital Library
Chenglong Wang, Alvin Cheung, and Rastislav Bodik. 2017. Synthesizing Highly Expressive SQL Queries from Input-output Examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. 452--466. Google ScholarDigital Library
Chenglong Wang, Yu Feng, Rastislav Bodik, Alvin Cheung, and Isil Dillig. 2020. Visualization by Example. In Proceedings of the 47th ACM SIGPLAN Symposium on Principles of Programming Languages.Google ScholarDigital Library
Yuepeng Wang, Rushi Shah, Abby Criswell, Rong Pan, and Isil Dillig. 2020. Data Migration Using Datalog Program Synthesis. Proceedings of the VLDB Endowment 13, 7 (2020), 1006--1019. Google ScholarDigital Library
Navid Yaghmazadeh, Xinyu Wang, and Isil Dillig. 2018. Automated Migration of Hierarchical Data to Relational Tables Using Programming-by-Example. Proceedings of the VLDB Endowment 11, 5 (2018), 580--593. Google ScholarDigital Library
Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang, Dongxu Wang, Zifan Li, and Dragomir Radev. 2018. SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 1653--1663.Google ScholarCross Ref
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.Google ScholarCross Ref
Sai Zhang and Yuyin Sun. 2013. Automatically Synthesizing SQL Queries from Input-output Examples. In Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering. 224--234. Google ScholarDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 14, Issue 11
July 2021
732 pages
ISSN:2150-8097
Editors:
Xin Luna Dong
Amazon
,
Felix Naumann
HPI, University of Potsdam
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 July 2021
Published in pvldb Volume 14, Issue 11
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 87
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

PATSQL: efficient synthesis of SQL queries from example tables with quick inference of projected columns

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Sql: Learn Basics of Queries and Implement Easily (sql programming, SQL 2016, sql database programming, sql for beginners, sql beginners guide, sql ... sql workbook, sql guide, MSSQL) (Volume 1)

Oracle High-Performance SQL Tuning

SQL: The Complete Reference, Second Edition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

PATSQL: efficient synthesis of SQL queries from example tables with quick inference of projected columns

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Sql: Learn Basics of Queries and Implement Easily (sql programming, SQL 2016, sql database programming, sql for beginners, sql beginners guide, sql ... sql workbook, sql guide, MSSQL) (Volume 1)

Oracle High-Performance SQL Tuning

SQL: The Complete Reference, Second Edition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media