skip to main content
research-article

PATSQL: efficient synthesis of SQL queries from example tables with quick inference of projected columns

Published:01 July 2021Publication History
Skip Abstract Section

Abstract

SQL is one of the most popular tools for data analysis, and it is now used by an increasing number of users without having expertise in databases. Several studies have proposed programming-by-example approaches to help such non-experts to write correct SQL queries. While existing methods support a variety of SQL features such as aggregation and nested query, they suffer a significant increase in computational cost as the scale of example tables increases. In this paper, we propose an efficient algorithm utilizing properties known in relational algebra to synthesize SQL queries from input and output tables. Our key insight is that a projection operator in a program sketch can be lifted above other operators by applying transformation rules in relational algebra, while preserving the semantics of the program. This enables a quick inference of appropriate columns in the projection operator, which is an essential component in synthesis but causes combinatorial explosions in prior work. We also introduce a novel form of constraints and its top-down propagation mechanism for efficient sketch completion. We implemented this algorithm in our tool PATSQL and evaluated it on 226 queries from prior benchmarks and Kaggle's tutorials. As a result, PATSQL solved 68% of the benchmarks and found 89% of the solutions within a second. Our tool is available at https://naist-se.github.io/patsql/.

References

  1. Katrin Affolter, Kurt Stockinger, and Abraham Bernstein. 2019. A comparative survey of recent natural language interfaces for databases. The VLDB Journal 28 (2019), 793--819.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. 2018. A Survey of Machine Learning for Big Code and Naturalness. Comput. Surveys 51, 4, Article 81 (2018). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Christopher Baik, Zhongjun Jin, Michael Cafarella, and H. V. Jagadish. 2020. Duoquest: A Dual-Specification System for Expressive SQL Queries. CoRR abs/2003.07438 (2020). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Matej Balog, Alexander Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. 2017. DeepCoder: Learning to Write Programs. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  5. Rohan Bavishi, Caroline Lemieux, Roy Fox, Koushik Sen, and Ion Stoica. 2019. AutoPandas: Neural-Backed Generators for Program Synthesis. Proceedings of the ACM on Programming Languages (OOPSLA) 3, 1--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ben Bogin, Jonathan Berant, and Matt Gardner. 2019. Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4560--4565.Google ScholarGoogle ScholarCross RefCross Ref
  7. Lin Cheng. 2019. SqlSol: An accurate SQL Query Synthesizer. In International Conference on Formal Engineering Methods. 104--120.Google ScholarGoogle ScholarCross RefCross Ref
  8. Alvin Cheung, Armando Solar-Lezama, and Samuel Madden. 2012. Inferring SQL Queries Using Program Synthesis. CoRR abs/1208.2013 (2012).Google ScholarGoogle Scholar
  9. Jan Van den Bussche and Stijn Vansummeren. 2009. Translating SQL into the relational algebra. Course notes, Hasselt University and Université Libre de Bruxelles (2009).Google ScholarGoogle Scholar
  10. Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli. 2017. RobustFill: Neural Program Learning under Noisy I/O. In Proceedings of the 34th International Conference on Machine Learning. 990--998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Anna Fariha and Alexandra Meliou. 2019. Example-Driven Query Intent Discovery: Abductive Reasoning Using Semantic Similarity. Proceedings of the VLDB Endowment 12, 11 (2019), 1262--1275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yu Feng, Ruben Martins, Jacob Van Geffen, Isil Dillig, and Swarat Chaudhuri. 2017. Component-based Synthesis of Table Consolidation and Transformation Tasks from Examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. 422--436. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2008. Database Systems: The Complete Book, Second Edition. Prentice Hall Press, Upper Saddle River, NJ, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Sumit Gulwani. 2011. Automating String Processing in Spreadsheets using Input-Output Examples. In Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Sumit Gulwani and Prateek Jain. 2019. Programming by Examples: PL meets ML. Dependable Software Systems Engineering (2019).Google ScholarGoogle Scholar
  16. Sumit Gulwani, Alex Polozov, and Rishabh Singh. 2017. Program Synthesis. Foundations and Trends in Programming Languages 4, 1--2 (2017), 1--119.Google ScholarGoogle ScholarCross RefCross Ref
  17. Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-Guang Lou, Ting Liu, and Dongmei Zhang. 2019. Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4524--4535.Google ScholarGoogle ScholarCross RefCross Ref
  18. Shivam Handa and Martin C. Rinard. 2020. Inductive program synthesis over noisy data. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 87--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wen tau Yih, and Xiaodong He. 2018. Natural Language to Structured Query Generation via Meta-Learning. In The 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google ScholarGoogle ScholarCross RefCross Ref
  20. Dmitri V. Kalashnikov, Laks V.S. Lakshmanan, and Divesh Srivastava. 2018. FastQRE: Fast Query Reverse Engineering. In Proceedings of the 2018 International Conference on Management of Data. 337--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Neel Kant. 2018. Recent Advances in Neural Program Synthesis. CoRR abs/1802.02353 (2018).Google ScholarGoogle Scholar
  22. Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2018. Data Scientists in Software Teams: State of the Art and Challenges. IEEE Transactions on Software Engineering 44, 11 (2018), 1024--1038. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Juraj Kubelka, Romain Robbes, and Alexandre Bergel. 2018. The road to live programming: insights from the practice. In Proceedings of the 40th International Conference on Software Engineering. 1090--1101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Vu Le and Sumit Gulwani. 2014. FlashExtract: A Framework for Data Extraction by Examples. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation. New York, NY, USA, 542--553. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Woosuk Lee, Kihong Heo, Rajeev Alur, and Mayur Naik. 2018. Accelerating Search-based Program Synthesis Using Learned Probabilistic Models. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. 436--449. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Denis Mayr Lima Martins. 2019. Reverse engineering database queries from examples: State-of-the-art, challenges, and research opportunities. Information Systems 83 (2019), 89--100.Google ScholarGoogle ScholarCross RefCross Ref
  27. Ruben Martins, Jia Chen, Yanju Chen, Yu Feng, and Isil Dillig. 2019. Trinity: An Extensible Synthesis Framework for Data Science. Proceedings of the VLDB Endowment 12, 12 (2019), 1914--1917. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Aditya Krishna Menon, Omer Tamuz, Sumit Gulwani, Butler Lampson, and Adam Tauman Kalai. 2013. A Machine Learning Framework for Programming by Example. In Proceedings of the 30th International Conference on Machine Learning. 187--195. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Fiona Nah. 2003. A Study on Tolerable Waiting Time: How Long Are Web Users Willing to Wait?. In The Americas Conference on Information Systems. 2212--2222.Google ScholarGoogle Scholar
  30. Maxwell I. Nye, Luke B. Hewitt, Joshua B. Tenenbaum, and Armando Solar-Lezama. 2019. Learning to Infer Program Sketches. In Proceedings of the 36th International Conference on Machine Learning. 4861--4870.Google ScholarGoogle Scholar
  31. Pedro Orvalho, Miguel Terra-Neves, Miguel Ventura, Ruben Martins, and Vasco Manquinho. 2020. SQUARES : A SQL Synthesizer Using Query Reverse Engineering. Proceedings of the VLDB Endowment 13, 12 (2020), 2853--2856. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Kensen Shi, David W. Bieber, and Rishabh Singh. 2020. TF-Coder: Program Synthesis for Tensor Manipulations. CoRR abs/2003.09040 (2020).Google ScholarGoogle Scholar
  33. Alyssa C. Smith, Brandon C. W. Ralph, Jeremy Marty-Dugas, and Daniel Smilek. 2019. Loading... loading... The influence of download time on information search. PLOS ONE 14 (2019), 1--24.Google ScholarGoogle ScholarCross RefCross Ref
  34. Calvin Smith and Aws Albarghouthi. 2016. MapReduce Program Synthesis. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation. 326--340. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Calvin Smith and Aws Albarghouthi. 2019. Program Synthesis with Equivalence Reduction. In International Conference on Verification, Model Checking, and Abstract Interpretation. 24--47.Google ScholarGoogle Scholar
  36. Wei Chit Tan, Meihui Zhang, Hazem Elmeleegy, and Divesh Srivastava. 2017. Reverse Engineering Aggregation Queries. Proceedings of the VLDB Endowment 10, 11, 1394--1405. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Chenglong Wang, Alvin Cheung, and Rastislav Bodik. 2017. Synthesizing Highly Expressive SQL Queries from Input-output Examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. 452--466. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Chenglong Wang, Yu Feng, Rastislav Bodik, Alvin Cheung, and Isil Dillig. 2020. Visualization by Example. In Proceedings of the 47th ACM SIGPLAN Symposium on Principles of Programming Languages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Yuepeng Wang, Rushi Shah, Abby Criswell, Rong Pan, and Isil Dillig. 2020. Data Migration Using Datalog Program Synthesis. Proceedings of the VLDB Endowment 13, 7 (2020), 1006--1019. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Navid Yaghmazadeh, Xinyu Wang, and Isil Dillig. 2018. Automated Migration of Hierarchical Data to Relational Tables Using Programming-by-Example. Proceedings of the VLDB Endowment 11, 5 (2018), 580--593. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang, Dongxu Wang, Zifan Li, and Dragomir Radev. 2018. SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 1653--1663.Google ScholarGoogle ScholarCross RefCross Ref
  42. Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.Google ScholarGoogle ScholarCross RefCross Ref
  43. Sai Zhang and Yuyin Sun. 2013. Automatically Synthesizing SQL Queries from Input-output Examples. In Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering. 224--234. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 14, Issue 11
    July 2021
    732 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 July 2021
    Published in pvldb Volume 14, Issue 11

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader