skip to main content
10.1145/3132747.3132773acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article

Optimizing Big-Data Queries Using Program Synthesis

Published:14 October 2017Publication History

ABSTRACT

Classical query optimization relies on a predefined set of rewrite rules to re-order and substitute SQL operators at a logical level. This paper proposes Blitz, a system that can synthesize efficient query-specific operators using automated program reasoning. Blitz uses static analysis to identify sub-queries as potential targets for optimization. For each sub-query, it constructs a template that defines a large space of possible operator implementations, all restricted to have linear time and space complexity. Blitz then employs program synthesis to instantiate the template and obtain a data-parallel operator implementation that is functionally equivalent to the original sub-query up to a bound on the input size.

Program synthesis is an undecidable problem in general and often difficult to scale, even for bounded inputs. Blitz therefore uses a series of analyses to judiciously use program synthesis and incrementally construct complex operators.

We integrated Blitz with existing big-data query languages by embedding the synthesized operators back into the query as User Defined Operators. We evaluated Blitz on several production queries from Microsoft running on two state-of-the-art query engines: SparkSQL as well as Scope, the big-data engine of Microsoft. Blitz produces correct optimizations despite the synthesis being bounded. The resulting queries have much more succinct query plans and demonstrate significant performance improvements on both big-data systems (1.3x --- 4.7x).

Skip Supplemental Material Section

Supplemental Material

opt_query.mp4

mp4

2.1 GB

References

  1. Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 1383--1394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. 2014. Apollo: Scalable and Coordinated Scheduling for Cloud-scale Computing. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI'14). USENIX Association, Berkeley, CA, USA, 285--300. http://dl.acm.org/citation.cfm?id=2685048.2685071 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Damianos Chatziantoniou and Kenneth A. Ross. 1997. Groupwise Processing of Relational Queries. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB '97). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 476--485. http://dl.acm.org/citation.cfm?id=645923.671003 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Damianos Chatziantoniou and Kenneth A. Ross. 2007. Partitioned Optimization of Complex Queries. Inf. Syst. 32, 2 (April 2007), 248--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Alvin Cheung, Armando Solar-Lezama, and Samuel Madden. 2013. Optimizing Database-backed Applications with Query Synthesis. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '13). ACM, New York, NY, USA, 3--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Shumo Chu, Konstantin Weitz, Alvin Cheung, and Dan Suciu. 2017. HoTTSQL: Proving Query Rewrites with Univalent SQL Semantics. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). ACM, New York, NY, USA, 510--524. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Przemysław Daca, Thomas A. Henzinger, and Andrey Kupriyanov. 2016. Array Folds Logic. 230--248.Google ScholarGoogle Scholar
  8. César Galindo-Legaria and Milind Joshi. 2001. Orthogonal Optimization ofSubqueries and Aggregation. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (SIGMOD '01). ACM, New York, NY, USA, 571--581. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Diego Garbervetsky, Zvonimir Pavlinovic, Michael Barnett, Madanlal Musuvathi, Todd Mytkowicz, and Edgardo Zoppi. 2017. Static Analysis for Optimizing Big Data Queries. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). ACM, New York, NY, USA, 932--937. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Shelly Grossman, Sara Cohen, Shachar Itzhaky, Noam Rinetzky, and Mooly Sagiv. 2017. Verifying Equivalence of Spark Programs. Springer International Publishing, Cham, 282--300.Google ScholarGoogle Scholar
  11. Sumit Gulwani, William R. Harris, and Rishabh Singh. 2012. Spreadsheet data manipulation using examples. Commun. ACM 55, 8 (2012), 97--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Zhenyu Guo, Xuepeng Fan, Rishan Chen, Jiaxing Zhang, Hucheng Zhou, Sean McDirmid, Chang Liu, Wei Lin, Jingren Zhou, and Lidong Zhou. 2012. Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12). USENIX, Hollywood, CA, 121--133. https://www.usenix.org/conference/osdi12/technical-sessions/presentation/guo Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Red-shift and the Case for Simpler Data Warehouses. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 1917--1923. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yin Huai, Ashutosh Chauhan, Alan Gates, Gunther Hagleitner, Eric N. Hanson, Owen O'Malley, Jitendra Pandey, Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2014. Major Technical Advancements in Apache Hive. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD '14). ACM, New York, NY, USA, 1235--1246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Sangeetha Abdu Jyothi, Carlo Curino, Ishai Menache, Shravan Matthur Narayanamurthy, Alexey Tumanov, Jonathan Yaniv, Ruslan Mavlyutov, Inigo Goiri, Subru Krishnan, Janardhan Kulkarni, and Sriram Rao. 2016. Morpheus: Towards Automated SLOs for Enterprise Clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, GA, 117--134. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/jyothi Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Viktor Leis, Kan Kundhikanjana, Alfons Kemper, and Thomas Neumann. 2015. Efficient Processing of Window Functions in Analytical SQL Queries. Proc. VLDB Endow. 8, 10, 1058--1069. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Veselin Raychev, Madanlal Musuvathi, and Todd Mytkowicz. 2015. Parallelizing User-defined Aggregations Using Symbolic Execution. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP '15). ACM, New York, NY, USA, 153--167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Calvin Smith and Aws Albarghouthi. 2016. MapReduce Program Synthesis. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '16). ACM, New York, NY, USA, 326--340. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Armando Solar-Lezama, Liviu Tancau, Rastislav Bodik, Sanjit Seshia, and Vijay Saraswat. 2006. Combinatorial sketching for finite programs. ACM SIGOPS Operating Systems Review 40, 5 (2006), 404--415. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: A Warehousing Solution over a Map-reduce Framework. Proc. VLDB Endow. 2, 2 (Aug. 2009), 1626--1629. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. TPC. 2016. TPCx-BB Benchmark. (2016). http//:www.tpc.org/tpcx-bb/Google ScholarGoogle Scholar
  22. TPC. 2017. TPC-DS Benchmark. (2017). http//:www.tpc.org/tpcds/Google ScholarGoogle Scholar
  23. Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC '13). ACM, New York, NY, USA, Article 5, 16 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Chenglong Wang, Alvin Cheung, and Rastislav Bodik. 2017. Synthesizing Highly Expressive SQL Queries from Input-output Examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). ACM, New York, NY, USA, 452--466. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 10--10. http://dl.acm.org/citation.cfm?id=1863103.1863113 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Sai Zhang and Yuyin Sun. 2013. Automatically synthesizing SQL queries from input-output examples. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013, Silicon Valley, CA, USA, November 11-15, 2013. IEEE Computer Society, Washington, DC, USA, 224--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jingren Zhou, Nicolas Bruno, and Wei Lin. 2012. Advanced Partitioning Techniques for Massively Distributed Computation. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD '12). ACM, New York, NY, USA, 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, and Darren Shakib. 2012. SCOPE: Parallel Databases Meet MapReduce. The VLDB Journal 21, 5 (Oct. 2012), 611--636. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, and Darren Shakib. 2012. SCOPE: Parallel Databases Meet MapReduce. The VLDB Journal 21, 5 (Oct. 2012), 611--636. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Calisto Zuzarte, Hamid Pirahesh, Wenbin Ma, Qi Cheng, Linqi Liu, and Kwai Wong. 2003. WinMagic: Subquery Elimination Using Window Aggregation. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD '03). ACM, New York, NY, USA, 652--656. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Optimizing Big-Data Queries Using Program Synthesis

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles
              October 2017
              677 pages
              ISBN:9781450350853
              DOI:10.1145/3132747

              Copyright © 2017 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 14 October 2017

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed limited

              Acceptance Rates

              Overall Acceptance Rate131of716submissions,18%

              Upcoming Conference

              SOSP '24

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader