ABSTRACT
Classical query optimization relies on a predefined set of rewrite rules to re-order and substitute SQL operators at a logical level. This paper proposes Blitz, a system that can synthesize efficient query-specific operators using automated program reasoning. Blitz uses static analysis to identify sub-queries as potential targets for optimization. For each sub-query, it constructs a template that defines a large space of possible operator implementations, all restricted to have linear time and space complexity. Blitz then employs program synthesis to instantiate the template and obtain a data-parallel operator implementation that is functionally equivalent to the original sub-query up to a bound on the input size.
Program synthesis is an undecidable problem in general and often difficult to scale, even for bounded inputs. Blitz therefore uses a series of analyses to judiciously use program synthesis and incrementally construct complex operators.
We integrated Blitz with existing big-data query languages by embedding the synthesized operators back into the query as User Defined Operators. We evaluated Blitz on several production queries from Microsoft running on two state-of-the-art query engines: SparkSQL as well as Scope, the big-data engine of Microsoft. Blitz produces correct optimizations despite the synthesis being bounded. The resulting queries have much more succinct query plans and demonstrate significant performance improvements on both big-data systems (1.3x --- 4.7x).
Supplemental Material
- Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 1383--1394. Google ScholarDigital Library
- Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. 2014. Apollo: Scalable and Coordinated Scheduling for Cloud-scale Computing. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI'14). USENIX Association, Berkeley, CA, USA, 285--300. http://dl.acm.org/citation.cfm?id=2685048.2685071 Google ScholarDigital Library
- Damianos Chatziantoniou and Kenneth A. Ross. 1997. Groupwise Processing of Relational Queries. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB '97). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 476--485. http://dl.acm.org/citation.cfm?id=645923.671003 Google ScholarDigital Library
- Damianos Chatziantoniou and Kenneth A. Ross. 2007. Partitioned Optimization of Complex Queries. Inf. Syst. 32, 2 (April 2007), 248--282. Google ScholarDigital Library
- Alvin Cheung, Armando Solar-Lezama, and Samuel Madden. 2013. Optimizing Database-backed Applications with Query Synthesis. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '13). ACM, New York, NY, USA, 3--14. Google ScholarDigital Library
- Shumo Chu, Konstantin Weitz, Alvin Cheung, and Dan Suciu. 2017. HoTTSQL: Proving Query Rewrites with Univalent SQL Semantics. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). ACM, New York, NY, USA, 510--524. Google ScholarDigital Library
- Przemysław Daca, Thomas A. Henzinger, and Andrey Kupriyanov. 2016. Array Folds Logic. 230--248.Google Scholar
- César Galindo-Legaria and Milind Joshi. 2001. Orthogonal Optimization ofSubqueries and Aggregation. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (SIGMOD '01). ACM, New York, NY, USA, 571--581. Google ScholarDigital Library
- Diego Garbervetsky, Zvonimir Pavlinovic, Michael Barnett, Madanlal Musuvathi, Todd Mytkowicz, and Edgardo Zoppi. 2017. Static Analysis for Optimizing Big Data Queries. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). ACM, New York, NY, USA, 932--937. Google ScholarDigital Library
- Shelly Grossman, Sara Cohen, Shachar Itzhaky, Noam Rinetzky, and Mooly Sagiv. 2017. Verifying Equivalence of Spark Programs. Springer International Publishing, Cham, 282--300.Google Scholar
- Sumit Gulwani, William R. Harris, and Rishabh Singh. 2012. Spreadsheet data manipulation using examples. Commun. ACM 55, 8 (2012), 97--105. Google ScholarDigital Library
- Zhenyu Guo, Xuepeng Fan, Rishan Chen, Jiaxing Zhang, Hucheng Zhou, Sean McDirmid, Chang Liu, Wei Lin, Jingren Zhou, and Lidong Zhou. 2012. Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12). USENIX, Hollywood, CA, 121--133. https://www.usenix.org/conference/osdi12/technical-sessions/presentation/guo Google ScholarDigital Library
- Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Red-shift and the Case for Simpler Data Warehouses. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 1917--1923. Google ScholarDigital Library
- Yin Huai, Ashutosh Chauhan, Alan Gates, Gunther Hagleitner, Eric N. Hanson, Owen O'Malley, Jitendra Pandey, Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2014. Major Technical Advancements in Apache Hive. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD '14). ACM, New York, NY, USA, 1235--1246. Google ScholarDigital Library
- Sangeetha Abdu Jyothi, Carlo Curino, Ishai Menache, Shravan Matthur Narayanamurthy, Alexey Tumanov, Jonathan Yaniv, Ruslan Mavlyutov, Inigo Goiri, Subru Krishnan, Janardhan Kulkarni, and Sriram Rao. 2016. Morpheus: Towards Automated SLOs for Enterprise Clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, GA, 117--134. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/jyothi Google ScholarDigital Library
- Viktor Leis, Kan Kundhikanjana, Alfons Kemper, and Thomas Neumann. 2015. Efficient Processing of Window Functions in Analytical SQL Queries. Proc. VLDB Endow. 8, 10, 1058--1069. Google ScholarDigital Library
- Veselin Raychev, Madanlal Musuvathi, and Todd Mytkowicz. 2015. Parallelizing User-defined Aggregations Using Symbolic Execution. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP '15). ACM, New York, NY, USA, 153--167. Google ScholarDigital Library
- Calvin Smith and Aws Albarghouthi. 2016. MapReduce Program Synthesis. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '16). ACM, New York, NY, USA, 326--340. Google ScholarDigital Library
- Armando Solar-Lezama, Liviu Tancau, Rastislav Bodik, Sanjit Seshia, and Vijay Saraswat. 2006. Combinatorial sketching for finite programs. ACM SIGOPS Operating Systems Review 40, 5 (2006), 404--415. Google ScholarDigital Library
- Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: A Warehousing Solution over a Map-reduce Framework. Proc. VLDB Endow. 2, 2 (Aug. 2009), 1626--1629. Google ScholarDigital Library
- TPC. 2016. TPCx-BB Benchmark. (2016). http//:www.tpc.org/tpcx-bb/Google Scholar
- TPC. 2017. TPC-DS Benchmark. (2017). http//:www.tpc.org/tpcds/Google Scholar
- Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC '13). ACM, New York, NY, USA, Article 5, 16 pages. Google ScholarDigital Library
- Chenglong Wang, Alvin Cheung, and Rastislav Bodik. 2017. Synthesizing Highly Expressive SQL Queries from Input-output Examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). ACM, New York, NY, USA, 452--466. Google ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 10--10. http://dl.acm.org/citation.cfm?id=1863103.1863113 Google ScholarDigital Library
- Sai Zhang and Yuyin Sun. 2013. Automatically synthesizing SQL queries from input-output examples. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013, Silicon Valley, CA, USA, November 11-15, 2013. IEEE Computer Society, Washington, DC, USA, 224--234. Google ScholarDigital Library
- Jingren Zhou, Nicolas Bruno, and Wei Lin. 2012. Advanced Partitioning Techniques for Massively Distributed Computation. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD '12). ACM, New York, NY, USA, 13--24. Google ScholarDigital Library
- Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, and Darren Shakib. 2012. SCOPE: Parallel Databases Meet MapReduce. The VLDB Journal 21, 5 (Oct. 2012), 611--636. Google ScholarDigital Library
- Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, and Darren Shakib. 2012. SCOPE: Parallel Databases Meet MapReduce. The VLDB Journal 21, 5 (Oct. 2012), 611--636. Google ScholarDigital Library
- Calisto Zuzarte, Hamid Pirahesh, Wenbin Ma, Qi Cheng, Linqi Liu, and Kwai Wong. 2003. WinMagic: Subquery Elimination Using Window Aggregation. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD '03). ACM, New York, NY, USA, 652--656. Google ScholarDigital Library
Index Terms
- Optimizing Big-Data Queries Using Program Synthesis
Recommendations
Equivalence and minimization of conjunctive queries under combined semantics
ICDT '12: Proceedings of the 15th International Conference on Database TheoryThe problems of query containment, equivalence, and minimization are fundamental problems in the context of query processing and optimization. In their classic work [2] published in 1977, Chandra and Merlin solved the three problems for the language of ...
Automated Translation of Functional Big Data Queries to SQL
Big data analytics frameworks like Apache Spark and Flink enable users to implement queries over large, distributed databases using functional APIs. In recent years, these APIs have grown in popularity because their functional interfaces abstract away ...
Optimizing queries using materialized views: a practical, scalable solution
Materialized views can provide massive improvements in query processing time, especially for aggregation queries over large tables. To realize this potential, the query optimizer must know how and when to exploit materialized views. This paper presents ...
Comments