research-article

Optimizing Big-Data Queries Using Program Synthesis

Authors:
Matthias Schlaipfer

TU Wien, Vienna, Austria

TU Wien, Vienna, Austria
View Profile

,
Kaushik Rajan

Microsoft Research, Bangalore, India

Microsoft Research, Bangalore, India
View Profile

,
Akash Lal

Microsoft Research, Bangalore, India

Microsoft Research, Bangalore, India
View Profile

,
Malavika Samak

MIT CSAIL, Cambridge, USA

MIT CSAIL, Cambridge, USA
View Profile

SOSP '17: Proceedings of the 26th Symposium on Operating Systems PrinciplesOctober 2017Pages 631–646https://doi.org/10.1145/3132747.3132773

Published:14 October 2017Publication History

SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles

Pages 631–646

ABSTRACT

Classical query optimization relies on a predefined set of rewrite rules to re-order and substitute SQL operators at a logical level. This paper proposes Blitz, a system that can synthesize efficient query-specific operators using automated program reasoning. Blitz uses static analysis to identify sub-queries as potential targets for optimization. For each sub-query, it constructs a template that defines a large space of possible operator implementations, all restricted to have linear time and space complexity. Blitz then employs program synthesis to instantiate the template and obtain a data-parallel operator implementation that is functionally equivalent to the original sub-query up to a bound on the input size.

Program synthesis is an undecidable problem in general and often difficult to scale, even for bounded inputs. Blitz therefore uses a series of analyses to judiciously use program synthesis and incrementally construct complex operators.

We integrated Blitz with existing big-data query languages by embedding the synthesized operators back into the query as User Defined Operators. We evaluated Blitz on several production queries from Microsoft running on two state-of-the-art query engines: SparkSQL as well as Scope, the big-data engine of Microsoft. Blitz produces correct optimizations despite the synthesis being bounded. The resulting queries have much more succinct query plans and demonstrate significant performance improvements on both big-data systems (1.3x --- 4.7x).

Supplemental Material

opt_query.mp4

mp4

2.1 GB

Download

References

Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 1383--1394. Google ScholarDigital Library
Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, and Lidong Zhou. 2014. Apollo: Scalable and Coordinated Scheduling for Cloud-scale Computing. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI'14). USENIX Association, Berkeley, CA, USA, 285--300. http://dl.acm.org/citation.cfm?id=2685048.2685071 Google ScholarDigital Library
Damianos Chatziantoniou and Kenneth A. Ross. 1997. Groupwise Processing of Relational Queries. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB '97). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 476--485. http://dl.acm.org/citation.cfm?id=645923.671003 Google ScholarDigital Library
Damianos Chatziantoniou and Kenneth A. Ross. 2007. Partitioned Optimization of Complex Queries. Inf. Syst. 32, 2 (April 2007), 248--282. Google ScholarDigital Library
Alvin Cheung, Armando Solar-Lezama, and Samuel Madden. 2013. Optimizing Database-backed Applications with Query Synthesis. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '13). ACM, New York, NY, USA, 3--14. Google ScholarDigital Library
Shumo Chu, Konstantin Weitz, Alvin Cheung, and Dan Suciu. 2017. HoTTSQL: Proving Query Rewrites with Univalent SQL Semantics. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). ACM, New York, NY, USA, 510--524. Google ScholarDigital Library
Przemysław Daca, Thomas A. Henzinger, and Andrey Kupriyanov. 2016. Array Folds Logic. 230--248.Google Scholar
César Galindo-Legaria and Milind Joshi. 2001. Orthogonal Optimization ofSubqueries and Aggregation. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (SIGMOD '01). ACM, New York, NY, USA, 571--581. Google ScholarDigital Library
Diego Garbervetsky, Zvonimir Pavlinovic, Michael Barnett, Madanlal Musuvathi, Todd Mytkowicz, and Edgardo Zoppi. 2017. Static Analysis for Optimizing Big Data Queries. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2017). ACM, New York, NY, USA, 932--937. Google ScholarDigital Library
Shelly Grossman, Sara Cohen, Shachar Itzhaky, Noam Rinetzky, and Mooly Sagiv. 2017. Verifying Equivalence of Spark Programs. Springer International Publishing, Cham, 282--300.Google Scholar
Sumit Gulwani, William R. Harris, and Rishabh Singh. 2012. Spreadsheet data manipulation using examples. Commun. ACM 55, 8 (2012), 97--105. Google ScholarDigital Library
Zhenyu Guo, Xuepeng Fan, Rishan Chen, Jiaxing Zhang, Hucheng Zhou, Sean McDirmid, Chang Liu, Wei Lin, Jingren Zhou, and Lidong Zhou. 2012. Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12). USENIX, Hollywood, CA, 121--133. https://www.usenix.org/conference/osdi12/technical-sessions/presentation/guo Google ScholarDigital Library
Anurag Gupta, Deepak Agarwal, Derek Tan, Jakub Kulesza, Rahul Pathak, Stefano Stefani, and Vidhya Srinivasan. 2015. Amazon Red-shift and the Case for Simpler Data Warehouses. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 1917--1923. Google ScholarDigital Library
Yin Huai, Ashutosh Chauhan, Alan Gates, Gunther Hagleitner, Eric N. Hanson, Owen O'Malley, Jitendra Pandey, Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2014. Major Technical Advancements in Apache Hive. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD '14). ACM, New York, NY, USA, 1235--1246. Google ScholarDigital Library
Sangeetha Abdu Jyothi, Carlo Curino, Ishai Menache, Shravan Matthur Narayanamurthy, Alexey Tumanov, Jonathan Yaniv, Ruslan Mavlyutov, Inigo Goiri, Subru Krishnan, Janardhan Kulkarni, and Sriram Rao. 2016. Morpheus: Towards Automated SLOs for Enterprise Clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, GA, 117--134. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/jyothi Google ScholarDigital Library
Viktor Leis, Kan Kundhikanjana, Alfons Kemper, and Thomas Neumann. 2015. Efficient Processing of Window Functions in Analytical SQL Queries. Proc. VLDB Endow. 8, 10, 1058--1069. Google ScholarDigital Library
Veselin Raychev, Madanlal Musuvathi, and Todd Mytkowicz. 2015. Parallelizing User-defined Aggregations Using Symbolic Execution. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP '15). ACM, New York, NY, USA, 153--167. Google ScholarDigital Library
Calvin Smith and Aws Albarghouthi. 2016. MapReduce Program Synthesis. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '16). ACM, New York, NY, USA, 326--340. Google ScholarDigital Library
Armando Solar-Lezama, Liviu Tancau, Rastislav Bodik, Sanjit Seshia, and Vijay Saraswat. 2006. Combinatorial sketching for finite programs. ACM SIGOPS Operating Systems Review 40, 5 (2006), 404--415. Google ScholarDigital Library
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: A Warehousing Solution over a Map-reduce Framework. Proc. VLDB Endow. 2, 2 (Aug. 2009), 1626--1629. Google ScholarDigital Library
TPC. 2016. TPCx-BB Benchmark. (2016). http//:www.tpc.org/tpcx-bb/Google Scholar
TPC. 2017. TPC-DS Benchmark. (2017). http//:www.tpc.org/tpcds/Google Scholar
Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC '13). ACM, New York, NY, USA, Article 5, 16 pages. Google ScholarDigital Library
Chenglong Wang, Alvin Cheung, and Rastislav Bodik. 2017. Synthesizing Highly Expressive SQL Queries from Input-output Examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). ACM, New York, NY, USA, 452--466. Google ScholarDigital Library
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 10--10. http://dl.acm.org/citation.cfm?id=1863103.1863113 Google ScholarDigital Library
Sai Zhang and Yuyin Sun. 2013. Automatically synthesizing SQL queries from input-output examples. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013, Silicon Valley, CA, USA, November 11-15, 2013. IEEE Computer Society, Washington, DC, USA, 224--234. Google ScholarDigital Library
Jingren Zhou, Nicolas Bruno, and Wei Lin. 2012. Advanced Partitioning Techniques for Massively Distributed Computation. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD '12). ACM, New York, NY, USA, 13--24. Google ScholarDigital Library
Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, and Darren Shakib. 2012. SCOPE: Parallel Databases Meet MapReduce. The VLDB Journal 21, 5 (Oct. 2012), 611--636. Google ScholarDigital Library
Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, and Darren Shakib. 2012. SCOPE: Parallel Databases Meet MapReduce. The VLDB Journal 21, 5 (Oct. 2012), 611--636. Google ScholarDigital Library
Calisto Zuzarte, Hamid Pirahesh, Wenbin Ma, Qi Cheng, Linqi Liu, and Kwai Wong. 2003. WinMagic: Subquery Elimination Using Window Aggregation. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD '03). ACM, New York, NY, USA, 652--656. Google ScholarDigital Library

Index Terms

Optimizing Big-Data Queries Using Program Synthesis
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
        Query operators
        Query optimization
      2. Parallel and distributed DBMSs
2. Theory of computation
  1. Semantics and reasoning
    1. Program reasoning
      1. Program analysis
      2. Program specifications

Recommendations

Equivalence and minimization of conjunctive queries under combined semantics
ICDT '12: Proceedings of the 15th International Conference on Database Theory

The problems of query containment, equivalence, and minimization are fundamental problems in the context of query processing and optimization. In their classic work [2] published in 1977, Chandra and Merlin solved the three problems for the language of ...
Read More
Automated Translation of Functional Big Data Queries to SQL

Big data analytics frameworks like Apache Spark and Flink enable users to implement queries over large, distributed databases using functional APIs. In recent years, these APIs have grown in popularity because their functional interfaces abstract away ...
Read More
Optimizing queries using materialized views: a practical, scalable solution

Materialized views can provide massive improvements in query processing time, especially for aggregation queries over large tables. To realize this potential, the query optimizer must know how and when to exploit materialized views. This paper presents ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles
October 2017
677 pages
ISBN:9781450350853
DOI:10.1145/3132747

Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 October 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Program Synthesis
Query Optimization
User-Defined Operators
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate131of716submissions,18%
Upcoming Conference
SOSP '24

Sponsor:

sigops

ACM SIGOPS 29th Symposium on Operating Systems Principles

November 5 - 8, 2024

Austin , TX , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 24
  Total Citations
  View Citations
- 1,598
  Total Downloads
- Downloads (Last 12 months)63
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimizing Big-Data Queries Using Program Synthesis

SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Equivalence and minimization of conjunctive queries under combined semantics

Automated Translation of Functional Big Data Queries to SQL

Optimizing queries using materialized views: a practical, scalable solution

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Optimizing Big-Data Queries Using Program Synthesis

SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Equivalence and minimization of conjunctive queries under combined semantics

Automated Translation of Functional Big Data Queries to SQL

Optimizing queries using materialized views: a practical, scalable solution

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media