research-article

Opening the black boxes in data flow optimization

Authors:
Fabian Hueske

Technische Universität Berlin, Germany

Technische Universität Berlin, Germany
View Profile

,
Mathias Peters

Humboldt-Universität zu Berlin, Germany

Humboldt-Universität zu Berlin, Germany
View Profile

,
Matthias J. Sax

Humboldt-Universität zu Berlin, Germany

Humboldt-Universität zu Berlin, Germany
View Profile

,
Astrid Rheinländer

Humboldt-Universität zu Berlin, Germany

Humboldt-Universität zu Berlin, Germany
View Profile

,
Rico Bergmann

Humboldt-Universität zu Berlin, Germany

Humboldt-Universität zu Berlin, Germany
View Profile

,
Aljoscha Krettek

Technische Universität Berlin, Germany

Technische Universität Berlin, Germany
View Profile

,
Kostas Tzoumas

Technische Universität Berlin, Germany

Technische Universität Berlin, Germany
View Profile

Proceedings of the VLDB Endowment Volume 5 Issue 11pp 1256–1267https://doi.org/10.14778/2350229.2350244

Published:01 July 2012Publication History

Proceedings of the VLDB Endowment

Abstract

Many systems for big data analytics employ a data flow abstraction to define parallel data processing tasks. In this setting, custom operations expressed as user-defined functions are very common. We address the problem of performing data flow optimization at this level of abstraction, where the semantics of operators are not known. Traditionally, query optimization is applied to queries with known algebraic semantics. In this work, we find that a handful of properties, rather than a full algebraic specification, suffice to establish reordering conditions for data processing operators. We show that these properties can be accurately estimated for black box operators by statically analyzing the general-purpose code of their user-defined functions.

We design and implement an optimizer for parallel data flows that does not assume knowledge of semantics or algebraic properties of operators. Our evaluation confirms that the optimizer can apply common rewritings such as selection reordering, bushy join-order enumeration, and limited forms of aggregation push-down, hence yielding similar rewriting power as modern relational DBMS optimizers. Moreover, it can optimize the operator order of nonrelational data flows, a unique feature among today's systems.

References

http://public.web.cern.ch/public/en/LHC/Computing-en.html.Google Scholar
http://www.greenplum.com/technology/mapreduce.Google Scholar
http://www.sable.mcgill.ca/soot/.Google Scholar
http://stratosphere.eu.Google Scholar
A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques and Tools. Pearson, 2006. Google ScholarDigital Library
M. Baker. Next-generation sequencing: Adjusting to data overload. Nature Methods, 7(7):495--499, 2010.Google ScholarCross Ref
D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In SoCC, pp. 119--130, 2010. Google ScholarDigital Library
J. Becla, A. Hanushevsky, S. Nikolaev, G. Abdulla, A. S. Szalay, M. A. Nieto-Santisteban, A. Thakar, and J. Gray. Designing a multi-petabyte database for LSST. CoRR, abs/cs/0604112, 2006.Google Scholar
A. Behm, V. R. Borkar, M. J. Carey, R. Grover, C. Li, N. Onose, R. Vernica, A. Deutsch, Y. Papakonstantinou, and V. J. Tsotras. ASTERIX: Towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases, 29(3):185--216, 2011. Google ScholarDigital Library
K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Y. Eltabakh, C.-C. Kanne, F. Özcan, and E. J. Shekita. Jaql: A scripting language for large scale semistructured data analysis. PVLDB, 4(12):1272--1283, 2011.Google ScholarDigital Library
V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pp. 1151--1162, 2011. Google ScholarDigital Library
R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265--1276, 2008. Google ScholarDigital Library
S. Chaudhuri and K. Shim. Including group-by in query optimization. In VLDB, pp. 354--366, 1994. Google ScholarDigital Library
S. Chaudhuri and K. Shim. An overview of cost-based optimization of queries with aggregates. IEEE Data Eng. Bull., 18(3):3--9, 1995.Google Scholar
S. Chaudhuri and K. Shim. Optimization of queries with user-defined predicates. ACM TODS, 24(2):177--228, 1999. Google ScholarDigital Library
B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda, V. Lychagina, Y. Kwon, and M. Wong. Tenzing A SQL Implementation On The MapReduce Framework. PVLDB, 4(12):1318--1327, 2011.Google Scholar
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pp. 137--150, 2004. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. CACM, 51(1):107--113, 2008. Google ScholarDigital Library
P. Fender and G. Moerkotte. A new, highly efficient, and easy to implement top-down join enumeration algorithm. In ICDE, pp. 864--875, 2011. Google ScholarDigital Library
E. Friedman, P. M. Pawlowski, and J. Cieslewicz. SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. PVLDB, 2(2):1402--1413, 2009. Google ScholarDigital Library
G. Graefe. The Cascades framework for query optimization. IEEE Data Eng. Bull., 18(3):19--29, 1995.Google Scholar
T. Grust, M. Mayr, J. Rittinger, and T. Schreiber. Ferry: Database-supported program execution. In SIGMOD, pp. 1063--1066, 2009. Google ScholarDigital Library
J. M. Hellerstein. Optimization techniques for queries with expensive methods. ACM TODS, 23(2):113--157, 1998. Google ScholarDigital Library
H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of MapReduce programs. PVLDB, 4(11):1111--1122, 2011.Google ScholarDigital Library
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys, pp. 59--72, 2007. Google ScholarDigital Library
E. Jahani, M. J. Cafarella, and C. Ré. Automatic optimization for MapReduce programs. PVLDB, 4(6):385--396, 2011. Google ScholarDigital Library
G. Moerkotte and T. Neumann. Dynamic programming strikes back. In SIGMOD, pp. 539--552, 2008. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In SIGMOD, pp. 1099--1110, 2008. Google ScholarDigital Library
P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In SIGMOD, pp. 23--34, 1979. Google ScholarDigital Library
M. Stonebraker, C. Bear, U. Çetintemel, M. Cherniack, T. Ge, N. Hachem, S. Harizopoulos, J. Lifter, J. Rogers, and S. B. Zdonik. One size fits all? Part 2: Benchmarking studies. In CIDR, pp. 173--184, 2007.Google Scholar
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - A warehousing solution over a map-reduce framework. PVLDB, 2(2):1626--1629, 2009. Google ScholarDigital Library
D. Warneke and O. Kao. Nephele: Efficient parallel data processing in the cloud. In SC-MTAGS, 2009. Google ScholarDigital Library
Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, pp. 1--14, 2008. Google ScholarDigital Library

Recommendations

Peeking into the optimization of data flow programs with MapReduce-style UDFs
ICDE '13: Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013)

Data flows are a popular abstraction to define dataintensive processing tasks. In order to support a wide range of use cases, many data processing systems feature MapReduce-style user-defined functions (UDFs). In contrast to UDFs as known from ...
Read More
Big data multi-query optimisation with Apache Flink

Big data analytic frameworks, such as MapReduce, Spark and Flink, have recently gained more popularity to process large data. Flink is an open-source Apache-hosted big data analytic framework for processing batch and streaming data. For historical data ...
Read More
Query optimization for massively parallel data processing
SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud Computing

MapReduce has been widely recognized as an efficient tool for large-scale data analysis. It achieves high performance by exploiting parallelism among processing nodes while providing a simple interface for upper-layer applications. Some vendors have ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 5, Issue 11
July 2012
608 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 July 2012
Published in pvldb Volume 5, Issue 11
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 49
  Total Citations
  View Citations
- 447
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Opening the black boxes in data flow optimization

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Peeking into the optimization of data flow programs with MapReduce-style UDFs

Big data multi-query optimisation with Apache Flink

Query optimization for massively parallel data processing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Opening the black boxes in data flow optimization

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Peeking into the optimization of data flow programs with MapReduce-style UDFs

Big data multi-query optimisation with Apache Flink

Query optimization for massively parallel data processing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media