Abstract
Many systems for big data analytics employ a data flow abstraction to define parallel data processing tasks. In this setting, custom operations expressed as user-defined functions are very common. We address the problem of performing data flow optimization at this level of abstraction, where the semantics of operators are not known. Traditionally, query optimization is applied to queries with known algebraic semantics. In this work, we find that a handful of properties, rather than a full algebraic specification, suffice to establish reordering conditions for data processing operators. We show that these properties can be accurately estimated for black box operators by statically analyzing the general-purpose code of their user-defined functions.
We design and implement an optimizer for parallel data flows that does not assume knowledge of semantics or algebraic properties of operators. Our evaluation confirms that the optimizer can apply common rewritings such as selection reordering, bushy join-order enumeration, and limited forms of aggregation push-down, hence yielding similar rewriting power as modern relational DBMS optimizers. Moreover, it can optimize the operator order of nonrelational data flows, a unique feature among today's systems.
- http://public.web.cern.ch/public/en/LHC/Computing-en.html.Google Scholar
- http://www.greenplum.com/technology/mapreduce.Google Scholar
- http://www.sable.mcgill.ca/soot/.Google Scholar
- http://stratosphere.eu.Google Scholar
- A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques and Tools. Pearson, 2006. Google ScholarDigital Library
- M. Baker. Next-generation sequencing: Adjusting to data overload. Nature Methods, 7(7):495--499, 2010.Google ScholarCross Ref
- D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In SoCC, pp. 119--130, 2010. Google ScholarDigital Library
- J. Becla, A. Hanushevsky, S. Nikolaev, G. Abdulla, A. S. Szalay, M. A. Nieto-Santisteban, A. Thakar, and J. Gray. Designing a multi-petabyte database for LSST. CoRR, abs/cs/0604112, 2006.Google Scholar
- A. Behm, V. R. Borkar, M. J. Carey, R. Grover, C. Li, N. Onose, R. Vernica, A. Deutsch, Y. Papakonstantinou, and V. J. Tsotras. ASTERIX: Towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases, 29(3):185--216, 2011. Google ScholarDigital Library
- K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Y. Eltabakh, C.-C. Kanne, F. Özcan, and E. J. Shekita. Jaql: A scripting language for large scale semistructured data analysis. PVLDB, 4(12):1272--1283, 2011.Google ScholarDigital Library
- V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pp. 1151--1162, 2011. Google ScholarDigital Library
- R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265--1276, 2008. Google ScholarDigital Library
- S. Chaudhuri and K. Shim. Including group-by in query optimization. In VLDB, pp. 354--366, 1994. Google ScholarDigital Library
- S. Chaudhuri and K. Shim. An overview of cost-based optimization of queries with aggregates. IEEE Data Eng. Bull., 18(3):3--9, 1995.Google Scholar
- S. Chaudhuri and K. Shim. Optimization of queries with user-defined predicates. ACM TODS, 24(2):177--228, 1999. Google ScholarDigital Library
- B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda, V. Lychagina, Y. Kwon, and M. Wong. Tenzing A SQL Implementation On The MapReduce Framework. PVLDB, 4(12):1318--1327, 2011.Google Scholar
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pp. 137--150, 2004. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. CACM, 51(1):107--113, 2008. Google ScholarDigital Library
- P. Fender and G. Moerkotte. A new, highly efficient, and easy to implement top-down join enumeration algorithm. In ICDE, pp. 864--875, 2011. Google ScholarDigital Library
- E. Friedman, P. M. Pawlowski, and J. Cieslewicz. SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. PVLDB, 2(2):1402--1413, 2009. Google ScholarDigital Library
- G. Graefe. The Cascades framework for query optimization. IEEE Data Eng. Bull., 18(3):19--29, 1995.Google Scholar
- T. Grust, M. Mayr, J. Rittinger, and T. Schreiber. Ferry: Database-supported program execution. In SIGMOD, pp. 1063--1066, 2009. Google ScholarDigital Library
- J. M. Hellerstein. Optimization techniques for queries with expensive methods. ACM TODS, 23(2):113--157, 1998. Google ScholarDigital Library
- H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of MapReduce programs. PVLDB, 4(11):1111--1122, 2011.Google ScholarDigital Library
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys, pp. 59--72, 2007. Google ScholarDigital Library
- E. Jahani, M. J. Cafarella, and C. Ré. Automatic optimization for MapReduce programs. PVLDB, 4(6):385--396, 2011. Google ScholarDigital Library
- G. Moerkotte and T. Neumann. Dynamic programming strikes back. In SIGMOD, pp. 539--552, 2008. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In SIGMOD, pp. 1099--1110, 2008. Google ScholarDigital Library
- P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In SIGMOD, pp. 23--34, 1979. Google ScholarDigital Library
- M. Stonebraker, C. Bear, U. Çetintemel, M. Cherniack, T. Ge, N. Hachem, S. Harizopoulos, J. Lifter, J. Rogers, and S. B. Zdonik. One size fits all? Part 2: Benchmarking studies. In CIDR, pp. 173--184, 2007.Google Scholar
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - A warehousing solution over a map-reduce framework. PVLDB, 2(2):1626--1629, 2009. Google ScholarDigital Library
- D. Warneke and O. Kao. Nephele: Efficient parallel data processing in the cloud. In SC-MTAGS, 2009. Google ScholarDigital Library
- Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, pp. 1--14, 2008. Google ScholarDigital Library
Recommendations
Peeking into the optimization of data flow programs with MapReduce-style UDFs
ICDE '13: Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013)Data flows are a popular abstraction to define dataintensive processing tasks. In order to support a wide range of use cases, many data processing systems feature MapReduce-style user-defined functions (UDFs). In contrast to UDFs as known from ...
Big data multi-query optimisation with Apache Flink
Big data analytic frameworks, such as MapReduce, Spark and Flink, have recently gained more popularity to process large data. Flink is an open-source Apache-hosted big data analytic framework for processing batch and streaming data. For historical data ...
Query optimization for massively parallel data processing
SOCC '11: Proceedings of the 2nd ACM Symposium on Cloud ComputingMapReduce has been widely recognized as an efficient tool for large-scale data analysis. It achieves high performance by exploiting parallelism among processing nodes while providing a simple interface for upper-layer applications. Some vendors have ...
Comments