skip to main content
research-article

Opening the black boxes in data flow optimization

Published:01 July 2012Publication History
Skip Abstract Section

Abstract

Many systems for big data analytics employ a data flow abstraction to define parallel data processing tasks. In this setting, custom operations expressed as user-defined functions are very common. We address the problem of performing data flow optimization at this level of abstraction, where the semantics of operators are not known. Traditionally, query optimization is applied to queries with known algebraic semantics. In this work, we find that a handful of properties, rather than a full algebraic specification, suffice to establish reordering conditions for data processing operators. We show that these properties can be accurately estimated for black box operators by statically analyzing the general-purpose code of their user-defined functions.

We design and implement an optimizer for parallel data flows that does not assume knowledge of semantics or algebraic properties of operators. Our evaluation confirms that the optimizer can apply common rewritings such as selection reordering, bushy join-order enumeration, and limited forms of aggregation push-down, hence yielding similar rewriting power as modern relational DBMS optimizers. Moreover, it can optimize the operator order of nonrelational data flows, a unique feature among today's systems.

References

  1. http://public.web.cern.ch/public/en/LHC/Computing-en.html.Google ScholarGoogle Scholar
  2. http://www.greenplum.com/technology/mapreduce.Google ScholarGoogle Scholar
  3. http://www.sable.mcgill.ca/soot/.Google ScholarGoogle Scholar
  4. http://stratosphere.eu.Google ScholarGoogle Scholar
  5. A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques and Tools. Pearson, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Baker. Next-generation sequencing: Adjusting to data overload. Nature Methods, 7(7):495--499, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  7. D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In SoCC, pp. 119--130, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Becla, A. Hanushevsky, S. Nikolaev, G. Abdulla, A. S. Szalay, M. A. Nieto-Santisteban, A. Thakar, and J. Gray. Designing a multi-petabyte database for LSST. CoRR, abs/cs/0604112, 2006.Google ScholarGoogle Scholar
  9. A. Behm, V. R. Borkar, M. J. Carey, R. Grover, C. Li, N. Onose, R. Vernica, A. Deutsch, Y. Papakonstantinou, and V. J. Tsotras. ASTERIX: Towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases, 29(3):185--216, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Y. Eltabakh, C.-C. Kanne, F. Özcan, and E. J. Shekita. Jaql: A scripting language for large scale semistructured data analysis. PVLDB, 4(12):1272--1283, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pp. 1151--1162, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265--1276, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Chaudhuri and K. Shim. Including group-by in query optimization. In VLDB, pp. 354--366, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Chaudhuri and K. Shim. An overview of cost-based optimization of queries with aggregates. IEEE Data Eng. Bull., 18(3):3--9, 1995.Google ScholarGoogle Scholar
  15. S. Chaudhuri and K. Shim. Optimization of queries with user-defined predicates. ACM TODS, 24(2):177--228, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda, V. Lychagina, Y. Kwon, and M. Wong. Tenzing A SQL Implementation On The MapReduce Framework. PVLDB, 4(12):1318--1327, 2011.Google ScholarGoogle Scholar
  17. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pp. 137--150, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. CACM, 51(1):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Fender and G. Moerkotte. A new, highly efficient, and easy to implement top-down join enumeration algorithm. In ICDE, pp. 864--875, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. E. Friedman, P. M. Pawlowski, and J. Cieslewicz. SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. PVLDB, 2(2):1402--1413, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. Graefe. The Cascades framework for query optimization. IEEE Data Eng. Bull., 18(3):19--29, 1995.Google ScholarGoogle Scholar
  22. T. Grust, M. Mayr, J. Rittinger, and T. Schreiber. Ferry: Database-supported program execution. In SIGMOD, pp. 1063--1066, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. M. Hellerstein. Optimization techniques for queries with expensive methods. ACM TODS, 23(2):113--157, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of MapReduce programs. PVLDB, 4(11):1111--1122, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys, pp. 59--72, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. E. Jahani, M. J. Cafarella, and C. Ré. Automatic optimization for MapReduce programs. PVLDB, 4(6):385--396, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. Moerkotte and T. Neumann. Dynamic programming strikes back. In SIGMOD, pp. 539--552, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In SIGMOD, pp. 1099--1110, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In SIGMOD, pp. 23--34, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Stonebraker, C. Bear, U. Çetintemel, M. Cherniack, T. Ge, N. Hachem, S. Harizopoulos, J. Lifter, J. Rogers, and S. B. Zdonik. One size fits all? Part 2: Benchmarking studies. In CIDR, pp. 173--184, 2007.Google ScholarGoogle Scholar
  31. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - A warehousing solution over a map-reduce framework. PVLDB, 2(2):1626--1629, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. D. Warneke and O. Kao. Nephele: Efficient parallel data processing in the cloud. In SC-MTAGS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, pp. 1--14, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 5, Issue 11
    July 2012
    608 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 July 2012
    Published in pvldb Volume 5, Issue 11

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader