ABSTRACT
The MapReduce distributed programming framework is very popular, but currently lacks the optimization techniques that have been standard with relational database systems for many years. This paper proposes Manimal, which uses static code analysis to detect MapReduce program semantics and thereby enable wholly-automatic optimization of MapReduce programs. For example, a programmer's map function that emits data only when an if... statement holds true is essentially encoding a selection condition; code analysis can detect and characterize these conditions. If Manimal has an appropriate index available, it can then alter MapReduce execution to use it.
Manimal can address many different optimization opportunities, including projections, structure-aware data compression, and others. However, this paper illustrates the system by focusing on one: efficient selection. We give a static analysis algorithm that can detect selections in user programs, and cover how Manimal can employ a B+Tree to execute these selections efficiently at runtime. Testing Manimal on several standard MapReduce programs, we show that selection alone can automatically reduce a standard program's runtime to 63% of conventional MapReduce execution time on identical hardware. We also give an in-depth discussion of other optimization targets and detection techniques.
- }}D. J. Abadi, S. Madden, and M. Ferreira. Integrating Compression and Execution in Column-Oriented Database Systems. In SIGMOD Conference, pages 671--682, 2006. Google ScholarDigital Library
- }}A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1):922--933, 2009. Google ScholarDigital Library
- }}F. Afrati and J. Ullman. Optimizing Joins in a Map-Reduce Environment. In EDBT, 2010. Google ScholarDigital Library
- }}A. V. Aho, M. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools: Second Edition. Addison-Wesley, 2007. Google ScholarDigital Library
- }}J. Cho and S. Rajagopalan. A fast regular expression indexing engine. In ICDE, pages 419--430, 2002. Google ScholarDigital Library
- }}J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
- }}B. Hindman, A. Konwinski, M. Zaharia, and I. Stoica. A Common Substrate for Cluster Computing. In HotCloud, June 2009. Google ScholarDigital Library
- }}M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: Fair Scheduling for Distributed Computing Clusters. In SOSP, pages 261--276, 2009. Google ScholarDigital Library
- }}Mahout. http://lucene.apache.org/mahout/index.html, 2009.Google Scholar
- }}C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a Not-So-Foreign Language for Data Processing. In SIGMOD Conference, pages 1099--1110, 2008. Google ScholarDigital Library
- }}A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A Comparison of Approaches to Large-Scale Data Analysis. In SIGMOD Conference, pages 165--178, 2009. Google ScholarDigital Library
- }}M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-Store: A Column-Oriented DBMS. In VLDB, pages 553--564, 2005. Google ScholarDigital Library
- }}H.-C. Yang, A. Dasdan, R.-L. Hsiao, and D. S. P. Jr. Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters. In SIGMOD Conference, pages 1029--1040, 2007. Google ScholarDigital Library
- }}M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica. Improving MapReduce Performance in Heterogeneous Environments. In OSDI, pages 29--42, 2008. Google ScholarDigital Library
Index Terms
- Manimal: relational optimization for data-intensive programs
Comments