Abstract
MapReduce has recently emerged as a new paradigm for large-scale data analysis due to its high scalability, fine-grained fault tolerance and easy programming model. Since different jobs often share similar work (e.g., several jobs scan the same input file or produce the same map output), there are many opportunities to optimize the performance for a batch of jobs. In this paper, we propose two new techniques for multi-job optimization in the MapReduce framework. The first is a generalized grouping technique (which generalizes the recently proposed MRShare technique) that merges multiple jobs into a single job thereby enabling the merged jobs to share both the scan of the input file as well as the communication of the common map output. The second is a materialization technique that enables multiple jobs to share both the scan of the input file as well as the communication of the common map output via partial materialization of the map output of some jobs (in the map and/or reduce phase). Our second contribution is the proposal of a new optimization algorithm that given an input batch of jobs, produces an optimal plan by a judicious partitioning of the jobs into groups and an optimal assignment of the processing technique to each group. Our experimental results on Hadoop demonstrate that our new approach significantly outperforms the state-of-the-art technique, MRShare, by up to 107%.
- F. N. Afrati and J. D. Ullman. Optimizing joins in a mapreduce environment. In EDBT, 2010. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI, 2004. Google ScholarDigital Library
- I. Elghandour and A. Aboulnaga. Restore: Reusing results of mapreduce jobs. In VLDB, 2012. Google ScholarDigital Library
- L. Fegaras, C. Li, and U. Gupta. An optimization framework for map-reduce queries. In EDBT, 2012. Google ScholarDigital Library
- A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high-level dataflow system on top of map-reduce: the pig experience. In VLDB, 2009. Google ScholarDigital Library
- H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. In VLDB, 2011.Google ScholarDigital Library
- H. Herodotou, F. Dong, and S. Babu. Mapreduce programming and cost-based optimization? crossing this chasm with starfish. In VLDB, 2011.Google Scholar
- E. Jahani, M. J. Cafarella, and C. Ré. Automatic optimization for mapreduce programs. In VLDB, 2011. Google ScholarDigital Library
- J. Jestes, K. Yi, and F. Li. Building wavelet histograms on large data in mapreduce. In VLDB, 2011. Google ScholarDigital Library
- F. Li, B. C. Ooi, M. T. Özsu, and S. Wu. Distributed data management using mapreduce. ACM Computing Surveys. To appear in 2014. Google ScholarDigital Library
- H. Lim, H. Herodotou, and S. Babu. Stubby: A transformation-based optimizer for mapreduce workflows. In VLDB, 2012. Google ScholarDigital Library
- T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. Mrshare: sharing across multiple queries in mapreduce. In VLDB, 2010. Google ScholarDigital Library
- C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic optimization of parallel dataflow programs. In ATC, 2008. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In SIGMOD, 2008. Google ScholarDigital Library
- Y. Shi, X. Meng, F. Wang, and Y. Gan. Hedc: a histogram estimator for data in the cloud. In CloudDb, 2012. Google ScholarDigital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive-a warehousing solution over a map-reduce framework. In VLDB, 2009. Google ScholarDigital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, and H. Liu. Hive-a petabyte scale data warehouse using hadoop. In ICDE, 2010.Google ScholarCross Ref
- G. Wang and C.-Y. Chan. Multi-query optimization in mapreduce framework. Technical Report http://www.comp.nus.edu.sg/~g0800170/techreport-MJQ.pdf, National University of Singapore, February 2013.Google ScholarDigital Library
- T. White. Hadoop: The Definitive Guide. O'Reilly Media, 2009. Google ScholarDigital Library
- S. Wu, F. Li, S. Mehrotra, and B. C. Ooi. Query optimization for massively parallel data processing. In SOCC, 2011. Google ScholarDigital Library
Recommendations
Data-locality-aware mapreduce real-time scheduling framework
A framework to manage interactive MapReduce applications with deadline constraint.A dispatcher to assign jobs to resources considering blocking and data-locality.A dynamic power management for MapReduce tasks to improve run-time energy efficiency.A ...
Big data multi-query optimisation with Apache Flink
Big data analytic frameworks, such as MapReduce, Spark and Flink, have recently gained more popularity to process large data. Flink is an open-source Apache-hosted big data analytic framework for processing batch and streaming data. For historical data ...
Multi-objective scheduling of MapReduce jobs in big data processing
Data generation has increased drastically over the past few years due to the rapid development of Internet-based technologies. This period has been called the big data era. Big data offer an emerging paradigm shift in data exploration and utilization. ...
Comments