research-article

Multi-query optimization in MapReduce framework

Authors:
Guoping Wang

National University of Singapore

National University of Singapore
View Profile

,
Chee-Yong Chan

National University of Singapore

National University of Singapore
View Profile

Proceedings of the VLDB Endowment Volume 7 Issue 3pp 145–156https://doi.org/10.14778/2732232.2732234

Published:01 November 2013Publication History

Proceedings of the VLDB Endowment

Abstract

MapReduce has recently emerged as a new paradigm for large-scale data analysis due to its high scalability, fine-grained fault tolerance and easy programming model. Since different jobs often share similar work (e.g., several jobs scan the same input file or produce the same map output), there are many opportunities to optimize the performance for a batch of jobs. In this paper, we propose two new techniques for multi-job optimization in the MapReduce framework. The first is a generalized grouping technique (which generalizes the recently proposed MRShare technique) that merges multiple jobs into a single job thereby enabling the merged jobs to share both the scan of the input file as well as the communication of the common map output. The second is a materialization technique that enables multiple jobs to share both the scan of the input file as well as the communication of the common map output via partial materialization of the map output of some jobs (in the map and/or reduce phase). Our second contribution is the proposal of a new optimization algorithm that given an input batch of jobs, produces an optimal plan by a judicious partitioning of the jobs into groups and an optimal assignment of the processing technique to each group. Our experimental results on Hadoop demonstrate that our new approach significantly outperforms the state-of-the-art technique, MRShare, by up to 107%.

References

F. N. Afrati and J. D. Ullman. Optimizing joins in a mapreduce environment. In EDBT, 2010. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI, 2004. Google ScholarDigital Library
I. Elghandour and A. Aboulnaga. Restore: Reusing results of mapreduce jobs. In VLDB, 2012. Google ScholarDigital Library
L. Fegaras, C. Li, and U. Gupta. An optimization framework for map-reduce queries. In EDBT, 2012. Google ScholarDigital Library
A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high-level dataflow system on top of map-reduce: the pig experience. In VLDB, 2009. Google ScholarDigital Library
H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. In VLDB, 2011.Google ScholarDigital Library
H. Herodotou, F. Dong, and S. Babu. Mapreduce programming and cost-based optimization? crossing this chasm with starfish. In VLDB, 2011.Google Scholar
E. Jahani, M. J. Cafarella, and C. Ré. Automatic optimization for mapreduce programs. In VLDB, 2011. Google ScholarDigital Library
J. Jestes, K. Yi, and F. Li. Building wavelet histograms on large data in mapreduce. In VLDB, 2011. Google ScholarDigital Library
F. Li, B. C. Ooi, M. T. Özsu, and S. Wu. Distributed data management using mapreduce. ACM Computing Surveys. To appear in 2014. Google ScholarDigital Library
H. Lim, H. Herodotou, and S. Babu. Stubby: A transformation-based optimizer for mapreduce workflows. In VLDB, 2012. Google ScholarDigital Library
T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. Mrshare: sharing across multiple queries in mapreduce. In VLDB, 2010. Google ScholarDigital Library
C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic optimization of parallel dataflow programs. In ATC, 2008. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In SIGMOD, 2008. Google ScholarDigital Library
Y. Shi, X. Meng, F. Wang, and Y. Gan. Hedc: a histogram estimator for data in the cloud. In CloudDb, 2012. Google ScholarDigital Library
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive-a warehousing solution over a map-reduce framework. In VLDB, 2009. Google ScholarDigital Library
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, and H. Liu. Hive-a petabyte scale data warehouse using hadoop. In ICDE, 2010.Google ScholarCross Ref
G. Wang and C.-Y. Chan. Multi-query optimization in mapreduce framework. Technical Report http://www.comp.nus.edu.sg/~g0800170/techreport-MJQ.pdf, National University of Singapore, February 2013.Google ScholarDigital Library
T. White. Hadoop: The Definitive Guide. O'Reilly Media, 2009. Google ScholarDigital Library
S. Wu, F. Li, S. Mehrotra, and B. C. Ooi. Query optimization for massively parallel data processing. In SOCC, 2011. Google ScholarDigital Library

Recommendations

Data-locality-aware mapreduce real-time scheduling framework

A framework to manage interactive MapReduce applications with deadline constraint.A dispatcher to assign jobs to resources considering blocking and data-locality.A dynamic power management for MapReduce tasks to improve run-time energy efficiency.A ...
Read More
Big data multi-query optimisation with Apache Flink

Big data analytic frameworks, such as MapReduce, Spark and Flink, have recently gained more popularity to process large data. Flink is an open-source Apache-hosted big data analytic framework for processing batch and streaming data. For historical data ...
Read More
Multi-objective scheduling of MapReduce jobs in big data processing

Data generation has increased drastically over the past few years due to the rapid development of Internet-based technologies. This period has been called the big data era. Big data offer an emerging paradigm shift in data exploration and utilization. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 7, Issue 3
November 2013
84 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 November 2013
Published in pvldb Volume 7, Issue 3
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 22
  Total Citations
  View Citations
- 216
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multi-query optimization in MapReduce framework

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Data-locality-aware mapreduce real-time scheduling framework

Big data multi-query optimisation with Apache Flink

Multi-objective scheduling of MapReduce jobs in big data processing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Multi-query optimization in MapReduce framework

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Data-locality-aware mapreduce real-time scheduling framework

Big data multi-query optimisation with Apache Flink

Multi-objective scheduling of MapReduce jobs in big data processing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media