research-article

MRShare: sharing across multiple queries in MapReduce

Authors:
Tomasz Nykiel

University of Toronto

University of Toronto
View Profile

,
Michalis Potamias

Boston University

Boston University
View Profile

,
Chaitanya Mishra

Facebook

Facebook
View Profile

,
George Kollios

Boston University

Boston University
View Profile

,
Nick Koudas

University of Toronto

University of Toronto
View Profile

Proceedings of the VLDB Endowment Volume 3 Issue 1-2pp 494–505https://doi.org/10.14778/1920841.1920906

Published:01 September 2010Publication History

Proceedings of the VLDB Endowment

Abstract

Large-scale data analysis lies in the core of modern enterprises and scientific research. With the emergence of cloud computing, the use of an analytical query processing infrastructure (e.g., Amazon EC2) can be directly mapped to monetary value. MapReduce has been a popular framework in the context of cloud computing, designed to serve long running queries (jobs) which can be processed in batch mode. Taking into account that different jobs often perform similar work, there are many opportunities for sharing. In principle, sharing similar work reduces the overall amount of work, which can lead to reducing monetary charges incurred while utilizing the processing infrastructure. In this paper we propose a sharing framework tailored to MapReduce.

Our framework, MRShare, transforms a batch of queries into a new batch that will be executed more efficiently, by merging jobs into groups and evaluating each group as a single query. Based on our cost model for MapReduce, we define an optimization problem and we provide a solution that derives the optimal grouping of queries. Experiments in our prototype, built on top of Hadoop, demonstrate the overall effectiveness of our approach and substantial savings.

References

Amazon EC2. http://aws.amazon.com/ec2/.Google Scholar
Blogscope. http://www.blogscope.net/.Google Scholar
Hadoop project. http://hadoop.apache.org/.Google Scholar
Saving energy in datacenters. http://www1.eere.energy.gov/industry/datacenters/.Google Scholar
A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. In VLDB, 2009. Google ScholarDigital Library
P. Agrawal, D. Kifer, and C. Olston. Scheduling shared scans of large data files. Proc. VLDB Endow., 1(1):958--969, 2008. Google ScholarDigital Library
G. Candea, N. Polyzotis, and R. Vingralek. A scalable, predictable join operator for highly concurrent data warehouses. In VLDB, 2009. Google ScholarDigital Library
R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow., 1(2):1265--1276, 2008. Google ScholarDigital Library
S. Chaudhuri and K. Shim. Optimization of queries with user-defined predicates. ACM Trans. Database Syst., 24(2):177--228, 1999. Google ScholarDigital Library
J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. Mad skills: New analysis practices for big data. PVLDB, 2(2):1481--1492, 2009. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI '04, pages 137--150. Google ScholarDigital Library
S. Finkelstein. Common expression analysis in database applications. In SIGMOD '82, pages 235--245, 1982. Google ScholarDigital Library
E. Friedman, P. Pawlowski, and J. Cieslewicz. Sql/mapreduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. In VLDB, 2009. Google ScholarDigital Library
A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a highlevel dataflow system on top of mapreduce: The pig experience. PVLDB, 2(2):1414--1425, 2009. Google ScholarDigital Library
S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. Qpipe: a simultaneously pipelined relational query engine. In SIGMOD '05, pages 383--394, 2005. Google ScholarDigital Library
C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic optimization of parallel dataflow programs. In USENIX Annual Tech. Conf., pages 267--273, 2008. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, pages 1099--1110, 2008. Google ScholarDigital Library
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD '09, pages 165--178, 2009. Google ScholarDigital Library
R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with sawzall. Scientific Programming, 13(4):277--298, 2005. Google ScholarDigital Library
L. Qiao, V. Raman, F. Reiss, P. J. Haas, and G. M. Lohman. Main-memory scan sharing for multi-core cpus. Proc. VLDB Endow., 1(1):610--621, 2008. Google ScholarDigital Library
T. K. Sellis. Multiple-query optimization. ACM Trans. Database Syst., 13(1):23--52, 1988. Google ScholarDigital Library
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - a warehousing solution over a map-reduce framework. In VLDB, 2009. Google ScholarDigital Library
H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD, pages 1029--1040, 2007. Google ScholarDigital Library
Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, pages 1--14, 2008. Google ScholarDigital Library
J. Zhou, P.-A. Larson, J.-C. Freytag, and W. Lehner. Efficient exploitation of similar subexpressions for query processing. In SIGMOD '07, pages 533--544, 2007. Google ScholarDigital Library
M. Zukowski, S. Héman, N. Nes, and P. Boncz. Cooperative scans: dynamic bandwidth sharing in a dbms. In VLDB '07, pages 723--734, 2007. Google ScholarDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 3, Issue 1-2
September 2010
1658 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 September 2010
Published in pvldb Volume 3, Issue 1-2
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 80
  Total Citations
  View Citations
- 925
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Big Data Analytics with R and Hadoop

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark

Big Data Analytics with Hadoop 3: Build highly effective analytics solutions to gain valuable insight into your big data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Big Data Analytics with R and Hadoop

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark

Big Data Analytics with Hadoop 3: Build highly effective analytics solutions to gain valuable insight into your big data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media