Article

Map-reduce-merge: simplified relational data processing on large clusters

Authors:
Hung-chih Yang

Yahoo!, Sunnyvale, CA

Yahoo!, Sunnyvale, CA
View Profile

,
Ali Dasdan

Yahoo!, Sunnyvale, CA

Yahoo!, Sunnyvale, CA
View Profile

,
Ruey-Lung Hsiao

UCLA, Los Angeles, CA

UCLA, Los Angeles, CA
View Profile

,
D. Stott Parker

UCLA, Los Angeles, CA

UCLA, Los Angeles, CA
View Profile

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of dataJune 2007Pages 1029–1040https://doi.org/10.1145/1247480.1247602

Published:11 June 2007Publication History

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

Pages 1029–1040

ABSTRACT

Map-Reduce is a programming model that enables easy development of scalable parallel applications to process a vast amount of data on large clusters of commodity machines. Through a simple interface with two functions, map and reduce, this model facilitates parallel implementation of many real-world tasks such as data processing jobs for search engines and machine learning.

However,this model does not directly support processing multiple related heterogeneous datasets. While processing relational data is a common need, this limitation causes difficulties and/or inefficiency when Map-Reduce is applied on relational operations like joins.

We improve Map-Reduce into a new model called Map-Reduce-Merge. It adds to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted (or hashed) by map and reduce modules. We also demonstrate that this new model can express relational algebra operators as well as implement several join algorithms.

References

Apache. Hadoop. http://lucene.apache.org/hadoop/, 2006.Google Scholar
A. C. Arpaci-Dusseau et al. High-Performance Sorting on Networks of Workstations. In SIGMOD 1997, pages 243--254, 1997.Google Scholar
E. A. Brewer. Combining Systems and Databases: A Search Engine Retrospective. In J. M. Hellerstein and M. Stonebraker, editors, Readings in Database Systems, Fourth Edition, Cambridge, MA, 2005. MIT Press.Google Scholar
F. Chang et al. Bigtable: A Distributed Storage System for Structured Data. In OSDI, pages 205--218, 2006. Google ScholarDigital Library
L. Chu et al. Optimizing Data Aggregation for Cluster-Based Internet Services. In PPOPP, pages 119--130. ACM, 2003.Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137--150, 2004.Google ScholarDigital Library
D. J. DeWitt et al. GAMMA-A High Performance Dataflow Database Machine. In VLDB 1986, pages 228--237, 1986. Google ScholarDigital Library
D. J. DeWitt and Gerber. R. Multiprocessor Hash-Based Join Algorithms. In VLDB 1985, 1985.Google Scholar
D. J. DeWitt and J. Gray. Parallel Database Systems: The Future of High Performance Database Systems. Commun. ACM, 35(6):85--98, 1992. Google ScholarDigital Library
S. Ghemawat, H. Gobioff, and S. T. Leung. The Google file system. In SOSP, pages 29--43, 2003.Google ScholarDigital Library
J. Gray. Sort Benchmark. http://research.microsoft.com/barc/SortBenchmark/,2006.Google Scholar
J. Gray et al. Scientific data management in the coming decade. SIGMOD Record, 34(4):34--41, 2005. Google ScholarDigital Library
M. Isard et al. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In EuroSys, 2007.Google ScholarDigital Library
R. Lämmel. Google's MapReduce Programming Model - Revisited. Draft; Online since 2 January, 2006; 26 pages, 22 Jan. 2006.Google Scholar
R. Pike et al. Interpreting the Data: Parallel Analysis with Sawzall. Scientific Programming Journal, 13(4):227--298, 2005. Google ScholarDigital Library
Teradata. Teradata. http://www.teradata.com/t/go.aspx, 2006.Google Scholar
TPC. TPC-H. http://www.tpc.org/tpch/default.asp, 2006.Google Scholar
Wikipedia. Redundant Array of Inexpensive Nodes. http://en.wikipedia.org/wiki/Redundant Array of Inexpensive Nodes, 2006.Google Scholar

Index Terms

Map-reduce-merge: simplified relational data processing on large clusters

Recommendations

Scale-out beyond map-reduce
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

The amount and variety of data being collected in the enterprise is growing at a staggering pace. The default now is to capture and store any and all data, in anticipation of potential future strategic value, and vast amounts of data are being generated ...
Read More
Rainfall Prediction using Artificial Neural Network on Map-Reduce Framework
WCI '15: Proceedings of the Third International Symposium on Women in Computing and Informatics

Big data is a celebrated topic in Business as well as research community for several years. With the revolution of Big Data, it is becoming easy and less expensive to store tremendous amount of data for future analysis. Weather data gets accumulated very ...
Read More
A 2-Tier Clustering Algorithm with Map-Reduce
CHINAGRID '10: Proceedings of the The Fifth Annual ChinaGrid Conference

In the field of data mining, clustering is one of the important methods. K-Means is a typical distance-based clustering algorithm; 2-tier clustering should implement scalable clustering by means of dividing, sampling and knowledge integrating. Among ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data
June 2007
1210 pages
ISBN:9781595936868
DOI:10.1145/1247480
General Chairs:
Lizhu Zhou
Tsinghua University, China
,
Tok Wang Ling
National University of Singapore, Singapore
,
Program Chair:
Beng Chin Ooi
National University of Singapore, Singapore
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 June 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cluster
data processing
distributed
join
map-reduce
map-reduce-merge
parallel
relational
search engine
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 464
  Total Citations
  View Citations
- 5,997
  Total Downloads
- Downloads (Last 12 months)84
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Map-reduce-merge: simplified relational data processing on large clusters

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Scale-out beyond map-reduce

Rainfall Prediction using Artificial Neural Network on Map-Reduce Framework

A 2-Tier Clustering Algorithm with Map-Reduce

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Map-reduce-merge: simplified relational data processing on large clusters

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Scale-out beyond map-reduce

Rainfall Prediction using Artificial Neural Network on Map-Reduce Framework

A 2-Tier Clustering Algorithm with Map-Reduce

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media