ABSTRACT
Map-Reduce is a programming model that enables easy development of scalable parallel applications to process a vast amount of data on large clusters of commodity machines. Through a simple interface with two functions, map and reduce, this model facilitates parallel implementation of many real-world tasks such as data processing jobs for search engines and machine learning.
However,this model does not directly support processing multiple related heterogeneous datasets. While processing relational data is a common need, this limitation causes difficulties and/or inefficiency when Map-Reduce is applied on relational operations like joins.
We improve Map-Reduce into a new model called Map-Reduce-Merge. It adds to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted (or hashed) by map and reduce modules. We also demonstrate that this new model can express relational algebra operators as well as implement several join algorithms.
- Apache. Hadoop. http://lucene.apache.org/hadoop/, 2006.Google Scholar
- A. C. Arpaci-Dusseau et al. High-Performance Sorting on Networks of Workstations. In SIGMOD 1997, pages 243--254, 1997.Google Scholar
- E. A. Brewer. Combining Systems and Databases: A Search Engine Retrospective. In J. M. Hellerstein and M. Stonebraker, editors, Readings in Database Systems, Fourth Edition, Cambridge, MA, 2005. MIT Press.Google Scholar
- F. Chang et al. Bigtable: A Distributed Storage System for Structured Data. In OSDI, pages 205--218, 2006. Google ScholarDigital Library
- L. Chu et al. Optimizing Data Aggregation for Cluster-Based Internet Services. In PPOPP, pages 119--130. ACM, 2003.Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137--150, 2004.Google ScholarDigital Library
- D. J. DeWitt et al. GAMMA-A High Performance Dataflow Database Machine. In VLDB 1986, pages 228--237, 1986. Google ScholarDigital Library
- D. J. DeWitt and Gerber. R. Multiprocessor Hash-Based Join Algorithms. In VLDB 1985, 1985.Google Scholar
- D. J. DeWitt and J. Gray. Parallel Database Systems: The Future of High Performance Database Systems. Commun. ACM, 35(6):85--98, 1992. Google ScholarDigital Library
- S. Ghemawat, H. Gobioff, and S. T. Leung. The Google file system. In SOSP, pages 29--43, 2003.Google ScholarDigital Library
- J. Gray. Sort Benchmark. http://research.microsoft.com/barc/SortBenchmark/,2006.Google Scholar
- J. Gray et al. Scientific data management in the coming decade. SIGMOD Record, 34(4):34--41, 2005. Google ScholarDigital Library
- M. Isard et al. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In EuroSys, 2007.Google ScholarDigital Library
- R. Lämmel. Google's MapReduce Programming Model - Revisited. Draft; Online since 2 January, 2006; 26 pages, 22 Jan. 2006.Google Scholar
- R. Pike et al. Interpreting the Data: Parallel Analysis with Sawzall. Scientific Programming Journal, 13(4):227--298, 2005. Google ScholarDigital Library
- Teradata. Teradata. http://www.teradata.com/t/go.aspx, 2006.Google Scholar
- TPC. TPC-H. http://www.tpc.org/tpch/default.asp, 2006.Google Scholar
- Wikipedia. Redundant Array of Inexpensive Nodes. http://en.wikipedia.org/wiki/Redundant Array of Inexpensive Nodes, 2006.Google Scholar
Index Terms
- Map-reduce-merge: simplified relational data processing on large clusters
Recommendations
Scale-out beyond map-reduce
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data miningThe amount and variety of data being collected in the enterprise is growing at a staggering pace. The default now is to capture and store any and all data, in anticipation of potential future strategic value, and vast amounts of data are being generated ...
Rainfall Prediction using Artificial Neural Network on Map-Reduce Framework
WCI '15: Proceedings of the Third International Symposium on Women in Computing and InformaticsBig data is a celebrated topic in Business as well as research community for several years. With the revolution of Big Data, it is becoming easy and less expensive to store tremendous amount of data for future analysis. Weather data gets accumulated very ...
A 2-Tier Clustering Algorithm with Map-Reduce
CHINAGRID '10: Proceedings of the The Fifth Annual ChinaGrid ConferenceIn the field of data mining, clustering is one of the important methods. K-Means is a typical distance-based clustering algorithm; 2-tier clustering should implement scalable clustering by means of dividing, sampling and knowledge integrating. Among ...
Comments