skip to main content
10.1145/1247480.1247602acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Map-reduce-merge: simplified relational data processing on large clusters

Published:11 June 2007Publication History

ABSTRACT

Map-Reduce is a programming model that enables easy development of scalable parallel applications to process a vast amount of data on large clusters of commodity machines. Through a simple interface with two functions, map and reduce, this model facilitates parallel implementation of many real-world tasks such as data processing jobs for search engines and machine learning.

However,this model does not directly support processing multiple related heterogeneous datasets. While processing relational data is a common need, this limitation causes difficulties and/or inefficiency when Map-Reduce is applied on relational operations like joins.

We improve Map-Reduce into a new model called Map-Reduce-Merge. It adds to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted (or hashed) by map and reduce modules. We also demonstrate that this new model can express relational algebra operators as well as implement several join algorithms.

References

  1. Apache. Hadoop. http://lucene.apache.org/hadoop/, 2006.Google ScholarGoogle Scholar
  2. A. C. Arpaci-Dusseau et al. High-Performance Sorting on Networks of Workstations. In SIGMOD 1997, pages 243--254, 1997.Google ScholarGoogle Scholar
  3. E. A. Brewer. Combining Systems and Databases: A Search Engine Retrospective. In J. M. Hellerstein and M. Stonebraker, editors, Readings in Database Systems, Fourth Edition, Cambridge, MA, 2005. MIT Press.Google ScholarGoogle Scholar
  4. F. Chang et al. Bigtable: A Distributed Storage System for Structured Data. In OSDI, pages 205--218, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Chu et al. Optimizing Data Aggregation for Cluster-Based Internet Services. In PPOPP, pages 119--130. ACM, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137--150, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. J. DeWitt et al. GAMMA-A High Performance Dataflow Database Machine. In VLDB 1986, pages 228--237, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. J. DeWitt and Gerber. R. Multiprocessor Hash-Based Join Algorithms. In VLDB 1985, 1985.Google ScholarGoogle Scholar
  9. D. J. DeWitt and J. Gray. Parallel Database Systems: The Future of High Performance Database Systems. Commun. ACM, 35(6):85--98, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Ghemawat, H. Gobioff, and S. T. Leung. The Google file system. In SOSP, pages 29--43, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Gray. Sort Benchmark. http://research.microsoft.com/barc/SortBenchmark/,2006.Google ScholarGoogle Scholar
  12. J. Gray et al. Scientific data management in the coming decade. SIGMOD Record, 34(4):34--41, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Isard et al. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In EuroSys, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Lämmel. Google's MapReduce Programming Model - Revisited. Draft; Online since 2 January, 2006; 26 pages, 22 Jan. 2006.Google ScholarGoogle Scholar
  15. R. Pike et al. Interpreting the Data: Parallel Analysis with Sawzall. Scientific Programming Journal, 13(4):227--298, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Teradata. Teradata. http://www.teradata.com/t/go.aspx, 2006.Google ScholarGoogle Scholar
  17. TPC. TPC-H. http://www.tpc.org/tpch/default.asp, 2006.Google ScholarGoogle Scholar
  18. Wikipedia. Redundant Array of Inexpensive Nodes. http://en.wikipedia.org/wiki/Redundant Array of Inexpensive Nodes, 2006.Google ScholarGoogle Scholar

Index Terms

  1. Map-reduce-merge: simplified relational data processing on large clusters

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data
                June 2007
                1210 pages
                ISBN:9781595936868
                DOI:10.1145/1247480
                • General Chairs:
                • Lizhu Zhou,
                • Tok Wang Ling,
                • Program Chair:
                • Beng Chin Ooi

                Copyright © 2007 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 11 June 2007

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • Article

                Acceptance Rates

                Overall Acceptance Rate785of4,003submissions,20%

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader