skip to main content
research-article

Parallel data processing with MapReduce: a survey

Published:11 January 2012Publication History
Skip Abstract Section

Abstract

A prominent parallel data processing tool MapReduce is gaining significant momentum from both industry and academia as the volume of data to analyze grows rapidly. While MapReduce is used in many areas where massive data analysis is required, there are still debates on its performance, efficiency per node, and simple abstraction. This survey intends to assist the database and open source communities in understanding various technical aspects of the MapReduce framework. In this survey, we characterize the MapReduce framework and discuss its inherent pros and cons. We then introduce its optimization strategies reported in the recent literature. We also discuss the open issues and challenges raised on parallel data analysis with MapReduce.

References

  1. Mahout: Scalable machine-learning and data-mining library. http://mapout.apache.org, 2010.Google ScholarGoogle Scholar
  2. F.N. Afrati and J.D. Ullman. Optimizing joins in a map-reduce environment. In Proceedings of the 13th EDBT, pages 99--110, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Ailamaki, D.J. DeWitt, M.D. Hill, and M. Skounakis. Weaving relations for cache performance. The VLDB Journal, pages 169--180, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Anand. Scaling Hadoop to 4000 nodes at Yahoo! http://goo.gl/8dRMq, 2008.Google ScholarGoogle Scholar
  5. S. Babu. Towards automatic optimization of mapreduce programs. In Proceedings of the 1st ACM symposium on Cloud computing, pages 137--142, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Nokia Research Center. Disco: Massive data- minimal code. http://discoproject.org, 2010.Google ScholarGoogle Scholar
  7. S. Chen. Cheetah: a high performance, custom data warehouse on top of MapReduce. Proceedings of the VLDB, 3(1-2):1459--1468, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. de Kruijf and K. Sankaralingam. Mapreduce for the cell broadband engine architecture. IBM Journal of Research and Development, 53(5):10:1--10:12, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Dean. Designs, lessons and advice from building large distributed systems. Keynote from LADIS, 2009.Google ScholarGoogle Scholar
  10. D. DeWitt and M. Stonebraker. MapReduce: A major step backwards. The Database Column, 1, 2008.Google ScholarGoogle Scholar
  11. A. Abouzeid et al. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proceedings of the VLDB Endowment, 2(1):922--933, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Floratou et al. Column-Oriented Storage Techniques for MapReduce. Proceedings of the VLDB, 4(7), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Matsunaga et al. Cloudblast: Combining mapreduce and virtualization on distributed resources for bioinformatics applications. In Fourth IEEE International Conference on eScience, pages 222--229, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Okcan et al. Processing Theta-Joins using MapReduce. In Proceedings of the 2011 ACM SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Pavlo et al. A comparison of approaches to large-scale data analysis. In Proceedings of the ACM SIGMOD, pages 165--178, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Thusoo et al. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2):1626--1629, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Thusoo et al. Hive-a petabyte scale data warehouse using Hadoop. In Proceedings of the 26th IEEE ICDE, pages 996--1005, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  18. A.F. Gates et al. Building a high-level dataflow system on top of Map-Reduce: the Pig experience. Proceedings of the VLDB Endowment, 2(2):1414--1425, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. Catanzaro et al. A map reduce framework for programming graphics processors. In Workshop on Software Tools for MultiCore Systems, 2008.Google ScholarGoogle Scholar
  20. B. He et al. Mars: a MapReduce framework on graphics processors. In Proceedings of the 17th PACT, pages 260--269, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. B. Li et al. A Platform for Scalable One-Pass Analytics using MapReduce. In Proceedings of the 2011 ACM SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. Olston et al. Pig latin: a not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD, pages 1099--1110, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. Ordonez et al. Relational versus Non-Relational Database Systems for Data Warehousing. In Proceedings of the ACM DOLAP, pages 67--68, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. Ranger et al. Evaluating mapreduce for multi-core and multiprocessor systems. In Proceedings of the 2007 IEEE HPCA, pages 13--24, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Yang et al. Osprey: Implementing MapReduce-style fault tolerance in a shared-nothing distributed database. In Proceedings of the 26th IEEE ICDE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  26. D. Battré et al. Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In Proceedings of the 1st ACM symposium on Cloud computing, pages 119--130, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. Jiang et al. Map-join-reduce: Towards scalable and efficient data analysis on large clusters. IEEE Transactions on Knowledge and Data Engineering, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Jiang et al. The performance of mapreduce: An in-depth study. Proceedings of the VLDB Endowment, 3(1-2):472--483, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. Logothetis et al. Ad-hoc data processing in the cloud. Proceedings of the VLDB Endowment, 1(2):1472--1475, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Warneke et al. Nephele: efficient parallel data processing in the cloud. In Proceedings of the 2nd MTAGS, pages 1--10, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D.J. DeWitt et al. Clustera: an integrated computation and data management system. Proceedings of the VLDB Endowment, 1(1):28--41, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. E. Anderson et al. Efficiency matters! ACM SIGOPS Operating Systems Review, 44(1):40--45, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. E. Friedman et al. SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. Proceedings of the VLDB Endowment, 2(2):1402--1413, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. E. Jahani et al. Automatic Optimization for MapReduce Programs. Proceedings of the VLDB, 4(6):385--396, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. F. Chang et al. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2):1--26, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. G. Malewicz et al. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD, pages 135--146, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. H. Yang et al. Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the 2007 ACM SIGMOD, pages 1029--1040, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. J. Dean et al. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. J. Dean et al. MapReduce: a flexible data processing tool. Communications of the ACM, 53(1):72--77, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. Dittrich et al. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proceedings of the VLDB Endowment, 3(1-2):515--529, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. J. Ekanayake et al. Mapreduce for data intensive scientific analyses. In the 4th IEEE International Conference on eScience, pages 277--284, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. J. Ekanayake et al. Twister: A runtime for iterative MapReduce. In Proceedings of the 19th ACM HPDC, pages 810--818, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. Leverich et al. On the energy (in) efficiency of hadoop clusters. ACM SIGOPS Operating Systems Review, 44(1):61--65, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Jeffrey Dean et al. Mapreduce: Simplified data processing on large clusters. In In Proceedings of the 6th USENIX OSDI, pages 137--150, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. K. Morton et al. Estimating the Progress of MapReduce Pipelines. In Proceedings of the the 26th IEEE ICDE, pages 681--684, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  46. K. Morton et al. Paratimer: a progress indicator for mapreduce dags. In Proceedings of the 2010 ACM SIGMOD, pages 507--518, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Kenton et al. Protocol Buffer- Google's data interchange format. http://code.google.com/p/protobuf/.Google ScholarGoogle Scholar
  48. M. Isard et al. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pages 59--72, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. M. Isard et al. Distributed data-parallel computing using a high-level programming language. In Proceedings of the ACM SIGMOD, pages 987--994, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. M. Stonebraker et al. One size fits all? Part 2: Benchmarking results. In Conference on Innovative Data Systems Research (CIDR), 2007.Google ScholarGoogle Scholar
  51. M. Stonebraker et al. MapReduce and parallel DBMSs: friends or foes? Communications of the ACM, 53(1):64--71, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. M. Zaharia et al. Improving mapreduce performance in heterogeneous environments. In Proceedings of the 8th USENIX OSDI, pages 29--42, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. R. Chaiken et al. Scope: easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment, 1(2):1265--1276, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. R. Farivar et al. Mithra: Multiple data independent tasks on a heterogeneous resource architecture. In IEEE International Conference on Cluster Computing and Workshops, pages 1--10, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  55. R. Pike et al. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming, 13(4):277--298, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. R. Vernica et al. Efficient parallel set-similarity joins using mapreduce. In Proceedings of the 2010 ACM SIGMOD, pages 495--506. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. S. Blanas et al. A comparison of join algorithms for log processing in MaPreduce. In Proceedings of the 2010 ACM SIGMOD, pages 975--986, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. S. Das et al. Ricardo: integrating R and Hadoop. In Proceedings of the 2010 ACM SIGMOD, pages 987--998, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. S. Ghemawat et al. The google file system. ACM SIGOPS Operating Systems Review, 37(5):29--43, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. S. Kavulya et al. An analysis of traces from a production mapreduce cluster. In 10th IEEE/ACM CCGrid, pages 94--103, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. S. Loebman et al. Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help? In IEEE International Conference on Cluster Computing and Workshops, pages 1--10, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  62. S. Melnik et al. Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 3(1-2):330--339, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. T. Condie et al. MapReduce online. In Proceedings of the 7th USENIX conference on Networked systems design and implementation, pages 21--21, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. T. Nykiel et al. MRshare: Sharing across multiple queries in mapreduce. Proceedings of the VLDB, 3(1-2):494--505, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. W. Jiang et al. A Map-Reduce System with an Alternate API for Multi-core Environments. In Proceedings of the 10th IEEE/ACM CCGrid, pages 84--93, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Y. Bu et al. HaLoop: Efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment, 3(1-2):285--296, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Y. Chen et al. To compress or not to compress - compute vs. IO tradeoffs for mapreduce energy efficiency. In Proceedings of the first ACM SIGCOMM workshop on Green networking, pages 23--28. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Y. He et al. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems. In Proceedings of the 2011 IEEE ICDE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Y. Lin et al. Llama: Leveraging Columnar Storage for Scalable Join Processing in the MapReduce Framework. In Proceedings of the 2011 ACM SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Y. Xu et al. Integrating Hadoop and parallel DBMS. In Proceedings of the ACM SIGMOD, pages 969--974, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Y. Yu et al. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX OSDI, pages 1--14, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. D. Florescu and D. Kossmann. Rethinking cost and performance of database systems. ACM SIGMOD Record, 38(1):43--48, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Saptarshi Guha. RHIPE-R and Hadoop Integrated Processing Environment. http://www.stat.purdue.edu/sguha/rhipe/, 2010.Google ScholarGoogle Scholar
  74. W. Lang and J.M. Patel. Energy management for MapReduce clusters. Proceedings of the VLDB, 3(1-2):129--139, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. J. Lin and C. Dyer. Data-intensive text processing with MapReduce. Synthesis Lectures on Human Language Technologies, 3(1):1--177, 2010. Google ScholarGoogle ScholarCross RefCross Ref
  76. O. O'Malley and A.C. Murthy. Winning a 60 second dash with a yellow elephant. Proceedings of sort benchmark, 2009.Google ScholarGoogle Scholar
  77. D.A. Patterson. Technical perspective: the data center is the computer. Communications of the ACM, 51(1):105--105, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. M.C. Schatz. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics, 25(11):1363, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. M. Stonebraker and U. Cetintemel. One size fits all: An idea whose time has come and gone. In Proceedings of the 21st IEEE ICDE, pages 2--11. IEEE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. R. Taylor. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC bioinformatics, 11(Suppl 12):S1, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  81. T. White. Hadoop: The Definitive Guide. Yahoo Press, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Parallel data processing with MapReduce: a survey

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGMOD Record
            ACM SIGMOD Record  Volume 40, Issue 4
            December 2011
            60 pages
            ISSN:0163-5808
            DOI:10.1145/2094114
            Issue’s Table of Contents

            Copyright © 2012 Authors

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 11 January 2012

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader