Abstract
A prominent parallel data processing tool MapReduce is gaining significant momentum from both industry and academia as the volume of data to analyze grows rapidly. While MapReduce is used in many areas where massive data analysis is required, there are still debates on its performance, efficiency per node, and simple abstraction. This survey intends to assist the database and open source communities in understanding various technical aspects of the MapReduce framework. In this survey, we characterize the MapReduce framework and discuss its inherent pros and cons. We then introduce its optimization strategies reported in the recent literature. We also discuss the open issues and challenges raised on parallel data analysis with MapReduce.
- Mahout: Scalable machine-learning and data-mining library. http://mapout.apache.org, 2010.Google Scholar
- F.N. Afrati and J.D. Ullman. Optimizing joins in a map-reduce environment. In Proceedings of the 13th EDBT, pages 99--110, 2010. Google ScholarDigital Library
- A. Ailamaki, D.J. DeWitt, M.D. Hill, and M. Skounakis. Weaving relations for cache performance. The VLDB Journal, pages 169--180, 2001. Google ScholarDigital Library
- A. Anand. Scaling Hadoop to 4000 nodes at Yahoo! http://goo.gl/8dRMq, 2008.Google Scholar
- S. Babu. Towards automatic optimization of mapreduce programs. In Proceedings of the 1st ACM symposium on Cloud computing, pages 137--142, 2010. Google ScholarDigital Library
- Nokia Research Center. Disco: Massive data- minimal code. http://discoproject.org, 2010.Google Scholar
- S. Chen. Cheetah: a high performance, custom data warehouse on top of MapReduce. Proceedings of the VLDB, 3(1-2):1459--1468, 2010. Google ScholarDigital Library
- M. de Kruijf and K. Sankaralingam. Mapreduce for the cell broadband engine architecture. IBM Journal of Research and Development, 53(5):10:1--10:12, 2009. Google ScholarDigital Library
- J. Dean. Designs, lessons and advice from building large distributed systems. Keynote from LADIS, 2009.Google Scholar
- D. DeWitt and M. Stonebraker. MapReduce: A major step backwards. The Database Column, 1, 2008.Google Scholar
- A. Abouzeid et al. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proceedings of the VLDB Endowment, 2(1):922--933, 2009. Google ScholarDigital Library
- A. Floratou et al. Column-Oriented Storage Techniques for MapReduce. Proceedings of the VLDB, 4(7), 2011. Google ScholarDigital Library
- A. Matsunaga et al. Cloudblast: Combining mapreduce and virtualization on distributed resources for bioinformatics applications. In Fourth IEEE International Conference on eScience, pages 222--229, 2008. Google ScholarDigital Library
- A. Okcan et al. Processing Theta-Joins using MapReduce. In Proceedings of the 2011 ACM SIGMOD, 2011. Google ScholarDigital Library
- A. Pavlo et al. A comparison of approaches to large-scale data analysis. In Proceedings of the ACM SIGMOD, pages 165--178, 2009. Google ScholarDigital Library
- A. Thusoo et al. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2):1626--1629, 2009. Google ScholarDigital Library
- A. Thusoo et al. Hive-a petabyte scale data warehouse using Hadoop. In Proceedings of the 26th IEEE ICDE, pages 996--1005, 2010.Google ScholarCross Ref
- A.F. Gates et al. Building a high-level dataflow system on top of Map-Reduce: the Pig experience. Proceedings of the VLDB Endowment, 2(2):1414--1425, 2009. Google ScholarDigital Library
- B. Catanzaro et al. A map reduce framework for programming graphics processors. In Workshop on Software Tools for MultiCore Systems, 2008.Google Scholar
- B. He et al. Mars: a MapReduce framework on graphics processors. In Proceedings of the 17th PACT, pages 260--269, 2008. Google ScholarDigital Library
- B. Li et al. A Platform for Scalable One-Pass Analytics using MapReduce. In Proceedings of the 2011 ACM SIGMOD, 2011. Google ScholarDigital Library
- C. Olston et al. Pig latin: a not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD, pages 1099--1110, 2008. Google ScholarDigital Library
- C. Ordonez et al. Relational versus Non-Relational Database Systems for Data Warehousing. In Proceedings of the ACM DOLAP, pages 67--68, 2010. Google ScholarDigital Library
- C. Ranger et al. Evaluating mapreduce for multi-core and multiprocessor systems. In Proceedings of the 2007 IEEE HPCA, pages 13--24, 2007. Google ScholarDigital Library
- C. Yang et al. Osprey: Implementing MapReduce-style fault tolerance in a shared-nothing distributed database. In Proceedings of the 26th IEEE ICDE, 2010.Google ScholarCross Ref
- D. Battré et al. Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In Proceedings of the 1st ACM symposium on Cloud computing, pages 119--130, 2010. Google ScholarDigital Library
- D. Jiang et al. Map-join-reduce: Towards scalable and efficient data analysis on large clusters. IEEE Transactions on Knowledge and Data Engineering, 2010. Google ScholarDigital Library
- D. Jiang et al. The performance of mapreduce: An in-depth study. Proceedings of the VLDB Endowment, 3(1-2):472--483, 2010. Google ScholarDigital Library
- D. Logothetis et al. Ad-hoc data processing in the cloud. Proceedings of the VLDB Endowment, 1(2):1472--1475, 2008. Google ScholarDigital Library
- D. Warneke et al. Nephele: efficient parallel data processing in the cloud. In Proceedings of the 2nd MTAGS, pages 1--10, 2009. Google ScholarDigital Library
- D.J. DeWitt et al. Clustera: an integrated computation and data management system. Proceedings of the VLDB Endowment, 1(1):28--41, 2008. Google ScholarDigital Library
- E. Anderson et al. Efficiency matters! ACM SIGOPS Operating Systems Review, 44(1):40--45, 2010. Google ScholarDigital Library
- E. Friedman et al. SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. Proceedings of the VLDB Endowment, 2(2):1402--1413, 2009. Google ScholarDigital Library
- E. Jahani et al. Automatic Optimization for MapReduce Programs. Proceedings of the VLDB, 4(6):385--396, 2011. Google ScholarDigital Library
- F. Chang et al. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2):1--26, 2008. Google ScholarDigital Library
- G. Malewicz et al. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD, pages 135--146, 2010. Google ScholarDigital Library
- H. Yang et al. Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the 2007 ACM SIGMOD, pages 1029--1040, 2007. Google ScholarDigital Library
- J. Dean et al. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
- J. Dean et al. MapReduce: a flexible data processing tool. Communications of the ACM, 53(1):72--77, 2010. Google ScholarDigital Library
- J. Dittrich et al. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proceedings of the VLDB Endowment, 3(1-2):515--529, 2010. Google ScholarDigital Library
- J. Ekanayake et al. Mapreduce for data intensive scientific analyses. In the 4th IEEE International Conference on eScience, pages 277--284, 2008. Google ScholarDigital Library
- J. Ekanayake et al. Twister: A runtime for iterative MapReduce. In Proceedings of the 19th ACM HPDC, pages 810--818, 2010. Google ScholarDigital Library
- J. Leverich et al. On the energy (in) efficiency of hadoop clusters. ACM SIGOPS Operating Systems Review, 44(1):61--65, 2010. Google ScholarDigital Library
- Jeffrey Dean et al. Mapreduce: Simplified data processing on large clusters. In In Proceedings of the 6th USENIX OSDI, pages 137--150, 2004. Google ScholarDigital Library
- K. Morton et al. Estimating the Progress of MapReduce Pipelines. In Proceedings of the the 26th IEEE ICDE, pages 681--684, 2010.Google ScholarCross Ref
- K. Morton et al. Paratimer: a progress indicator for mapreduce dags. In Proceedings of the 2010 ACM SIGMOD, pages 507--518, 2010. Google ScholarDigital Library
- Kenton et al. Protocol Buffer- Google's data interchange format. http://code.google.com/p/protobuf/.Google Scholar
- M. Isard et al. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pages 59--72, 2007. Google ScholarDigital Library
- M. Isard et al. Distributed data-parallel computing using a high-level programming language. In Proceedings of the ACM SIGMOD, pages 987--994, 2009. Google ScholarDigital Library
- M. Stonebraker et al. One size fits all? Part 2: Benchmarking results. In Conference on Innovative Data Systems Research (CIDR), 2007.Google Scholar
- M. Stonebraker et al. MapReduce and parallel DBMSs: friends or foes? Communications of the ACM, 53(1):64--71, 2010. Google ScholarDigital Library
- M. Zaharia et al. Improving mapreduce performance in heterogeneous environments. In Proceedings of the 8th USENIX OSDI, pages 29--42, 2008. Google ScholarDigital Library
- R. Chaiken et al. Scope: easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment, 1(2):1265--1276, 2008. Google ScholarDigital Library
- R. Farivar et al. Mithra: Multiple data independent tasks on a heterogeneous resource architecture. In IEEE International Conference on Cluster Computing and Workshops, pages 1--10, 2009.Google ScholarCross Ref
- R. Pike et al. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming, 13(4):277--298, 2005. Google ScholarDigital Library
- R. Vernica et al. Efficient parallel set-similarity joins using mapreduce. In Proceedings of the 2010 ACM SIGMOD, pages 495--506. ACM, 2010. Google ScholarDigital Library
- S. Blanas et al. A comparison of join algorithms for log processing in MaPreduce. In Proceedings of the 2010 ACM SIGMOD, pages 975--986, 2010. Google ScholarDigital Library
- S. Das et al. Ricardo: integrating R and Hadoop. In Proceedings of the 2010 ACM SIGMOD, pages 987--998, 2010. Google ScholarDigital Library
- S. Ghemawat et al. The google file system. ACM SIGOPS Operating Systems Review, 37(5):29--43, 2003. Google ScholarDigital Library
- S. Kavulya et al. An analysis of traces from a production mapreduce cluster. In 10th IEEE/ACM CCGrid, pages 94--103, 2010. Google ScholarDigital Library
- S. Loebman et al. Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help? In IEEE International Conference on Cluster Computing and Workshops, pages 1--10, 2009.Google ScholarCross Ref
- S. Melnik et al. Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 3(1-2):330--339, 2010. Google ScholarDigital Library
- T. Condie et al. MapReduce online. In Proceedings of the 7th USENIX conference on Networked systems design and implementation, pages 21--21, 2010. Google ScholarDigital Library
- T. Nykiel et al. MRshare: Sharing across multiple queries in mapreduce. Proceedings of the VLDB, 3(1-2):494--505, 2010. Google ScholarDigital Library
- W. Jiang et al. A Map-Reduce System with an Alternate API for Multi-core Environments. In Proceedings of the 10th IEEE/ACM CCGrid, pages 84--93, 2010. Google ScholarDigital Library
- Y. Bu et al. HaLoop: Efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment, 3(1-2):285--296, 2010. Google ScholarDigital Library
- Y. Chen et al. To compress or not to compress - compute vs. IO tradeoffs for mapreduce energy efficiency. In Proceedings of the first ACM SIGCOMM workshop on Green networking, pages 23--28. ACM, 2010. Google ScholarDigital Library
- Y. He et al. RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems. In Proceedings of the 2011 IEEE ICDE, 2011. Google ScholarDigital Library
- Y. Lin et al. Llama: Leveraging Columnar Storage for Scalable Join Processing in the MapReduce Framework. In Proceedings of the 2011 ACM SIGMOD, 2011. Google ScholarDigital Library
- Y. Xu et al. Integrating Hadoop and parallel DBMS. In Proceedings of the ACM SIGMOD, pages 969--974, 2010. Google ScholarDigital Library
- Y. Yu et al. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX OSDI, pages 1--14, 2008. Google ScholarDigital Library
- D. Florescu and D. Kossmann. Rethinking cost and performance of database systems. ACM SIGMOD Record, 38(1):43--48, 2009. Google ScholarDigital Library
- Saptarshi Guha. RHIPE-R and Hadoop Integrated Processing Environment. http://www.stat.purdue.edu/sguha/rhipe/, 2010.Google Scholar
- W. Lang and J.M. Patel. Energy management for MapReduce clusters. Proceedings of the VLDB, 3(1-2):129--139, 2010. Google ScholarDigital Library
- J. Lin and C. Dyer. Data-intensive text processing with MapReduce. Synthesis Lectures on Human Language Technologies, 3(1):1--177, 2010. Google ScholarCross Ref
- O. O'Malley and A.C. Murthy. Winning a 60 second dash with a yellow elephant. Proceedings of sort benchmark, 2009.Google Scholar
- D.A. Patterson. Technical perspective: the data center is the computer. Communications of the ACM, 51(1):105--105, 2008. Google ScholarDigital Library
- M.C. Schatz. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics, 25(11):1363, 2009. Google ScholarDigital Library
- M. Stonebraker and U. Cetintemel. One size fits all: An idea whose time has come and gone. In Proceedings of the 21st IEEE ICDE, pages 2--11. IEEE, 2005. Google ScholarDigital Library
- R. Taylor. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC bioinformatics, 11(Suppl 12):S1, 2010.Google ScholarCross Ref
- T. White. Hadoop: The Definitive Guide. Yahoo Press, 2010. Google ScholarDigital Library
Index Terms
- Parallel data processing with MapReduce: a survey
Recommendations
Efficient big data processing in Hadoop MapReduce
This tutorial is motivated by the clear need of many organizations, companies, and researchers to deal with big data volumes efficiently. Examples include web analytics applications, scientific applications, and social networks. A popular data ...
Prominence of MapReduce in Big Data Processing
CSNT '14: Proceedings of the 2014 Fourth International Conference on Communication Systems and Network TechnologiesBig Data has come up with aureate haste and a clef enabler for the social business, Big Data gifts an opportunity to create extraordinary business advantage and better service delivery. Big Data is bringing a positive change in the decision making ...
Hybrid storage architecture and efficient MapReduce processing for unstructured data
We present a hybrid storage architecture which integrates various kinds of data stores.We propose three partitioning strategies to execute MapReduce-based batch-processing jobs.Our hybrid solution shows 10% to 8.6 times faster performance than the ...
Comments