research-article

The family of mapreduce and large-scale data processing systems

Authors:
Sherif Sakr

NICTA and University of New South Wales, Sydney, Australia

NICTA and University of New South Wales, Sydney, Australia
View Profile

,
Anna Liu

NICTA and University of New South Wales, Sydney, Australia

NICTA and University of New South Wales, Sydney, Australia
View Profile

,
Ayman G. Fayoumi

King Abdulaziz University, Saudia Arabia

King Abdulaziz University, Saudia Arabia
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 46 Issue 1Article No.: 11pp 1–44https://doi.org/10.1145/2522968.2522979

Published:11 July 2013Publication History

ACM Computing Surveys

Abstract

In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large-scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling, and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large-scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large-scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.

Supplemental Material

Available for Download

zip

sakr.zip (106.7 KB)

Supplemental movie, appendix, image and software files for, The family of mapreduce and large-scale data processing systems

References

Abadi, D. J., Marcus, A., Madden, S., and Hollenbach, K. 2009. SW-store: A vertically partitioned dbms for semantic web data management. VLDB J. 18, 2, 385--406. Google ScholarDigital Library
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Rasin, A., and Silberschatz, A. 2009. HadoopDB: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow. 2, 1, 922--933. Google ScholarDigital Library
Abouzeid, A., Bajda-Pawlikowski, K., Huang, J., Abadi, D., and Silberschatz, A. 2010. HadoopDB in action: Building real world applications. In Proceedings of the 36^th ACM SIGMOD International Conference on Management of Data (SIGMOD'10). Google ScholarDigital Library
Afrati, F. and Ullman, J. 2010. Optimizing joins in a map-reduce environment. In Proceedings of the 13^th International Conference on Extending Database Technology (EDBT'10). 99--110. Google ScholarDigital Library
Afrati, F. N., Sarma, A. D., Menestrina, D., Parameswaran, A. G., and Ullman, J. D. 2012. Fuzzy joins using mapreduce. In Proceedings of the 28^th IEEE International Conference on Data Engineering (ICDE'12). 498--509. Google ScholarDigital Library
Afrati, F. N. and Ullman, J. D. 2011. Optimizing multiway joins in a map-reduce environment. IEEE Trans. Knowl. Data Engin. 23, 9, 1282--1298. Google ScholarDigital Library
Alexandrov, A., Battre, D., Ewen, S., Heimel, M., Hueske, F., Kao, O., Markl, V., Nijkamp, E., and Warneke, D. 2010. Massively parallel data analysis with pacts on nephele. Proc. VLDB Endow. 3, 2, 1625--1628. Google ScholarDigital Library
Alvaro, P., Condie, T., Conway, N., Elmeleegy, K., Hellerstein, J. M., and Sears, R. 2010. Boom analytics: Exploring data-centric, declarative programming for the cloud. In Proceedings of the 5^th European Conference on Computer Systems (EuroSys'10). 223--236. Google ScholarDigital Library
Armbrust, M., Fox, A., Rean, G., Joseph, A., Katz, R., Konwinski, A., Gunho, L., David, P., Rabkin, A., Stoica, I., and Zaharia, M. 2009. Above the clouds: A berkeley view of cloud computing. http://www.cs.columbia.edu/&sim;roxana/teaching/COMS-E6998-7-Fall-2011/papers/armbrust-tr09.pdf.Google Scholar
Babu, S. 2010. Towards automatic optimization of mapreduce programs. In Proceedings of the 1^st ACM Symposium on Cloud Computing (SoCC'10). 137--142. Google ScholarDigital Library
Balmin, A., Kaldewey, T., and Tata, S. 2012. Clydesdale: Structured data processing on hadoop. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'12). 705--708. Google ScholarDigital Library
Battre, D., Ewen, S., Hueske, F., Kao, O., Markl, V., and Warneke, D. 2010. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In Proceedings of the ACM Symposium on Cloud Computing (SoCC'10). 119--130. Google ScholarDigital Library
Behm, A., Borkar, V. R., Carey, M. J., Grover, R., Li, C., Onose, N., Vernica, R., Deutsch, A., Papakonstantinou, Y., and Tsotras, V. J. 2011. ASTERIX: Towards a scalable, semistructured data platform for evolving-world models. Distrib. Parallel Databases 29, 3, 185--216. Google ScholarDigital Library
Bell, G., Gray, J., and Szalay, A. S. 2006. Petascale computational systems. IEEE Comput. 39, 1, 110--112. Google ScholarDigital Library
Beyer, K. S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M. Y., Kanne, C.-C., Ozcan, F., and Shekita, E. J. 2011. Jaql: A scripting language for large scale semistructured data analysis. Proc. VLDB Endow. 4, 12, 1272--1283.Google ScholarDigital Library
Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U. A., and Pasquini, R. 2011. Incoop: MapReduce for Incremental computations. In Proceedings of the ACM Symposium on Cloud Computing (SoCC'11). Google ScholarDigital Library
Blanas, S., Patel, J., Ercegovac, V., Rao, J., Shekita, E., and Tian, Y. 2010. A comparison of join algorithms for log processing in mapreduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 975--986. Google ScholarDigital Library
Boag, S., Chamberlin, D., Fernandez, M. F., Florescu, D., Robie, J., and Simeon, J. 2010. XQuery 1.0: An xml query language. http://www.w3.org/TR/xquery.Google Scholar
Borkar, V., Alsubaiee, S., Altowim, Y., Altwaijry, H., Behm, A., Bu, Y., Carey, M., Grover, R., Heilbron, Z., Kim, Y.-S., Li, C., Pirzadeh, P., Onose, N., Vernica, R., and Wen, J. 2012a. ASTERIX: An open source system for big data management and analysis. Proc. VLDB Endow. 5, 2. Google ScholarDigital Library
Borkar, V. R., Carey, M. J., Grover, R., Onose, N., and Vernica, R. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proceedings of the 27^th IEEE International Conference on Data Engineering (ICDE'11). 1151--1162. Google ScholarDigital Library
Borkar, V. R., Carey, M. J., and Li, C. 2012b. Inside “big data management”: Ogres, onions, or parfaits&quest; In Proceedings of the 15^th International Conference on Extending Database Technology (EDBT'12). 3--14. Google ScholarDigital Library
Bray, T., Paoli, J., Sperberg-Mcqueen, C. M., Maler, E., and Yergeau, F. 2008. Extensible markup language (xml) 1.0, 5^th ed. http://www.w3.org/TR/REC-xml/.Google Scholar
Bu, Y., Howe, B., Balazinska, M., and Ernst, M. 2010. HaLoop: Efficient iterative data processing on large clusters. Proc. VLDB Endow. 3, 1, 285--296. Google ScholarDigital Library
Cafarella, M. J. and Re, C. 2010. Manimal: Relational optimization for data-intensive programs. In Proceedings of the 13^th International Workshop on the Web and Databases (WebDB'10). Google ScholarDigital Library
Cary, A., Sun, Z., Hristidis, V., and Rishe, N. 2009. Experiences on processing spatial data with mapreduce. In Proceedings of the 21^st International Conference on Scientific and Statistical Database Management (SSDBM'09). 302--319. Google ScholarDigital Library
Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., and Zhou, J. 2008. SCOPE: Easy and efficient parallel processing of massive data sets. Proc. VLDB Endow. 1, 2, 1265--1276. Google ScholarDigital Library
Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry, R. R., Bradshaw, R., and Weizenbaum, N. 2010. FlumeJava: Easy, efficient data-parallel pipelines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'10). 363--375. Google ScholarDigital Library
Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2. Google ScholarDigital Library
Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., Aragonda, P., Lychagina, V., Kwon, Y., and Wong, M. 2011. Tenzing a sql implementation on the mapreduce framework. Proc. VLDB Endow. 4, 12, 1318--1327.Google ScholarDigital Library
Chen, R., Weng, X., He, B., and Yang, M. 2010. Large graph processing in the cloud. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 1123--1126. Google ScholarDigital Library
Condie, T., Chu, D., Hellerstein, J. M., and Maniatis, P. 2008. Evita raced: Metacompilation for declarative networks. Proc. VLDB Endow. 1, 1, 1153--1165. Google ScholarDigital Library
Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Elmeleegy, K., and Sears, R. 2010a. MapReduce online. In Proceedings of the 7^th USENIX Conference on Networked Systems Design and Implementation (NSDI'10). 313--328. Google ScholarDigital Library
Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Gerth, J., Talbot, J., Elmeleegy, K., and Sears, R. 2010b. Online aggregation and continuous query support in mapreduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 1115--1118. Google ScholarDigital Library
Cordeiro, R. L. F., Traina, C., Jr., Traina, A. J. M., Lopez, J., Kang, U., and Faloutsos, C. 2011. Clustering very large multi-dimensional datasets with mapreduce. In Proceedings of the 17^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'11). 690--698. Google ScholarDigital Library
Das, S., Sismanis, Y., Beyer, K., Gemulla, R., Haas, P., and McPherson, J. 2010. Ricardo: Integrating r and hadoop. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 987--998. Google ScholarDigital Library
Dean, J. and Ghemawat, S. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6^th Symposium on Operating System Design and Implementation (OSDI'04). 137--150. Google ScholarDigital Library
Dean, J. and Ghemawat, S. 2008. MapReduce: Simplified data processing on large clusters. Comm. ACM 51, 1, 107--113. Google ScholarDigital Library
Dean, J. and Ghemawat, S. 2010. MapReduce: A flexible data processing tool. Comm. ACM 53, 1, 72--77. Google ScholarDigital Library
Dewitt, D. J. and Gray, J. 1992. Parallel database systems: The future of high performance database systems. Comm. ACM 35, 6, 85--98. Google ScholarDigital Library
Dittrich, J., Quiane -Ruiz, J., Jindal, A., Kargin, Y., Setty, V., and Schad, J. 2010. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow. 3, 1, 518--529. Google ScholarDigital Library
Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., and Fox, G. 2010. Twister: A runtime for iterative mapreduce. In Proceedings of the 19^th ACM International Symposium on High Performance Distributed Computing (HPDC'10). 810--818. Google ScholarDigital Library
Elghandour, I. and Aboulnaga, A. 2012a. ReStore: Reusing results of mapreduce jobs. Proc. VLDB Endow. 5, 6, 586--597. Google ScholarDigital Library
Elghandour, I. and Aboulnaga, A. 2012b. ReStore: Reusing results of mapreduce jobs in pig. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 701--704. Google ScholarDigital Library
Eltabakh, M. Y., Tian, Y., Ozcan, F., Gemulla, R., Krettek, A., and McPherson, J. 2011. CoHadoop: Flexible data placement and its exploitation in hadoop. Proc. VLDB Endow. 4, 9, 575--585. Google ScholarDigital Library
Ene, A., Im, S., and Moseley, B. 2011. Fast clustering using mapreduce. In Proceedings of the 17^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'11). 681--689. Google ScholarDigital Library
Fegaras, L., Li, C., Gupta, U., and Philip, J. 2011. XML query optimization in map-reduce. In Proceedings of the International Workshop on the Web and Databases (WebDB).Google Scholar
Floratou, A., Patel, J. M., Shekita, E. J., and Tata, S. 2011. Column-oriented storage techniques for mapreduce. Proc. VLDB Endow. 4, 7, 419--429. Google ScholarDigital Library
Friedman, E., Pawlowski, P., and Cieslewicz, J. 2009. SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. Proc. VLDB Endow. 2, 2, 1402--1413. Google ScholarDigital Library
Gates, A. 2011. Programming Pig. O'Reilly Media. Google ScholarDigital Library
Gates, A., Natkovich, O., Chopra, S., Kamath, P., Narayanam, S., Olston, C., Reed, B., Srinivasan, S., and Srivastava, U. 2009. Building a highlevel dataflow system on top of mapreduce: The pig experience. Proc. VLDB Endow. 2, 2, 1414--1425. Google ScholarDigital Library
Ghemawat, S., Gobioff, H., and Leung, S. 2003. The google file system. In Proceedings of the 19^th ACM Symposium on Operating Systems Principles (SOSP'03). 29--43. Google ScholarDigital Library
Ghoting, A., Kambadur, P., Pednault, E. P. D., and Kannan, R. 2011a. NIMBLE: A toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce. In Proceedings of the 17^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'11). 334--342. Google ScholarDigital Library
Ghoting, A., Krishnamurthy, R., Pednault, E. P. D., Reinwald, B., Sindhwani, V., Tatikonda, S., Tian, Y., and Vaithyanathan, S. 2011b. SystemML: Declarative machine learning on mapreduce. In Proceedings of the IEEE 27^th International Conference on Data Engineering (ICDE'11). 231--242. Google ScholarDigital Library
Halevy, A. Y. 2001. Answering queries using views: A survey. VLDB J. 10, 4, 270--294. Google ScholarDigital Library
He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., and Xu, Z. 2011. RCFile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In Proceedings of the IEEE 27^th International Conference on Data Engineering (ICDE'11). 1199--1208. Google ScholarDigital Library
Herodotou, H. 2011. Hadoop performance models. CoRR abs/1106.0940. http://arxiv.org/abs/1106.0940.Google Scholar
Herodotou, H. and Babu, S. 2011. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. Proc. VLDB Endow. 4, 11, 1111--1122.Google ScholarDigital Library
Herodotou, H., Dong, F., and Babu, S. 2011a. Mapreduce programming and cost-based optimization&quest; crossing this chasm with starfish. Proc. VLDB Endow. 4, 12, 1446--1449.Google ScholarDigital Library
Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F. B., and Babu, S. 2011b. Starfish: A self-tuning system for big data analytics. In Proceedings of the 5^th Conference on Innovative Data Systems Research (CIDR'11). 261--272.Google Scholar
Hey, T., Tansley, S., and Tolle, K., eds. 2009. The fourth paradigm: Data-intensive scientific discovery. Microsoft Research. http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_complete_lr.pdf.Google Scholar
Hindman, B., Konwinski, A., Zaharia, M., and Stoica, I. 2009. A common substrate for cluster computing. In HotCloud Workshop held in conjunction with the USENIX Annual Technical Conference. https://www.usenix.org/legacy/event/hotcloud09/tech/full_papers/hindman.pdf. Google ScholarDigital Library
Huang, J., Abadi, D. J., and Ren, K. 2011. Scalable sparql querying of large rdf graphs. Proc. VLDB Endow. 4, 11, 1123--1134.Google ScholarDigital Library
Husain, M. F., McGlothlin, J. P., Masud, M. M., Khan, L. R., and Thuraisingham, B. M. 2011. Heuristics-based query processing for large rdf graphs using cloud computing. IEEE Trans. Knowl. Data Engin. 23, 9, 1312--1327. Google ScholarDigital Library
Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the 2^nd ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys'07). 59--72. Google ScholarDigital Library
Jahani, E., Cafarella, M. J., and Re, C. 2011. Automatic optimization for mapreduce programs. Proc. VLDB Endow. 4, 6, 385--396. Google ScholarDigital Library
Jiang, D., Ooi, B. C., Shi, L., and Wu, S. 2010. The performance of mapreduce: An in-depth study. Proc. VLDB Endow. 3, 1, 472--483. Google ScholarDigital Library
Jiang, D., Tung, A. K. H., and Chen, G. 2011. MAP-JOIN-REDUCE: Toward scalable and efficient data analysis on large clusters. IEEE Trans. Knowl. Data Engin. 23, 9, 1299--1311. Google ScholarDigital Library
Jindal, A., Quiane-Ruiz, J.-A., and Dittrich, J. 2011. Trojan data layouts: Right shoes for a running elephant. In Proceedings of the 2^nd ACM Symposium on Cloud Computing (SoCC'11). Google ScholarDigital Library
Kaldewey, T., Shekita, E. J., and Tata, S. 2012. Clydesdale: Structured data processing on mapreduce. In Proceedings of the 15^th International Conference on Extending Database Technology (EDBT'12). 15--25. Google ScholarDigital Library
Kang, U., Meeder, B., and Faloutsos, C. 2011a. Spectral analysis for billion-scale graphs: Discoveries and implementation. In Proceedings of the 15^th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining (PAKDD'11). 13--25. Google ScholarDigital Library
Kang, U., Tong, H., Sun, J., Lin, C.-Y., and Faloutsos, C. 2011b. GBASE: A scalable and general graph management system. In Proceedings of the 17^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'11). 1091--1099. Google ScholarDigital Library
Kang, U., Tsourakakis, C. E., and Faloutsos, C. 2009. PEGASUS: A peta-scale graph mining system. In Proceedings of the 9^th IEEE International Conference on Data Mining (ICDM'09). 229--238. Google ScholarDigital Library
Kang, U., Tsourakakis, C. E., and Faloutsos, C. 2011c. PEGASUS: Mining peta-scale graphs. Knowl. Inf. Syst. 27, 2, 303--325. Google ScholarDigital Library
Khatchadourian, S., Consens, M. P., and Simeon, J. 2011. Having a chuql at xml on the cloud. In Proceedings of the 5^th Alberto Mendelzon International Workshop on Foundations of Data Management (AMW'11).Google Scholar
Kim, H., Ravindra, P., and Anyanwu, K. 2011. From sparql to mapreduce: The journey using a nested triple-group algebra. Proc. VLDB Endow. 4, 12, 1426--1429.Google ScholarDigital Library
Kolb, L., Thor, A., and Rahm, E. 2012a. Dedoop: Efficient deduplication with hadoop. Proc. VLDB Endow. 5, 12. Google ScholarDigital Library
Kolb, L., Thor, A., and Rahm, E. 2012b. Load balancing for mapreduce-based entity resolution. In Proceedings of the 28^th International Conference on Data Engineering (ICDE'12). 618--629. Google ScholarDigital Library
Kumar, V., andrade, H., Gedik, B., and Wu, K.-L. 2010. DEDUCE: At the intersection of mapreduce and stream processing. In Proceedings of the 13^th International Conference on Extending Database Technology (EDBT'10). 657--662. Google ScholarDigital Library
Lamport, L. 1998. The part-time parliament. ACM Trans. Comput. Syst. 16, 2, 133--169. Google ScholarDigital Library
Large Synoptic Survey. 2013. http://www.lsst.org/.Google Scholar
Lattanzi, S., Moseley, B., Suri, S., and Vassilvitskii, S. 2011. Filtering: A method for solving graph problems in mapreduce. In Proceedings of the 23^rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA'11). 85--94. Google ScholarDigital Library
Lim, H., Herodotou, H., and Babu, S. 2012. Stubby: A transformation-based optimizer for mapreduce workflows. Proc. VLDB Endow. 5, 12. Google ScholarDigital Library
Lin, J. J. 2009. Brute force and indexed approaches to pairwise document similarity comparisons with mapreduce. In Proceedings of the 32^nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'09). 155--162. Google ScholarDigital Library
Lin, Y., Agrawal, D., Chen, C., Ooi, B. C., and Wu, S. 2011. Llama: Leveraging columnar storage for scalable join processing in the mapreduce framework. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'11). 961--972. Google ScholarDigital Library
Logothetis, D. and Yocum, K. 2008. Ad-hoc data processing in the cloud. Proc. VLDB Endow. 1, 2, 1472--1475. Google ScholarDigital Library
Loo, B. T., Condie, T., Hellerstein, J. M., Maniatis, P., Roscoe, T., and Stoica, I. 2005. Implementing declarative overlays. In Proceedings of the 20^th ACM Symposium on Operating Systems Principles (SOSP'05). 75--90. Google ScholarDigital Library
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., and Hellerstein, J. M. 2010. GraphLab: A new framework for parallel machine learning. In Proceedings of the 26^th Conference on Uncertainty in Artificial Intelligence (UAI'10). 340--349.Google ScholarDigital Library
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., and Hellerstein, J. M. 2012. Distributed graphlab: A framework for machine learning in the cloud. Proc. VLDB Endow. 5, 8, 716--727. Google ScholarDigital Library
Malewicz, G., Austern, M., Bik, A., Dehnert, J., Horn, I., Leiser, N., and Czajkowski, G. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 135--146. Google ScholarDigital Library
Manola, F. and Miller, E. 2004. RDF Primer, W3C Recommendation. http://www.w3.org/TR/REC-rdf-syntax/.Google Scholar
Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., and Vassilakis, T. 2010. Dremel: Interactive analysis of web-scale datasets. Proc. VLDB Endow. 3, 1, 330--339. Google ScholarDigital Library
Metwally, A. and Faloutsos, C. 2012. V-smart-join: A scalable mapreduce framework for all-pair similarity joins of multisets and vectors. Proc. VLDB Endow. 5, 8, 704--715. Google ScholarDigital Library
Morales, G. F., Gionis, A., and Sozio, M. 2011. Social content matching in mapreduce. Proc. VLDB Endow. 4, 7, 460--469. Google ScholarDigital Library
Morton, K., Balazinska, M., and Grossman, D. 2010a. ParaTimer: A progress indicator for mapreduce dags. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 507--518. Google ScholarDigital Library
Morton, K., Friesen, A., Balazinska, M., and Grossman, D. 2010b. Estimating the progress of mapreduce pipelines. In Proceedings of the 26^th IEEE International Conference on Data Engineering (ICDE'10). 681--684.Google Scholar
Myung, J., Yeon, J., and Goo Lee, S. 2010. SPARQL basic graph pattern processing with iterative mapreduce. In Proceedings of the Workshop on Massive Data Analytics on the Cloud (MDAC'10). Google ScholarDigital Library
Neumann, T. and Weikum, G. 2008. RDF-3x: A risc-style engine for rdf. Proc. VLDB Endow. 1, 1. Google ScholarDigital Library
Nykiel, T., Potamias, M., Mishra, C., Kollios, G., and Koudas, N. 2010. MRShare: Sharing across multiple queries in mapreduce. Proc. VLDB Endow. 3, 1, 494--505. Google ScholarDigital Library
Odersky, M., Spoon, L., and Venners, B. 2011. Programming in Scala: A Comprehensive Step-by-Step Guide. Artima Inc. Google ScholarDigital Library
Olston, C., Reed, B., Srivastava, U., Kumar, R., and Tomkins, A. 2008. Pig latin: A not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'08). 1099--1110. Google ScholarDigital Library
Panda, B., Herbach, J., Basu, S., and Bayardo, R. J. 2009. PLANET: Massively parallel learning of tree ensembles with mapreduce. Proc. VLDB Endow, 2, 2, 1426--1437. Google ScholarDigital Library
Papadimitriou, S. and Sun, J. 2008. DisCo: Distributed co-clustering with map-reduce: A case study towards petabyte-scale end-to-end mining. In Proceedings of the 8^th IEEE International Conference on Data Mining (ICDM'08). 512--521. Google ScholarDigital Library
Patterson, D. A. 2008. Technical perspective: The data center is the computer. Comm. ACM 51, 1, 105. Google ScholarDigital Library
Pavlo, A., Paulson, E., Rasin, A., Abadi, D., Dewitt, D., Madden, S., and Stonebraker, M. 2009. A comparison of approaches to large-scale data analysis. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'09). 165--178. Google ScholarDigital Library
Pike, R., Dorward, S., Griesemer, R., and Quinlan, S. 2005. Interpreting the data: Parallel analysis with sawzall. Sci. Program. 13, 4, 277--298. Google ScholarDigital Library
Prudhommeaux, E. and Seaborne, A. 2008. SPARQL query language for rdf, w3c recommendation. http://www.w3.org/TR/rdf-sparql-query/.Google Scholar
Quiane-Ruiz, J.-A., Pinkel, C., Schad, J., and Dittrich, J. 2011a. RAFT at work: Speeding-up mapreduce applications under task and node failures. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'11). 1225--1228. Google ScholarDigital Library
Quiane-Ruiz, J.-A., Pinkel, C., Schad, J., and Dittrich, J. 2011b. RAFTing mapreduce: Fast recovery on the raft. In Proceedings of the 27^th IEEE International Conference on Data Engineering (ICDE'11). 589--600. Google ScholarDigital Library
Ravindra, P., Kim, H., and Anyanwu, K. 2011. An intermediate algebra for optimizing rdf graph pattern matching on mapreduce. In Proceedings of the 8^th Extended Semantic Web Conference on the Semanic Web: Research and Applications (ESWC'11). 46--61. Google ScholarDigital Library
Schatzle, A., Przyjaciel-Zablocki, M., Hornung, T., and Lausen, G. 2011. PigSPARQL: Mapping sparql to pig latin. In Proceedings of the International Workshop on Semantic Web Information Management (SWIM'11). 65--84. Google ScholarDigital Library
Selinger, P. G., Astrahan, M. M., Chamberlin, D. D., Lorie, R. A., and Price, T. G. 1979. Access path selection in a relational database management system. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'79). 23--34. Google ScholarDigital Library
Stonebraker, M. 1986. The case for shared nothing. IEEE Datab. Engin. Bull. 9, 1, 4--9.Google Scholar
Stonebraker, M., Abadi, D., Dewitt, D., Madden, S., Paulson, E., Pavlo, A., and Rasin, A. 2010. MapReduce and parallel dbmss: Friends or foes&quest; Comm. ACM 53, 1, 64--71. Google ScholarDigital Library
Stutz, P., Bernstein, A., and Cohen, W. W. 2010. Signal/collect: Graph algorithms for the (semantic) web. In Proceedings of the International Semantic Web Conference. 764--780. Google ScholarDigital Library
Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., and Murthy, R. 2009. Hive - A warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 2, 1626--1629. Google ScholarDigital Library
Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., and Murthy, R. 2010a. Hive - A petabyte scale data warehouse using hadoop. In Proceedings of the 26^th IEEE International Conference on Data Engineering (ICDE'10). 996--1005.Google Scholar
Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sarma, J. S., Murthy, R., and Liu, H. 2010b. Data warehousing and analytics infrastructure at facebook. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 1013--1020. Google ScholarDigital Library
Ullman, J. D. 1990. Principles of Database and Knowledge Base Systems: Volume II: The New Technologies. W. H. Freeman and Co., New York. Google ScholarDigital Library
Valiant, L. G. 1990. A bridging model for parallel computation. Comm. ACM 33, 8, 103--111. Google ScholarDigital Library
Vernica, R., Carey, M., and Li, C. 2010. Efficient parallel set-similarity joins using MapReduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 495--506. Google ScholarDigital Library
Wang, C., Wang, J., Lin, X., Wang, W., Wang, H., Li, H., Tian, W., Xu, J., and Li, R. 2010. MapDupReducer: Detecting near duplicates over massive datasets. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 1119--1122. Google ScholarDigital Library
Wang, G., Xie, W., Demers, A., and Gehrke, J. 2013. Asynchronous large-scale graph processing made easy. In Proceedings of the 7^th Conference on Innovative Data Systems Research (CIDR'13).Google Scholar
White, T. 2012. Hadoop: The Definitive Guide. O'Reilly Media. Google ScholarDigital Library
Xiao, C., Wang, W., Lin, X., Yu, J. X., and Wang, G. 2011. Efficient similarity joins for near-duplicate detection. ACM Trans. Datab. Syst. 36, 3, 15. Google ScholarDigital Library
Yang, H., Dasdan, A., Hsiao, R., and Parker, D. 2007. Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'07). 1029--1040. Google ScholarDigital Library
Yang, H. and Parker, D. 2009. Traverse: Simplified indexing on large map-reduce-merge clusters. In Proceedings of the 14^th International Conference on Database Systems for Advanced Applications (DASFAA'09). 308--322. Google ScholarDigital Library
Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, U., Gunda, P., and Currey, J. 2008. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8^th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). 1--14. Google ScholarDigital Library
Zaharia, M., Borthakur, D., Sarma, J. S., Elmeleegy, K., Shenker, S., and Stoica, I. 2010a. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5^th European Conference on Computer Systems (EuroSys'10). 265--278. Google ScholarDigital Library
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9^th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). Google ScholarDigital Library
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. 2010b. Spark: Cluster computing with working sets. In Proceedings of the 2^nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'10). 10. Google ScholarDigital Library
Zaharia, M., Konwinski, A., Joseph, A., Katz, R., and Stoica, I. 2008. Improving mapreduce performance in heterogeneous environments. In Proceedings of the 8^th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). 29--42. Google ScholarDigital Library
Zhang, Y., Gao, Q., Gao, L., and Wang, C. 2012. IMapReduce: A Distributed Computing Framework for Iterative Computation. J. Grid Comput. 10, 1, 47--68. Google ScholarDigital Library
Zhou, J., Larson, P., and Chaiken, R. 2010. Incorporating partitioning and parallel plans into the SCOPE optimizer. In Proceedings of the 26^th IEEE International Conference on Data Engineering (ICDE'10). 1060--1071.Google Scholar
Zukowski, M., Boncz, P. A., Nes, N., and Heman, S. 2005. MonetDB/X100 - A dbms in the cpu cache. IEEE Data Engin. Bull. 28, 2, 17--22.Google Scholar

Index Terms

The family of mapreduce and large-scale data processing systems

Recommendations

Prominence of MapReduce in Big Data Processing
CSNT '14: Proceedings of the 2014 Fourth International Conference on Communication Systems and Network Technologies

Big Data has come up with aureate haste and a clef enabler for the social business, Big Data gifts an opportunity to create extraordinary business advantage and better service delivery. Big Data is bringing a positive change in the decision making ...
Read More
MapReduce: Review and open challenges

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Read More
Challenges for MapReduce in Big Data
SERVICES '14: Proceedings of the 2014 IEEE World Congress on Services

In the Big Data community, MapReduce has been seen as one of the key enabling approaches for meeting continuously increasing demands on computing resources imposed by massive data sets. The reason for this is the high scalability of the MapReduce ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Computing Surveys Volume 46, Issue 1
October 2013
551 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/2522968
Issue’s Table of Contents

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 July 2013
- Accepted: 1 January 2013
- Revised: 1 November 2012
- Received: 1 July 2012
Published in csur Volume 46, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
MapReduce
big data
large-scale data processing
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 136
  Total Citations
  View Citations
- 3,727
  Total Downloads
- Downloads (Last 12 months)65
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The family of mapreduce and large-scale data processing systems

ACM Computing Surveys

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Prominence of MapReduce in Big Data Processing

MapReduce: Review and open challenges

Challenges for MapReduce in Big Data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

The family of mapreduce and large-scale data processing systems

ACM Computing Surveys

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Prominence of MapReduce in Big Data Processing

MapReduce: Review and open challenges

Challenges for MapReduce in Big Data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media