Abstract
In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large-scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling, and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large-scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large-scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, The family of mapreduce and large-scale data processing systems
- Abadi, D. J., Marcus, A., Madden, S., and Hollenbach, K. 2009. SW-store: A vertically partitioned dbms for semantic web data management. VLDB J. 18, 2, 385--406. Google ScholarDigital Library
- Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Rasin, A., and Silberschatz, A. 2009. HadoopDB: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow. 2, 1, 922--933. Google ScholarDigital Library
- Abouzeid, A., Bajda-Pawlikowski, K., Huang, J., Abadi, D., and Silberschatz, A. 2010. HadoopDB in action: Building real world applications. In Proceedings of the 36th ACM SIGMOD International Conference on Management of Data (SIGMOD'10). Google ScholarDigital Library
- Afrati, F. and Ullman, J. 2010. Optimizing joins in a map-reduce environment. In Proceedings of the 13th International Conference on Extending Database Technology (EDBT'10). 99--110. Google ScholarDigital Library
- Afrati, F. N., Sarma, A. D., Menestrina, D., Parameswaran, A. G., and Ullman, J. D. 2012. Fuzzy joins using mapreduce. In Proceedings of the 28th IEEE International Conference on Data Engineering (ICDE'12). 498--509. Google ScholarDigital Library
- Afrati, F. N. and Ullman, J. D. 2011. Optimizing multiway joins in a map-reduce environment. IEEE Trans. Knowl. Data Engin. 23, 9, 1282--1298. Google ScholarDigital Library
- Alexandrov, A., Battre, D., Ewen, S., Heimel, M., Hueske, F., Kao, O., Markl, V., Nijkamp, E., and Warneke, D. 2010. Massively parallel data analysis with pacts on nephele. Proc. VLDB Endow. 3, 2, 1625--1628. Google ScholarDigital Library
- Alvaro, P., Condie, T., Conway, N., Elmeleegy, K., Hellerstein, J. M., and Sears, R. 2010. Boom analytics: Exploring data-centric, declarative programming for the cloud. In Proceedings of the 5th European Conference on Computer Systems (EuroSys'10). 223--236. Google ScholarDigital Library
- Armbrust, M., Fox, A., Rean, G., Joseph, A., Katz, R., Konwinski, A., Gunho, L., David, P., Rabkin, A., Stoica, I., and Zaharia, M. 2009. Above the clouds: A berkeley view of cloud computing. http://www.cs.columbia.edu/∼roxana/teaching/COMS-E6998-7-Fall-2011/papers/armbrust-tr09.pdf.Google Scholar
- Babu, S. 2010. Towards automatic optimization of mapreduce programs. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC'10). 137--142. Google ScholarDigital Library
- Balmin, A., Kaldewey, T., and Tata, S. 2012. Clydesdale: Structured data processing on hadoop. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'12). 705--708. Google ScholarDigital Library
- Battre, D., Ewen, S., Hueske, F., Kao, O., Markl, V., and Warneke, D. 2010. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In Proceedings of the ACM Symposium on Cloud Computing (SoCC'10). 119--130. Google ScholarDigital Library
- Behm, A., Borkar, V. R., Carey, M. J., Grover, R., Li, C., Onose, N., Vernica, R., Deutsch, A., Papakonstantinou, Y., and Tsotras, V. J. 2011. ASTERIX: Towards a scalable, semistructured data platform for evolving-world models. Distrib. Parallel Databases 29, 3, 185--216. Google ScholarDigital Library
- Bell, G., Gray, J., and Szalay, A. S. 2006. Petascale computational systems. IEEE Comput. 39, 1, 110--112. Google ScholarDigital Library
- Beyer, K. S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M. Y., Kanne, C.-C., Ozcan, F., and Shekita, E. J. 2011. Jaql: A scripting language for large scale semistructured data analysis. Proc. VLDB Endow. 4, 12, 1272--1283.Google ScholarDigital Library
- Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U. A., and Pasquini, R. 2011. Incoop: MapReduce for Incremental computations. In Proceedings of the ACM Symposium on Cloud Computing (SoCC'11). Google ScholarDigital Library
- Blanas, S., Patel, J., Ercegovac, V., Rao, J., Shekita, E., and Tian, Y. 2010. A comparison of join algorithms for log processing in mapreduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 975--986. Google ScholarDigital Library
- Boag, S., Chamberlin, D., Fernandez, M. F., Florescu, D., Robie, J., and Simeon, J. 2010. XQuery 1.0: An xml query language. http://www.w3.org/TR/xquery.Google Scholar
- Borkar, V., Alsubaiee, S., Altowim, Y., Altwaijry, H., Behm, A., Bu, Y., Carey, M., Grover, R., Heilbron, Z., Kim, Y.-S., Li, C., Pirzadeh, P., Onose, N., Vernica, R., and Wen, J. 2012a. ASTERIX: An open source system for big data management and analysis. Proc. VLDB Endow. 5, 2. Google ScholarDigital Library
- Borkar, V. R., Carey, M. J., Grover, R., Onose, N., and Vernica, R. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proceedings of the 27th IEEE International Conference on Data Engineering (ICDE'11). 1151--1162. Google ScholarDigital Library
- Borkar, V. R., Carey, M. J., and Li, C. 2012b. Inside “big data management”: Ogres, onions, or parfaits? In Proceedings of the 15th International Conference on Extending Database Technology (EDBT'12). 3--14. Google ScholarDigital Library
- Bray, T., Paoli, J., Sperberg-Mcqueen, C. M., Maler, E., and Yergeau, F. 2008. Extensible markup language (xml) 1.0, 5th ed. http://www.w3.org/TR/REC-xml/.Google Scholar
- Bu, Y., Howe, B., Balazinska, M., and Ernst, M. 2010. HaLoop: Efficient iterative data processing on large clusters. Proc. VLDB Endow. 3, 1, 285--296. Google ScholarDigital Library
- Cafarella, M. J. and Re, C. 2010. Manimal: Relational optimization for data-intensive programs. In Proceedings of the 13th International Workshop on the Web and Databases (WebDB'10). Google ScholarDigital Library
- Cary, A., Sun, Z., Hristidis, V., and Rishe, N. 2009. Experiences on processing spatial data with mapreduce. In Proceedings of the 21st International Conference on Scientific and Statistical Database Management (SSDBM'09). 302--319. Google ScholarDigital Library
- Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., and Zhou, J. 2008. SCOPE: Easy and efficient parallel processing of massive data sets. Proc. VLDB Endow. 1, 2, 1265--1276. Google ScholarDigital Library
- Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry, R. R., Bradshaw, R., and Weizenbaum, N. 2010. FlumeJava: Easy, efficient data-parallel pipelines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'10). 363--375. Google ScholarDigital Library
- Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2. Google ScholarDigital Library
- Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., Aragonda, P., Lychagina, V., Kwon, Y., and Wong, M. 2011. Tenzing a sql implementation on the mapreduce framework. Proc. VLDB Endow. 4, 12, 1318--1327.Google ScholarDigital Library
- Chen, R., Weng, X., He, B., and Yang, M. 2010. Large graph processing in the cloud. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 1123--1126. Google ScholarDigital Library
- Condie, T., Chu, D., Hellerstein, J. M., and Maniatis, P. 2008. Evita raced: Metacompilation for declarative networks. Proc. VLDB Endow. 1, 1, 1153--1165. Google ScholarDigital Library
- Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Elmeleegy, K., and Sears, R. 2010a. MapReduce online. In Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation (NSDI'10). 313--328. Google ScholarDigital Library
- Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Gerth, J., Talbot, J., Elmeleegy, K., and Sears, R. 2010b. Online aggregation and continuous query support in mapreduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 1115--1118. Google ScholarDigital Library
- Cordeiro, R. L. F., Traina, C., Jr., Traina, A. J. M., Lopez, J., Kang, U., and Faloutsos, C. 2011. Clustering very large multi-dimensional datasets with mapreduce. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'11). 690--698. Google ScholarDigital Library
- Das, S., Sismanis, Y., Beyer, K., Gemulla, R., Haas, P., and McPherson, J. 2010. Ricardo: Integrating r and hadoop. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 987--998. Google ScholarDigital Library
- Dean, J. and Ghemawat, S. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI'04). 137--150. Google ScholarDigital Library
- Dean, J. and Ghemawat, S. 2008. MapReduce: Simplified data processing on large clusters. Comm. ACM 51, 1, 107--113. Google ScholarDigital Library
- Dean, J. and Ghemawat, S. 2010. MapReduce: A flexible data processing tool. Comm. ACM 53, 1, 72--77. Google ScholarDigital Library
- Dewitt, D. J. and Gray, J. 1992. Parallel database systems: The future of high performance database systems. Comm. ACM 35, 6, 85--98. Google ScholarDigital Library
- Dittrich, J., Quiane -Ruiz, J., Jindal, A., Kargin, Y., Setty, V., and Schad, J. 2010. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow. 3, 1, 518--529. Google ScholarDigital Library
- Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., and Fox, G. 2010. Twister: A runtime for iterative mapreduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC'10). 810--818. Google ScholarDigital Library
- Elghandour, I. and Aboulnaga, A. 2012a. ReStore: Reusing results of mapreduce jobs. Proc. VLDB Endow. 5, 6, 586--597. Google ScholarDigital Library
- Elghandour, I. and Aboulnaga, A. 2012b. ReStore: Reusing results of mapreduce jobs in pig. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 701--704. Google ScholarDigital Library
- Eltabakh, M. Y., Tian, Y., Ozcan, F., Gemulla, R., Krettek, A., and McPherson, J. 2011. CoHadoop: Flexible data placement and its exploitation in hadoop. Proc. VLDB Endow. 4, 9, 575--585. Google ScholarDigital Library
- Ene, A., Im, S., and Moseley, B. 2011. Fast clustering using mapreduce. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'11). 681--689. Google ScholarDigital Library
- Fegaras, L., Li, C., Gupta, U., and Philip, J. 2011. XML query optimization in map-reduce. In Proceedings of the International Workshop on the Web and Databases (WebDB).Google Scholar
- Floratou, A., Patel, J. M., Shekita, E. J., and Tata, S. 2011. Column-oriented storage techniques for mapreduce. Proc. VLDB Endow. 4, 7, 419--429. Google ScholarDigital Library
- Friedman, E., Pawlowski, P., and Cieslewicz, J. 2009. SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. Proc. VLDB Endow. 2, 2, 1402--1413. Google ScholarDigital Library
- Gates, A. 2011. Programming Pig. O'Reilly Media. Google ScholarDigital Library
- Gates, A., Natkovich, O., Chopra, S., Kamath, P., Narayanam, S., Olston, C., Reed, B., Srinivasan, S., and Srivastava, U. 2009. Building a highlevel dataflow system on top of mapreduce: The pig experience. Proc. VLDB Endow. 2, 2, 1414--1425. Google ScholarDigital Library
- Ghemawat, S., Gobioff, H., and Leung, S. 2003. The google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP'03). 29--43. Google ScholarDigital Library
- Ghoting, A., Kambadur, P., Pednault, E. P. D., and Kannan, R. 2011a. NIMBLE: A toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'11). 334--342. Google ScholarDigital Library
- Ghoting, A., Krishnamurthy, R., Pednault, E. P. D., Reinwald, B., Sindhwani, V., Tatikonda, S., Tian, Y., and Vaithyanathan, S. 2011b. SystemML: Declarative machine learning on mapreduce. In Proceedings of the IEEE 27th International Conference on Data Engineering (ICDE'11). 231--242. Google ScholarDigital Library
- Halevy, A. Y. 2001. Answering queries using views: A survey. VLDB J. 10, 4, 270--294. Google ScholarDigital Library
- He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., and Xu, Z. 2011. RCFile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In Proceedings of the IEEE 27th International Conference on Data Engineering (ICDE'11). 1199--1208. Google ScholarDigital Library
- Herodotou, H. 2011. Hadoop performance models. CoRR abs/1106.0940. http://arxiv.org/abs/1106.0940.Google Scholar
- Herodotou, H. and Babu, S. 2011. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. Proc. VLDB Endow. 4, 11, 1111--1122.Google ScholarDigital Library
- Herodotou, H., Dong, F., and Babu, S. 2011a. Mapreduce programming and cost-based optimization? crossing this chasm with starfish. Proc. VLDB Endow. 4, 12, 1446--1449.Google ScholarDigital Library
- Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F. B., and Babu, S. 2011b. Starfish: A self-tuning system for big data analytics. In Proceedings of the 5th Conference on Innovative Data Systems Research (CIDR'11). 261--272.Google Scholar
- Hey, T., Tansley, S., and Tolle, K., eds. 2009. The fourth paradigm: Data-intensive scientific discovery. Microsoft Research. http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_complete_lr.pdf.Google Scholar
- Hindman, B., Konwinski, A., Zaharia, M., and Stoica, I. 2009. A common substrate for cluster computing. In HotCloud Workshop held in conjunction with the USENIX Annual Technical Conference. https://www.usenix.org/legacy/event/hotcloud09/tech/full_papers/hindman.pdf. Google ScholarDigital Library
- Huang, J., Abadi, D. J., and Ren, K. 2011. Scalable sparql querying of large rdf graphs. Proc. VLDB Endow. 4, 11, 1123--1134.Google ScholarDigital Library
- Husain, M. F., McGlothlin, J. P., Masud, M. M., Khan, L. R., and Thuraisingham, B. M. 2011. Heuristics-based query processing for large rdf graphs using cloud computing. IEEE Trans. Knowl. Data Engin. 23, 9, 1312--1327. Google ScholarDigital Library
- Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys'07). 59--72. Google ScholarDigital Library
- Jahani, E., Cafarella, M. J., and Re, C. 2011. Automatic optimization for mapreduce programs. Proc. VLDB Endow. 4, 6, 385--396. Google ScholarDigital Library
- Jiang, D., Ooi, B. C., Shi, L., and Wu, S. 2010. The performance of mapreduce: An in-depth study. Proc. VLDB Endow. 3, 1, 472--483. Google ScholarDigital Library
- Jiang, D., Tung, A. K. H., and Chen, G. 2011. MAP-JOIN-REDUCE: Toward scalable and efficient data analysis on large clusters. IEEE Trans. Knowl. Data Engin. 23, 9, 1299--1311. Google ScholarDigital Library
- Jindal, A., Quiane-Ruiz, J.-A., and Dittrich, J. 2011. Trojan data layouts: Right shoes for a running elephant. In Proceedings of the 2nd ACM Symposium on Cloud Computing (SoCC'11). Google ScholarDigital Library
- Kaldewey, T., Shekita, E. J., and Tata, S. 2012. Clydesdale: Structured data processing on mapreduce. In Proceedings of the 15th International Conference on Extending Database Technology (EDBT'12). 15--25. Google ScholarDigital Library
- Kang, U., Meeder, B., and Faloutsos, C. 2011a. Spectral analysis for billion-scale graphs: Discoveries and implementation. In Proceedings of the 15th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining (PAKDD'11). 13--25. Google ScholarDigital Library
- Kang, U., Tong, H., Sun, J., Lin, C.-Y., and Faloutsos, C. 2011b. GBASE: A scalable and general graph management system. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'11). 1091--1099. Google ScholarDigital Library
- Kang, U., Tsourakakis, C. E., and Faloutsos, C. 2009. PEGASUS: A peta-scale graph mining system. In Proceedings of the 9th IEEE International Conference on Data Mining (ICDM'09). 229--238. Google ScholarDigital Library
- Kang, U., Tsourakakis, C. E., and Faloutsos, C. 2011c. PEGASUS: Mining peta-scale graphs. Knowl. Inf. Syst. 27, 2, 303--325. Google ScholarDigital Library
- Khatchadourian, S., Consens, M. P., and Simeon, J. 2011. Having a chuql at xml on the cloud. In Proceedings of the 5th Alberto Mendelzon International Workshop on Foundations of Data Management (AMW'11).Google Scholar
- Kim, H., Ravindra, P., and Anyanwu, K. 2011. From sparql to mapreduce: The journey using a nested triple-group algebra. Proc. VLDB Endow. 4, 12, 1426--1429.Google ScholarDigital Library
- Kolb, L., Thor, A., and Rahm, E. 2012a. Dedoop: Efficient deduplication with hadoop. Proc. VLDB Endow. 5, 12. Google ScholarDigital Library
- Kolb, L., Thor, A., and Rahm, E. 2012b. Load balancing for mapreduce-based entity resolution. In Proceedings of the 28th International Conference on Data Engineering (ICDE'12). 618--629. Google ScholarDigital Library
- Kumar, V., andrade, H., Gedik, B., and Wu, K.-L. 2010. DEDUCE: At the intersection of mapreduce and stream processing. In Proceedings of the 13th International Conference on Extending Database Technology (EDBT'10). 657--662. Google ScholarDigital Library
- Lamport, L. 1998. The part-time parliament. ACM Trans. Comput. Syst. 16, 2, 133--169. Google ScholarDigital Library
- Large Synoptic Survey. 2013. http://www.lsst.org/.Google Scholar
- Lattanzi, S., Moseley, B., Suri, S., and Vassilvitskii, S. 2011. Filtering: A method for solving graph problems in mapreduce. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA'11). 85--94. Google ScholarDigital Library
- Lim, H., Herodotou, H., and Babu, S. 2012. Stubby: A transformation-based optimizer for mapreduce workflows. Proc. VLDB Endow. 5, 12. Google ScholarDigital Library
- Lin, J. J. 2009. Brute force and indexed approaches to pairwise document similarity comparisons with mapreduce. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'09). 155--162. Google ScholarDigital Library
- Lin, Y., Agrawal, D., Chen, C., Ooi, B. C., and Wu, S. 2011. Llama: Leveraging columnar storage for scalable join processing in the mapreduce framework. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'11). 961--972. Google ScholarDigital Library
- Logothetis, D. and Yocum, K. 2008. Ad-hoc data processing in the cloud. Proc. VLDB Endow. 1, 2, 1472--1475. Google ScholarDigital Library
- Loo, B. T., Condie, T., Hellerstein, J. M., Maniatis, P., Roscoe, T., and Stoica, I. 2005. Implementing declarative overlays. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP'05). 75--90. Google ScholarDigital Library
- Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., and Hellerstein, J. M. 2010. GraphLab: A new framework for parallel machine learning. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence (UAI'10). 340--349.Google ScholarDigital Library
- Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., and Hellerstein, J. M. 2012. Distributed graphlab: A framework for machine learning in the cloud. Proc. VLDB Endow. 5, 8, 716--727. Google ScholarDigital Library
- Malewicz, G., Austern, M., Bik, A., Dehnert, J., Horn, I., Leiser, N., and Czajkowski, G. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 135--146. Google ScholarDigital Library
- Manola, F. and Miller, E. 2004. RDF Primer, W3C Recommendation. http://www.w3.org/TR/REC-rdf-syntax/.Google Scholar
- Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., and Vassilakis, T. 2010. Dremel: Interactive analysis of web-scale datasets. Proc. VLDB Endow. 3, 1, 330--339. Google ScholarDigital Library
- Metwally, A. and Faloutsos, C. 2012. V-smart-join: A scalable mapreduce framework for all-pair similarity joins of multisets and vectors. Proc. VLDB Endow. 5, 8, 704--715. Google ScholarDigital Library
- Morales, G. F., Gionis, A., and Sozio, M. 2011. Social content matching in mapreduce. Proc. VLDB Endow. 4, 7, 460--469. Google ScholarDigital Library
- Morton, K., Balazinska, M., and Grossman, D. 2010a. ParaTimer: A progress indicator for mapreduce dags. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 507--518. Google ScholarDigital Library
- Morton, K., Friesen, A., Balazinska, M., and Grossman, D. 2010b. Estimating the progress of mapreduce pipelines. In Proceedings of the 26th IEEE International Conference on Data Engineering (ICDE'10). 681--684.Google Scholar
- Myung, J., Yeon, J., and Goo Lee, S. 2010. SPARQL basic graph pattern processing with iterative mapreduce. In Proceedings of the Workshop on Massive Data Analytics on the Cloud (MDAC'10). Google ScholarDigital Library
- Neumann, T. and Weikum, G. 2008. RDF-3x: A risc-style engine for rdf. Proc. VLDB Endow. 1, 1. Google ScholarDigital Library
- Nykiel, T., Potamias, M., Mishra, C., Kollios, G., and Koudas, N. 2010. MRShare: Sharing across multiple queries in mapreduce. Proc. VLDB Endow. 3, 1, 494--505. Google ScholarDigital Library
- Odersky, M., Spoon, L., and Venners, B. 2011. Programming in Scala: A Comprehensive Step-by-Step Guide. Artima Inc. Google ScholarDigital Library
- Olston, C., Reed, B., Srivastava, U., Kumar, R., and Tomkins, A. 2008. Pig latin: A not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'08). 1099--1110. Google ScholarDigital Library
- Panda, B., Herbach, J., Basu, S., and Bayardo, R. J. 2009. PLANET: Massively parallel learning of tree ensembles with mapreduce. Proc. VLDB Endow, 2, 2, 1426--1437. Google ScholarDigital Library
- Papadimitriou, S. and Sun, J. 2008. DisCo: Distributed co-clustering with map-reduce: A case study towards petabyte-scale end-to-end mining. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM'08). 512--521. Google ScholarDigital Library
- Patterson, D. A. 2008. Technical perspective: The data center is the computer. Comm. ACM 51, 1, 105. Google ScholarDigital Library
- Pavlo, A., Paulson, E., Rasin, A., Abadi, D., Dewitt, D., Madden, S., and Stonebraker, M. 2009. A comparison of approaches to large-scale data analysis. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'09). 165--178. Google ScholarDigital Library
- Pike, R., Dorward, S., Griesemer, R., and Quinlan, S. 2005. Interpreting the data: Parallel analysis with sawzall. Sci. Program. 13, 4, 277--298. Google ScholarDigital Library
- Prudhommeaux, E. and Seaborne, A. 2008. SPARQL query language for rdf, w3c recommendation. http://www.w3.org/TR/rdf-sparql-query/.Google Scholar
- Quiane-Ruiz, J.-A., Pinkel, C., Schad, J., and Dittrich, J. 2011a. RAFT at work: Speeding-up mapreduce applications under task and node failures. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'11). 1225--1228. Google ScholarDigital Library
- Quiane-Ruiz, J.-A., Pinkel, C., Schad, J., and Dittrich, J. 2011b. RAFTing mapreduce: Fast recovery on the raft. In Proceedings of the 27th IEEE International Conference on Data Engineering (ICDE'11). 589--600. Google ScholarDigital Library
- Ravindra, P., Kim, H., and Anyanwu, K. 2011. An intermediate algebra for optimizing rdf graph pattern matching on mapreduce. In Proceedings of the 8th Extended Semantic Web Conference on the Semanic Web: Research and Applications (ESWC'11). 46--61. Google ScholarDigital Library
- Schatzle, A., Przyjaciel-Zablocki, M., Hornung, T., and Lausen, G. 2011. PigSPARQL: Mapping sparql to pig latin. In Proceedings of the International Workshop on Semantic Web Information Management (SWIM'11). 65--84. Google ScholarDigital Library
- Selinger, P. G., Astrahan, M. M., Chamberlin, D. D., Lorie, R. A., and Price, T. G. 1979. Access path selection in a relational database management system. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'79). 23--34. Google ScholarDigital Library
- Stonebraker, M. 1986. The case for shared nothing. IEEE Datab. Engin. Bull. 9, 1, 4--9.Google Scholar
- Stonebraker, M., Abadi, D., Dewitt, D., Madden, S., Paulson, E., Pavlo, A., and Rasin, A. 2010. MapReduce and parallel dbmss: Friends or foes? Comm. ACM 53, 1, 64--71. Google ScholarDigital Library
- Stutz, P., Bernstein, A., and Cohen, W. W. 2010. Signal/collect: Graph algorithms for the (semantic) web. In Proceedings of the International Semantic Web Conference. 764--780. Google ScholarDigital Library
- Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., and Murthy, R. 2009. Hive - A warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 2, 1626--1629. Google ScholarDigital Library
- Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., and Murthy, R. 2010a. Hive - A petabyte scale data warehouse using hadoop. In Proceedings of the 26th IEEE International Conference on Data Engineering (ICDE'10). 996--1005.Google Scholar
- Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sarma, J. S., Murthy, R., and Liu, H. 2010b. Data warehousing and analytics infrastructure at facebook. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 1013--1020. Google ScholarDigital Library
- Ullman, J. D. 1990. Principles of Database and Knowledge Base Systems: Volume II: The New Technologies. W. H. Freeman and Co., New York. Google ScholarDigital Library
- Valiant, L. G. 1990. A bridging model for parallel computation. Comm. ACM 33, 8, 103--111. Google ScholarDigital Library
- Vernica, R., Carey, M., and Li, C. 2010. Efficient parallel set-similarity joins using MapReduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 495--506. Google ScholarDigital Library
- Wang, C., Wang, J., Lin, X., Wang, W., Wang, H., Li, H., Tian, W., Xu, J., and Li, R. 2010. MapDupReducer: Detecting near duplicates over massive datasets. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 1119--1122. Google ScholarDigital Library
- Wang, G., Xie, W., Demers, A., and Gehrke, J. 2013. Asynchronous large-scale graph processing made easy. In Proceedings of the 7th Conference on Innovative Data Systems Research (CIDR'13).Google Scholar
- White, T. 2012. Hadoop: The Definitive Guide. O'Reilly Media. Google ScholarDigital Library
- Xiao, C., Wang, W., Lin, X., Yu, J. X., and Wang, G. 2011. Efficient similarity joins for near-duplicate detection. ACM Trans. Datab. Syst. 36, 3, 15. Google ScholarDigital Library
- Yang, H., Dasdan, A., Hsiao, R., and Parker, D. 2007. Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'07). 1029--1040. Google ScholarDigital Library
- Yang, H. and Parker, D. 2009. Traverse: Simplified indexing on large map-reduce-merge clusters. In Proceedings of the 14th International Conference on Database Systems for Advanced Applications (DASFAA'09). 308--322. Google ScholarDigital Library
- Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, U., Gunda, P., and Currey, J. 2008. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). 1--14. Google ScholarDigital Library
- Zaharia, M., Borthakur, D., Sarma, J. S., Elmeleegy, K., Shenker, S., and Stoica, I. 2010a. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European Conference on Computer Systems (EuroSys'10). 265--278. Google ScholarDigital Library
- Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). Google ScholarDigital Library
- Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. 2010b. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'10). 10. Google ScholarDigital Library
- Zaharia, M., Konwinski, A., Joseph, A., Katz, R., and Stoica, I. 2008. Improving mapreduce performance in heterogeneous environments. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). 29--42. Google ScholarDigital Library
- Zhang, Y., Gao, Q., Gao, L., and Wang, C. 2012. IMapReduce: A Distributed Computing Framework for Iterative Computation. J. Grid Comput. 10, 1, 47--68. Google ScholarDigital Library
- Zhou, J., Larson, P., and Chaiken, R. 2010. Incorporating partitioning and parallel plans into the SCOPE optimizer. In Proceedings of the 26th IEEE International Conference on Data Engineering (ICDE'10). 1060--1071.Google Scholar
- Zukowski, M., Boncz, P. A., Nes, N., and Heman, S. 2005. MonetDB/X100 - A dbms in the cpu cache. IEEE Data Engin. Bull. 28, 2, 17--22.Google Scholar
Index Terms
- The family of mapreduce and large-scale data processing systems
Recommendations
Prominence of MapReduce in Big Data Processing
CSNT '14: Proceedings of the 2014 Fourth International Conference on Communication Systems and Network TechnologiesBig Data has come up with aureate haste and a clef enabler for the social business, Big Data gifts an opportunity to create extraordinary business advantage and better service delivery. Big Data is bringing a positive change in the decision making ...
MapReduce: Review and open challenges
The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Challenges for MapReduce in Big Data
SERVICES '14: Proceedings of the 2014 IEEE World Congress on ServicesIn the Big Data community, MapReduce has been seen as one of the key enabling approaches for meeting continuously increasing demands on computing resources imposed by massive data sets. The reason for this is the high scalability of the MapReduce ...
Comments