skip to main content
research-article

The family of mapreduce and large-scale data processing systems

Authors Info & Claims
Published:11 July 2013Publication History
Skip Abstract Section

Abstract

In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large-scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling, and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large-scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large-scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.

Skip Supplemental Material Section

Supplemental Material

References

  1. Abadi, D. J., Marcus, A., Madden, S., and Hollenbach, K. 2009. SW-store: A vertically partitioned dbms for semantic web data management. VLDB J. 18, 2, 385--406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Rasin, A., and Silberschatz, A. 2009. HadoopDB: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow. 2, 1, 922--933. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Abouzeid, A., Bajda-Pawlikowski, K., Huang, J., Abadi, D., and Silberschatz, A. 2010. HadoopDB in action: Building real world applications. In Proceedings of the 36th ACM SIGMOD International Conference on Management of Data (SIGMOD'10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Afrati, F. and Ullman, J. 2010. Optimizing joins in a map-reduce environment. In Proceedings of the 13th International Conference on Extending Database Technology (EDBT'10). 99--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Afrati, F. N., Sarma, A. D., Menestrina, D., Parameswaran, A. G., and Ullman, J. D. 2012. Fuzzy joins using mapreduce. In Proceedings of the 28th IEEE International Conference on Data Engineering (ICDE'12). 498--509. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Afrati, F. N. and Ullman, J. D. 2011. Optimizing multiway joins in a map-reduce environment. IEEE Trans. Knowl. Data Engin. 23, 9, 1282--1298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Alexandrov, A., Battre, D., Ewen, S., Heimel, M., Hueske, F., Kao, O., Markl, V., Nijkamp, E., and Warneke, D. 2010. Massively parallel data analysis with pacts on nephele. Proc. VLDB Endow. 3, 2, 1625--1628. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Alvaro, P., Condie, T., Conway, N., Elmeleegy, K., Hellerstein, J. M., and Sears, R. 2010. Boom analytics: Exploring data-centric, declarative programming for the cloud. In Proceedings of the 5th European Conference on Computer Systems (EuroSys'10). 223--236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Armbrust, M., Fox, A., Rean, G., Joseph, A., Katz, R., Konwinski, A., Gunho, L., David, P., Rabkin, A., Stoica, I., and Zaharia, M. 2009. Above the clouds: A berkeley view of cloud computing. http://www.cs.columbia.edu/∼roxana/teaching/COMS-E6998-7-Fall-2011/papers/armbrust-tr09.pdf.Google ScholarGoogle Scholar
  10. Babu, S. 2010. Towards automatic optimization of mapreduce programs. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC'10). 137--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Balmin, A., Kaldewey, T., and Tata, S. 2012. Clydesdale: Structured data processing on hadoop. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'12). 705--708. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Battre, D., Ewen, S., Hueske, F., Kao, O., Markl, V., and Warneke, D. 2010. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In Proceedings of the ACM Symposium on Cloud Computing (SoCC'10). 119--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Behm, A., Borkar, V. R., Carey, M. J., Grover, R., Li, C., Onose, N., Vernica, R., Deutsch, A., Papakonstantinou, Y., and Tsotras, V. J. 2011. ASTERIX: Towards a scalable, semistructured data platform for evolving-world models. Distrib. Parallel Databases 29, 3, 185--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Bell, G., Gray, J., and Szalay, A. S. 2006. Petascale computational systems. IEEE Comput. 39, 1, 110--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Beyer, K. S., Ercegovac, V., Gemulla, R., Balmin, A., Eltabakh, M. Y., Kanne, C.-C., Ozcan, F., and Shekita, E. J. 2011. Jaql: A scripting language for large scale semistructured data analysis. Proc. VLDB Endow. 4, 12, 1272--1283.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U. A., and Pasquini, R. 2011. Incoop: MapReduce for Incremental computations. In Proceedings of the ACM Symposium on Cloud Computing (SoCC'11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Blanas, S., Patel, J., Ercegovac, V., Rao, J., Shekita, E., and Tian, Y. 2010. A comparison of join algorithms for log processing in mapreduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 975--986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Boag, S., Chamberlin, D., Fernandez, M. F., Florescu, D., Robie, J., and Simeon, J. 2010. XQuery 1.0: An xml query language. http://www.w3.org/TR/xquery.Google ScholarGoogle Scholar
  19. Borkar, V., Alsubaiee, S., Altowim, Y., Altwaijry, H., Behm, A., Bu, Y., Carey, M., Grover, R., Heilbron, Z., Kim, Y.-S., Li, C., Pirzadeh, P., Onose, N., Vernica, R., and Wen, J. 2012a. ASTERIX: An open source system for big data management and analysis. Proc. VLDB Endow. 5, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Borkar, V. R., Carey, M. J., Grover, R., Onose, N., and Vernica, R. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proceedings of the 27th IEEE International Conference on Data Engineering (ICDE'11). 1151--1162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Borkar, V. R., Carey, M. J., and Li, C. 2012b. Inside “big data management”: Ogres, onions, or parfaits? In Proceedings of the 15th International Conference on Extending Database Technology (EDBT'12). 3--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Bray, T., Paoli, J., Sperberg-Mcqueen, C. M., Maler, E., and Yergeau, F. 2008. Extensible markup language (xml) 1.0, 5th ed. http://www.w3.org/TR/REC-xml/.Google ScholarGoogle Scholar
  23. Bu, Y., Howe, B., Balazinska, M., and Ernst, M. 2010. HaLoop: Efficient iterative data processing on large clusters. Proc. VLDB Endow. 3, 1, 285--296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Cafarella, M. J. and Re, C. 2010. Manimal: Relational optimization for data-intensive programs. In Proceedings of the 13th International Workshop on the Web and Databases (WebDB'10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Cary, A., Sun, Z., Hristidis, V., and Rishe, N. 2009. Experiences on processing spatial data with mapreduce. In Proceedings of the 21st International Conference on Scientific and Statistical Database Management (SSDBM'09). 302--319. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Chaiken, R., Jenkins, B., Larson, P., Ramsey, B., Shakib, D., Weaver, S., and Zhou, J. 2008. SCOPE: Easy and efficient parallel processing of massive data sets. Proc. VLDB Endow. 1, 2, 1265--1276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry, R. R., Bradshaw, R., and Weizenbaum, N. 2010. FlumeJava: Easy, efficient data-parallel pipelines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'10). 363--375. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Chattopadhyay, B., Lin, L., Liu, W., Mittal, S., Aragonda, P., Lychagina, V., Kwon, Y., and Wong, M. 2011. Tenzing a sql implementation on the mapreduce framework. Proc. VLDB Endow. 4, 12, 1318--1327.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Chen, R., Weng, X., He, B., and Yang, M. 2010. Large graph processing in the cloud. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 1123--1126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Condie, T., Chu, D., Hellerstein, J. M., and Maniatis, P. 2008. Evita raced: Metacompilation for declarative networks. Proc. VLDB Endow. 1, 1, 1153--1165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Elmeleegy, K., and Sears, R. 2010a. MapReduce online. In Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation (NSDI'10). 313--328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Gerth, J., Talbot, J., Elmeleegy, K., and Sears, R. 2010b. Online aggregation and continuous query support in mapreduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 1115--1118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Cordeiro, R. L. F., Traina, C., Jr., Traina, A. J. M., Lopez, J., Kang, U., and Faloutsos, C. 2011. Clustering very large multi-dimensional datasets with mapreduce. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'11). 690--698. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Das, S., Sismanis, Y., Beyer, K., Gemulla, R., Haas, P., and McPherson, J. 2010. Ricardo: Integrating r and hadoop. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 987--998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Dean, J. and Ghemawat, S. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI'04). 137--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Dean, J. and Ghemawat, S. 2008. MapReduce: Simplified data processing on large clusters. Comm. ACM 51, 1, 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Dean, J. and Ghemawat, S. 2010. MapReduce: A flexible data processing tool. Comm. ACM 53, 1, 72--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Dewitt, D. J. and Gray, J. 1992. Parallel database systems: The future of high performance database systems. Comm. ACM 35, 6, 85--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Dittrich, J., Quiane -Ruiz, J., Jindal, A., Kargin, Y., Setty, V., and Schad, J. 2010. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow. 3, 1, 518--529. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., and Fox, G. 2010. Twister: A runtime for iterative mapreduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC'10). 810--818. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Elghandour, I. and Aboulnaga, A. 2012a. ReStore: Reusing results of mapreduce jobs. Proc. VLDB Endow. 5, 6, 586--597. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Elghandour, I. and Aboulnaga, A. 2012b. ReStore: Reusing results of mapreduce jobs in pig. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 701--704. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Eltabakh, M. Y., Tian, Y., Ozcan, F., Gemulla, R., Krettek, A., and McPherson, J. 2011. CoHadoop: Flexible data placement and its exploitation in hadoop. Proc. VLDB Endow. 4, 9, 575--585. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Ene, A., Im, S., and Moseley, B. 2011. Fast clustering using mapreduce. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'11). 681--689. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Fegaras, L., Li, C., Gupta, U., and Philip, J. 2011. XML query optimization in map-reduce. In Proceedings of the International Workshop on the Web and Databases (WebDB).Google ScholarGoogle Scholar
  47. Floratou, A., Patel, J. M., Shekita, E. J., and Tata, S. 2011. Column-oriented storage techniques for mapreduce. Proc. VLDB Endow. 4, 7, 419--429. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Friedman, E., Pawlowski, P., and Cieslewicz, J. 2009. SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. Proc. VLDB Endow. 2, 2, 1402--1413. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Gates, A. 2011. Programming Pig. O'Reilly Media. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Gates, A., Natkovich, O., Chopra, S., Kamath, P., Narayanam, S., Olston, C., Reed, B., Srinivasan, S., and Srivastava, U. 2009. Building a highlevel dataflow system on top of mapreduce: The pig experience. Proc. VLDB Endow. 2, 2, 1414--1425. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Ghemawat, S., Gobioff, H., and Leung, S. 2003. The google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP'03). 29--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Ghoting, A., Kambadur, P., Pednault, E. P. D., and Kannan, R. 2011a. NIMBLE: A toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'11). 334--342. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Ghoting, A., Krishnamurthy, R., Pednault, E. P. D., Reinwald, B., Sindhwani, V., Tatikonda, S., Tian, Y., and Vaithyanathan, S. 2011b. SystemML: Declarative machine learning on mapreduce. In Proceedings of the IEEE 27th International Conference on Data Engineering (ICDE'11). 231--242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Halevy, A. Y. 2001. Answering queries using views: A survey. VLDB J. 10, 4, 270--294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., and Xu, Z. 2011. RCFile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In Proceedings of the IEEE 27th International Conference on Data Engineering (ICDE'11). 1199--1208. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Herodotou, H. 2011. Hadoop performance models. CoRR abs/1106.0940. http://arxiv.org/abs/1106.0940.Google ScholarGoogle Scholar
  57. Herodotou, H. and Babu, S. 2011. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. Proc. VLDB Endow. 4, 11, 1111--1122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Herodotou, H., Dong, F., and Babu, S. 2011a. Mapreduce programming and cost-based optimization? crossing this chasm with starfish. Proc. VLDB Endow. 4, 12, 1446--1449.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F. B., and Babu, S. 2011b. Starfish: A self-tuning system for big data analytics. In Proceedings of the 5th Conference on Innovative Data Systems Research (CIDR'11). 261--272.Google ScholarGoogle Scholar
  60. Hey, T., Tansley, S., and Tolle, K., eds. 2009. The fourth paradigm: Data-intensive scientific discovery. Microsoft Research. http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_complete_lr.pdf.Google ScholarGoogle Scholar
  61. Hindman, B., Konwinski, A., Zaharia, M., and Stoica, I. 2009. A common substrate for cluster computing. In HotCloud Workshop held in conjunction with the USENIX Annual Technical Conference. https://www.usenix.org/legacy/event/hotcloud09/tech/full_papers/hindman.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Huang, J., Abadi, D. J., and Ren, K. 2011. Scalable sparql querying of large rdf graphs. Proc. VLDB Endow. 4, 11, 1123--1134.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Husain, M. F., McGlothlin, J. P., Masud, M. M., Khan, L. R., and Thuraisingham, B. M. 2011. Heuristics-based query processing for large rdf graphs using cloud computing. IEEE Trans. Knowl. Data Engin. 23, 9, 1312--1327. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys'07). 59--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Jahani, E., Cafarella, M. J., and Re, C. 2011. Automatic optimization for mapreduce programs. Proc. VLDB Endow. 4, 6, 385--396. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Jiang, D., Ooi, B. C., Shi, L., and Wu, S. 2010. The performance of mapreduce: An in-depth study. Proc. VLDB Endow. 3, 1, 472--483. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Jiang, D., Tung, A. K. H., and Chen, G. 2011. MAP-JOIN-REDUCE: Toward scalable and efficient data analysis on large clusters. IEEE Trans. Knowl. Data Engin. 23, 9, 1299--1311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Jindal, A., Quiane-Ruiz, J.-A., and Dittrich, J. 2011. Trojan data layouts: Right shoes for a running elephant. In Proceedings of the 2nd ACM Symposium on Cloud Computing (SoCC'11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Kaldewey, T., Shekita, E. J., and Tata, S. 2012. Clydesdale: Structured data processing on mapreduce. In Proceedings of the 15th International Conference on Extending Database Technology (EDBT'12). 15--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Kang, U., Meeder, B., and Faloutsos, C. 2011a. Spectral analysis for billion-scale graphs: Discoveries and implementation. In Proceedings of the 15th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining (PAKDD'11). 13--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Kang, U., Tong, H., Sun, J., Lin, C.-Y., and Faloutsos, C. 2011b. GBASE: A scalable and general graph management system. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'11). 1091--1099. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Kang, U., Tsourakakis, C. E., and Faloutsos, C. 2009. PEGASUS: A peta-scale graph mining system. In Proceedings of the 9th IEEE International Conference on Data Mining (ICDM'09). 229--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Kang, U., Tsourakakis, C. E., and Faloutsos, C. 2011c. PEGASUS: Mining peta-scale graphs. Knowl. Inf. Syst. 27, 2, 303--325. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Khatchadourian, S., Consens, M. P., and Simeon, J. 2011. Having a chuql at xml on the cloud. In Proceedings of the 5th Alberto Mendelzon International Workshop on Foundations of Data Management (AMW'11).Google ScholarGoogle Scholar
  75. Kim, H., Ravindra, P., and Anyanwu, K. 2011. From sparql to mapreduce: The journey using a nested triple-group algebra. Proc. VLDB Endow. 4, 12, 1426--1429.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Kolb, L., Thor, A., and Rahm, E. 2012a. Dedoop: Efficient deduplication with hadoop. Proc. VLDB Endow. 5, 12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Kolb, L., Thor, A., and Rahm, E. 2012b. Load balancing for mapreduce-based entity resolution. In Proceedings of the 28th International Conference on Data Engineering (ICDE'12). 618--629. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Kumar, V., andrade, H., Gedik, B., and Wu, K.-L. 2010. DEDUCE: At the intersection of mapreduce and stream processing. In Proceedings of the 13th International Conference on Extending Database Technology (EDBT'10). 657--662. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Lamport, L. 1998. The part-time parliament. ACM Trans. Comput. Syst. 16, 2, 133--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Large Synoptic Survey. 2013. http://www.lsst.org/.Google ScholarGoogle Scholar
  81. Lattanzi, S., Moseley, B., Suri, S., and Vassilvitskii, S. 2011. Filtering: A method for solving graph problems in mapreduce. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA'11). 85--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Lim, H., Herodotou, H., and Babu, S. 2012. Stubby: A transformation-based optimizer for mapreduce workflows. Proc. VLDB Endow. 5, 12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Lin, J. J. 2009. Brute force and indexed approaches to pairwise document similarity comparisons with mapreduce. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'09). 155--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Lin, Y., Agrawal, D., Chen, C., Ooi, B. C., and Wu, S. 2011. Llama: Leveraging columnar storage for scalable join processing in the mapreduce framework. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'11). 961--972. Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Logothetis, D. and Yocum, K. 2008. Ad-hoc data processing in the cloud. Proc. VLDB Endow. 1, 2, 1472--1475. Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Loo, B. T., Condie, T., Hellerstein, J. M., Maniatis, P., Roscoe, T., and Stoica, I. 2005. Implementing declarative overlays. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP'05). 75--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., and Hellerstein, J. M. 2010. GraphLab: A new framework for parallel machine learning. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence (UAI'10). 340--349.Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., and Hellerstein, J. M. 2012. Distributed graphlab: A framework for machine learning in the cloud. Proc. VLDB Endow. 5, 8, 716--727. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. Malewicz, G., Austern, M., Bik, A., Dehnert, J., Horn, I., Leiser, N., and Czajkowski, G. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 135--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. Manola, F. and Miller, E. 2004. RDF Primer, W3C Recommendation. http://www.w3.org/TR/REC-rdf-syntax/.Google ScholarGoogle Scholar
  91. Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., and Vassilakis, T. 2010. Dremel: Interactive analysis of web-scale datasets. Proc. VLDB Endow. 3, 1, 330--339. Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. Metwally, A. and Faloutsos, C. 2012. V-smart-join: A scalable mapreduce framework for all-pair similarity joins of multisets and vectors. Proc. VLDB Endow. 5, 8, 704--715. Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. Morales, G. F., Gionis, A., and Sozio, M. 2011. Social content matching in mapreduce. Proc. VLDB Endow. 4, 7, 460--469. Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. Morton, K., Balazinska, M., and Grossman, D. 2010a. ParaTimer: A progress indicator for mapreduce dags. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 507--518. Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. Morton, K., Friesen, A., Balazinska, M., and Grossman, D. 2010b. Estimating the progress of mapreduce pipelines. In Proceedings of the 26th IEEE International Conference on Data Engineering (ICDE'10). 681--684.Google ScholarGoogle Scholar
  96. Myung, J., Yeon, J., and Goo Lee, S. 2010. SPARQL basic graph pattern processing with iterative mapreduce. In Proceedings of the Workshop on Massive Data Analytics on the Cloud (MDAC'10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. Neumann, T. and Weikum, G. 2008. RDF-3x: A risc-style engine for rdf. Proc. VLDB Endow. 1, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. Nykiel, T., Potamias, M., Mishra, C., Kollios, G., and Koudas, N. 2010. MRShare: Sharing across multiple queries in mapreduce. Proc. VLDB Endow. 3, 1, 494--505. Google ScholarGoogle ScholarDigital LibraryDigital Library
  99. Odersky, M., Spoon, L., and Venners, B. 2011. Programming in Scala: A Comprehensive Step-by-Step Guide. Artima Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. Olston, C., Reed, B., Srivastava, U., Kumar, R., and Tomkins, A. 2008. Pig latin: A not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'08). 1099--1110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. Panda, B., Herbach, J., Basu, S., and Bayardo, R. J. 2009. PLANET: Massively parallel learning of tree ensembles with mapreduce. Proc. VLDB Endow, 2, 2, 1426--1437. Google ScholarGoogle ScholarDigital LibraryDigital Library
  102. Papadimitriou, S. and Sun, J. 2008. DisCo: Distributed co-clustering with map-reduce: A case study towards petabyte-scale end-to-end mining. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM'08). 512--521. Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. Patterson, D. A. 2008. Technical perspective: The data center is the computer. Comm. ACM 51, 1, 105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  104. Pavlo, A., Paulson, E., Rasin, A., Abadi, D., Dewitt, D., Madden, S., and Stonebraker, M. 2009. A comparison of approaches to large-scale data analysis. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'09). 165--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  105. Pike, R., Dorward, S., Griesemer, R., and Quinlan, S. 2005. Interpreting the data: Parallel analysis with sawzall. Sci. Program. 13, 4, 277--298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. Prudhommeaux, E. and Seaborne, A. 2008. SPARQL query language for rdf, w3c recommendation. http://www.w3.org/TR/rdf-sparql-query/.Google ScholarGoogle Scholar
  107. Quiane-Ruiz, J.-A., Pinkel, C., Schad, J., and Dittrich, J. 2011a. RAFT at work: Speeding-up mapreduce applications under task and node failures. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'11). 1225--1228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  108. Quiane-Ruiz, J.-A., Pinkel, C., Schad, J., and Dittrich, J. 2011b. RAFTing mapreduce: Fast recovery on the raft. In Proceedings of the 27th IEEE International Conference on Data Engineering (ICDE'11). 589--600. Google ScholarGoogle ScholarDigital LibraryDigital Library
  109. Ravindra, P., Kim, H., and Anyanwu, K. 2011. An intermediate algebra for optimizing rdf graph pattern matching on mapreduce. In Proceedings of the 8th Extended Semantic Web Conference on the Semanic Web: Research and Applications (ESWC'11). 46--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  110. Schatzle, A., Przyjaciel-Zablocki, M., Hornung, T., and Lausen, G. 2011. PigSPARQL: Mapping sparql to pig latin. In Proceedings of the International Workshop on Semantic Web Information Management (SWIM'11). 65--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. Selinger, P. G., Astrahan, M. M., Chamberlin, D. D., Lorie, R. A., and Price, T. G. 1979. Access path selection in a relational database management system. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'79). 23--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  112. Stonebraker, M. 1986. The case for shared nothing. IEEE Datab. Engin. Bull. 9, 1, 4--9.Google ScholarGoogle Scholar
  113. Stonebraker, M., Abadi, D., Dewitt, D., Madden, S., Paulson, E., Pavlo, A., and Rasin, A. 2010. MapReduce and parallel dbmss: Friends or foes? Comm. ACM 53, 1, 64--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  114. Stutz, P., Bernstein, A., and Cohen, W. W. 2010. Signal/collect: Graph algorithms for the (semantic) web. In Proceedings of the International Semantic Web Conference. 764--780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  115. Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., and Murthy, R. 2009. Hive - A warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 2, 1626--1629. Google ScholarGoogle ScholarDigital LibraryDigital Library
  116. Thusoo, A., Sarma, J., Jain, N., Shao, Z., Chakka, P., Zhang, N., Anthony, S., Liu, H., and Murthy, R. 2010a. Hive - A petabyte scale data warehouse using hadoop. In Proceedings of the 26th IEEE International Conference on Data Engineering (ICDE'10). 996--1005.Google ScholarGoogle Scholar
  117. Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sarma, J. S., Murthy, R., and Liu, H. 2010b. Data warehousing and analytics infrastructure at facebook. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 1013--1020. Google ScholarGoogle ScholarDigital LibraryDigital Library
  118. Ullman, J. D. 1990. Principles of Database and Knowledge Base Systems: Volume II: The New Technologies. W. H. Freeman and Co., New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  119. Valiant, L. G. 1990. A bridging model for parallel computation. Comm. ACM 33, 8, 103--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  120. Vernica, R., Carey, M., and Li, C. 2010. Efficient parallel set-similarity joins using MapReduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 495--506. Google ScholarGoogle ScholarDigital LibraryDigital Library
  121. Wang, C., Wang, J., Lin, X., Wang, W., Wang, H., Li, H., Tian, W., Xu, J., and Li, R. 2010. MapDupReducer: Detecting near duplicates over massive datasets. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 1119--1122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  122. Wang, G., Xie, W., Demers, A., and Gehrke, J. 2013. Asynchronous large-scale graph processing made easy. In Proceedings of the 7th Conference on Innovative Data Systems Research (CIDR'13).Google ScholarGoogle Scholar
  123. White, T. 2012. Hadoop: The Definitive Guide. O'Reilly Media. Google ScholarGoogle ScholarDigital LibraryDigital Library
  124. Xiao, C., Wang, W., Lin, X., Yu, J. X., and Wang, G. 2011. Efficient similarity joins for near-duplicate detection. ACM Trans. Datab. Syst. 36, 3, 15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  125. Yang, H., Dasdan, A., Hsiao, R., and Parker, D. 2007. Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'07). 1029--1040. Google ScholarGoogle ScholarDigital LibraryDigital Library
  126. Yang, H. and Parker, D. 2009. Traverse: Simplified indexing on large map-reduce-merge clusters. In Proceedings of the 14th International Conference on Database Systems for Advanced Applications (DASFAA'09). 308--322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  127. Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, U., Gunda, P., and Currey, J. 2008. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  128. Zaharia, M., Borthakur, D., Sarma, J. S., Elmeleegy, K., Shenker, S., and Stoica, I. 2010a. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European Conference on Computer Systems (EuroSys'10). 265--278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  129. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., and Stoica, I. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  130. Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. 2010b. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'10). 10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  131. Zaharia, M., Konwinski, A., Joseph, A., Katz, R., and Stoica, I. 2008. Improving mapreduce performance in heterogeneous environments. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI'08). 29--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  132. Zhang, Y., Gao, Q., Gao, L., and Wang, C. 2012. IMapReduce: A Distributed Computing Framework for Iterative Computation. J. Grid Comput. 10, 1, 47--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  133. Zhou, J., Larson, P., and Chaiken, R. 2010. Incorporating partitioning and parallel plans into the SCOPE optimizer. In Proceedings of the 26th IEEE International Conference on Data Engineering (ICDE'10). 1060--1071.Google ScholarGoogle Scholar
  134. Zukowski, M., Boncz, P. A., Nes, N., and Heman, S. 2005. MonetDB/X100 - A dbms in the cpu cache. IEEE Data Engin. Bull. 28, 2, 17--22.Google ScholarGoogle Scholar

Index Terms

  1. The family of mapreduce and large-scale data processing systems

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image ACM Computing Surveys
                ACM Computing Surveys  Volume 46, Issue 1
                October 2013
                551 pages
                ISSN:0360-0300
                EISSN:1557-7341
                DOI:10.1145/2522968
                Issue’s Table of Contents

                Copyright © 2013 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 11 July 2013
                • Accepted: 1 January 2013
                • Revised: 1 November 2012
                • Received: 1 July 2012
                Published in csur Volume 46, Issue 1

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article
                • Research
                • Refereed

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader