ABSTRACT
MapReduce programming model has simplified the implementation of many data parallel applications. The simplicity of the programming model and the quality of services provided by many implementations of MapReduce attract a lot of enthusiasm among distributed computing communities. From the years of experience in applying MapReduce to various scientific applications we identified a set of extensions to the programming model and improvements to its architecture that will expand the applicability of MapReduce to more classes of applications. In this paper, we present the programming model and the architecture of Twister an enhanced MapReduce runtime that supports iterative MapReduce computations efficiently. We also show performance comparisons of Twister with other similar runtimes such as Hadoop and DryadLINQ for large scale data parallel applications.
- }}J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Commun. ACM, vol. 51, pp. 107--113, 2008. Google ScholarDigital Library
- }}M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, "Dryad: distributed data-parallel programs from sequential building blocks," presented at the Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, Lisbon, Portugal, 2007. Google ScholarDigital Library
- }}J. Ekanayake, A. Balkir, T. Gunarathne, G. Fox, C. Poulain, N. Araujo, and R. Barga, "DryadLINQ for Scientific Analyses," presented at the 5th IEEE International Conference on e-Science, Oxford UK, 2009. Google ScholarDigital Library
- }}J. Ekanayake, S. Pallickara, and G. Fox, "MapReduce for Data Intensive Scientific Analyses," presented at the Proceedings of the 2008 Fourth IEEE International Conference on eScience, 2008. Google ScholarDigital Library
- }}J. Ekanayake, X. Qiu, T. Gunarathne, S. Beason, and G. Fox, "High Performance Parallel Computing with Clouds and Cloud Technologies," in Cloud Computing and Software Services: Theory and Techniques, ed: CRC Press (Taylor and Francis).Google Scholar
- }}G. Fox, S.-H. Bae, J. Ekanayake, X. Qiu, and H. Yuan, "Parallel Data Mining from Multicore to Cloudy Grids," presented at the International Advanced Research Workshop on High Performance Computing and Grids (HPC2008), Cetraro, Italy, 2008.Google Scholar
- }}MPI (Message Passing Interface). Available: http://www-unix.mcs.anl.gov/mpi/Google Scholar
- }}PVM (Parallel Virtual Machine). Available: http://www.csm.ornl.gov/pvm/Google Scholar
- }}C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun, "Map-Reduce for Machine Learning on Multicore," in NIPS, ed: MIT Press, 2006, pp. 281--288.Google Scholar
- }}Apache Hadoop. Available: http://hadoop.apache.org/Google Scholar
- }}Y. Gu and R. L. Grossman, "Sector and Sphere: the design and implementation of a high-performance data cloud," Philosophical transactions. Series A, Mathematical, physical, and engineering sciences, vol. 367, pp. 2429--2445, 2009.Google ScholarCross Ref
- }}Twister: A Runtime for Iterative MapReduce. Available: http://www.iterativemapreduce.org/Google Scholar
- }}Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and C. J., "DryadLINQ: A System for GeneralPurpose Distributed Data-Parallel Computing Using a HighLevel Language," in Symposium on Operating System Design and Implementation (OSDI), 2008. Google ScholarDigital Library
- }}Disco project. Available: http://discoproject.org/Google Scholar
- }}S. Ghemawat, H. Gobioff, and S.-T. Leung, "The Google file system," SIGOPS Oper. Syst. Rev., vol. 37, pp. 29--43, 2003. Google ScholarDigital Library
- }}J. B. MacQueen, "Some Methods for Classification and Analysis of MultiVariate Observations," in Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability. vol. 1, L. M. L. Cam and J. Neyman, Eds., ed: University of California Press, 1967.Google Scholar
- }}K. Rose, E. Gurewwitz, and G. Fox, "A deterministic annealing approach to clustering," Pattern Recogn. Lett., vol. 11, pp. 589--594, 1990. Google ScholarDigital Library
- }}S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Available: http://infolab.stanford.edu/~backrub/google.html Google ScholarDigital Library
- }}J. de Leeuw, "Applications of convex analysis to multidimensional scaling," Recent Developments in Statistics, pp. 133--145, 1977.Google Scholar
- }}S. Pallickara and G. Fox, "NaradaBrokering: A Distributed Middleware Framework and Architecture for Enabling Durable Peer-to-Peer Grids," presented at the Middleware 2003, 2003. Google ScholarDigital Library
- }}ActiveMQ. Available: http://activemq.apache.org/Google Scholar
- }}C. Moretti, H. Bui, K. Hollingsworth, B. Rich, P. Flynn, and D. Thain, "All-Pairs: An Abstraction for Data Intensive Computing on Campus Grids," in IEEE Transactions on Parallel and Distributed Systems, 2010, pp. 33--46. Google ScholarDigital Library
- }}O. Gotoh, "An improved algorithm for matching biological sequences," Journal of Molecular Biology vol. 162, pp. 705--708, 1982.Google ScholarCross Ref
- }}Source Code. Smith Waterman Software. Available: http://jaligner.sourceforge.net/Google Scholar
- }}J. Qiu, J. Ekanayake, T. Gunarathne, J. Y. Choi, S.-H. Bae, Y. Ruan, S. Ekanayake, S. Wu, S. Beason, G. Fox, M. Rho, and H. Tang, "Data Intensive Computing for Bioinformatics," in Data Intensive Distributed Computing, ed: IGI Publishers, 2010.Google Scholar
- }}A. F. A. Smit, R. Hubley, and P. Green. (2004, Repeatmasker. Available: http://www.repeatmasker.orgGoogle Scholar
- }}J. Jurka, "Repbase Update: a database and an electronic journal of repetitive elements," Trends in Genetics, vol. 6, pp. 418--420, 2000.Google ScholarCross Ref
- }}J. Kruskal, "Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis," Psychometrika, vol. 29, pp. 1--27, 1964.Google ScholarCross Ref
- }}Y. Takane, Young, F. W., & de Leeuw, J., "Nonmetric individual differences multidimensional scaling: an alternating least squares method with optimal scaling features," Psychometrika, vol. 42, pp. 7--67, 1977.Google ScholarCross Ref
- }}I. Borg, & Groenen, P. J., Modern Multidimensional Scaling: Theory and Applications: Springer, 2005.Google Scholar
- }}Y. Zhu, S. Ye, and X. Li, "Distributed PageRank computation based on iterative aggregation-disaggregation methods," presented at the Proceedings of the 14th ACM international conference on Information and knowledge management, Bremen, Germany, 2005. Google ScholarDigital Library
- }}S. Kamvar, T. Haveliwala, C. Manning, and G. Golub, "Exploiting the Block Structure of the Web for Computing PageRank," Stanford InfoLab, Technical Report 2003.Google Scholar
- }}The Power Method. Available: http://en.wikipedia.org/wiki/Pagerank#Power_MethodGoogle Scholar
- }}(2009, The ClueWeb09 Dataset. Available: http://boston.lti.cs.cmu.edu/Data/clueweb09/Google Scholar
- }}C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis, "Evaluating MapReduce for multi-core and multiprocessor systems," in 13th International Symposium on High-Performance Computer Architecture, 2007, pp. 13--24. Google ScholarDigital Library
- }}C. Team. (2009, Condor DAGMan. Available: http://www.cs.wisc.edu/condor/dagman/.Google Scholar
- }}LINQ Language-Integrated Query. Available: http://msdn.microsoft.com/en-us/netframework/aa904594.aspxGoogle Scholar
- }}Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. v. Laszewski, V. Nefedova, I. Raicu, T. Stef-Praun, and M. Wilde, "Swift: Fast, Reliable, Loosely Coupled Parallel Computation," in IEEE Congress on Services, 2007, pp. 199--206.Google ScholarCross Ref
Index Terms
- Twister: a runtime for iterative MapReduce
Recommendations
Integrating Pig with Harp to support iterative applications with fast cache and customized communication
DataCloud '14: Proceedings of the 5th International Workshop on Data-Intensive Computing in the CloudsUse of high-level scripting languages to solve big data problems has become a mainstream approach for sophisticated machine learning data analysis. Often data must be used in several steps of a computation to complete a full task. Composing default data ...
A comparative analysis of iterative MapReduce systems
EDB '16: Proceedings of the Sixth International Conference on Emerging Databases: Technologies, Applications, and TheorySince the development of MapReduce, there have been several efforts to extend data mining and machine learning algorithms for MapReduce. Many of those algorithms are iterative by nature. In order to process them efficiently, Spark as well as research ...
An Experimental Comparison of Iterative MapReduce Frameworks
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge ManagementMapReduce has become a dominant framework in big data analysis, and thus there have been significant efforts to implement various data analysis algorithms in MapReduce. Many data analysis algorithms are inherently iterative, repeating the same set of ...
Comments