research-article

Twister: a runtime for iterative MapReduce

Authors:
Jaliya Ekanayake

Indiana University Bloomington

Indiana University Bloomington
View Profile

,
Hui Li

Indiana University Bloomington

Indiana University Bloomington
View Profile

,
Bingjing Zhang

Indiana University Bloomington

Indiana University Bloomington
View Profile

,
Thilina Gunarathne

Indiana University Bloomington

Indiana University Bloomington
View Profile

,
Seung-Hee Bae

Indiana University Bloomington

Indiana University Bloomington
View Profile

,
Judy Qiu

Indiana University Bloomington

Indiana University Bloomington
View Profile

,
Geoffrey Fox

Indiana University Bloomington

Indiana University Bloomington
View Profile

HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed ComputingJune 2010Pages 810–818https://doi.org/10.1145/1851476.1851593

Published:21 June 2010Publication History

HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

Pages 810–818

ABSTRACT

MapReduce programming model has simplified the implementation of many data parallel applications. The simplicity of the programming model and the quality of services provided by many implementations of MapReduce attract a lot of enthusiasm among distributed computing communities. From the years of experience in applying MapReduce to various scientific applications we identified a set of extensions to the programming model and improvements to its architecture that will expand the applicability of MapReduce to more classes of applications. In this paper, we present the programming model and the architecture of Twister an enhanced MapReduce runtime that supports iterative MapReduce computations efficiently. We also show performance comparisons of Twister with other similar runtimes such as Hadoop and DryadLINQ for large scale data parallel applications.

References

}}J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Commun. ACM, vol. 51, pp. 107--113, 2008. Google ScholarDigital Library
}}M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, "Dryad: distributed data-parallel programs from sequential building blocks," presented at the Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, Lisbon, Portugal, 2007. Google ScholarDigital Library
}}J. Ekanayake, A. Balkir, T. Gunarathne, G. Fox, C. Poulain, N. Araujo, and R. Barga, "DryadLINQ for Scientific Analyses," presented at the 5th IEEE International Conference on e-Science, Oxford UK, 2009. Google ScholarDigital Library
}}J. Ekanayake, S. Pallickara, and G. Fox, "MapReduce for Data Intensive Scientific Analyses," presented at the Proceedings of the 2008 Fourth IEEE International Conference on eScience, 2008. Google ScholarDigital Library
}}J. Ekanayake, X. Qiu, T. Gunarathne, S. Beason, and G. Fox, "High Performance Parallel Computing with Clouds and Cloud Technologies," in Cloud Computing and Software Services: Theory and Techniques, ed: CRC Press (Taylor and Francis).Google Scholar
}}G. Fox, S.-H. Bae, J. Ekanayake, X. Qiu, and H. Yuan, "Parallel Data Mining from Multicore to Cloudy Grids," presented at the International Advanced Research Workshop on High Performance Computing and Grids (HPC2008), Cetraro, Italy, 2008.Google Scholar
}}MPI (Message Passing Interface). Available: http://www-unix.mcs.anl.gov/mpi/Google Scholar
}}PVM (Parallel Virtual Machine). Available: http://www.csm.ornl.gov/pvm/Google Scholar
}}C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun, "Map-Reduce for Machine Learning on Multicore," in NIPS, ed: MIT Press, 2006, pp. 281--288.Google Scholar
}}Apache Hadoop. Available: http://hadoop.apache.org/Google Scholar
}}Y. Gu and R. L. Grossman, "Sector and Sphere: the design and implementation of a high-performance data cloud," Philosophical transactions. Series A, Mathematical, physical, and engineering sciences, vol. 367, pp. 2429--2445, 2009.Google ScholarCross Ref
}}Twister: A Runtime for Iterative MapReduce. Available: http://www.iterativemapreduce.org/Google Scholar
}}Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and C. J., "DryadLINQ: A System for GeneralPurpose Distributed Data-Parallel Computing Using a HighLevel Language," in Symposium on Operating System Design and Implementation (OSDI), 2008. Google ScholarDigital Library
}}Disco project. Available: http://discoproject.org/Google Scholar
}}S. Ghemawat, H. Gobioff, and S.-T. Leung, "The Google file system," SIGOPS Oper. Syst. Rev., vol. 37, pp. 29--43, 2003. Google ScholarDigital Library
}}J. B. MacQueen, "Some Methods for Classification and Analysis of MultiVariate Observations," in Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability. vol. 1, L. M. L. Cam and J. Neyman, Eds., ed: University of California Press, 1967.Google Scholar
}}K. Rose, E. Gurewwitz, and G. Fox, "A deterministic annealing approach to clustering," Pattern Recogn. Lett., vol. 11, pp. 589--594, 1990. Google ScholarDigital Library
}}S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Available: http://infolab.stanford.edu/~backrub/google.html Google ScholarDigital Library
}}J. de Leeuw, "Applications of convex analysis to multidimensional scaling," Recent Developments in Statistics, pp. 133--145, 1977.Google Scholar
}}S. Pallickara and G. Fox, "NaradaBrokering: A Distributed Middleware Framework and Architecture for Enabling Durable Peer-to-Peer Grids," presented at the Middleware 2003, 2003. Google ScholarDigital Library
}}ActiveMQ. Available: http://activemq.apache.org/Google Scholar
}}C. Moretti, H. Bui, K. Hollingsworth, B. Rich, P. Flynn, and D. Thain, "All-Pairs: An Abstraction for Data Intensive Computing on Campus Grids," in IEEE Transactions on Parallel and Distributed Systems, 2010, pp. 33--46. Google ScholarDigital Library
}}O. Gotoh, "An improved algorithm for matching biological sequences," Journal of Molecular Biology vol. 162, pp. 705--708, 1982.Google ScholarCross Ref
}}Source Code. Smith Waterman Software. Available: http://jaligner.sourceforge.net/Google Scholar
}}J. Qiu, J. Ekanayake, T. Gunarathne, J. Y. Choi, S.-H. Bae, Y. Ruan, S. Ekanayake, S. Wu, S. Beason, G. Fox, M. Rho, and H. Tang, "Data Intensive Computing for Bioinformatics," in Data Intensive Distributed Computing, ed: IGI Publishers, 2010.Google Scholar
}}A. F. A. Smit, R. Hubley, and P. Green. (2004, Repeatmasker. Available: http://www.repeatmasker.orgGoogle Scholar
}}J. Jurka, "Repbase Update: a database and an electronic journal of repetitive elements," Trends in Genetics, vol. 6, pp. 418--420, 2000.Google ScholarCross Ref
}}J. Kruskal, "Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis," Psychometrika, vol. 29, pp. 1--27, 1964.Google ScholarCross Ref
}}Y. Takane, Young, F. W., & de Leeuw, J., "Nonmetric individual differences multidimensional scaling: an alternating least squares method with optimal scaling features," Psychometrika, vol. 42, pp. 7--67, 1977.Google ScholarCross Ref
}}I. Borg, & Groenen, P. J., Modern Multidimensional Scaling: Theory and Applications: Springer, 2005.Google Scholar
}}Y. Zhu, S. Ye, and X. Li, "Distributed PageRank computation based on iterative aggregation-disaggregation methods," presented at the Proceedings of the 14th ACM international conference on Information and knowledge management, Bremen, Germany, 2005. Google ScholarDigital Library
}}S. Kamvar, T. Haveliwala, C. Manning, and G. Golub, "Exploiting the Block Structure of the Web for Computing PageRank," Stanford InfoLab, Technical Report 2003.Google Scholar
}}The Power Method. Available: http://en.wikipedia.org/wiki/Pagerank#Power_MethodGoogle Scholar
}}(2009, The ClueWeb09 Dataset. Available: http://boston.lti.cs.cmu.edu/Data/clueweb09/Google Scholar
}}C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis, "Evaluating MapReduce for multi-core and multiprocessor systems," in 13th International Symposium on High-Performance Computer Architecture, 2007, pp. 13--24. Google ScholarDigital Library
}}C. Team. (2009, Condor DAGMan. Available: http://www.cs.wisc.edu/condor/dagman/.Google Scholar
}}LINQ Language-Integrated Query. Available: http://msdn.microsoft.com/en-us/netframework/aa904594.aspxGoogle Scholar
}}Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. v. Laszewski, V. Nefedova, I. Raicu, T. Stef-Praun, and M. Wilde, "Swift: Fast, Reliable, Loosely Coupled Parallel Computation," in IEEE Congress on Services, 2007, pp. 199--206.Google ScholarCross Ref

Index Terms

Twister: a runtime for iterative MapReduce
1. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed programming languages
  2. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Distributed programming languages
        Parallel programming languages

Recommendations

Integrating Pig with Harp to support iterative applications with fast cache and customized communication
DataCloud '14: Proceedings of the 5th International Workshop on Data-Intensive Computing in the Clouds

Use of high-level scripting languages to solve big data problems has become a mainstream approach for sophisticated machine learning data analysis. Often data must be used in several steps of a computation to complete a full task. Composing default data ...
Read More
A comparative analysis of iterative MapReduce systems
EDB '16: Proceedings of the Sixth International Conference on Emerging Databases: Technologies, Applications, and Theory

Since the development of MapReduce, there have been several efforts to extend data mining and machine learning algorithms for MapReduce. Many of those algorithms are iterative by nature. In order to process them efficiently, Spark as well as research ...
Read More
An Experimental Comparison of Iterative MapReduce Frameworks
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

MapReduce has become a dominant framework in big data analysis, and thus there have been significant efforts to implement various data analysis algorithms in MapReduce. Many data analysis algorithms are inherently iterative, repeating the same set of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
June 2010
911 pages
ISBN:9781605589428
DOI:10.1145/1851476
General Chairs:
Salim Hariri
University of Arizona
,
Kate Keahey
University of Chicago
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 June 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
MapReduce
cloud technologies
iterative algorithms
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate166of966submissions,17%
Upcoming Conference
HPDC '24

Sponsor:

sigarch

The 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 3 - 7, 2024

Pisa , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 550
  Total Citations
  View Citations
- 2,890
  Total Downloads
- Downloads (Last 12 months)43
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Twister: a runtime for iterative MapReduce

HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Integrating Pig with Harp to support iterative applications with fast cache and customized communication

A comparative analysis of iterative MapReduce systems

An Experimental Comparison of Iterative MapReduce Frameworks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Twister: a runtime for iterative MapReduce

HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Integrating Pig with Harp to support iterative applications with fast cache and customized communication

A comparative analysis of iterative MapReduce systems

An Experimental Comparison of Iterative MapReduce Frameworks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media