skip to main content
research-article

HaLoop: efficient iterative data processing on large clusters

Published:01 September 2010Publication History
Skip Abstract Section

Abstract

The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce and Dryad are two popular platforms in which the dataflow takes the form of a directed acyclic graph of operators. These platforms lack built-in support for iterative programs, which arise naturally in many applications including data mining, web ranking, graph analysis, model fitting, and so on. This paper presents HaLoop, a modified version of the Hadoop MapReduce framework that is designed to serve these applications. HaLoop not only extends MapReduce with programming support for iterative applications, it also dramatically improves their efficiency by making the task scheduler loop-aware and by adding various caching mechanisms. We evaluated HaLoop on real queries and real datasets. Compared with Hadoop, on average, HaLoop reduces query runtimes by 1.85, and shuffles only 4% of the data between mappers and reducers.

References

  1. http://www.nsf.gov/pubs/2008/nsf08560/nsf08560.htm. Accessed July 7, 2010.Google ScholarGoogle Scholar
  2. Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Alexander Rasin, and Avi Silberschatz. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. VLDB, 2(1):922--933, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. François Bancilhon and Raghu Ramakrishnan. An amateur's introduction to recursive query processing strategies. In SIGMOD Conference, pages 16--52, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. David J. DeWitt and Jim Gray. Parallel database systems: The future of high performance database systems. Commun. ACM, 35(6):85--98, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Jaliya Ekanayake and Shrideep Pallickara. MapReduce for data intensive scientific analysis. In IEEE eScience, pages 277--284, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Hadoop. http://hadoop.apache.org/. Accessed July 7, 2010.Google ScholarGoogle Scholar
  8. Hdfs. http://hadoop.apache.org/common/docs/current/hdfs_design.html. Accessed July 7, 2010.Google ScholarGoogle Scholar
  9. Hive. http://hadoop.apache.org/hive/. Accessed July 7, 2010.Google ScholarGoogle Scholar
  10. Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604--632, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Mahout. http://lucene.apache.org/mahout/. Accessed July 7, 2010.Google ScholarGoogle Scholar
  13. Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD Conference, pages 135--146, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig Latin: a not-so-foreign language for data processing. In SIGMOD Conference, pages 1099--1110, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report 1999--66, Stanford InfoLab, 1999.Google ScholarGoogle Scholar
  16. Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD Conference, pages 165--178, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Weining Zhang, Ke Wang, and Siu-Cheung Chau. Data partition and parallel evaluation of datalog programs. IEEE Trans. Knowl. Data Eng., 7(1):163--176, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 3, Issue 1-2
    September 2010
    1658 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 September 2010
    Published in pvldb Volume 3, Issue 1-2

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader