research-article

HaLoop: efficient iterative data processing on large clusters

Authors:
Yingyi Bu

University of Washington, Seattle, WA

University of Washington, Seattle, WA
View Profile

,
Bill Howe

University of Washington, Seattle, WA

University of Washington, Seattle, WA
View Profile

,
Magdalena Balazinska

University of Washington, Seattle, WA

University of Washington, Seattle, WA
View Profile

,
Michael D. Ernst

University of Washington, Seattle, WA

University of Washington, Seattle, WA
View Profile

Proceedings of the VLDB Endowment Volume 3 Issue 1-2pp 285–296https://doi.org/10.14778/1920841.1920881

Published:01 September 2010Publication History

Proceedings of the VLDB Endowment

Abstract

The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce and Dryad are two popular platforms in which the dataflow takes the form of a directed acyclic graph of operators. These platforms lack built-in support for iterative programs, which arise naturally in many applications including data mining, web ranking, graph analysis, model fitting, and so on. This paper presents HaLoop, a modified version of the Hadoop MapReduce framework that is designed to serve these applications. HaLoop not only extends MapReduce with programming support for iterative applications, it also dramatically improves their efficiency by making the task scheduler loop-aware and by adding various caching mechanisms. We evaluated HaLoop on real queries and real datasets. Compared with Hadoop, on average, HaLoop reduces query runtimes by 1.85, and shuffles only 4% of the data between mappers and reducers.

References

http://www.nsf.gov/pubs/2008/nsf08560/nsf08560.htm. Accessed July 7, 2010.Google Scholar
Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Alexander Rasin, and Avi Silberschatz. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. VLDB, 2(1):922--933, 2009. Google ScholarDigital Library
François Bancilhon and Raghu Ramakrishnan. An amateur's introduction to recursive query processing strategies. In SIGMOD Conference, pages 16--52, 1986. Google ScholarDigital Library
Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
David J. DeWitt and Jim Gray. Parallel database systems: The future of high performance database systems. Commun. ACM, 35(6):85--98, 1992. Google ScholarDigital Library
Jaliya Ekanayake and Shrideep Pallickara. MapReduce for data intensive scientific analysis. In IEEE eScience, pages 277--284, 2008. Google ScholarDigital Library
Hadoop. http://hadoop.apache.org/. Accessed July 7, 2010.Google Scholar
Hdfs. http://hadoop.apache.org/common/docs/current/hdfs_design.html. Accessed July 7, 2010.Google Scholar
Hive. http://hadoop.apache.org/hive/. Accessed July 7, 2010.Google Scholar
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys, pages 59--72, 2007. Google ScholarDigital Library
Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604--632, 1999. Google ScholarDigital Library
Mahout. http://lucene.apache.org/mahout/. Accessed July 7, 2010.Google Scholar
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD Conference, pages 135--146, 2010. Google ScholarDigital Library
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig Latin: a not-so-foreign language for data processing. In SIGMOD Conference, pages 1099--1110, 2008. Google ScholarDigital Library
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report 1999--66, Stanford InfoLab, 1999.Google Scholar
Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD Conference, pages 165--178, 2009. Google ScholarDigital Library
Weining Zhang, Ke Wang, and Siu-Cheung Chau. Data partition and parallel evaluation of datalog programs. IEEE Trans. Knowl. Data Eng., 7(1):163--176, 1995. Google ScholarDigital Library

Recommendations

The HaLoop approach to large-scale iterative data analysis

The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce has enjoyed particular success. However, MapReduce ...
Read More
Big Data Analytics with R and Hadoop
Read More
Big Data Analytics
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 3, Issue 1-2
September 2010
1658 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 September 2010
Published in pvldb Volume 3, Issue 1-2
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 204
  Total Citations
  View Citations
- 2,835
  Total Downloads
- Downloads (Last 12 months)47
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

The HaLoop approach to large-scale iterative data analysis

Big Data Analytics with R and Hadoop

Big Data Analytics

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

The HaLoop approach to large-scale iterative data analysis

Big Data Analytics with R and Hadoop

Big Data Analytics

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media