research-article

Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Authors:
Jens Dittrich

Saarland University

Saarland University
View Profile

,
Jorge-Arnulfo Quiané-Ruiz

Saarland University

Saarland University
View Profile

,
Alekh Jindal

Saarland University and International Max Planck Research School for Computer Science

Saarland University and International Max Planck Research School for Computer Science
View Profile

,
Yagiz Kargin

International Max Planck Research School for Computer Science

International Max Planck Research School for Computer Science
View Profile

,
Vinay Setty

International Max Planck Research School for Computer Science

International Max Planck Research School for Computer Science
View Profile

,
Jörg Schad

Saarland University

Saarland University
View Profile

Proceedings of the VLDB Endowment Volume 3 Issue 1-2pp 515–529https://doi.org/10.14778/1920841.1920908

Published:01 September 2010Publication History

Proceedings of the VLDB Endowment

Abstract

MapReduce is a computing paradigm that has gained a lot of attention in recent years from industry and research. Unlike parallel DBMSs, MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. However, this comes at a price: MapReduce processes tasks in a scan-oriented fashion. Hence, the performance of Hadoop --- an open-source implementation of MapReduce --- often does not match the one of a well-configured parallel DBMS. In this paper we propose a new type of system named Hadoop++: it boosts task performance without changing the Hadoop framework at all (Hadoop does not even 'notice it'). To reach this goal, rather than changing a working system (Hadoop), we inject our technology at the right places through UDFs only and affect Hadoop from inside. This has three important consequences: First, Hadoop++ significantly outperforms Hadoop. Second, any future changes of Hadoop may directly be used with Hadoop++ without rewriting any glue code. Third, Hadoop++ does not need to change the Hadoop interface. Our experiments show the superiority of Hadoop++ over both Hadoop and HadoopDB for tasks related to indexing and join processing.

References

Dbcolumn on MapReduce, http://databasecolumn.vertica.com/2008/01/mapreduce-a-major-step-back.html.Google Scholar
HDFS Bug, http://issues.apache.org/jira/browse/HDFS-96.Google Scholar
A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1), 2009. Google ScholarDigital Library
F. Afrati and J. Ullman. Optimizing Joins in a Map-Reduce Environment. In EDBT, 2010. Google ScholarDigital Library
D. Bitton and D. J. DeWitt. Duplicate Record Elimination in Large Data Files. TODS, 8(2), 1983. Google ScholarDigital Library
M. J. Cafarella and C. Re. Relational Optimization for Data-Intensive Programs. In WebDB, 2010. Google ScholarDigital Library
R. Chaiken et al. Scope: Easy and Efficient Parallel Processing of Massive Data Sets. PVLDB, 1(2), 2008. Google ScholarDigital Library
J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C. Welton. Mad Skills: New Analysis Practices for Big Data. PVLDB, 2(2), 2009. Google ScholarDigital Library
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. MapReduce Online. In NSDI, 2010. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: Simplified Data Processing on Large Clusters. In OSDI, 2004. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: A Flexible Data Processing Tool. CACM, 53(1):72--77, 2010. Google ScholarDigital Library
A. Gates et al. Building a HighLevel Dataflow System on Top of MapReduce: The Pig Experience. PVLDB, 2(2), 2009. Google ScholarDigital Library
M. Isard et al. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In EuroSys, 2007. Google ScholarDigital Library
K. Morton and A. Friesen. KAMD: A Progress Estimator for MapReduce Pipelines. In ICDE, 2010.Google ScholarCross Ref
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, 2008. Google ScholarDigital Library
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009. Google ScholarDigital Library
J. Rao and K. A. Ross. Cache Conscious Indexing for Decision-Support in Main Memory. In VLDB, 1999. Google ScholarDigital Library
J. Schad, J. Dittrich, and J.-A. Quiane-Ruiz. Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance. PVLDB, 3(1), 2010. Google ScholarDigital Library
M. Stonebraker, D. J. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. MapReduce and Parallel DBMSs: Friends or Foes? CACM, 53(1), 2010. Google ScholarDigital Library
A. Thusoo et al. Hive - a warehousing solution over a map-reduce framework. PVLDB, 2(2), 2009. Google ScholarDigital Library
P. Yan and P. Larson. Data Reduction Through Early Grouping. In CASCON, 1994. Google ScholarDigital Library
C. Yang, C. Yen, C. Tan, and S. Madden. Osprey: Implementing MapReduce-Style Fault Tolerance in a Shared-Nothing Distributed Database. In ICDE, 2010.Google ScholarCross Ref
H. Yang et al. Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters. In SIGMOD, 2007. Google ScholarDigital Library

Index Terms

Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)
1. Information systems
  1. Data management systems
    1. Database management system engines
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Index terms have been assigned to the content through auto-classification.

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 3, Issue 1-2
September 2010
1658 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 September 2010
Published in pvldb Volume 3, Issue 1-2
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 114
  Total Citations
  View Citations
- 1,973
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools

Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop

Big Data Analytics with R and Hadoop

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools

Hadoop Essentials - Tackling the Challenges of Big Data with Hadoop

Big Data Analytics with R and Hadoop

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media