research-article

Only aggressive elephants are fast elephants

Authors:
Jens Dittrich

Information Systems Group, Saarland University

Information Systems Group, Saarland University
View Profile

,
Jorge-Arnulfo Quiané-Ruiz

Information Systems Group, Saarland University

Information Systems Group, Saarland University
View Profile

,
Stefan Richter

Information Systems Group, Saarland University

Information Systems Group, Saarland University
View Profile

,
Stefan Schuh

Information Systems Group, Saarland University

Information Systems Group, Saarland University
View Profile

,
Alekh Jindal

Information Systems Group, Saarland University

Information Systems Group, Saarland University
View Profile

,
Jörg Schad

Information Systems Group, Saarland University

Information Systems Group, Saarland University
View Profile

Proceedings of the VLDB Endowment Volume 5 Issue 11pp 1591–1602https://doi.org/10.14778/2350229.2350272

Published:01 July 2012Publication History

Proceedings of the VLDB Endowment

Abstract

Yellow elephants are slow. A major reason is that they consume their inputs entirely before responding to an elephant rider's orders. Some clever riders have trained their yellow elephants to only consume parts of the inputs before responding. However, the teaching time to make an elephant do that is high. So high that the teaching lessons often do not pay off. We take a different approach. We make elephants aggressive; only this will make them very fast.

We propose HAIL (Hadoop Aggressive Indexing Library), an enhancement of HDFS and Hadoop MapReduce that dramatically improves runtimes of several classes of MapReduce jobs. HAIL changes the upload pipeline of HDFS in order to create different clustered indexes on each data block replica. An interesting feature of HAIL is that we typically create a win-win situation: we improve both data upload to HDFS and the runtime of the actual Hadoop MapReduce job. In terms of data upload, HAIL improves over HDFS by up to 60% with the default replication factor of three. In terms of query execution, we demonstrate that HAIL runs up to 68x faster than Hadoop. In our experiments, we use six clusters including physical and EC2 clusters of up to 100 nodes. A series of scalability experiments also demonstrates the superiority of HAIL.

References

S. Agrawal et al. Database Tuning Advisor for Microsoft SQL Server 2005. VLDB, pages 1110--1121, 2004.Google ScholarCross Ref
A. Ailamaki et al. Weaving Relations for Cache Performance. VLDB, pages 169--180, 2001. Google ScholarDigital Library
S. Blanas et al. A Comparison of Join Algorithms for Log Processing in MapReduce. SIGMOD, pages 975--986, 2010. Google ScholarDigital Library
N. Bruno and S. Chaudhuri. Constrained Physical Design Tuning. VLDB J., 19(1):21--44, 2010. Google ScholarDigital Library
M. J. Cafarella and C. Ré. Manimal: Relational Optimization for Data-Intensive Programs. WebDB, 2010. Google ScholarDigital Library
S. Chaudhuri et al. Index Selection for Databases: A Hardness Study and a Principled Heuristic Solution. TKDE, 16(11):1313--1323, 2004. Google ScholarDigital Library
G. Chen et al. A Framework for Supporting DBMS-like Indexes in the Cloud. PVLDB, 4(11):702--713, 2011.Google ScholarDigital Library
S. Chen. Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce. PVLDB, 3(1--2):1459--1468, 2010. Google ScholarDigital Library
D. Dash et al. CoPhy: A Scalable, Portable, and Interactive Index Advisor for Large Workloads. PVLDB, 4(6):362--372, 2011. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: A Flexible Data Processing Tool. CACM, 53(1):72--77, 2010. Google ScholarDigital Library
J. Dittrich and J.-A. Quiané-Ruiz. Efficient Parallel Data Processing in MapReduce Workflows. PVLDB, 5, 2012. Google ScholarDigital Library
J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). PVLDB, 3(1):518--529, 2010. Google ScholarDigital Library
M. Y. Eltabakh et al. CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop. PVLDB, 4(9):575--585, 2011. Google ScholarDigital Library
A. Floratou et al. Column-Oriented Storage Techniques for MapReduce. PVLDB, 4(7):419--429, 2011. Google ScholarDigital Library
http://engineering.twitter.com/2010/04/hadoop-at-twitter.html.Google Scholar
Hadoop Users, http://wiki.apache.org/hadoop/PoweredBy.Google Scholar
H. Herodotou and S. Babu. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. PVLDB, 4(11):1111--1122, 2011.Google ScholarDigital Library
S. Idreos et al. Here are my Data Files. Here are my Queries. Where are my Results? CIDR, pages 57--68, 2011.Google Scholar
E. Jahani et al. Automatic Optimization for MapReduce Programs. PVLDB, 4(6):385--396, 2011. Google ScholarDigital Library
D. Jiang et al. The Performance of MapReduce: An In-depth Study. PVLDB, 3(1):472--483, 2010. Google ScholarDigital Library
A. Jindal, J.-A. Quiané-Ruiz, and J. Dittrich. Trojan Data Layouts: Right Shoes for a Running Elephant. SOCC, 2011. Google ScholarDigital Library
W. Lang and J. M. Patel. Energy Management for MapReduce Clusters. PVLDB, 3(1):129--139, 2010. Google ScholarDigital Library
J. Lin et al. Full-Text Indexing for Optimizing Selection Operations in Large-Scale Data Analytics. MapReduce Workshop, 2011. Google ScholarDigital Library
D. Logothetis et al. In-Situ MapReduce for Log Processing. USENIX, 2011. Google ScholarDigital Library
T. Nykiel et al. MRShare: Sharing Across Multiple Queries in MapReduce. PVLDB, 3(1):494--505, 2010. Google ScholarDigital Library
C. Olston. Keynote: Programming and Debugging Large-Scale Data Processing Workflows. SOCC, 2011.Google Scholar
A. Pavlo et al. A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD, pages 165--178, 2009. Google ScholarDigital Library
J.-A. Quiané-Ruiz, C. Pinkel, J. Schad, and J. Dittrich. RAFTing MapReduce: Fast recovery on the RAFT. ICDE, pages 589--600, 2011. Google ScholarDigital Library
J. Rao and K. Ross. Making B+-Trees Cache Conscious in Main Memory. ACM SIGMOD Record, 29(2):475--486, 2000. Google ScholarDigital Library
J. Schad, J. Dittrich, and J.-A. Quiané-Ruiz. Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance. PVLDB, 3(1):460--471, 2010. Google ScholarDigital Library
I. Stoica et al. Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications. SIGCOMM, pages 149--160, 2001. Google ScholarDigital Library
A. Thusoo et al. Data Warehousing and Analytics Infrastructure at Facebook. SIGMOD, pages 1013--1020, 2010. Google ScholarDigital Library
T. White. Hadoop: The Definitive Guide. O'Reilly, 2011. Google ScholarDigital Library
M. Zaharia et al. Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling. EuroSys, pages 265--278, 2010. Google ScholarDigital Library

Recommendations

Can the elephants handle the NoSQL onslaught?

In this new era of "big data", traditional DBMSs are under attack from two sides. At one end of the spectrum, the use of document store NoSQL systems (e.g. MongoDB) threatens to move modern Web 2.0 applications away from traditional RDBMSs. At the other ...
Read More
XORing elephants: novel erasure codes for big data

Distributed storage systems for large clusters typically use replication to provide reliability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. Reed-Solomon codes are the standard design choice ...
Read More
Fast Data Processing with Spark
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 5, Issue 11
July 2012
608 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 July 2012
Published in pvldb Volume 5, Issue 11
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 29
  Total Citations
  View Citations
- 522
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Only aggressive elephants are fast elephants

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Can the elephants handle the NoSQL onslaught?

XORing elephants: novel erasure codes for big data

Fast Data Processing with Spark

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Only aggressive elephants are fast elephants

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Can the elephants handle the NoSQL onslaught?

XORing elephants: novel erasure codes for big data

Fast Data Processing with Spark

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media