Abstract
Yellow elephants are slow. A major reason is that they consume their inputs entirely before responding to an elephant rider's orders. Some clever riders have trained their yellow elephants to only consume parts of the inputs before responding. However, the teaching time to make an elephant do that is high. So high that the teaching lessons often do not pay off. We take a different approach. We make elephants aggressive; only this will make them very fast.
We propose HAIL (Hadoop Aggressive Indexing Library), an enhancement of HDFS and Hadoop MapReduce that dramatically improves runtimes of several classes of MapReduce jobs. HAIL changes the upload pipeline of HDFS in order to create different clustered indexes on each data block replica. An interesting feature of HAIL is that we typically create a win-win situation: we improve both data upload to HDFS and the runtime of the actual Hadoop MapReduce job. In terms of data upload, HAIL improves over HDFS by up to 60% with the default replication factor of three. In terms of query execution, we demonstrate that HAIL runs up to 68x faster than Hadoop. In our experiments, we use six clusters including physical and EC2 clusters of up to 100 nodes. A series of scalability experiments also demonstrates the superiority of HAIL.
- S. Agrawal et al. Database Tuning Advisor for Microsoft SQL Server 2005. VLDB, pages 1110--1121, 2004.Google ScholarCross Ref
- A. Ailamaki et al. Weaving Relations for Cache Performance. VLDB, pages 169--180, 2001. Google ScholarDigital Library
- S. Blanas et al. A Comparison of Join Algorithms for Log Processing in MapReduce. SIGMOD, pages 975--986, 2010. Google ScholarDigital Library
- N. Bruno and S. Chaudhuri. Constrained Physical Design Tuning. VLDB J., 19(1):21--44, 2010. Google ScholarDigital Library
- M. J. Cafarella and C. Ré. Manimal: Relational Optimization for Data-Intensive Programs. WebDB, 2010. Google ScholarDigital Library
- S. Chaudhuri et al. Index Selection for Databases: A Hardness Study and a Principled Heuristic Solution. TKDE, 16(11):1313--1323, 2004. Google ScholarDigital Library
- G. Chen et al. A Framework for Supporting DBMS-like Indexes in the Cloud. PVLDB, 4(11):702--713, 2011.Google ScholarDigital Library
- S. Chen. Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce. PVLDB, 3(1--2):1459--1468, 2010. Google ScholarDigital Library
- D. Dash et al. CoPhy: A Scalable, Portable, and Interactive Index Advisor for Large Workloads. PVLDB, 4(6):362--372, 2011. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: A Flexible Data Processing Tool. CACM, 53(1):72--77, 2010. Google ScholarDigital Library
- J. Dittrich and J.-A. Quiané-Ruiz. Efficient Parallel Data Processing in MapReduce Workflows. PVLDB, 5, 2012. Google ScholarDigital Library
- J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). PVLDB, 3(1):518--529, 2010. Google ScholarDigital Library
- M. Y. Eltabakh et al. CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop. PVLDB, 4(9):575--585, 2011. Google ScholarDigital Library
- A. Floratou et al. Column-Oriented Storage Techniques for MapReduce. PVLDB, 4(7):419--429, 2011. Google ScholarDigital Library
- http://engineering.twitter.com/2010/04/hadoop-at-twitter.html.Google Scholar
- Hadoop Users, http://wiki.apache.org/hadoop/PoweredBy.Google Scholar
- H. Herodotou and S. Babu. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. PVLDB, 4(11):1111--1122, 2011.Google ScholarDigital Library
- S. Idreos et al. Here are my Data Files. Here are my Queries. Where are my Results? CIDR, pages 57--68, 2011.Google Scholar
- E. Jahani et al. Automatic Optimization for MapReduce Programs. PVLDB, 4(6):385--396, 2011. Google ScholarDigital Library
- D. Jiang et al. The Performance of MapReduce: An In-depth Study. PVLDB, 3(1):472--483, 2010. Google ScholarDigital Library
- A. Jindal, J.-A. Quiané-Ruiz, and J. Dittrich. Trojan Data Layouts: Right Shoes for a Running Elephant. SOCC, 2011. Google ScholarDigital Library
- W. Lang and J. M. Patel. Energy Management for MapReduce Clusters. PVLDB, 3(1):129--139, 2010. Google ScholarDigital Library
- J. Lin et al. Full-Text Indexing for Optimizing Selection Operations in Large-Scale Data Analytics. MapReduce Workshop, 2011. Google ScholarDigital Library
- D. Logothetis et al. In-Situ MapReduce for Log Processing. USENIX, 2011. Google ScholarDigital Library
- T. Nykiel et al. MRShare: Sharing Across Multiple Queries in MapReduce. PVLDB, 3(1):494--505, 2010. Google ScholarDigital Library
- C. Olston. Keynote: Programming and Debugging Large-Scale Data Processing Workflows. SOCC, 2011.Google Scholar
- A. Pavlo et al. A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD, pages 165--178, 2009. Google ScholarDigital Library
- J.-A. Quiané-Ruiz, C. Pinkel, J. Schad, and J. Dittrich. RAFTing MapReduce: Fast recovery on the RAFT. ICDE, pages 589--600, 2011. Google ScholarDigital Library
- J. Rao and K. Ross. Making B+-Trees Cache Conscious in Main Memory. ACM SIGMOD Record, 29(2):475--486, 2000. Google ScholarDigital Library
- J. Schad, J. Dittrich, and J.-A. Quiané-Ruiz. Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance. PVLDB, 3(1):460--471, 2010. Google ScholarDigital Library
- I. Stoica et al. Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications. SIGCOMM, pages 149--160, 2001. Google ScholarDigital Library
- A. Thusoo et al. Data Warehousing and Analytics Infrastructure at Facebook. SIGMOD, pages 1013--1020, 2010. Google ScholarDigital Library
- T. White. Hadoop: The Definitive Guide. O'Reilly, 2011. Google ScholarDigital Library
- M. Zaharia et al. Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling. EuroSys, pages 265--278, 2010. Google ScholarDigital Library
Recommendations
Can the elephants handle the NoSQL onslaught?
In this new era of "big data", traditional DBMSs are under attack from two sides. At one end of the spectrum, the use of document store NoSQL systems (e.g. MongoDB) threatens to move modern Web 2.0 applications away from traditional RDBMSs. At the other ...
XORing elephants: novel erasure codes for big data
Distributed storage systems for large clusters typically use replication to provide reliability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. Reed-Solomon codes are the standard design choice ...
Comments