skip to main content
research-article

Only aggressive elephants are fast elephants

Published:01 July 2012Publication History
Skip Abstract Section

Abstract

Yellow elephants are slow. A major reason is that they consume their inputs entirely before responding to an elephant rider's orders. Some clever riders have trained their yellow elephants to only consume parts of the inputs before responding. However, the teaching time to make an elephant do that is high. So high that the teaching lessons often do not pay off. We take a different approach. We make elephants aggressive; only this will make them very fast.

We propose HAIL (Hadoop Aggressive Indexing Library), an enhancement of HDFS and Hadoop MapReduce that dramatically improves runtimes of several classes of MapReduce jobs. HAIL changes the upload pipeline of HDFS in order to create different clustered indexes on each data block replica. An interesting feature of HAIL is that we typically create a win-win situation: we improve both data upload to HDFS and the runtime of the actual Hadoop MapReduce job. In terms of data upload, HAIL improves over HDFS by up to 60% with the default replication factor of three. In terms of query execution, we demonstrate that HAIL runs up to 68x faster than Hadoop. In our experiments, we use six clusters including physical and EC2 clusters of up to 100 nodes. A series of scalability experiments also demonstrates the superiority of HAIL.

References

  1. S. Agrawal et al. Database Tuning Advisor for Microsoft SQL Server 2005. VLDB, pages 1110--1121, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  2. A. Ailamaki et al. Weaving Relations for Cache Performance. VLDB, pages 169--180, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Blanas et al. A Comparison of Join Algorithms for Log Processing in MapReduce. SIGMOD, pages 975--986, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. Bruno and S. Chaudhuri. Constrained Physical Design Tuning. VLDB J., 19(1):21--44, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. J. Cafarella and C. Ré. Manimal: Relational Optimization for Data-Intensive Programs. WebDB, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Chaudhuri et al. Index Selection for Databases: A Hardness Study and a Principled Heuristic Solution. TKDE, 16(11):1313--1323, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Chen et al. A Framework for Supporting DBMS-like Indexes in the Cloud. PVLDB, 4(11):702--713, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Chen. Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce. PVLDB, 3(1--2):1459--1468, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Dash et al. CoPhy: A Scalable, Portable, and Interactive Index Advisor for Large Workloads. PVLDB, 4(6):362--372, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Dean and S. Ghemawat. MapReduce: A Flexible Data Processing Tool. CACM, 53(1):72--77, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Dittrich and J.-A. Quiané-Ruiz. Efficient Parallel Data Processing in MapReduce Workflows. PVLDB, 5, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). PVLDB, 3(1):518--529, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Y. Eltabakh et al. CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop. PVLDB, 4(9):575--585, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Floratou et al. Column-Oriented Storage Techniques for MapReduce. PVLDB, 4(7):419--429, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. http://engineering.twitter.com/2010/04/hadoop-at-twitter.html.Google ScholarGoogle Scholar
  16. Hadoop Users, http://wiki.apache.org/hadoop/PoweredBy.Google ScholarGoogle Scholar
  17. H. Herodotou and S. Babu. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. PVLDB, 4(11):1111--1122, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Idreos et al. Here are my Data Files. Here are my Queries. Where are my Results? CIDR, pages 57--68, 2011.Google ScholarGoogle Scholar
  19. E. Jahani et al. Automatic Optimization for MapReduce Programs. PVLDB, 4(6):385--396, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Jiang et al. The Performance of MapReduce: An In-depth Study. PVLDB, 3(1):472--483, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Jindal, J.-A. Quiané-Ruiz, and J. Dittrich. Trojan Data Layouts: Right Shoes for a Running Elephant. SOCC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. W. Lang and J. M. Patel. Energy Management for MapReduce Clusters. PVLDB, 3(1):129--139, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Lin et al. Full-Text Indexing for Optimizing Selection Operations in Large-Scale Data Analytics. MapReduce Workshop, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Logothetis et al. In-Situ MapReduce for Log Processing. USENIX, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. T. Nykiel et al. MRShare: Sharing Across Multiple Queries in MapReduce. PVLDB, 3(1):494--505, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C. Olston. Keynote: Programming and Debugging Large-Scale Data Processing Workflows. SOCC, 2011.Google ScholarGoogle Scholar
  27. A. Pavlo et al. A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD, pages 165--178, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J.-A. Quiané-Ruiz, C. Pinkel, J. Schad, and J. Dittrich. RAFTing MapReduce: Fast recovery on the RAFT. ICDE, pages 589--600, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Rao and K. Ross. Making B+-Trees Cache Conscious in Main Memory. ACM SIGMOD Record, 29(2):475--486, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Schad, J. Dittrich, and J.-A. Quiané-Ruiz. Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance. PVLDB, 3(1):460--471, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. I. Stoica et al. Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications. SIGCOMM, pages 149--160, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Thusoo et al. Data Warehousing and Analytics Infrastructure at Facebook. SIGMOD, pages 1013--1020, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. T. White. Hadoop: The Definitive Guide. O'Reilly, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Zaharia et al. Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling. EuroSys, pages 265--278, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 5, Issue 11
    July 2012
    608 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 July 2012
    Published in pvldb Volume 5, Issue 11

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader