skip to main content
research-article

Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Published:01 September 2010Publication History
Skip Abstract Section

Abstract

MapReduce is a computing paradigm that has gained a lot of attention in recent years from industry and research. Unlike parallel DBMSs, MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. However, this comes at a price: MapReduce processes tasks in a scan-oriented fashion. Hence, the performance of Hadoop --- an open-source implementation of MapReduce --- often does not match the one of a well-configured parallel DBMS. In this paper we propose a new type of system named Hadoop++: it boosts task performance without changing the Hadoop framework at all (Hadoop does not even 'notice it'). To reach this goal, rather than changing a working system (Hadoop), we inject our technology at the right places through UDFs only and affect Hadoop from inside. This has three important consequences: First, Hadoop++ significantly outperforms Hadoop. Second, any future changes of Hadoop may directly be used with Hadoop++ without rewriting any glue code. Third, Hadoop++ does not need to change the Hadoop interface. Our experiments show the superiority of Hadoop++ over both Hadoop and HadoopDB for tasks related to indexing and join processing.

References

  1. Dbcolumn on MapReduce, http://databasecolumn.vertica.com/2008/01/mapreduce-a-major-step-back.html.Google ScholarGoogle Scholar
  2. HDFS Bug, http://issues.apache.org/jira/browse/HDFS-96.Google ScholarGoogle Scholar
  3. A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. F. Afrati and J. Ullman. Optimizing Joins in a Map-Reduce Environment. In EDBT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Bitton and D. J. DeWitt. Duplicate Record Elimination in Large Data Files. TODS, 8(2), 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. J. Cafarella and C. Re. Relational Optimization for Data-Intensive Programs. In WebDB, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Chaiken et al. Scope: Easy and Efficient Parallel Processing of Massive Data Sets. PVLDB, 1(2), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C. Welton. Mad Skills: New Analysis Practices for Big Data. PVLDB, 2(2), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. MapReduce Online. In NSDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Dean and S. Ghemawat. Mapreduce: Simplified Data Processing on Large Clusters. In OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Dean and S. Ghemawat. MapReduce: A Flexible Data Processing Tool. CACM, 53(1):72--77, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Gates et al. Building a HighLevel Dataflow System on Top of MapReduce: The Pig Experience. PVLDB, 2(2), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Isard et al. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In EuroSys, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. K. Morton and A. Friesen. KAMD: A Progress Estimator for MapReduce Pipelines. In ICDE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  15. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Rao and K. A. Ross. Cache Conscious Indexing for Decision-Support in Main Memory. In VLDB, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Schad, J. Dittrich, and J.-A. Quiane-Ruiz. Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance. PVLDB, 3(1), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Stonebraker, D. J. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. MapReduce and Parallel DBMSs: Friends or Foes? CACM, 53(1), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Thusoo et al. Hive - a warehousing solution over a map-reduce framework. PVLDB, 2(2), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Yan and P. Larson. Data Reduction Through Early Grouping. In CASCON, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. Yang, C. Yen, C. Tan, and S. Madden. Osprey: Implementing MapReduce-Style Fault Tolerance in a Shared-Nothing Distributed Database. In ICDE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  23. H. Yang et al. Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters. In SIGMOD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 3, Issue 1-2
        September 2010
        1658 pages

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 September 2010
        Published in pvldb Volume 3, Issue 1-2

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader