Abstract
MapReduce is a computing paradigm that has gained a lot of attention in recent years from industry and research. Unlike parallel DBMSs, MapReduce allows non-expert users to run complex analytical tasks over very large data sets on very large clusters and clouds. However, this comes at a price: MapReduce processes tasks in a scan-oriented fashion. Hence, the performance of Hadoop --- an open-source implementation of MapReduce --- often does not match the one of a well-configured parallel DBMS. In this paper we propose a new type of system named Hadoop++: it boosts task performance without changing the Hadoop framework at all (Hadoop does not even 'notice it'). To reach this goal, rather than changing a working system (Hadoop), we inject our technology at the right places through UDFs only and affect Hadoop from inside. This has three important consequences: First, Hadoop++ significantly outperforms Hadoop. Second, any future changes of Hadoop may directly be used with Hadoop++ without rewriting any glue code. Third, Hadoop++ does not need to change the Hadoop interface. Our experiments show the superiority of Hadoop++ over both Hadoop and HadoopDB for tasks related to indexing and join processing.
- Dbcolumn on MapReduce, http://databasecolumn.vertica.com/2008/01/mapreduce-a-major-step-back.html.Google Scholar
- HDFS Bug, http://issues.apache.org/jira/browse/HDFS-96.Google Scholar
- A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1), 2009. Google ScholarDigital Library
- F. Afrati and J. Ullman. Optimizing Joins in a Map-Reduce Environment. In EDBT, 2010. Google ScholarDigital Library
- D. Bitton and D. J. DeWitt. Duplicate Record Elimination in Large Data Files. TODS, 8(2), 1983. Google ScholarDigital Library
- M. J. Cafarella and C. Re. Relational Optimization for Data-Intensive Programs. In WebDB, 2010. Google ScholarDigital Library
- R. Chaiken et al. Scope: Easy and Efficient Parallel Processing of Massive Data Sets. PVLDB, 1(2), 2008. Google ScholarDigital Library
- J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C. Welton. Mad Skills: New Analysis Practices for Big Data. PVLDB, 2(2), 2009. Google ScholarDigital Library
- T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. MapReduce Online. In NSDI, 2010. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: Simplified Data Processing on Large Clusters. In OSDI, 2004. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: A Flexible Data Processing Tool. CACM, 53(1):72--77, 2010. Google ScholarDigital Library
- A. Gates et al. Building a HighLevel Dataflow System on Top of MapReduce: The Pig Experience. PVLDB, 2(2), 2009. Google ScholarDigital Library
- M. Isard et al. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In EuroSys, 2007. Google ScholarDigital Library
- K. Morton and A. Friesen. KAMD: A Progress Estimator for MapReduce Pipelines. In ICDE, 2010.Google ScholarCross Ref
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, 2008. Google ScholarDigital Library
- A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009. Google ScholarDigital Library
- J. Rao and K. A. Ross. Cache Conscious Indexing for Decision-Support in Main Memory. In VLDB, 1999. Google ScholarDigital Library
- J. Schad, J. Dittrich, and J.-A. Quiane-Ruiz. Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance. PVLDB, 3(1), 2010. Google ScholarDigital Library
- M. Stonebraker, D. J. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. MapReduce and Parallel DBMSs: Friends or Foes? CACM, 53(1), 2010. Google ScholarDigital Library
- A. Thusoo et al. Hive - a warehousing solution over a map-reduce framework. PVLDB, 2(2), 2009. Google ScholarDigital Library
- P. Yan and P. Larson. Data Reduction Through Early Grouping. In CASCON, 1994. Google ScholarDigital Library
- C. Yang, C. Yen, C. Tan, and S. Madden. Osprey: Implementing MapReduce-Style Fault Tolerance in a Shared-Nothing Distributed Database. In ICDE, 2010.Google ScholarCross Ref
- H. Yang et al. Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters. In SIGMOD, 2007. Google ScholarDigital Library
Index Terms
- Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)
Comments