ABSTRACT
Data processing needs are changing with the ever increasing amounts of both structured and unstructured data. While the processing of structured data typically relies on the well-developed field of relational database management systems (RDBMSs), MapReduce is a programming model developed to cope with processing immense amounts of unstructured data. MapReduce, however, offers features and advantages that can be exploited to process structured data. Several database vendors and researchers have already turned to MapReduce to aid in processing relational data, thus requiring integration of MapReduce and RDBMS technologies. In this paper, we provide a taxonomy to characterize several existing integration methods. Further, we take a detailed look at DBInputFormat which is an interface between Hadoop's MapReduce and a relational database. The challenges posed by such an interface are identified and we provide suggestions for improvement.
- DBInputFormat. http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/db/DBInputFormat.html.Google Scholar
- Hadoop. http://hadoop.apache.org/.Google Scholar
- JDBC. http://java.sun.com/javase/technologies/database/.Google Scholar
- Vertica. http://www.vertica.com/.Google Scholar
- Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Alexander Rasin, and Avi Silberschatz. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. In VLDB'09: Proceedings of the 2009 International Conference on VLDB, August 2009.Google Scholar
- Qiming Chen, Andy Therber, Meichun Hsu, Hans Zeller, Bin Zhang, and Ren Wu. Efficiently support MapReduce-like computation models inside parallel DBMS. In IDEAS '09: Proceedings of the 2009 International Database Engineering and Applications Symposium, pages 43--53, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, and Kunle Olukotun. Map-Reduce for machine learning on multi-core. In In Proceedings of Neural Information Processing Systems Conference, pages 281--288. MIT Press, 2006.Google Scholar
- Jonathan Cohen. Graph Twiddling in a MapReduce World. Computing in Science and Eng., 11(4):29--41, 2009. Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI'04: Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation, pages 137--150, Berkeley, CA, USA, 2004. USENIX Association. Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: a flexible data processing tool. Commun. ACM, 53(1):72--77, 2010. Google ScholarDigital Library
- Eric Friedman, Peter Pawlowski, and John Cieslewicz. SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. VLDB'09: Proceedings of the 2009 International Conference on VLDB, 2(2):1402--1413, 2009. Google ScholarDigital Library
- Greenplum. A Unified Engine for RDBMS and MapReduce. www.greenplum.com/technology/mapreduce/.Google Scholar
- Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099--1110, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD '09: Proceedings of the 35th SIGMOD international conference on Management of data, pages 165--178, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. Interpreting the data: Parallel analysis with Sawzall. Sci. Program., 13(4):277--298, 2005. Google ScholarDigital Library
- Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Madden, Erik Paulson, Andrew Pavlo, and Alexander Rasin. MapReduce and parallel DBMSs: friends or foes? Commun. ACM, 53(1):64--71, 2010. Google ScholarDigital Library
- Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. Hive: a warehousing solution over a map-reduce framework. VLDB'09: Proceedings of the 2009 International Conference on VLDB, 2(2):1626--1629, 2009. Google ScholarDigital Library
Index Terms
- Integrating MapReduce and RDBMSs
Recommendations
MapReduce: Review and open challenges
The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Challenges for MapReduce in Big Data
SERVICES '14: Proceedings of the 2014 IEEE World Congress on ServicesIn the Big Data community, MapReduce has been seen as one of the key enabling approaches for meeting continuously increasing demands on computing resources imposed by massive data sets. The reason for this is the high scalability of the MapReduce ...
Efficient big data processing in Hadoop MapReduce
This tutorial is motivated by the clear need of many organizations, companies, and researchers to deal with big data volumes efficiently. Examples include web analytics applications, scientific applications, and social networks. A popular data ...
Comments