research-article

Efficient big data processing in Hadoop MapReduce

Authors:
Jens Dittrich

Saarland University

Saarland University
View Profile

,
Jorge-Arnulfo Quiané-Ruiz

Saarland University

Saarland University
View Profile

Proceedings of the VLDB Endowment Volume 5 Issue 12pp 2014–2015https://doi.org/10.14778/2367502.2367562

Published:01 August 2012Publication History

Proceedings of the VLDB Endowment

Abstract

This tutorial is motivated by the clear need of many organizations, companies, and researchers to deal with big data volumes efficiently. Examples include web analytics applications, scientific applications, and social networks. A popular data processing engine for big data is Hadoop MapReduce. Early versions of Hadoop MapReduce suffered from severe performance problems. Today, this is becoming history. There are many techniques that can be used with Hadoop MapReduce jobs to boost performance by orders of magnitude. In this tutorial we teach such techniques. First, we will briefly familiarize the audience with Hadoop MapReduce and motivate its use for big data processing. Then, we will focus on different data management techniques, going from job optimization to physical data organization like data layouts and indexes. Throughout this tutorial, we will highlight the similarities and differences between Hadoop MapReduce and Parallel DBMS. Furthermore, we will point out unresolved research problems and open issues.

References

Hadoop, http://hadoop.apache.org/mapreduce/.Google Scholar
D. Abadi et al. Column-Oriented Database Systems. PVDLB, 2(2):1664--1665, 2009. Google Scholar
F. N. Afrati and J. D. Ullman. Optimizing Joins in a Map-Reduce Environment. In EDBT, pages 99--110, 2010. Google Scholar
S. Babu. Towards automatic optimization of MapReduce programs. In SOCC, pages 137--142, 2010. Google Scholar
S. Blanas et al. A Comparison of Join Algorithms for Log Processing in MapReduce. In SIGMOD, pages 975--986, 2010. Google Scholar
J. Dean and S. Ghemawat. MapReduce: A Flexible Data Processing Tool. CACM, 53(1):72--77, 2010. Google Scholar
J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). PVLDB, 3(1):519--529, 2010. Google Scholar
J. Dittrich, J.-A. Quiané-Ruiz, S. Richter, S. Schuh, A. Jindal, and J. Schad. Only Aggressive Elephants are Fast Elephants. PVLDB, 5, 2012. Google Scholar
A. Floratou et al. Column-Oriented Storage Techniques for MapReduce. PVLDB, 4(7):419--429, 2011. Google Scholar
A. Gates et al. Building a HighLevel Dataflow System on Top of MapReduce: The Pig Experience. PVLDB, 2(2):1414--1425, 2009. Google Scholar
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In SOSP, pages 29--43, 2003. Google Scholar
H. Herodotou and S. Babu. Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. PVLDB, 4(11):1111--1122, 2011.Google Scholar
M. Isard et al. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In EuroSys, pages 59--72, 2007. Google Scholar
E. Jahani, M. J. Cafarella, and C. Ré. Automatic Optimization for MapReduce Programs. PVLDB, 4(6):385--396, 2011. Google Scholar
D. Jiang et al. The Performance of MapReduce: An In-depth Study. PVLDB, 3(1--2):472--483, 2010. Google Scholar
A. Jindal, J.-A. Quiané-Ruiz, and J. Dittrich. Trojan Data Layouts: Right Shoes for a Running Elephant. In SOCC, 2011. Google Scholar
J. Lin et al. Full-Text Indexing for Optimizing Selection Operations in Large-Scale Data Analytics. MapReduce Workshop, 2011. Google Scholar
Y. Lin et al. Llama: Leveraging Columnar Storage for Scalable Join Processing in the MapReduce Framework. In SIGMOD, pages 961--972, 2011. Google Scholar
D. Logothetis et al. Stateful Bulk Processing for Incremental Analytics. In SoCC, pages 51--62, 2010. Google Scholar
A. Okcan and M. Riedewald. Processing Theta-Joins Using MapReduce. In SIGMOD, pages 949--960, 2011. Google Scholar
A. Pavlo et al. A Comparison of Approaches to Large-Scale Data Analysis. In SIGMOD, pages 165--178, 2009. Google Scholar
J.-A. Quiané-Ruiz, C. Pinkel, J. Schad, and J. Dittrich. RAFTing MapReduce: Fast Recovery on the RAFT. ICDE, pages 589--600, 2011. Google Scholar
A. Thusoo et al. Data Warehousing and Analytics Infrastructure at Facebook. In SIGMOD, pages 1013--1020, 2010. Google Scholar
A. Thusoo et al. Hive -- A Petabyte Scale Data Warehouse Using Hadoop. In ICDE, pages 996--1005, 2010.Google Scholar
S. Wu, F. Li, S. Mehrotra, and B. C. Ooi. Query Optimization for Massively Parallel Data Processing. In SOCC, 2011. Google Scholar
M. Zaharia et al. Improving MapReduce Performance in Heterogeneous Environments. In OSDI, pages 29--42, 2008. Google Scholar

Index Terms

Efficient big data processing in Hadoop MapReduce
1. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information retrieval
    1. Document representation

Index terms have been assigned to the content through auto-classification.

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 5, Issue 12
August 2012
340 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2012
Published in pvldb Volume 5, Issue 12
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 43
  Total Citations
  View Citations
- 5,223
  Total Downloads
- Downloads (Last 12 months)148
- Downloads (Last 6 weeks)24
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient big data processing in Hadoop MapReduce

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Big Data Analytics with R and Hadoop

Big Data Analytics with Hadoop 3: Build highly effective analytics solutions to gain valuable insight into your big data

Big Data and Hadoop

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Efficient big data processing in Hadoop MapReduce

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Big Data Analytics with R and Hadoop

Big Data Analytics with Hadoop 3: Build highly effective analytics solutions to gain valuable insight into your big data

Big Data and Hadoop

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media