nach oben

Erschienen in:

2015 | OriginalPaper | Buchkapitel

An Efficient Solution for Processing Skewed MapReduce Jobs

verfasst von : Reza Akbarinia, Miguel Liroz-Gistau, Divyakant Agrawal, Patrick Valduriez

Erschienen in: Database and Expert Systems Applications

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Although MapReduce has been praised for its high scalability and fault tolerance, it has been criticized in some points, in particular, its poor performance in the case of data skew. There are important cases where a high percentage of processing in the reduce side is done by a few nodes, or even one node, while the others remain idle. There have been some attempts to address the problem of data skew, but only for specific cases. In particular, there is no proposed solution for the cases where most of the intermediate values correspond to a single key, or when the number of keys is less than the number of reduce workers.

In this paper, we propose FP-Hadoop, a system that makes the reduce side of MapReduce more parallel, and efficiently deals with the problem of data skew in the reduce side. In FP-Hadoop, there is a new phase, called intermediate reduce (IR), in which blocks of intermediate values, constructed dynamically, are processed by intermediate reduce workers in parallel, by using a scheduling strategy. By using the IR phase, even if all intermediate values belong to only one key, the main part of the reducing work can be done in parallel by using the computing resources of all available workers. We implemented a prototype of FP-Hadoop, and conducted extensive experiments over synthetic and real datasets. We achieved excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel An External Memory Algorithm for All-Pairs Regular Path Problem

Nächstes Kapitel MapReduce-DBMS: An Integration Model for Big Data Management and Optimization

http://dumps.wikimedia.org/other/pagecounts-raw/.

This program is just for illustration; actually, it is possible to write a more efficient code by leveraging the sorting mechanisms of MapReduce.

http://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia.

This mechanism is used for communication between the master and workers.

http://dumps.wikimedia.org/other/pagecounts-raw/.

http://dumps.wikimedia.org/enwiki/latest/.

http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/index.html.

http://webdatacommons.org/hyperlinkgraph/.

https://code.google.com/p/skewtune/

Hadoop (2014). http://hadoop.apache.org

Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. 21(2), 169–190 (2012)CrossRef

Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI (2004)

Dittrich, J., Quiané-Ruiz, J.A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)

Elghandour, I., Aboulnaga, A.: ReStore: reusing results of MapReduce jobs in pig. In: SIGMOD (2012)

Elmeleegy, K., Olston, C., Reed, B.: SpongeFiles: mitigating data skew in mapreduce using distributed memory. In: SIGMOD (2014)

Gufler, B., Augsten, N., Reiser, A., Kemper, A.: Load balancing in MapReduce based on scalable cardinality estimates. In: ICDE. IEEE, April 2012

Kwon, Y., Balazinska, M., Howe, B., Rolia, J.A.: SkewTune: mitigating skew in MapReduce applications. In: SIGMOD (2012)

Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with MapReduce: a survey. SIGMOD Rec. 40(4), 11–20 (2011)CrossRefMATH

10.

Ramakrishnan, S.R., Swart, G., Urmanov, A.: Balancing reducer skew in MapReduce workloads using progressive sampling. In: ACM Symposium on Cloud Computing, SoCC (2012)

11.

Rao, S., Ramakrishnan, R., Silberstein, A., Ovsiannikov, M., Reeves, D.: Sailfish: a framework for large scale data processing. In: ACM Symposium on Cloud Computing, SoCC (2012)

12.

White, T.: Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale, 3rd edn. O’Reilly, Sebastopol (2012)

13.

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI (2012)

Titel: An Efficient Solution for Processing Skewed MapReduce Jobs
verfasst von: Reza Akbarinia
Miguel Liroz-Gistau
Divyakant Agrawal
Patrick Valduriez
Verlag: Springer International Publishing
Buch: Database and Expert Systems Applications
Print ISBN: 978-3-319-22851-8

Electronic ISBN: 978-3-319-22852-5

Copyright-Jahr: 2015
DOI: https://doi.org/10.1007/978-3-319-22852-5_35

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"