Skip to main content
Erschienen in: The Journal of Supercomputing 1/2019

01.09.2018

Load balancing in join algorithms for skewed data in MapReduce systems

verfasst von: Elaheh Gavagsaz, Ali Rezaee, Hamid Haj Seyyed Javadi

Erschienen in: The Journal of Supercomputing | Ausgabe 1/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Join is an essential tool for data analysis which collected from different data sources. MapReduce has emerged as a prominent programming model for processing of massive data. However, traditional join algorithms based on MapReduce are not efficient when handling skewed data. The presence of data skew in input data leads to considerable load imbalance and performance degradation. This paper proposes a new skew-insensitive method, called fine-grained partitioning for skew data (FGSD) which can improve the load balancing for reduce tasks. The proposed method considers the properties of both input and output data through a proposed stream sampling algorithm. FGSD introduces a new approach for distribution of input data which leads to efficiently handling redistribution and join product skew. The experimental results confirm that our solution can not only achieve higher balancing performance, but also reduce the execution time of a job with varying degrees of the data skew. Furthermore, FGSD does not require any modification to the MapReduce environment and is applicable to complex join.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
12.
Zurück zum Zitat Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y (2010) A comparison of join algorithms for log processing in MaPreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp 975–986. https://doi.org/10.1145/1807167.1807273 Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y (2010) A comparison of join algorithms for log processing in MaPreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp 975–986. https://​doi.​org/​10.​1145/​1807167.​1807273
16.
20.
Zurück zum Zitat DeWitt DJ, Naughton JF, Schneider DA, Seshadri S (1992) Practical Skew Handling in Parallel Joins. In: Proceedings of the 18th International Conference on Very Large Data Bases, pp 27–40 DeWitt DJ, Naughton JF, Schneider DA, Seshadri S (1992) Practical Skew Handling in Parallel Joins. In: Proceedings of the 18th International Conference on Very Large Data Bases, pp 27–40
32.
Zurück zum Zitat Cochran WG (1977) Sampling techniques. Wiley, New YorkMATH Cochran WG (1977) Sampling techniques. Wiley, New YorkMATH
35.
Zurück zum Zitat Meng X (2013) Scalable simple random sampling and stratified sampling. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, vol 28, pp III-531–III-539 Meng X (2013) Scalable simple random sampling and stratified sampling. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, vol 28, pp III-531–III-539
39.
Zurück zum Zitat Walton CB, Dale AG, Jenevein RM (1991) A taxonomy and performance model of data skew effects in parallel joins. In: Proceedings of the 17th International Conference on Very Large Data Bases, pp 537–548 Walton CB, Dale AG, Jenevein RM (1991) A taxonomy and performance model of data skew effects in parallel joins. In: Proceedings of the 17th International Conference on Very Large Data Bases, pp 537–548
40.
Zurück zum Zitat Harada L, Kitsuregawa M (1995) Dynamic join product skew handling for hash-joins in shared-nothing database systems. In: DASFAA Harada L, Kitsuregawa M (1995) Dynamic join product skew handling for hash-joins in shared-nothing database systems. In: DASFAA
41.
Zurück zum Zitat Jimmy L (2009) The curse of zipf and limits to parallelization: a look at the stragglers problem in MapReduce. In: Proceedings of LSDS-IR Workshop Jimmy L (2009) The curse of zipf and limits to parallelization: a look at the stragglers problem in MapReduce. In: Proceedings of LSDS-IR Workshop
42.
Zurück zum Zitat Zipf GK (1949) Human behavior and the principle of least effort: an introduction to human ecology. Addison-Wesley Press, Boston Zipf GK (1949) Human behavior and the principle of least effort: an introduction to human ecology. Addison-Wesley Press, Boston
44.
Zurück zum Zitat Altman DG, Bland JM (1996) Statistics notes: detecting skewness from summary information. BMJ 313(7066):1200CrossRef Altman DG, Bland JM (1996) Statistics notes: detecting skewness from summary information. BMJ 313(7066):1200CrossRef
Metadaten
Titel
Load balancing in join algorithms for skewed data in MapReduce systems
verfasst von
Elaheh Gavagsaz
Ali Rezaee
Hamid Haj Seyyed Javadi
Publikationsdatum
01.09.2018
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 1/2019
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-018-2578-0

Weitere Artikel der Ausgabe 1/2019

The Journal of Supercomputing 1/2019 Zur Ausgabe

Premium Partner