Skip to main content
Erschienen in: The Journal of Supercomputing 1/2013

01.10.2013

An improved partitioning mechanism for optimizing massive data analysis using MapReduce

verfasst von: Kenn Slagter, Ching-Hsien Hsu, Yeh-Ching Chung, Daqiang Zhang

Erschienen in: The Journal of Supercomputing | Ausgabe 1/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In the era of Big Data, huge amounts of structured and unstructured data are being produced daily by a myriad of ubiquitous sources. Big Data is difficult to work with and requires massively parallel software running on a large number of computers. MapReduce is a recent programming model that simplifies writing distributed applications that handle Big Data. In order for MapReduce to work, it has to divide the workload among computers in a network. Consequently, the performance of MapReduce strongly depends on how evenly it distributes this workload. This can be a challenge, especially in the advent of data skew. In MapReduce, workload distribution depends on the algorithm that partitions the data. One way to avoid problems inherent from data skew is to use data sampling. How evenly the partitioner distributes the data depends on how large and representative the sample is and on how well the samples are analyzed by the partitioning mechanism. This paper proposes an improved partitioning algorithm that improves load balancing and memory consumption. This is done via an improved sampling algorithm and partitioner. To evaluate the proposed algorithm, its performance was compared against a state of the art partitioning mechanism employed by TeraSort. Experiments show that the proposed algorithm is faster, more memory efficient, and more accurate than the current implementation.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Candan KS, Kim JW, Nagarkar P, Nagendra M, Yu R (2010) RanKloud: scalable multimedia data processing in server clusters. IEEE MultiMed 18(1):64–77 CrossRef Candan KS, Kim JW, Nagarkar P, Nagendra M, Yu R (2010) RanKloud: scalable multimedia data processing in server clusters. IEEE MultiMed 18(1):64–77 CrossRef
2.
Zurück zum Zitat Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrws M, Chandra T, Fikes A, Gruber RE (2006) Bigtable: a distributed storage system for structured data. In: 7th UENIX symposium on operating systems design and implementation, pp 205–218 Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrws M, Chandra T, Fikes A, Gruber RE (2006) Bigtable: a distributed storage system for structured data. In: 7th UENIX symposium on operating systems design and implementation, pp 205–218
3.
Zurück zum Zitat Dean J, GhemawatDean S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51:107–113 CrossRef Dean J, GhemawatDean S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51:107–113 CrossRef
4.
Zurück zum Zitat Ghemawat S, Gobioff H, Leung S-T (2003) The Google file system. In: 19th ACM symposium on operating systems principles (SOSP) Ghemawat S, Gobioff H, Leung S-T (2003) The Google file system. In: 19th ACM symposium on operating systems principles (SOSP)
5.
Zurück zum Zitat Jiang W, Agrawal G (2011) Ex-MATE data intensive computing with large reduction objects and its application to graph mining. In: IEEE/ACM international symposium on cluster, cloud and grid computing, pp 475–484 Jiang W, Agrawal G (2011) Ex-MATE data intensive computing with large reduction objects and its application to graph mining. In: IEEE/ACM international symposium on cluster, cloud and grid computing, pp 475–484
6.
Zurück zum Zitat Jin C, Vecchiola C, Buyya R (2008) MRPGA: an extension of MapReduce for parallelizing genetic algorithms. In: IEEE fourth international conference on escience, pp 214–220 CrossRef Jin C, Vecchiola C, Buyya R (2008) MRPGA: an extension of MapReduce for parallelizing genetic algorithms. In: IEEE fourth international conference on escience, pp 214–220 CrossRef
7.
Zurück zum Zitat Kavulya S, Tany J, Gandhi R, Narasimhan P (2010) An analysis of traces from a production MapReduce cluster. In: IEEE/ACM international conference on cluster, cloud and grid computing, pp 94–95 Kavulya S, Tany J, Gandhi R, Narasimhan P (2010) An analysis of traces from a production MapReduce cluster. In: IEEE/ACM international conference on cluster, cloud and grid computing, pp 94–95
8.
Zurück zum Zitat Krishnan A (2005) GridBLAST: a globus-based high-throughput implementation of BLAST in a grid computing framework. Concurr Comput 17(13):1607–1623 CrossRef Krishnan A (2005) GridBLAST: a globus-based high-throughput implementation of BLAST in a grid computing framework. Concurr Comput 17(13):1607–1623 CrossRef
9.
Zurück zum Zitat Liu H, Orban D (2011) Cloud MapReduce: a MapReduce implementation on top of a cloud operating system. In: IEEE/ACM international symposium on cluster, cloud and grid computing, pp 464–474 Liu H, Orban D (2011) Cloud MapReduce: a MapReduce implementation on top of a cloud operating system. In: IEEE/ACM international symposium on cluster, cloud and grid computing, pp 464–474
10.
Zurück zum Zitat Hsu C-H, Chen S-C (2012) Efficient selection strategies towards processor reordering techniques for improving data locality in heterogeneous clusters. J Supercomput 60(3):284–300 MathSciNetCrossRef Hsu C-H, Chen S-C (2012) Efficient selection strategies towards processor reordering techniques for improving data locality in heterogeneous clusters. J Supercomput 60(3):284–300 MathSciNetCrossRef
11.
Zurück zum Zitat Matsunaga A, Tsugawa M, Fortes J (2008) Programming abstractions for data intensive computing on clouds and grids. In: IEEE fourth international conference on escience, pp 489–493 Matsunaga A, Tsugawa M, Fortes J (2008) Programming abstractions for data intensive computing on clouds and grids. In: IEEE fourth international conference on escience, pp 489–493
12.
Zurück zum Zitat Miceli C, Miceli M, Jha S, Kaiser H, Merzky A (2009) Programming abstractions for data intensive computing on clouds and grids. In: IEEE/ACM international symposium on cluster computing and the grid, pp 480–483 Miceli C, Miceli M, Jha S, Kaiser H, Merzky A (2009) Programming abstractions for data intensive computing on clouds and grids. In: IEEE/ACM international symposium on cluster computing and the grid, pp 480–483
13.
Zurück zum Zitat Panda B, Riedewald M, Fink D (2010) The model-summary problem and a solution for trees. In: International conference on data engineering, pp 452–455 Panda B, Riedewald M, Fink D (2010) The model-summary problem and a solution for trees. In: International conference on data engineering, pp 452–455
14.
Zurück zum Zitat Papadimitriou S, Sun J (2008) Distributed co-clustering with map-reduce. In: IEEE international conference on data mining, p 519 Papadimitriou S, Sun J (2008) Distributed co-clustering with map-reduce. In: IEEE international conference on data mining, p 519
15.
Zurück zum Zitat Hsu C-H, Chen SC (2010) A two-level scheduling strategy for optimizing communications of data parallel programs in clusters. Int J Ad Hoc Ubiq Comput 6(4):263–269 MathSciNetCrossRef Hsu C-H, Chen SC (2010) A two-level scheduling strategy for optimizing communications of data parallel programs in clusters. Int J Ad Hoc Ubiq Comput 6(4):263–269 MathSciNetCrossRef
16.
Zurück zum Zitat Shafer J, Rixner S, Cox AL (2010) The hadoop distributed filesystem: balancing portability and performance. In: IEEE international symposium on performance analysis of system and software (ISPASS), p 123 Shafer J, Rixner S, Cox AL (2010) The hadoop distributed filesystem: balancing portability and performance. In: IEEE international symposium on performance analysis of system and software (ISPASS), p 123
17.
Zurück zum Zitat Stockinger H, Pagni M, Cerutti L, Falquet L (2006) Grid approach to embarrassingly parallel CPU-intensive bioinformatics problems. In: IEEE international conference on e-science and grid computing Stockinger H, Pagni M, Cerutti L, Falquet L (2006) Grid approach to embarrassingly parallel CPU-intensive bioinformatics problems. In: IEEE international conference on e-science and grid computing
18.
Zurück zum Zitat Tan J, Pan X, Kavulya S, Gandhi R, Narasimhan P (2009) Mochi: visual log-analysis based tools for debugging hadoop. In: USENIX workshop on hot topics in cloud computing (HotCloud) Tan J, Pan X, Kavulya S, Gandhi R, Narasimhan P (2009) Mochi: visual log-analysis based tools for debugging hadoop. In: USENIX workshop on hot topics in cloud computing (HotCloud)
19.
Zurück zum Zitat Hsu C-H, Tsai B-R (2009) Scheduling for atomic broadcast operation in heterogeneous networks with one port model. J Supercomput 50(3):269–288 CrossRef Hsu C-H, Tsai B-R (2009) Scheduling for atomic broadcast operation in heterogeneous networks with one port model. J Supercomput 50(3):269–288 CrossRef
20.
Zurück zum Zitat Vashishtha H, Smit M, Stroulia E (2010) Moving text analysis tools to the cloud. In: IEEE world congress on services, pp 110–112 Vashishtha H, Smit M, Stroulia E (2010) Moving text analysis tools to the cloud. In: IEEE world congress on services, pp 110–112
21.
Zurück zum Zitat Verma A, Llor’a X, Goldberg DE, Campbell RH (2009) Scaling genetic algorithms using MapReduce. In: International conference on intelligent systems design and applications Verma A, Llor’a X, Goldberg DE, Campbell RH (2009) Scaling genetic algorithms using MapReduce. In: International conference on intelligent systems design and applications
22.
Zurück zum Zitat Xu W, Huang L, Fox A, Patterson D, Jordan M (2009) Detecting large-scale system problems by mining console logs. In: Proceedings of the ACM SIGOPS 22nd symposium on operating systems principles (SOSP) Xu W, Huang L, Fox A, Patterson D, Jordan M (2009) Detecting large-scale system problems by mining console logs. In: Proceedings of the ACM SIGOPS 22nd symposium on operating systems principles (SOSP)
23.
Zurück zum Zitat Fadika Z, Govindaraju M (2011) DELMA: dynamic elastic MapReduce framework for CPU-intensive applications. In: IEEE/ACM international symposium on cluster, cloud and grid computing, pp 454–463 Fadika Z, Govindaraju M (2011) DELMA: dynamic elastic MapReduce framework for CPU-intensive applications. In: IEEE/ACM international symposium on cluster, cloud and grid computing, pp 454–463
24.
Zurück zum Zitat O’Malley O (2008) TeraByte sort on Apache hadoop O’Malley O (2008) TeraByte sort on Apache hadoop
26.
Zurück zum Zitat Hsu C-H, Chen T-L, Park J-H (2008) On improving resource utilization and system throughput of master slave jobs scheduling in heterogeneous systems. J Supercomput 45(1):129–150 CrossRef Hsu C-H, Chen T-L, Park J-H (2008) On improving resource utilization and system throughput of master slave jobs scheduling in heterogeneous systems. J Supercomput 45(1):129–150 CrossRef
28.
Zurück zum Zitat Zaharia M, Konwinski A, Joseph AD, Katz R, Stoica I (2008) Improving MapReduce performance in heterogeneous environments. In: OSDI Zaharia M, Konwinski A, Joseph AD, Katz R, Stoica I (2008) Improving MapReduce performance in heterogeneous environments. In: OSDI
29.
Zurück zum Zitat Lynden S, Tanimura Y, Kojima I, Matono A (2011) Dynamic data redistribution for MapReduce joins. In: IEEE international conference on cloud computing technology and science, pp 713–717 Lynden S, Tanimura Y, Kojima I, Matono A (2011) Dynamic data redistribution for MapReduce joins. In: IEEE international conference on cloud computing technology and science, pp 713–717
30.
Zurück zum Zitat Groot S, Kitsuregawa M (2010) Jumbo: beyond MapReduce for workload balancing. In: VLDB, PhD workshop Groot S, Kitsuregawa M (2010) Jumbo: beyond MapReduce for workload balancing. In: VLDB, PhD workshop
31.
Zurück zum Zitat Heinz S, Zobel J, Williams H (2002) Burst tries: a fast, efficient data structure for string keys. ACM Trans Inf Syst 20(12):192–223 CrossRef Heinz S, Zobel J, Williams H (2002) Burst tries: a fast, efficient data structure for string keys. ACM Trans Inf Syst 20(12):192–223 CrossRef
32.
Zurück zum Zitat Hsu C-H, Chen S-C, Lan C-Y (2007) Scheduling contention-free irregular redistribution in parallelizing compilers. J Supercomput 40(3):229–247 CrossRef Hsu C-H, Chen S-C, Lan C-Y (2007) Scheduling contention-free irregular redistribution in parallelizing compilers. J Supercomput 40(3):229–247 CrossRef
33.
Zurück zum Zitat Shannon CE (1951) Prediction and entropy of printed English. Bell Syst Tech J 30:50–64 CrossRefMATH Shannon CE (1951) Prediction and entropy of printed English. Bell Syst Tech J 30:50–64 CrossRefMATH
Metadaten
Titel
An improved partitioning mechanism for optimizing massive data analysis using MapReduce
verfasst von
Kenn Slagter
Ching-Hsien Hsu
Yeh-Ching Chung
Daqiang Zhang
Publikationsdatum
01.10.2013
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 1/2013
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-013-0924-9

Weitere Artikel der Ausgabe 1/2013

The Journal of Supercomputing 1/2013 Zur Ausgabe