nach oben

The Journal of Supercomputing

Erschienen in:

01.10.2013

An improved partitioning mechanism for optimizing massive data analysis using MapReduce

verfasst von: Kenn Slagter, Ching-Hsien Hsu, Yeh-Ching Chung, Daqiang Zhang

Erschienen in: The Journal of Supercomputing | Ausgabe 1/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In the era of Big Data, huge amounts of structured and unstructured data are being produced daily by a myriad of ubiquitous sources. Big Data is difficult to work with and requires massively parallel software running on a large number of computers. MapReduce is a recent programming model that simplifies writing distributed applications that handle Big Data. In order for MapReduce to work, it has to divide the workload among computers in a network. Consequently, the performance of MapReduce strongly depends on how evenly it distributes this workload. This can be a challenge, especially in the advent of data skew. In MapReduce, workload distribution depends on the algorithm that partitions the data. One way to avoid problems inherent from data skew is to use data sampling. How evenly the partitioner distributes the data depends on how large and representative the sample is and on how well the samples are analyzed by the partitioning mechanism. This paper proposes an improved partitioning algorithm that improves load balancing and memory consumption. This is done via an improved sampling algorithm and partitioner. To evaluate the proposed algorithm, its performance was compared against a state of the art partitioning mechanism employed by TeraSort. Experiments show that the proposed algorithm is faster, more memory efficient, and more accurate than the current implementation.

Vorheriger Artikel TM: a new and simple topology for interconnection networks

Nächster Artikel A strategy of multi-criteria decision-making task ranking in social-networks

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Candan KS, Kim JW, Nagarkar P, Nagendra M, Yu R (2010) RanKloud: scalable multimedia data processing in server clusters. IEEE MultiMed 18(1):64–77 CrossRef

Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrws M, Chandra T, Fikes A, Gruber RE (2006) Bigtable: a distributed storage system for structured data. In: 7th UENIX symposium on operating systems design and implementation, pp 205–218

Dean J, GhemawatDean S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51:107–113 CrossRef

Ghemawat S, Gobioff H, Leung S-T (2003) The Google file system. In: 19th ACM symposium on operating systems principles (SOSP)

Jiang W, Agrawal G (2011) Ex-MATE data intensive computing with large reduction objects and its application to graph mining. In: IEEE/ACM international symposium on cluster, cloud and grid computing, pp 475–484

Jin C, Vecchiola C, Buyya R (2008) MRPGA: an extension of MapReduce for parallelizing genetic algorithms. In: IEEE fourth international conference on escience, pp 214–220 CrossRef

Kavulya S, Tany J, Gandhi R, Narasimhan P (2010) An analysis of traces from a production MapReduce cluster. In: IEEE/ACM international conference on cluster, cloud and grid computing, pp 94–95

Krishnan A (2005) GridBLAST: a globus-based high-throughput implementation of BLAST in a grid computing framework. Concurr Comput 17(13):1607–1623 CrossRef

Liu H, Orban D (2011) Cloud MapReduce: a MapReduce implementation on top of a cloud operating system. In: IEEE/ACM international symposium on cluster, cloud and grid computing, pp 464–474

10.

Hsu C-H, Chen S-C (2012) Efficient selection strategies towards processor reordering techniques for improving data locality in heterogeneous clusters. J Supercomput 60(3):284–300 MathSciNetCrossRef

11.

Matsunaga A, Tsugawa M, Fortes J (2008) Programming abstractions for data intensive computing on clouds and grids. In: IEEE fourth international conference on escience, pp 489–493

12.

Miceli C, Miceli M, Jha S, Kaiser H, Merzky A (2009) Programming abstractions for data intensive computing on clouds and grids. In: IEEE/ACM international symposium on cluster computing and the grid, pp 480–483

13.

Panda B, Riedewald M, Fink D (2010) The model-summary problem and a solution for trees. In: International conference on data engineering, pp 452–455

14.

Papadimitriou S, Sun J (2008) Distributed co-clustering with map-reduce. In: IEEE international conference on data mining, p 519

15.

Hsu C-H, Chen SC (2010) A two-level scheduling strategy for optimizing communications of data parallel programs in clusters. Int J Ad Hoc Ubiq Comput 6(4):263–269 MathSciNetCrossRef

16.

Shafer J, Rixner S, Cox AL (2010) The hadoop distributed filesystem: balancing portability and performance. In: IEEE international symposium on performance analysis of system and software (ISPASS), p 123

17.

Stockinger H, Pagni M, Cerutti L, Falquet L (2006) Grid approach to embarrassingly parallel CPU-intensive bioinformatics problems. In: IEEE international conference on e-science and grid computing

18.

Tan J, Pan X, Kavulya S, Gandhi R, Narasimhan P (2009) Mochi: visual log-analysis based tools for debugging hadoop. In: USENIX workshop on hot topics in cloud computing (HotCloud)

19.

Hsu C-H, Tsai B-R (2009) Scheduling for atomic broadcast operation in heterogeneous networks with one port model. J Supercomput 50(3):269–288 CrossRef

20.

Vashishtha H, Smit M, Stroulia E (2010) Moving text analysis tools to the cloud. In: IEEE world congress on services, pp 110–112

21.

Verma A, Llor’a X, Goldberg DE, Campbell RH (2009) Scaling genetic algorithms using MapReduce. In: International conference on intelligent systems design and applications

22.

Xu W, Huang L, Fox A, Patterson D, Jordan M (2009) Detecting large-scale system problems by mining console logs. In: Proceedings of the ACM SIGOPS 22nd symposium on operating systems principles (SOSP)

23.

Fadika Z, Govindaraju M (2011) DELMA: dynamic elastic MapReduce framework for CPU-intensive applications. In: IEEE/ACM international symposium on cluster, cloud and grid computing, pp 454–463

24.

O’Malley O (2008) TeraByte sort on Apache hadoop

25.

Apache software foundation (2007) Hadoop. http://hadoop.apache.org/core

26.

Hsu C-H, Chen T-L, Park J-H (2008) On improving resource utilization and system throughput of master slave jobs scheduling in heterogeneous systems. J Supercomput 45(1):129–150 CrossRef

27.

HBase. http://hadoop.apache.org/hbase/

28.

Zaharia M, Konwinski A, Joseph AD, Katz R, Stoica I (2008) Improving MapReduce performance in heterogeneous environments. In: OSDI

29.

Lynden S, Tanimura Y, Kojima I, Matono A (2011) Dynamic data redistribution for MapReduce joins. In: IEEE international conference on cloud computing technology and science, pp 713–717

30.

Groot S, Kitsuregawa M (2010) Jumbo: beyond MapReduce for workload balancing. In: VLDB, PhD workshop

31.

Heinz S, Zobel J, Williams H (2002) Burst tries: a fast, efficient data structure for string keys. ACM Trans Inf Syst 20(12):192–223 CrossRef

32.

Hsu C-H, Chen S-C, Lan C-Y (2007) Scheduling contention-free irregular redistribution in parallelizing compilers. J Supercomput 40(3):229–247 CrossRef

33.

Shannon CE (1951) Prediction and entropy of printed English. Bell Syst Tech J 30:50–64 CrossRefMATH

Titel: An improved partitioning mechanism for optimizing massive data analysis using MapReduce
verfasst von: Kenn Slagter
Ching-Hsien Hsu
Yeh-Ching Chung
Daqiang Zhang
Publikationsdatum: 01.10.2013
Verlag: Springer US
Erschienen in: The Journal of Supercomputing / Ausgabe 1/2013
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-013-0924-9

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 1/2013

QoS based resource provisioning and scheduling in grids

Adaptive resource management scheme for monitoring of CPS

An integrated solution for QoS provision and congestion management in high-performance interconnection networks using deterministic source-based routing

Optimal management of dynamic information in Delay Tolerant Networks

Optimal tilt and orientation maps: a multi-algorithm approach for heterogeneous multicore-GPU systems

Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments