nach oben

The Journal of Supercomputing

Erschienen in:

21.03.2017

A speculative parallel decompression algorithm on Apache Spark

verfasst von: Zhoukai Wang, Yinliang Zhao, Yang Liu, Zhong Chen, Cuocuo Lv, Yuxiang Li

Erschienen in: The Journal of Supercomputing | Ausgabe 9/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Data decompression is one of the most important techniques in data processing and has been widely used in multimedia information transmission and processing. However, the existing decompression algorithms on multicore platforms are time-consuming and do not support large data well. In order to expand parallelism and enhance decompression efficiency on large-scale datasets, based on the software thread-level speculation technique, this paper raises a speculative parallel decompression algorithm on Apache Spark. By analyzing the data structure of the compressed data, the algorithm firstly hires a function to divide compressed data into blocks which can be decompressed independently and then spawns a number of threads to speculatively decompress data blocks in parallel. At last, the speculative results are merged to form the final outcome. Comparing with the conventional parallel approach on multicore platform, the proposed algorithm is very efficiency and obtains a high parallelism degree by making the best of the resources of the cluster. Experiments show that the proposed approach could achieve 2.6\(\times \) speedup when comparing with the traditional approach in average. In addition, with the growing number of working nodes, the execution time cost decreases gradually, and the speedup scales linearly. The results indicate that the decompression efficiency can be significantly enhanced by adopting this speculative parallel algorithm.

Vorheriger Artikel A method for the optimum selection of datacenters in geographically distributed clouds

Nächster Artikel MAP-SDN: a metaheuristic assignment and provisioning SDN framework for cloud datacenters

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nelson M, Gailly J-L (1996) The data compression book, vol 2. M&T Books, New York

Shvachko K et al. (2010) The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE

Shoro AG, Soomro TR et al (2015) Big data analysis: apache spark perspective. Glob J Comput Sci Technol 15(1):7–14

Slagter K et al (2013) An improved partitioning mechanism for optimizing massive data analysis using MapReduce. J Supercomput 66(1):539–555CrossRef

Cui X et al (2014) Optimized big data K-means clustering using MapReduce. J Supercomput 70(3):1249–1259CrossRef

Marcuello P, Tubella J, González A (1999) Value prediction for speculative multithreaded architectures. In: Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society

Porwal S et al (2013) Data compression methodologies for lossless data and comparison between algorithms. IJESIT 2(2):142–7

Liu B et al (2014) A thread partitioning approach for speculative multithreading. J Supercomput 67(3):778–805CrossRef

Jang H, Kim C, Lee JW (2013) Practical speculative parallelization of variable-length decompression algorithms. In: ACM SIGPLAN Notices. ACM

10.

Zaharia M et al (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing

11.

Gailly J-L, Adler M (2003) The gzip home page. http://www.gzip.org/

12.

Seward J (2000) The bzip2 and libbzip2 official home page. http://www.bzip.org

13.

Gilchrist J (2004) Parallel data compression with bzip2. In: Proceedings of the 16th IASTED International Conference on Parallel and Distributed Computing and Systems

14.

Adler M (2015) PIGZ: a parallel implementation of gzip for modern multiprocessor, multi-core machines. http://www.zlib.net/pigz/

15.

Klein ST, Wiseman Y (2003) Parallel Huffman decoding with applications to JPEG files. Comput J 46(5):487–497CrossRefMATH

16.

Liu W et al (2006) POSH: a TLS compiler that exploits program structure. In: Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp 158–167

17.

Raman E et al (2008) Spice: speculative parallel iteration chunk execution. In: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization

18.

Zilles C, Sohi G (2002) Master/slave speculative parallelization. In: 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture

19.

Witte EE, Chamberlain RD, Franklin MA (1991) Parallel simulated annealing using speculative computation. IEEE Trans Parallel Distrib Syst 2(4):483–494CrossRef

20.

Zhao WJ, Yang HB, Wu Y (2010) Parallel genetic algorithm based on thread-level speculation. In: 2010 International Conference on Audio Language and Image Processing (ICALIP)

21.

Zaharia M et al (2012) Fast and interactive analytics over hadoop data with spark. USENIX Login 37(4):45–51

22.

Lin C-Y et al (2014) Large-scale logistic regression and linear support vector machines using spark. In: 2014 IEEE International Conference on Big Data (Big Data). IEEE

23.

Qiu H et al (2014) Yafim: a parallel frequent itemset mining algorithm with spark. In: 2014 IEEE International on Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE

24.

Islam NS et al (2015) Performance characterization and acceleration of in-memory file systems for hadoop and spark applications on HPC clusters. In: 2015 IEEE International Conference on Big Data (Big Data). IEEE

25.

Liang C, Ru L, Zhu X (2007) R-SpamRank: a spam detection algorithm based on link analysis. J Comput Inf Syst 3(4):1705–1712

26.

Xu D et al (2011) Predicting epidemic tendency through search behavior analysis. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence

27.

Liu Y et al (2011) How do users describe their information need: query recommendation based on snippet click model. Expert Syst Appl 38(11):13847–13856

28.

Amdahl GM (2013) Computer architecture and amdahl’s law. Computer 46(12):38–46CrossRef

29.

Zeigler BP, Nutaro JJ, Seo C (2015) What’s the best possible speedup achievable in distributed simulation: Amdahl’s law reconstructed. In: Proceedings of the Symposium on Theory of Modeling and Simulation: DEVS Integrative M&S Symposium. Society for Computer Simulation International

Titel: A speculative parallel decompression algorithm on Apache Spark
verfasst von: Zhoukai Wang
Yinliang Zhao
Yang Liu
Zhong Chen
Cuocuo Lv
Yuxiang Li
Publikationsdatum: 21.03.2017
Verlag: Springer US
Erschienen in: The Journal of Supercomputing / Ausgabe 9/2017
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-017-2000-3

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 9/2017

HeDPM: load balancing of linear pipeline applications on heterogeneous systems

MAP-SDN: a metaheuristic assignment and provisioning SDN framework for cloud datacenters

PSO-DS: a scheduling engine for scientific workflow managers

A linear-time algorithm for finding Hamiltonian (s, t)-paths in odd-sized rectangular grid graphs with a rectangular hole

Loginson: a transform and load system for very large-scale log analysis in large IT infrastructures

A method for the optimum selection of datacenters in geographically distributed clouds

Premium Partner