Skip to main content
Erschienen in: The Journal of Supercomputing 9/2017

21.03.2017

A speculative parallel decompression algorithm on Apache Spark

verfasst von: Zhoukai Wang, Yinliang Zhao, Yang Liu, Zhong Chen, Cuocuo Lv, Yuxiang Li

Erschienen in: The Journal of Supercomputing | Ausgabe 9/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Data decompression is one of the most important techniques in data processing and has been widely used in multimedia information transmission and processing. However, the existing decompression algorithms on multicore platforms are time-consuming and do not support large data well. In order to expand parallelism and enhance decompression efficiency on large-scale datasets, based on the software thread-level speculation technique, this paper raises a speculative parallel decompression algorithm on Apache Spark. By analyzing the data structure of the compressed data, the algorithm firstly hires a function to divide compressed data into blocks which can be decompressed independently and then spawns a number of threads to speculatively decompress data blocks in parallel. At last, the speculative results are merged to form the final outcome. Comparing with the conventional parallel approach on multicore platform, the proposed algorithm is very efficiency and obtains a high parallelism degree by making the best of the resources of the cluster. Experiments show that the proposed approach could achieve 2.6\(\times \) speedup when comparing with the traditional approach in average. In addition, with the growing number of working nodes, the execution time cost decreases gradually, and the speedup scales linearly. The results indicate that the decompression efficiency can be significantly enhanced by adopting this speculative parallel algorithm.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Nelson M, Gailly J-L (1996) The data compression book, vol 2. M&T Books, New York Nelson M, Gailly J-L (1996) The data compression book, vol 2. M&T Books, New York
2.
Zurück zum Zitat Shvachko K et al. (2010) The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE Shvachko K et al. (2010) The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE
3.
Zurück zum Zitat Shoro AG, Soomro TR et al (2015) Big data analysis: apache spark perspective. Glob J Comput Sci Technol 15(1):7–14 Shoro AG, Soomro TR et al (2015) Big data analysis: apache spark perspective. Glob J Comput Sci Technol 15(1):7–14
4.
Zurück zum Zitat Slagter K et al (2013) An improved partitioning mechanism for optimizing massive data analysis using MapReduce. J Supercomput 66(1):539–555CrossRef Slagter K et al (2013) An improved partitioning mechanism for optimizing massive data analysis using MapReduce. J Supercomput 66(1):539–555CrossRef
5.
Zurück zum Zitat Cui X et al (2014) Optimized big data K-means clustering using MapReduce. J Supercomput 70(3):1249–1259CrossRef Cui X et al (2014) Optimized big data K-means clustering using MapReduce. J Supercomput 70(3):1249–1259CrossRef
6.
Zurück zum Zitat Marcuello P, Tubella J, González A (1999) Value prediction for speculative multithreaded architectures. In: Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society Marcuello P, Tubella J, González A (1999) Value prediction for speculative multithreaded architectures. In: Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society
7.
Zurück zum Zitat Porwal S et al (2013) Data compression methodologies for lossless data and comparison between algorithms. IJESIT 2(2):142–7 Porwal S et al (2013) Data compression methodologies for lossless data and comparison between algorithms. IJESIT 2(2):142–7
8.
Zurück zum Zitat Liu B et al (2014) A thread partitioning approach for speculative multithreading. J Supercomput 67(3):778–805CrossRef Liu B et al (2014) A thread partitioning approach for speculative multithreading. J Supercomput 67(3):778–805CrossRef
9.
Zurück zum Zitat Jang H, Kim C, Lee JW (2013) Practical speculative parallelization of variable-length decompression algorithms. In: ACM SIGPLAN Notices. ACM Jang H, Kim C, Lee JW (2013) Practical speculative parallelization of variable-length decompression algorithms. In: ACM SIGPLAN Notices. ACM
10.
Zurück zum Zitat Zaharia M et al (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing Zaharia M et al (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing
13.
Zurück zum Zitat Gilchrist J (2004) Parallel data compression with bzip2. In: Proceedings of the 16th IASTED International Conference on Parallel and Distributed Computing and Systems Gilchrist J (2004) Parallel data compression with bzip2. In: Proceedings of the 16th IASTED International Conference on Parallel and Distributed Computing and Systems
15.
Zurück zum Zitat Klein ST, Wiseman Y (2003) Parallel Huffman decoding with applications to JPEG files. Comput J 46(5):487–497CrossRefMATH Klein ST, Wiseman Y (2003) Parallel Huffman decoding with applications to JPEG files. Comput J 46(5):487–497CrossRefMATH
16.
Zurück zum Zitat Liu W et al (2006) POSH: a TLS compiler that exploits program structure. In: Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp 158–167 Liu W et al (2006) POSH: a TLS compiler that exploits program structure. In: Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp 158–167
17.
Zurück zum Zitat Raman E et al (2008) Spice: speculative parallel iteration chunk execution. In: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization Raman E et al (2008) Spice: speculative parallel iteration chunk execution. In: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization
18.
Zurück zum Zitat Zilles C, Sohi G (2002) Master/slave speculative parallelization. In: 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture Zilles C, Sohi G (2002) Master/slave speculative parallelization. In: 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
19.
Zurück zum Zitat Witte EE, Chamberlain RD, Franklin MA (1991) Parallel simulated annealing using speculative computation. IEEE Trans Parallel Distrib Syst 2(4):483–494CrossRef Witte EE, Chamberlain RD, Franklin MA (1991) Parallel simulated annealing using speculative computation. IEEE Trans Parallel Distrib Syst 2(4):483–494CrossRef
20.
Zurück zum Zitat Zhao WJ, Yang HB, Wu Y (2010) Parallel genetic algorithm based on thread-level speculation. In: 2010 International Conference on Audio Language and Image Processing (ICALIP) Zhao WJ, Yang HB, Wu Y (2010) Parallel genetic algorithm based on thread-level speculation. In: 2010 International Conference on Audio Language and Image Processing (ICALIP)
21.
Zurück zum Zitat Zaharia M et al (2012) Fast and interactive analytics over hadoop data with spark. USENIX Login 37(4):45–51 Zaharia M et al (2012) Fast and interactive analytics over hadoop data with spark. USENIX Login 37(4):45–51
22.
Zurück zum Zitat Lin C-Y et al (2014) Large-scale logistic regression and linear support vector machines using spark. In: 2014 IEEE International Conference on Big Data (Big Data). IEEE Lin C-Y et al (2014) Large-scale logistic regression and linear support vector machines using spark. In: 2014 IEEE International Conference on Big Data (Big Data). IEEE
23.
Zurück zum Zitat Qiu H et al (2014) Yafim: a parallel frequent itemset mining algorithm with spark. In: 2014 IEEE International on Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE Qiu H et al (2014) Yafim: a parallel frequent itemset mining algorithm with spark. In: 2014 IEEE International on Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE
24.
Zurück zum Zitat Islam NS et al (2015) Performance characterization and acceleration of in-memory file systems for hadoop and spark applications on HPC clusters. In: 2015 IEEE International Conference on Big Data (Big Data). IEEE Islam NS et al (2015) Performance characterization and acceleration of in-memory file systems for hadoop and spark applications on HPC clusters. In: 2015 IEEE International Conference on Big Data (Big Data). IEEE
25.
Zurück zum Zitat Liang C, Ru L, Zhu X (2007) R-SpamRank: a spam detection algorithm based on link analysis. J Comput Inf Syst 3(4):1705–1712 Liang C, Ru L, Zhu X (2007) R-SpamRank: a spam detection algorithm based on link analysis. J Comput Inf Syst 3(4):1705–1712
26.
Zurück zum Zitat Xu D et al (2011) Predicting epidemic tendency through search behavior analysis. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Xu D et al (2011) Predicting epidemic tendency through search behavior analysis. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence
27.
Zurück zum Zitat Liu Y et al (2011) How do users describe their information need: query recommendation based on snippet click model. Expert Syst Appl 38(11):13847–13856 Liu Y et al (2011) How do users describe their information need: query recommendation based on snippet click model. Expert Syst Appl 38(11):13847–13856
28.
Zurück zum Zitat Amdahl GM (2013) Computer architecture and amdahl’s law. Computer 46(12):38–46CrossRef Amdahl GM (2013) Computer architecture and amdahl’s law. Computer 46(12):38–46CrossRef
29.
Zurück zum Zitat Zeigler BP, Nutaro JJ, Seo C (2015) What’s the best possible speedup achievable in distributed simulation: Amdahl’s law reconstructed. In: Proceedings of the Symposium on Theory of Modeling and Simulation: DEVS Integrative M&S Symposium. Society for Computer Simulation International Zeigler BP, Nutaro JJ, Seo C (2015) What’s the best possible speedup achievable in distributed simulation: Amdahl’s law reconstructed. In: Proceedings of the Symposium on Theory of Modeling and Simulation: DEVS Integrative M&S Symposium. Society for Computer Simulation International
Metadaten
Titel
A speculative parallel decompression algorithm on Apache Spark
verfasst von
Zhoukai Wang
Yinliang Zhao
Yang Liu
Zhong Chen
Cuocuo Lv
Yuxiang Li
Publikationsdatum
21.03.2017
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 9/2017
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-017-2000-3

Weitere Artikel der Ausgabe 9/2017

The Journal of Supercomputing 9/2017 Zur Ausgabe

Premium Partner