Skip to main content
Top
Published in: International Journal of Parallel Programming 6/2015

01-12-2015

An Autotuning Engine for the 3D Fast Wavelet Transform on Clusters with Hybrid CPU + GPU Platforms

Authors: Gregorio Bernabé, Javier Cuenca, Domingo Giménez

Published in: International Journal of Parallel Programming | Issue 6/2015

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This work presents an optimization method to run the 3D-fast wavelet transform (3D-FWT) on a CPU + GPU system. The optimization engine detects the different computing components in the system, and executes the appropriate kernel implemented in both CUDA or OpenCL for GPUs, and programmed with pthreads for a CPU. This engine automatically selects parameters such as the block size, the work-group size or the number of threads to reduce the execution time, and sends proportionally different parts of a video sequence to run concurrently in all the computing components of the system. An analysis of the development and optimization of the 3D-FWT for a hybrid cluster of CPU + GPUs is also described. Different parallel programming paradigms (message passing, shared memory and GPU SIMD) are combined to fully exploit the computing capacity of the different computational elements of the cluster, so resulting in an efficient combination of basic codes developed previously for individual components (CPUs or GPUs) and an important reduction of the compression time of long video sequences.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Manocha, D.: General-purpose computation using graphic processors. IEEE Comput. 38(8), 85–88 (2005)CrossRef Manocha, D.: General-purpose computation using graphic processors. IEEE Comput. 38(8), 85–88 (2005)CrossRef
2.
go back to reference Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A.E., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Comput. Graph. Forum 26(1), 80–113 (2007)CrossRef Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A.E., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Comput. Graph. Forum 26(1), 80–113 (2007)CrossRef
3.
go back to reference Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: GPU computing. Proc. IEEE 96(5), 879–889 (2008)CrossRef Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: GPU computing. Proc. IEEE 96(5), 879–889 (2008)CrossRef
7.
go back to reference Franco, J., Bernabé, G., Fernández, J., Ujaldón, M.: The 2D wavelet transform on emerging architectures: GPUs and multicores. J. Realt. Image Process. 3, 145–152 (2012)CrossRef Franco, J., Bernabé, G., Fernández, J., Ujaldón, M.: The 2D wavelet transform on emerging architectures: GPUs and multicores. J. Realt. Image Process. 3, 145–152 (2012)CrossRef
8.
go back to reference Franco, J., Bernabé, G., Fernández, J., Acacio, M.E. : A parallel implementation of the 2D wavelet transform using CUDA. In: 17th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (2009) Franco, J., Bernabé, G., Fernández, J., Acacio, M.E. : A parallel implementation of the 2D wavelet transform using CUDA. In: 17th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (2009)
9.
go back to reference Franco, J., Bernabé, G., Fernández, J., Ujaldón, M.: Parallel 3D fast wavelet transform on manycore GPUs and multicore CPUs. In: 10th International Conference on Computational Science (2010) Franco, J., Bernabé, G., Fernández, J., Ujaldón, M.: Parallel 3D fast wavelet transform on manycore GPUs and multicore CPUs. In: 10th International Conference on Computational Science (2010)
10.
go back to reference Bernabé, G., Guerrero, G.D., Fernández, J.: CUDA and OpenCL implementations of 3D fast wavelet transform. In: 3rd IEEE Latin American Symposium on Circuits and Systems (2012) Bernabé, G., Guerrero, G.D., Fernández, J.: CUDA and OpenCL implementations of 3D fast wavelet transform. In: 3rd IEEE Latin American Symposium on Circuits and Systems (2012)
11.
go back to reference Bernabé, G., Cuenca, J., Giménez, D.: Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs. In: International Conference on Computational Science (2013) Bernabé, G., Cuenca, J., Giménez, D.: Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs. In: International Conference on Computational Science (2013)
12.
go back to reference Bernabé, G., Cuenca, J., Giménez, D.: Optimizing a 3D-FWT code in heterogeneous cluster of multicore CPUs and manycore GPUs. In: 25th International Symposium on Computer Architecture and High Performance Computing (2013) Bernabé, G., Cuenca, J., Giménez, D.: Optimizing a 3D-FWT code in heterogeneous cluster of multicore CPUs and manycore GPUs. In: 25th International Symposium on Computer Architecture and High Performance Computing (2013)
13.
go back to reference Mallat, S.: A theory for multiresolution signal descomposition: the wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 11(7), 674–693 (1989)MATHCrossRef Mallat, S.: A theory for multiresolution signal descomposition: the wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 11(7), 674–693 (1989)MATHCrossRef
14.
go back to reference Bernabé, G., González, J., García, J. M., Duato, J.: A new lossy 3-D wavelet transform for high-quality compression of medical video. In: Proceedings of IEEE EMBS International Conference on Information Technology Applications in Biomedicine, pp. 226–231 (2000) Bernabé, G., González, J., García, J. M., Duato, J.: A new lossy 3-D wavelet transform for high-quality compression of medical video. In: Proceedings of IEEE EMBS International Conference on Information Technology Applications in Biomedicine, pp. 226–231 (2000)
15.
go back to reference Daubechies, I.: Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics, Philadelphia, PA (1992) Daubechies, I.: Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics, Philadelphia, PA (1992)
16.
go back to reference Meerwald, P., Norcen, R., Uhl, A.: Cache issues with JPEG2000 wavelet lifting. In: Proceedings of the Visual Communications and Image Processing Conference, pp. 626–634 (2002) Meerwald, P., Norcen, R., Uhl, A.: Cache issues with JPEG2000 wavelet lifting. In: Proceedings of the Visual Communications and Image Processing Conference, pp. 626–634 (2002)
17.
go back to reference Shahbahrami, A., Juurlink, B., Vassiliadis, S.: Improving the memory behavior of vertical filtering in the discrete wavelet transform. In: Proceedings of the ACM Conference in Computing Frontiers, pp. 253–260 (2006) Shahbahrami, A., Juurlink, B., Vassiliadis, S.: Improving the memory behavior of vertical filtering in the discrete wavelet transform. In: Proceedings of the ACM Conference in Computing Frontiers, pp. 253–260 (2006)
18.
go back to reference Tao, J., Shahbahrami, A., Juurlink, B., Buchty, R., Karl, W., Vassiliadis, S.: Optimizing cache performance of the discrete wavelet transform using a visualization tool. In: Proceedings of the IEEE International Symposium on Multimedia, pp. 153–160 (2007) Tao, J., Shahbahrami, A., Juurlink, B., Buchty, R., Karl, W., Vassiliadis, S.: Optimizing cache performance of the discrete wavelet transform using a visualization tool. In: Proceedings of the IEEE International Symposium on Multimedia, pp. 153–160 (2007)
19.
go back to reference Whaley, R.C., Petitet, A., Dongarra, J.: Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27(1–2), 3–35 (2001)MATHCrossRef Whaley, R.C., Petitet, A., Dongarra, J.: Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27(1–2), 3–35 (2001)MATHCrossRef
20.
go back to reference Im, E.J., Yelick, K., Vuduc, R.: Optimization framework for sparse matrix kernels. Int. J. High Perform. Comput. Appl. 18(1), 135–158 (2004)CrossRef Im, E.J., Yelick, K., Vuduc, R.: Optimization framework for sparse matrix kernels. Int. J. High Perform. Comput. Appl. 18(1), 135–158 (2004)CrossRef
21.
go back to reference Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proc. IEEE Spec. Issue Progr. Gener. Optim. Platf. Adapt. 93(2), 216–231 (2005) Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proc. IEEE Spec. Issue Progr. Gener. Optim. Platf. Adapt. 93(2), 216–231 (2005)
22.
go back to reference Frigo, M.: A fast fourier transform compiler. In: Proceedings of the Conference on Programming Language Design and Implementation (ACM SIGPLAN), pp. 169–180 (1999) Frigo, M.: A fast fourier transform compiler. In: Proceedings of the Conference on Programming Language Design and Implementation (ACM SIGPLAN), pp. 169–180 (1999)
23.
go back to reference Katagiri, T., Kise, K., Honda, H., Yuba, T.: ABCLib DRSSED: a parallel eigensolver with an auto-tuning facility. Parallel Comput. 32(3), 231–250 (2006)CrossRef Katagiri, T., Kise, K., Honda, H., Yuba, T.: ABCLib DRSSED: a parallel eigensolver with an auto-tuning facility. Parallel Comput. 32(3), 231–250 (2006)CrossRef
24.
go back to reference Carvalho, E., Calazans, N., Moraes, F.: Heuristics for dynamic task mapping in NoC-based heterogeneous MPSoCs. In: Proceedings of the 18th IEEE/IFIP International Workshop on Rapid System Prototyping, pp. 34–40 (2007) Carvalho, E., Calazans, N., Moraes, F.: Heuristics for dynamic task mapping in NoC-based heterogeneous MPSoCs. In: Proceedings of the 18th IEEE/IFIP International Workshop on Rapid System Prototyping, pp. 34–40 (2007)
25.
go back to reference Almeida, F., González, D., Moreno, L.: The master-slave paradigm on heterogeneous systems: a dynamic programming approach for the optimal mapping. J. Syst. Archit. 52, 105–116 (2006)CrossRef Almeida, F., González, D., Moreno, L.: The master-slave paradigm on heterogeneous systems: a dynamic programming approach for the optimal mapping. J. Syst. Archit. 52, 105–116 (2006)CrossRef
26.
go back to reference Giersch, A., Robert, Y., Vivien, F.: Scheduling tasks sharing files on heterogeneous master-slave platforms. J. Syst. Archit. 52, 88–104 (2006)CrossRef Giersch, A., Robert, Y., Vivien, F.: Scheduling tasks sharing files on heterogeneous master-slave platforms. J. Syst. Archit. 52, 88–104 (2006)CrossRef
27.
go back to reference Hsu, C., Chen, T., Li, K.: Performance effective pre-scheduling strategy for heterogeneous grid systems in the master slave paradigm. Futur. Gener. Comput. Syst. 23, 569–579 (2007)CrossRef Hsu, C., Chen, T., Li, K.: Performance effective pre-scheduling strategy for heterogeneous grid systems in the master slave paradigm. Futur. Gener. Comput. Syst. 23, 569–579 (2007)CrossRef
28.
go back to reference Banino, C., Beaumont, O., Carter, L., Ferrante, J., Legrand, A., Robert, Y.: Scheduling strategies for master-slave tasking on heterogeneous processor platforms. IEEE Trans. Parallel Distrib. Syst. 15, 319–330 (2004)CrossRef Banino, C., Beaumont, O., Carter, L., Ferrante, J., Legrand, A., Robert, Y.: Scheduling strategies for master-slave tasking on heterogeneous processor platforms. IEEE Trans. Parallel Distrib. Syst. 15, 319–330 (2004)CrossRef
29.
go back to reference Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC ’08) (2008) Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC ’08) (2008)
30.
go back to reference Kurzak, J., Tomov, S., Dongarra, J.: Autotuning GEMMs for fermi. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC ’11) (2011) Kurzak, J., Tomov, S., Dongarra, J.: Autotuning GEMMs for fermi. In: Proceedings of the ACM/IEEE Conference on Supercomputing (SC ’11) (2011)
31.
go back to reference Yinan, L., Dongarra, J., Tomov, S.: A note on auto-tuning GEMM for GPUs. In: Proceedings of the 9th International Conference on Computational Science: Part I, pp. 884–892 (2009) Yinan, L., Dongarra, J., Tomov, S.: A note on auto-tuning GEMM for GPUs. In: Proceedings of the 9th International Conference on Computational Science: Part I, pp. 884–892 (2009)
32.
go back to reference Davidson, A., Owens, J.: Toward techniques for auto-tuning GPU algorithms. Appl. Parallel Sci. Comput. 7134, 110–119 (2012)CrossRef Davidson, A., Owens, J.: Toward techniques for auto-tuning GPU algorithms. Appl. Parallel Sci. Comput. 7134, 110–119 (2012)CrossRef
33.
go back to reference Spiga, F., Girotto, I.: phiGEMM: A CPU–GPU library for porting quantum ESPRESSO on hybrid systems. In: Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing, pp. 368–375 (2008) Spiga, F., Girotto, I.: phiGEMM: A CPU–GPU library for porting quantum ESPRESSO on hybrid systems. In: Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing, pp. 368–375 (2008)
34.
go back to reference Fatica, M.: Accelerating LINPACK with CUDA on heterogenous clusters. In: Proceedings of the 2nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-2), pp. 46–51 (2009) Fatica, M.: Accelerating LINPACK with CUDA on heterogenous clusters. In: Proceedings of the 2nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-2), pp. 46–51 (2009)
36.
go back to reference Wang, F., Yang, C., Du, Y., Chen, H.Y.J., Xu, W.: Optimizing LINPACK benchmark on GPU-accelerated petascale supercomputer. J. Comput. Sci. Technol. 26, 854–865 (2011)CrossRef Wang, F., Yang, C., Du, Y., Chen, H.Y.J., Xu, W.: Optimizing LINPACK benchmark on GPU-accelerated petascale supercomputer. J. Comput. Sci. Technol. 26, 854–865 (2011)CrossRef
37.
go back to reference Tsai, Y., Wang, W., Chen, R.: Tuning block size for QR factorization on CPU–GPU hybrid systems. In: Proceedings of the IEEE 6th International Symposium on Embedded Multicore Socs (MCSoC), pp. 205–211 (2012) Tsai, Y., Wang, W., Chen, R.: Tuning block size for QR factorization on CPU–GPU hybrid systems. In: Proceedings of the IEEE 6th International Symposium on Embedded Multicore Socs (MCSoC), pp. 205–211 (2012)
38.
go back to reference Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. J. Comput. Sci. Technol. 23, 187–198 (2011) Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. J. Comput. Sci. Technol. 23, 187–198 (2011)
39.
go back to reference Chen, L., Villa, O., Krishnamoorthy, S., Gao, G.: Dynamic load balancing on single- and multi-GPU systems. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–12 (2010) Chen, L., Villa, O., Krishnamoorthy, S., Gao, G.: Dynamic load balancing on single- and multi-GPU systems. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–12 (2010)
40.
go back to reference Phothilimthana, P.M., Ansel, J., Ragan-Kelley, J., Amarasinghe, S.P.: Portable performance on heterogeneous architectures. In: 18th International Conference on Architectural Support for Programming Languages and Operating System (ASPLOS), pp. 431–444 (2013) Phothilimthana, P.M., Ansel, J., Ragan-Kelley, J., Amarasinghe, S.P.: Portable performance on heterogeneous architectures. In: 18th International Conference on Architectural Support for Programming Languages and Operating System (ASPLOS), pp. 431–444 (2013)
41.
go back to reference NVIDIA Tutorial at PDP’08, CUDA: A New Architecture for Computing on the GPU. IEEE Computer Society, Toulouse (2008) NVIDIA Tutorial at PDP’08, CUDA: A New Architecture for Computing on the GPU. IEEE Computer Society, Toulouse (2008)
Metadata
Title
An Autotuning Engine for the 3D Fast Wavelet Transform on Clusters with Hybrid CPU + GPU Platforms
Authors
Gregorio Bernabé
Javier Cuenca
Domingo Giménez
Publication date
01-12-2015
Publisher
Springer US
Published in
International Journal of Parallel Programming / Issue 6/2015
Print ISSN: 0885-7458
Electronic ISSN: 1573-7640
DOI
https://doi.org/10.1007/s10766-014-0328-3

Other articles of this Issue 6/2015

International Journal of Parallel Programming 6/2015 Go to the issue

Premium Partner