Skip to main content
Top
Published in: International Journal of Parallel Programming 6/2015

01-12-2015

BPLG: A Tuned Butterfly Processing Library for GPU Architectures

Authors: J. Lobeiras, M. Amor, R. Doallo

Published in: International Journal of Parallel Programming | Issue 6/2015

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In order to increase the efficiency of existing software many works are incorporating GPU processing. However, despite the current advances in GPU languages and tools, taking advantage of their parallel architecture is still far more complex than programming standard multi-core CPUs. In this work, we present a library based on a set of building blocks that enable to easily design well-known algorithms with little effort. More specifically, we implement butterfly algorithms with this library, that is, a set of orthogonal signal transforms and an algorithm to solve tridiagonal equations systems. Thanks to the parametrization of the building blocks, the library can be easily tuned depending on the desired GPU architecture. This generic approach can be used to easily design these GPU algorithms while obtaining competitive performance on two recent NVIDIA GPU architectures, which results specially interesting from the productivity point of view.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Khronos OpenCL Working Group: The OpenCL specification (2011) Khronos OpenCL Working Group: The OpenCL specification (2011)
2.
go back to reference NVIDIA: CUDA C Best Practices Guide (SDK Document.). V5.0. (2012) NVIDIA: CUDA C Best Practices Guide (SDK Document.). V5.0. (2012)
3.
go back to reference Vandevoorde, D., Josuttis, N.M.: C++ Templates: The Complete Guide. Addison-Wesley, Boston (2002) Vandevoorde, D., Josuttis, N.M.: C++ Templates: The Complete Guide. Addison-Wesley, Boston (2002)
4.
go back to reference Bell, N.: Thrust: a productivity-oriented library for CUDA. In: GPU Computing Gems, Jade Edition. Morgan Kaufmann (2011) Bell, N.: Thrust: a productivity-oriented library for CUDA. In: GPU Computing Gems, Jade Edition. Morgan Kaufmann (2011)
5.
go back to reference Sander B.,: Bolt: A C++ Templater Library for HSA. Presented in AMD Fusion Developer Summit ’12 (2012) Sander B.,: Bolt: A C++ Templater Library for HSA. Presented in AMD Fusion Developer Summit ’12 (2012)
6.
go back to reference Chu, E., George, A.: Inside the FFT Black Box: Serial and Parallel Fast Fourier Transform Algorithms. Computational Mathematics Series. CRC Press, Boca Raton (2000) Chu, E., George, A.: Inside the FFT Black Box: Serial and Parallel Fast Fourier Transform Algorithms. Computational Mathematics Series. CRC Press, Boca Raton (2000)
7.
go back to reference Hartley, R.V.L.: A more symmetrical Fourier analysis applied to transmission problems. In: Proc. of the Institute of Radio Engineers (IRE), vol. 30(3), pp. 144–150 (1942) Hartley, R.V.L.: A more symmetrical Fourier analysis applied to transmission problems. In: Proc. of the Institute of Radio Engineers (IRE), vol. 30(3), pp. 144–150 (1942)
8.
9.
go back to reference Lobeiras, J., Amor, M., Doallo, R.: SPLG: a tuned signal processing library for GPU architectures. In: International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2013), vol. 1, pp. 184–191 (2013) Lobeiras, J., Amor, M., Doallo, R.: SPLG: a tuned signal processing library for GPU architectures. In: International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2013), vol. 1, pp. 184–191 (2013)
10.
go back to reference Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proc. IEEE 93(2), 216–231 (2005)CrossRef Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proc. IEEE 93(2), 216–231 (2005)CrossRef
11.
go back to reference Intel: Intel Integrated Performance Primitives for Intel Architecture, Reference Manual. Volume 1: Signal Processing (2012) Intel: Intel Integrated Performance Primitives for Intel Architecture, Reference Manual. Volume 1: Signal Processing (2012)
12.
go back to reference Püschel, M., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R.W., Rizzolo, N.: SPIRAL: code generation for DSP transforms. In: Proc. of the IEEE, on “Program Generation, Optimization, and Platform Adaptation”, vol. 93(2), pp. 232–275 (2005) Püschel, M., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R.W., Rizzolo, N.: SPIRAL: code generation for DSP transforms. In: Proc. of the IEEE, on “Program Generation, Optimization, and Platform Adaptation”, vol. 93(2), pp. 232–275 (2005)
13.
go back to reference Govindaraju, N., Lloyd, B., Dotsenko, Y., Smith, B., Manferdelli, J.: High performance discrete Fourier transforms on graphics processors. In: Proc. of the 2008 ACM/IEEE conference on Supercomputing (SC ’08), pp. 2:1–2:12. IEEE Press (2008) Govindaraju, N., Lloyd, B., Dotsenko, Y., Smith, B., Manferdelli, J.: High performance discrete Fourier transforms on graphics processors. In: Proc. of the 2008 ACM/IEEE conference on Supercomputing (SC ’08), pp. 2:1–2:12. IEEE Press (2008)
14.
go back to reference Volkov, V., Kazian, B.: Fitting FFT onto the G80 Architecture. University of California, Berkeley, Tech. rep. (2009) Volkov, V., Kazian, B.: Fitting FFT onto the G80 Architecture. University of California, Berkeley, Tech. rep. (2009)
15.
go back to reference Nukada, A., Matsuoka, S.: Auto-tuning 3-D FFT Library for CUDA GPUs. In: Proc. of the Conference on High Performance Computing Networking, Storage and Analysis (SC ’09), pp. 1–10 (2009) Nukada, A., Matsuoka, S.: Auto-tuning 3-D FFT Library for CUDA GPUs. In: Proc. of the Conference on High Performance Computing Networking, Storage and Analysis (SC ’09), pp. 1–10 (2009)
16.
go back to reference Chen, Y., Cui, X., Mei, H.: Large-scale FFT on GPU clusters. In: ICS ’10: Proc. of the 24th ACM Intl. Conference on Supercomputing, pp. 315–324 (2010) Chen, Y., Cui, X., Mei, H.: Large-scale FFT on GPU clusters. In: ICS ’10: Proc. of the 24th ACM Intl. Conference on Supercomputing, pp. 315–324 (2010)
17.
go back to reference Dotsenko, Y., Baghsorkhi, S.S., Lloyd, B., Govindaraju, N.K.: Auto-tuning of fast Fourier transform on graphics processors. In: Principles and Practice of Parallel Programming (PPoPP ’11), pp. 257–266 (2011) Dotsenko, Y., Baghsorkhi, S.S., Lloyd, B., Govindaraju, N.K.: Auto-tuning of fast Fourier transform on graphics processors. In: Principles and Practice of Parallel Programming (PPoPP ’11), pp. 257–266 (2011)
18.
go back to reference Li, Y., Zhang, Y., Liu, Y., Long, G., Jia, H.: MPFFT: an auto-tuning FFT library for OpenCL GPUs. J. Comput. Sci. Technol. 28(1), 90–105 (2013)CrossRef Li, Y., Zhang, Y., Liu, Y., Long, G., Jia, H.: MPFFT: an auto-tuning FFT library for OpenCL GPUs. J. Comput. Sci. Technol. 28(1), 90–105 (2013)CrossRef
19.
20.
go back to reference AMD: AMD Math Libraries, OpenCL Fast Fourier Transform (clAmdFft) (2012) AMD: AMD Math Libraries, OpenCL Fast Fourier Transform (clAmdFft) (2012)
21.
go back to reference Wang B., Álvarez-Mesa M., Ching C., Juurlink B.: An optimized parallel IDCT on graphics processing units. In: 18th International Conference on Parallel Processing Workshops (EuroPar ’12), pp. 155–164. Springer, Berlin (2013) Wang B., Álvarez-Mesa M., Ching C., Juurlink B.: An optimized parallel IDCT on graphics processing units. In: 18th International Conference on Parallel Processing Workshops (EuroPar ’12), pp. 155–164. Springer, Berlin (2013)
22.
go back to reference Guptda, M., Garg, A.K.: Analysis of image compression algorithm using DCT. Int. J. Eng. Res. Appl. (IJERA) 2(1), 512–521 (2012) Guptda, M., Garg, A.K.: Analysis of image compression algorithm using DCT. Int. J. Eng. Res. Appl. (IJERA) 2(1), 512–521 (2012)
23.
go back to reference Kim, C.G., Choi, Y.S.: A high performance parallel DCT with OpenCL on heterogeneous computing environment. Multimed. Tools Appl. 64(2), 475–489 (2013)CrossRef Kim, C.G., Choi, Y.S.: A high performance parallel DCT with OpenCL on heterogeneous computing environment. Multimed. Tools Appl. 64(2), 475–489 (2013)CrossRef
24.
go back to reference Panella, M., Basset, L.: An efficient GPU implementation of modified discrete cosine transform using CUDA. Int. J. Comput. Sci. Inf. Secur. 10(5), 23–30 (2012) Panella, M., Basset, L.: An efficient GPU implementation of modified discrete cosine transform using CUDA. Int. J. Comput. Sci. Inf. Secur. 10(5), 23–30 (2012)
25.
go back to reference Thomas, L.H.: Elliptic Problems in Linear Differential Equations over a Network. Columbia University, Tech. rep. (1949) Thomas, L.H.: Elliptic Problems in Linear Differential Equations over a Network. Columbia University, Tech. rep. (1949)
26.
go back to reference Polizzi, E., Sameh, A.H.: A parallel hybrid banded system solver: the SPIKE algorithm. Parallel Comput. 32(2), 177–194 (2006)MathSciNetCrossRef Polizzi, E., Sameh, A.H.: A parallel hybrid banded system solver: the SPIKE algorithm. Parallel Comput. 32(2), 177–194 (2006)MathSciNetCrossRef
27.
go back to reference Intel: Intel Math Kernel Library, Reference Manual. V10.2. (2009) Intel: Intel Math Kernel Library, Reference Manual. V10.2. (2009)
28.
go back to reference Göddeke, D., Strzodka, R.: Cyclic reduction tridiagonal solvers on GPUs applied to mixed precision multigrid. IEEE Trans. Parallel Distrib. Syst. (TPDS) 22(1), 22–32 (2011). (Special Issue on HPC with Accelerators)CrossRef Göddeke, D., Strzodka, R.: Cyclic reduction tridiagonal solvers on GPUs applied to mixed precision multigrid. IEEE Trans. Parallel Distrib. Syst. (TPDS) 22(1), 22–32 (2011). (Special Issue on HPC with Accelerators)CrossRef
29.
go back to reference Chang, L.-W., Stratton, J.A., Kim, H.-S., Hwu, W.W.: A scalable, numerically stable, high-performance tridiagonal solver using GPUs. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12), pp. 27:1–27:11. IEEE Computer Society Press (2012) Chang, L.-W., Stratton, J.A., Kim, H.-S., Hwu, W.W.: A scalable, numerically stable, high-performance tridiagonal solver using GPUs. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12), pp. 27:1–27:11. IEEE Computer Society Press (2012)
30.
go back to reference Kim, H.-S., Wu, S., Chang, L.-W., Hwu, W.W.: A scalable tridiagonal solver for GPU. In: Intl. Conf. on Parallel Processing, pp. 444–453. IEEE Comp. Society (2011) Kim, H.-S., Wu, S., Chang, L.-W., Hwu, W.W.: A scalable tridiagonal solver for GPU. In: Intl. Conf. on Parallel Processing, pp. 444–453. IEEE Comp. Society (2011)
31.
go back to reference Zhang, Y., Cohen, J., Owens, J.D.: Fast tridiagonal solvers on the GPU. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2010), pp. 127–136 (2010) Zhang, Y., Cohen, J., Owens, J.D.: Fast tridiagonal solvers on the GPU. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2010), pp. 127–136 (2010)
32.
go back to reference CUDA Data Parallel Primitives Library. V2.1. (2013) CUDA Data Parallel Primitives Library. V2.1. (2013)
33.
34.
go back to reference Wang, X., Mou, Z.G.: A divide-and-conquer method of solving tridiagonal systems on hypercube massively parallel computers. In: IEEE Symposium on Parallel and Distributed Processing, pp. 810–817 (1991) Wang, X., Mou, Z.G.: A divide-and-conquer method of solving tridiagonal systems on hypercube massively parallel computers. In: IEEE Symposium on Parallel and Distributed Processing, pp. 810–817 (1991)
35.
go back to reference Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965)MATHMathSciNetCrossRef Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965)MATHMathSciNetCrossRef
36.
go back to reference Stockham, T.G.: High-speed convolution and correlation. In: Proceedings of the Spring Joint Computer Conference, pp. 229–233 (1966) Stockham, T.G.: High-speed convolution and correlation. In: Proceedings of the Spring Joint Computer Conference, pp. 229–233 (1966)
37.
go back to reference Keith, J.: The Regularized Fast Hartley Transform. Signals and Communication Technology. Springer, Berlin (2010) Keith, J.: The Regularized Fast Hartley Transform. Signals and Communication Technology. Springer, Berlin (2010)
Metadata
Title
BPLG: A Tuned Butterfly Processing Library for GPU Architectures
Authors
J. Lobeiras
M. Amor
R. Doallo
Publication date
01-12-2015
Publisher
Springer US
Published in
International Journal of Parallel Programming / Issue 6/2015
Print ISSN: 0885-7458
Electronic ISSN: 1573-7640
DOI
https://doi.org/10.1007/s10766-014-0323-8

Other articles of this Issue 6/2015

International Journal of Parallel Programming 6/2015 Go to the issue

Premium Partner