Top

International Journal of Parallel Programming

Published in:

01-12-2015

BPLG: A Tuned Butterfly Processing Library for GPU Architectures

Authors: J. Lobeiras, M. Amor, R. Doallo

Published in: International Journal of Parallel Programming | Issue 6/2015

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In order to increase the efficiency of existing software many works are incorporating GPU processing. However, despite the current advances in GPU languages and tools, taking advantage of their parallel architecture is still far more complex than programming standard multi-core CPUs. In this work, we present a library based on a set of building blocks that enable to easily design well-known algorithms with little effort. More specifically, we implement butterfly algorithms with this library, that is, a set of orthogonal signal transforms and an algorithm to solve tridiagonal equations systems. Thanks to the parametrization of the building blocks, the library can be easily tuned depending on the desired GPU architecture. This generic approach can be used to easily design these GPU algorithms while obtaining competitive performance on two recent NVIDIA GPU architectures, which results specially interesting from the productivity point of view.

previous article Cluster Cache Monitor: Leveraging the Proximity Data in CMP

next article List Scheduling in Embedded Systems Under Memory Constraints

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Khronos OpenCL Working Group: The OpenCL specification (2011)

NVIDIA: CUDA C Best Practices Guide (SDK Document.). V5.0. (2012)

Vandevoorde, D., Josuttis, N.M.: C++ Templates: The Complete Guide. Addison-Wesley, Boston (2002)

Bell, N.: Thrust: a productivity-oriented library for CUDA. In: GPU Computing Gems, Jade Edition. Morgan Kaufmann (2011)

Sander B.,: Bolt: A C++ Templater Library for HSA. Presented in AMD Fusion Developer Summit ’12 (2012)

Chu, E., George, A.: Inside the FFT Black Box: Serial and Parallel Fast Fourier Transform Algorithms. Computational Mathematics Series. CRC Press, Boca Raton (2000)

Hartley, R.V.L.: A more symmetrical Fourier analysis applied to transmission problems. In: Proc. of the Institute of Radio Engineers (IRE), vol. 30(3), pp. 144–150 (1942)

Ahmed, N., Natarajan, T., Rao, K.R.: Discrete cosine transform. IEEE Trans. Comput. C–23(1), 90–93 (1974)MathSciNetCrossRef

Lobeiras, J., Amor, M., Doallo, R.: SPLG: a tuned signal processing library for GPU architectures. In: International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2013), vol. 1, pp. 184–191 (2013)

10.

Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proc. IEEE 93(2), 216–231 (2005)CrossRef

11.

Intel: Intel Integrated Performance Primitives for Intel Architecture, Reference Manual. Volume 1: Signal Processing (2012)

12.

Püschel, M., Moura, J.M.F., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R.W., Rizzolo, N.: SPIRAL: code generation for DSP transforms. In: Proc. of the IEEE, on “Program Generation, Optimization, and Platform Adaptation”, vol. 93(2), pp. 232–275 (2005)

13.

Govindaraju, N., Lloyd, B., Dotsenko, Y., Smith, B., Manferdelli, J.: High performance discrete Fourier transforms on graphics processors. In: Proc. of the 2008 ACM/IEEE conference on Supercomputing (SC ’08), pp. 2:1–2:12. IEEE Press (2008)

14.

Volkov, V., Kazian, B.: Fitting FFT onto the G80 Architecture. University of California, Berkeley, Tech. rep. (2009)

15.

Nukada, A., Matsuoka, S.: Auto-tuning 3-D FFT Library for CUDA GPUs. In: Proc. of the Conference on High Performance Computing Networking, Storage and Analysis (SC ’09), pp. 1–10 (2009)

16.

Chen, Y., Cui, X., Mei, H.: Large-scale FFT on GPU clusters. In: ICS ’10: Proc. of the 24th ACM Intl. Conference on Supercomputing, pp. 315–324 (2010)

17.

Dotsenko, Y., Baghsorkhi, S.S., Lloyd, B., Govindaraju, N.K.: Auto-tuning of fast Fourier transform on graphics processors. In: Principles and Practice of Parallel Programming (PPoPP ’11), pp. 257–266 (2011)

18.

Li, Y., Zhang, Y., Liu, Y., Long, G., Jia, H.: MPFFT: an auto-tuning FFT library for OpenCL GPUs. J. Comput. Sci. Technol. 28(1), 90–105 (2013)CrossRef

19.

NVIDIA: CUDA CUFFT Library. V5.0. (2012)

20.

AMD: AMD Math Libraries, OpenCL Fast Fourier Transform (clAmdFft) (2012)

21.

Wang B., Álvarez-Mesa M., Ching C., Juurlink B.: An optimized parallel IDCT on graphics processing units. In: 18th International Conference on Parallel Processing Workshops (EuroPar ’12), pp. 155–164. Springer, Berlin (2013)

22.

Guptda, M., Garg, A.K.: Analysis of image compression algorithm using DCT. Int. J. Eng. Res. Appl. (IJERA) 2(1), 512–521 (2012)

23.

Kim, C.G., Choi, Y.S.: A high performance parallel DCT with OpenCL on heterogeneous computing environment. Multimed. Tools Appl. 64(2), 475–489 (2013)CrossRef

24.

Panella, M., Basset, L.: An efficient GPU implementation of modified discrete cosine transform using CUDA. Int. J. Comput. Sci. Inf. Secur. 10(5), 23–30 (2012)

25.

Thomas, L.H.: Elliptic Problems in Linear Differential Equations over a Network. Columbia University, Tech. rep. (1949)

26.

Polizzi, E., Sameh, A.H.: A parallel hybrid banded system solver: the SPIKE algorithm. Parallel Comput. 32(2), 177–194 (2006)MathSciNetCrossRef

27.

Intel: Intel Math Kernel Library, Reference Manual. V10.2. (2009)

28.

Göddeke, D., Strzodka, R.: Cyclic reduction tridiagonal solvers on GPUs applied to mixed precision multigrid. IEEE Trans. Parallel Distrib. Syst. (TPDS) 22(1), 22–32 (2011). (Special Issue on HPC with Accelerators)CrossRef

29.

Chang, L.-W., Stratton, J.A., Kim, H.-S., Hwu, W.W.: A scalable, numerically stable, high-performance tridiagonal solver using GPUs. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12), pp. 27:1–27:11. IEEE Computer Society Press (2012)

30.

Kim, H.-S., Wu, S., Chang, L.-W., Hwu, W.W.: A scalable tridiagonal solver for GPU. In: Intl. Conf. on Parallel Processing, pp. 444–453. IEEE Comp. Society (2011)

31.

Zhang, Y., Cohen, J., Owens, J.D.: Fast tridiagonal solvers on the GPU. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2010), pp. 127–136 (2010)

32.

CUDA Data Parallel Primitives Library. V2.1. (2013)

33.

NVIDIA: CUSPARSE Library. V5.0. (2012)

34.

Wang, X., Mou, Z.G.: A divide-and-conquer method of solving tridiagonal systems on hypercube massively parallel computers. In: IEEE Symposium on Parallel and Distributed Processing, pp. 810–817 (1991)

35.

Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965)MATHMathSciNetCrossRef

36.

Stockham, T.G.: High-speed convolution and correlation. In: Proceedings of the Spring Joint Computer Conference, pp. 229–233 (1966)

37.

Keith, J.: The Regularized Fast Hartley Transform. Signals and Communication Technology. Springer, Berlin (2010)

Title: BPLG: A Tuned Butterfly Processing Library for GPU Architectures
Authors: J. Lobeiras
M. Amor
R. Doallo
Publication date: 01-12-2015
Publisher: Springer US
Published in: International Journal of Parallel Programming / Issue 6/2015
Print ISSN: 0885-7458
Electronic ISSN: 1573-7640
DOI: https://doi.org/10.1007/s10766-014-0323-8

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 6/2015

The Scalability of Disjoint Data Structures on a New Hardware Transactional Memory System

Fully Optimized Code Block Segmentation Algorithm for LTE-Advanced

A Decomposition-Based Approach for Scalable Many-Field Packet Classification on Multi-core Processors

PageRank Computation Using a Multiple Implicitly Restarted Arnoldi Method for Modeling Epidemic Spread

A Hardware/Software Approach for Database Query Acceleration with FPGAs

Guest Editorial: SBAC-PAD 2013

Premium Partner