Top

International Journal of Parallel Programming

Published in:

01-06-2016

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Authors: José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez, Juan Touriño

Published in: International Journal of Parallel Programming | Issue 3/2016

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The use of GPUs for general purpose computation has increased dramatically in the past years due to the rising demands of computing power and their tremendous computing capacity at low cost. Hence, new programming models have been developed to integrate these accelerators with high-level programming languages, giving place to heterogeneous computing systems. Unfortunately, this heterogeneity is also exposed to the programmer complicating its exploitation. This paper presents a new technique to automatically rewrite sequential programs into a parallel counterpart targeting GPU-based heterogeneous systems. The original source code is analyzed through domain-independent computational kernels, which hide the complexity of the implementation details by presenting a non-statement-based, high-level, hierarchical representation of the application. Next, a locality-aware technique based on standard compiler transformations is applied to the original code through OpenHMPP directives. Two representative case studies from scientific applications have been selected: the three-dimensional discrete convolution and the simple-precision general matrix multiplication. The effectiveness of our technique is corroborated by a performance evaluation on NVIDIA GPUs.

previous article Atomic RMI: A Distributed Transactional Memory Framework

next article Automatic Generation of Unit Tests for Correlated Variables in Parallel Programs

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Andión, J.M., Arenaz, M., Rodríguez, G., Touriño, J.: A novel compiler support for automatic parallelization on multicore systems. Parallel Comput. 39(9), 442–460 (2013)CrossRef

Andrade, D., Arenaz, M., Fraguela, B.B., Touriño, J., Doallo, R.: Automated and accurate cache behavior analysis for codes with irregular access patterns. Concurr. Comput. Pract. Exp. 19(18), 2407–2423 (2007)CrossRef

Appentra Solutions: Parallware for OpenACC. http://www.appentra.com/products/parallware/. Accessed 31 Jan 2015

Arenaz, M., Touriño, J., Doallo, R.: Compiler support for parallel code generation through kernel recognition. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS), Santa Fe, NM, USA, p. 79b. IEEE (2004)

Arenaz, M., Touriño, J., Doallo, R.: XARK: an extensible framework for automatic recognition of computational kernels. ACM Trans. Program. Lang. Syst. 30(6), 32:1–32:56 (2008)CrossRef

Baskaran, M.M., Ramanujam, J., Sadayappan, P.: Automatic C-to-CUDA code generation for affine programs. In: Proceedings of the 19th International Conference on Compiler Construction (CC), Paphos, Cyprus, LNCS, vol. 6011, pp. 244–263. Springer (2010)

BLAS: Basic Linear Algebra Subprograms. http://www.netlib.org/blas/. Accessed 31 Jan 2015

Bodin, F., Bihan, S.: Heterogeneous multicore parallel programming for graphics processing units. Sci. Program. 17(4), 325–336 (2009)

Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. In: Proceedings of the 29th Conference on Programming Language Design and Implementation (PLDI), Tucson, AZ, USA, pp. 101–113. ACM (2008)

10.

Christen, M., Schenk, O., Burkhart, H.: Automatic code generation and tuning for stencil kernels on modern shared memory architectures. Comp. Sci. Res. Dev. 26(3–4), 205–210 (2011)CrossRef

11.

Eigenmann, R., Hoeflinger, J., Li, Z., Padua, D.A.: Experience in the automatic parallelization of four perfect-benchmark programs. In: Proceedings of the 4th International Workshop on Languages and Compilers for Parallel Computing (LCPC), Santa Clara, CA, USA, LNCS, vol. 589, pp. 65–83. Springer (1992)

12.

Grauer-Gray, S., Xu, L., Searles, R., Ayalasomayajula, S., Cavazos, J.: Auto-tuning a high-level language targeted to GPU codes. In: Proceedings of Innovative Parallel Computing (InPar), San Jose, CA, USA, pp. 1–10. IEEE (2012)

13.

Han, T.D., Abdelrahman, T.S.: hiCUDA: High-level GPGPU programming. IEEE Trans. Parallel Distrib. Syst. 22(1), 78–90 (2011)CrossRef

14.

HPC Project: Par4All. http://www.par4all.org/. Accessed 31 Jan 2015

15.

Intel Corporation: Intel Math Kernel Library. http://software.intel.com/intel-mkl/. Accessed 31 Jan 2015

16.

Jablin, T.B., Jablin, J.A., Prabhu, P., Liu, F., August, D.I.: Dynamically managed data for CPU–GPU architectures. In: Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO), San Jose, CA, USA, pp. 165–174. ACM (2012)

17.

Jablin, T.B., Prabhu, P., Jablin, J.A., Johnson, N.P., Beard, S.R., August, D.I.: Automatic CPU–GPU communication management and optimization. In: Proceedings of the 32nd Conference on Programming Language Design and Implementation (PLDI), San Jose, CA, USA, pp. 142–151. ACM (2011)

18.

Kurzak, J., Tomov, S., Dongarra, J.: Autotuning GEMM kernels for the Fermi GPU. IEEE Trans. Parallel Distrib. Syst. 23(11), 2045–2057 (2012)CrossRef

19.

Larsen, E.S., McAllister, D.: Fast matrix multiplies using graphics hardware. In: Proceedings of the 14th International Conference on High Performance Computing, Networking, Storage and Analysis (SC), Denver, CO, USA, p. 55. ACM (2001)

20.

Lee, S., Eigenmann, R.: OpenMPC: Extended OpenMP programming and tuning for GPUs. In: Proceedings of the 23rd International Conference on High Performance Computing, Networking, Storage and Analysis (SC), New Orleans, LA, USA, pp. 1–11. IEEE (2010)

21.

Lee, S., Vetter, J.S.: Early evaluation of directive-based GPU programming models for productive exascale computing. In: Proceedings of the 25th International Conference on High Performance Computing, Networking, Storage and Analysis (SC), Salt Lake City, UT, USA, pp. 23:1–23:11. IEEE (2012)

22.

Novatte Pte. Ltd.: CAPS Compilers. http://www.novatte.com/component/content/article/126-products/hpcclusters/301-caps-compilers-for-cuda-and-opencl/. Accessed 31 Jan 2015

23.

NVIDIA Corporation: Cg Toolkit. http://developer.nvidia.com/Cg/. Accessed 31 Jan 2015

24.

NVIDIA Corporation: CUBLAS Library. https://developer.nvidia.com/cublas/. Accessed 31 Jan 2015

25.

NVIDIA Corporation: CUDA C Best Practices Guide. http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/. Accessed 31 Jan 2015

26.

NVIDIA Corporation: CUDA C Programming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/. Accessed 31 Jan 2015

27.

OpenHMPP Consortium: OpenHMPP Concepts and Directives. http://en.wikipedia.org/wiki/OpenHMPP. Accessed 31 Jan 2015

28.

OpenMP Architecture Review Board: OpenMP Application Program Interface (Version 4.0). http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf. Accessed 31 Jan 2015

29.

Owens, J., Houston, M., Luebke, D., Green, S., Stone, J., Phillips, J.: GPU computing. Proc. IEEE 96(5), 879–899 (2008)CrossRef

30.

The Khronos Group Inc.: The OpenCL Specification (Version 2.0). http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf. Accessed 31 Jan 2015

31.

The Khronos Group Inc.: The OpenGL Shading Language (Version 4.50). https://www.opengl.org/registry/doc/GLSLangSpec.4.50.pdf. Accessed 31 Jan 2015

32.

The OpenACC Standards Group: The OpenACC Application Programming Interface (Version 2.0a). http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf. Accessed 31 Jan 2015

33.

Verdoolaege, S., Juega, J.C., Cohen, A., Gómez, J.I., Tenllado, C., Catthoor, F.: Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9(4), 54:1–54:23 (2013)CrossRef

34.

Viñas, M., Lobeiras, J., Fraguela, B.B., Arenaz, M., Amor, M., García, J.A., Castro, M.J., Doallo, R.: A multi-GPU shallow-water simulation with transport of contaminants. Concurr. Comput. Pract. Exp. 25(8), 1153–1169 (2013)CrossRef

35.

Volkov, V.: Better performance at lower occupancy. In: Proceedings of the 2010 GPU technology conference (GTC), San Jose, CA, USA. NVIDIA (2010)

36.

Wolfe, M.: Implementing the PGI accelerator model. In: Proceedings of the 3rd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU), Pittsburgh, PA, USA, pp. 43–50. ACM (2010)

37.

Zima, E.: Simplification and optimization of transformations of chains of recurrences. In: Proceedings of the 1995 International Symposium on Symbolic and Algebraic Computation (ISSAC), Montreal, Canada, pp. 42–50. ACM (1995)

38.

Zhang, Y., Mueller, F.: Autogeneration and autotuning of 3D stencil codes on homogeneous and heterogeneous GPU clusters. IEEE Trans. Parallel Distrib. Syst. 24(3), 417–427 (2013)CrossRef

Title: Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives
Authors: José M. Andión
Manuel Arenaz
François Bodin
Gabriel Rodríguez
Juan Touriño
Publication date: 01-06-2016
Publisher: Springer US
Published in: International Journal of Parallel Programming / Issue 3/2016
Print ISSN: 0885-7458
Electronic ISSN: 1573-7640
DOI: https://doi.org/10.1007/s10766-015-0362-9

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 3/2016

Relational Learning with GPUs: Accelerating Rule Coverage

BSP-Why: A Tool for Deductive Verification of BSP Algorithms with Subgroup Synchronisation

Atomic RMI: A Distributed Transactional Memory Framework

Pool Evolution: A Parallel Pattern for Evolutionary and Symbolic Computing

A Generic Implementation of Tree Skeletons

Synthesizing MPI Implementations from Functional Data-Parallel Programs

Premium Partner