Top

The Journal of Supercomputing

Published in:

29-10-2020

Automatic translation of data parallel programs for heterogeneous parallelism through OpenMP offloading

Authors: Farui Wang, Weizhe Zhang, Haonan Guo, Meng Hao, Gangzhao Lu, Zheng Wang

Published in: The Journal of Supercomputing | Issue 5/2021

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Heterogeneous multicores like GPGPUs are now commonplace in modern computing systems. Although heterogeneous multicores offer the potential for high performance, programmers are struggling to program such systems. This paper presents OAO, a compiler-based approach to automatically translate shared-memory OpenMP data-parallel programs to run on heterogeneous multicores through OpenMP offloading directives. Given the large user base of shared memory OpenMP programs, our approach allows programmers to continue using a single-source-based programming language that they are familiar with while benefiting from the heterogeneous performance. OAO introduces a novel runtime optimization scheme to automatically eliminate unnecessary host–device communication to minimize the communication overhead between the host and the accelerator device. We evaluate OAO by applying it to 23 benchmarks from the PolyBench and Rodinia suites on two distinct GPU platforms. Experimental results show that OAO achieves up to 32\(\times\) speedup over the original OpenMP version, and can reduce the host–device communication overhead by up to 99% over the hand-translated version.

previous article BB-tree based secure and dynamic public auditing convergence for cloud storage

next article Parallel optimization of three-dimensional wedge-shaped underwater acoustic propagation based on MPI+OpenMP hybrid programming model

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Al-Saber N, Kulkarni M (2015) Semcache++: semantics-aware caching for efficient multi-gpu offloading. In: Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, pp 79–88

Baskaran MM, Ramanujam J, Sadayappan P (2010) Automatic c-to-cuda code generation for affine programs. In: International Conference on Compiler Construction. Springer, pp 244–263

Castro D, Romano P, Ilic A, Khan AM (2019) Hetm: transactional memory for heterogeneous systems. In: 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, pp 232–244

Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: 2009 IEEE International Symposium on Workload Characterization (IISWC). IEEE, pp 44–54

Corporation N (2019) Cuda toolkit documentation v10.2.89. https://docs.nvidia.com/cuda. Accessed 10 Dec 2019

Huang Y, Li D (2017) Performance modeling for optimal data placement on GPU with heterogeneous memory systems. In: 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, pp 166–177

Jablin TB, Jablin JA, Prabhu P, Liu F, August DI (2012) Dynamically managed data for CPU–GPU architectures. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization. ACM, pp 165–174

Jablin TB, Prabhu P, Jablin JA, Johnson NP, Beard SR, August DI (2011) Automatic CPU–GPU communication management and optimization. In: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, pp 142–151

Kim Y, Kim H (2019) Translating cuda to opencl for hardware generation using neural machine translation. In: 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, pp 285–286

10.

Lattner C, Adve V (2004) Llvm: a compilation framework for lifelong program analysis and transformation. In: International Symposium on Code Generation and Optimization, 2004. CGO 2004. IEEE, pp 75–86

11.

Lee S, Eigenmann R (2010) Openmpc: extended openmp programming and tuning for gpus. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, pp 1–11

12.

Li L, Chapman B (2019) Compiler assisted hybrid implicit and explicit gpu memory management under unified address space. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, p 51

13.

LLVM AT (2020) Clang: a C language family frontend for llvm. http://clang.llvm.org. Accessed 14 Sep 2020

14.

LLVM AT (2020) The LLVM compiler infrastructure. http://llvm.org. Accessed 14 Sep 2020

15.

Mendonça G, Guimarães B, Alves P, Pereira M, Araújo G, Pereira FMQ (2017) DAWNCC: automatic annotation for data parallelism and offloading. ACM Trans Archit Code Optim (TACO) 14(2):13

16.

Mendonça G, Guimarães B, Pereira FMQ (2018) Benchmarks used to evaluate DAWNCC. http://cuda.dcc.ufmg.br/dawn/benchmarks.zip. Accessed 21 Dec 2018

17.

Mendonça GSD, Guimaraes BCF, Alves PRO, Pereira FMQ, Pereira MM, Araújo G (2016) Automatic insertion of copy annotation in data-parallel programs. In: 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE, pp 34–41

18.

Nugteren C, Corporaal H (2015) Bones: an automatic skeleton-based c-to-cuda compiler for gpus. ACM Trans Arch Code Optim (TACO) 11(4):35

19.

O’Boyle MF, Wang Z, Grewe D (2013) Portable mapping of data parallel programs to opencl for heterogeneous systems. In: Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE Computer Society, pp 1–10

20.

OpenMP ARB (2019) Openmp application program interface version 3.1. https://www.openmp.org/wp-content/uploads/OpenMP3.1.pdf. Accessed 07 Nov 2019

21.

OpenMP ARB (2019) Openmp application program interface version 4.0. https://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf. Accessed 07 Nov 2019

22.

OpenMP ARB (2019) Openmp application program interface version 4.5. https://www.openmp.org/wp-content/uploads/openmp-4.5.pdf. Accessed 07 Nov 2019

23.

OpenMP ARB (2019) Openmp application program interface version 5.0. https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf. Accessed 07 Nov 2019

24.

Pai S, Govindarajan R, Thazhuthaveetil MJ (2012) Fast and efficient automatic memory management for gpus using compiler-assisted runtime coherence scheme. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. ACM, pp 33–42

25.

Pouchet LN et al (2018) Polybench/c the polyhedral benchmark suite. https://web.cse.ohio-state.edu/~pouchet.2/software/polybench. Accessed 21 Dec 2018

26.

Riebler H, Vaz G, Kenter T, Plessl C (2019) Transparent acceleration for heterogeneous platforms with compilation to opencl. ACM Trans Arch Code Optim (TACO) 16(2):1–26CrossRef

27.

Saraswat V, Bloom B, Peshansky I, Tardieu O, Grove D (2019) The x10 parallel programming language. http://x10-lang.org. Accessed 10 Dec 2019

28.

Sathre P, Gardner M, Feng WC (2019) On the portability of CPU-accelerated applications via automated source-to-source translation. In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, pp 1–8

29.

Sousa R, Pereira M, Pereira FMQ, Araujo G (2019) Data-flow analysis and optimization for data coherence in heterogeneous architectures. J Parallel Distrib Comput 130:126–139CrossRef

30.

Verdoolaege S, Carlos Juega J, Cohen A, Ignacio Gomez J, Tenllado C, Catthoor F (2013) Polyhedral parallel code generation for cuda. ACM Trans Arch Code Optim (TACO) 9(4):54

31.

Wang K, Che S, Skadron K (2019) Rodinia: a benchmark suit for heterogeneous computing. http://lava.cs.virginia.edu/Rodinia/download_links.htm. Accessed 23 June 2019

32.

Wang X, Huang K, Knoll A, Qian X (2019) A hybrid framework for fast and accurate gpu performance estimation through source-level analysis and trace-based simulation. In: 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, pp 506–518

33.

Wu S, Dong X, Zhang X, Zhu Z (2019) Not: a high-level no-threading parallel programming method for heterogeneous systems. J Supercomput 75(7):3810–3841CrossRef

34.

Xiao J, Andelfinger P, Cai W, Richmond P, Knoll A, Eckhoff D (2020) Openablext: an automatic code generation framework for agent-based simulations on CPU–GPU–FPGA heterogeneous platforms. Concurrency and Computation: Practice and Experience p. e5807

35.

Zhang W, Cheng AM, Subhlok J (2015) Dwarfcode: a performance prediction tool for parallel applications. IEEE Trans Comput 65(2):495–507MathSciNetCrossRef

36.

Zhang W, Hao M, Snir M (2017) Predicting hpc parallel program performance based on llvm compiler. Cluster Comput 20(2):1179–1192CrossRef

Title: Automatic translation of data parallel programs for heterogeneous parallelism through OpenMP offloading
Authors: Farui Wang
Weizhe Zhang
Haonan Guo
Meng Hao
Gangzhao Lu
Zheng Wang
Publication date: 29-10-2020
Publisher: Springer US
Published in: The Journal of Supercomputing / Issue 5/2021
Print ISSN: 0920-8542
Electronic ISSN: 1573-0484
DOI: https://doi.org/10.1007/s11227-020-03452-2

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Other articles of this Issue 5/2021

Attribute-aware multi-task recommendation

SDAM: a combined stack distance-analytical modeling approach to estimate memory performance in GPUs

Deep learning and case-based reasoning for predictive and adaptive traffic emergency management

A survey on design and synthesis techniques for photonic integrated circuits

Intelligent and pervasive computing for cyber-physical systems

GPUs-RRTMG_LW: high-efficient and scalable computing for a longwave radiative transfer model on multiple GPUs

Premium Partner