Skip to main content

Über dieses Buch

This book constitutes the refereed post-conference proceedings of the 6th International Workshop on Accelerator Programming Using Directives, WACCPD 2019, held in Denver, CO, USA, in November 2019.

The 7 full papers presented have been carefully reviewed and selected from 13 submissions. The papers share knowledge and experiences to program emerging complex parallel computing systems. They are organized in the following three sections: porting scientific applications to heterogeneous architectures using directives; directive-based programming for math libraries; and performance portability for heterogeneous architectures.



Porting Scientific Applications to Heterogeneous Architectures Using Directives


GPU Implementation of a Sophisticated Implicit Low-Order Finite Element Solver with FP21-32-64 Computation Using OpenACC

Accelerating applications with portability and maintainability is one of the big challenges in science and engineering. Previously, we have developed a fast implicit low-order three-dimensional finite element solver, which has a complicated algorithm including artificial intelligence and transprecision computing. In addition, all possible tunings for the target architecture were implemented; accordingly, the solver has inferior portability and maintainability. In this paper, we apply OpenACC to the solver. The directive-based implementation of OpenACC enables GPU computation to be introduced with a smaller developmental cost even for complex codes. In performance measurements on AI Bridging Cloud Infrastructure (ABCI), we evaluated that a reasonable speedup was attained on GPUs, given that the elapsed time of the entire solver was reduced to 1/14 of that on CPUs based on the original CPU implementation. Our proposed template to use transprecision computing with our custom FP21 data type is available to the public; therefore, it can provide a successful example for other scientific computing applications.
Takuma Yamaguchi, Kohei Fujita, Tsuyoshi Ichimura, Akira Naruse, Maddegedara Lalith, Muneo Hori

Acceleration in Acoustic Wave Propagation Modelling Using OpenACC/OpenMP and Its Hybrid for the Global Monitoring System

CTBTO is operating and maintaining the international monitoring system of Seismic, Infrasound, Hydroacoustic and Airborne radionuclide facilities to detect a nuclear explosion over the globe. The monitoring network of CTBTO, especially with regard to infrasound and hydroacoustic, is quite unique because the network covers over the globe, and the data is opened to scientific use. CTBTO has been developing and improving the methodologies to analyze observed signals intensively. In this context, hydroacoustic modelling software, especially which that solves the partial differential equation directly, is of interest. As seen in the analysis of the Argentinian submarine accident, the horizontal reflection can play an important role in identifying the location of an underwater event, and as such, accurate modelling software may help analysts find relevant waves efficiently. Thus, CTBTO has been testing a parabolic equation based model (3D-SSFPE) and building a finite difference time domain (FDTD) model. At the same time, using such accurate models require larger computer resources than simplified methods such as ray-tracing. Thus we accelerated them using OpenMP and OpenACC, or the hybrid of those. As a result, in the best case scenarios, (1) 3D-SSFPE was accelerated by approximately 19 times to the original Octave code, employing the GPU-enabled Octfile technology, and (2) FDTD was accelerated by approximately 160 times to the original Fortran code using the OpenMP/OpenACC hybrid technology, on our DGX—Station with V100 GPUs.
Noriyuki Kushida, Ying-Tsong Lin, Peter Nielsen, Ronan Le Bras

Accelerating the Performance of Modal Aerosol Module of E3SM Using OpenACC

Using GPUs to accelerate the performance of HPC applications has recently gained great momentum. Energy Exascale Earth System Model (E3SM) is a state-of-the-science earth system model development and simulation project and has gained national recognition. It has a large code base with over a million lines of code. How to make effective use of GPUs, however, remains a challenge. In this paper, we use the modal aerosol module (MAM) of E3SM as a driving example to investigate how to effectively offload computational tasks to GPUs using the OpenACC directives. In particular, we are interested in the performance advantage of using GPUs and in understanding performance-limiting factors from both application characteristics and the GPU or OpenACC.
Hongzhang Shan, Zhengji Zhao, Marcus Wagner

Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices

Achieving high performance and performance portability for large-scale scientific applications is a major challenge on heterogeneous computing systems such as many-core CPUs and accelerators like GPUs. In this work, we implement a widely used block eigensolver, Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG), using two popular directive based programming models (OpenMP and OpenACC) for GPU-accelerated systems. Our work differs from existing work in that it adopts a holistic approach that optimizes the full solver performance rather than narrowing the problem into small kernels (e.g., SpMM, SpMV). Our LOPBCG GPU implementation achieves a 2.8\({\times }\)–4.3\({\times }\) speedup over an optimized CPU implementation when tested with four different input matrices. The evaluated configuration compared one Skylake CPU to one Skylake CPU and one NVIDIA V100 GPU. Our OpenMP and OpenACC LOBPCG GPU implementations gave nearly identical performance. We also consider how to create an efficient LOBPCG solver that can solve problems larger than GPU memory capacity. To this end, we create microbenchmarks representing the two dominant kernels (inner product and SpMM kernel) in LOBPCG and then evaluate performance when using two different programming approaches: tiling the kernels, and using Unified Memory with the original kernels. Our tiled SpMM implementation achieves a 2.9\({\times }\) and 48.2\({\times }\) speedup over the Unified Memory implementation on supercomputers with PCIe Gen3 and NVLink 2.0 CPU to GPU interconnects, respectively.
Fazlay Rabbi, Christopher S. Daley, Hasan Metin Aktulga, Nicholas J. Wright

Directive-Based Programming for Math Libraries


Performance of the RI-MP2 Fortran Kernel of GAMESS on GPUs via Directive-Based Offloading with Math Libraries

The US Department of Energy (DOE) started operating two GPU-based pre-exascale supercomputers in 2018 and plans to deploy another pre-exascale in 2020, and three exascale supercomputers in 2021/2022. All of the systems are GPU-enabled systems, and they plan to provide optimized vendor-promoted programming models for their GPUs such as CUDA, HIP and SYCL. However, due to their limited functional portability, it is challenging for HPC application developers to maintain their applications in an efficient and effective way with good productivity across all US DOE pre-exascale/exascale systems. Directive-based programming models for accelerators can be one of the solutions for HPC applications on the DOE supercomputers. In this study, we employ OpenMP and OpenACC offloading models to port and re-implement the RI-MP2 Fortran kernel of the GAMESS application on a pre-exascale GPU system, Summit. We compare and evaluate the performance of the re-structured offloading kernels with the original OpenMP threading kernel. We also evaluate the performance of multiple math libraries on the NVIDIA V100 GPU in the RI-MP2 kernel. Using the optimized directive-based offloading implementations, the RI-MP2 kernel on a single V100 GPU becomes more than 7 times faster than on dual-socket Power9 processors, which is near the theoretical speed-up based on peak performance ratios. MPI+directive-based offloading implementations of the RI-MP2 kernel perform more than 40 times faster than a MPI+OpenMP threading implementation on the same number of Summit nodes. This study demonstrates how directive-based offloading implementations can perform near what we expect based on machine peak ratios.
JaeHyuk Kwack, Colleen Bertoni, Buu Pham, Jeff Larkin

Performance Portability for Heterogeneous Architectures


Performance Portable Implementation of a Kinetic Plasma Simulation Mini-App

Performance portability is considered to be an inevitable requirement in the exascale era. We explore a performance portable approach for fusion plasma turbulence simulation code employing kinetic model, namely the GYSELA code. For this purpose, we extract the key features of GYSELA such as the high dimensionality and the semi-Lagrangian scheme, and encapsulate them into a mini-application which solves the similar but a simplified Vlasov-Poisson system. We implement the mini-app with a mixed OpenACC/OpenMP and Kokkos, where we suppress unnecessary duplications of code lines. For a reference case with the problem size of \(128^4\), the Skylake (Kokkos), Nvidia Tesla P100 (OpenACC), and P100 (Kokkos) versions achieve an acceleration of 1.45, 12.95, and 17.83, respectively, with respect to the baseline OpenMP version on Intel Skylake. In addition to the performance portability, we discuss the code readability and productivity of each implementation. Based on our experience, Kokkos can offer a readable and productive code at the cost of initial porting efforts, which would be enormous for a large scale simulation code like GYSELA.
Yuuichi Asahi, Guillaume Latu, Virginie Grandgirard, Julien Bigot

A Portable SIMD Primitive Using Kokkos for Heterogeneous Architectures

As computer architectures are rapidly evolving (e.g. those designed for exascale), multiple portability frameworks have been developed to avoid new architecture-specific development and tuning. However, portability frameworks depend on compilers for auto-vectorization and may lack support for explicit vectorization on heterogeneous platforms. Alternatively, programmers can use intrinsics-based primitives to achieve more efficient vectorization, but the lack of a gpu back-end for these primitives makes such code non-portable. A unified, portable, Single Instruction Multiple Data (simd) primitive proposed in this work, allows intrinsics-based vectorization on cpus and many-core architectures such as Intel Knights Landing (knl), and also facilitates Single Instruction Multiple Threads (simt) based execution on gpus. This unified primitive, coupled with the Kokkos portability ecosystem, makes it possible to develop explicitly vectorized code, which is portable across heterogeneous platforms. The new simd primitive is used on different architectures to test the performance boost against hard-to-auto-vectorize baseline, to measure the overhead against efficiently vectroized baseline, and to evaluate the new feature called the “logical vector length” (lvl). The simd primitive provides portability across cpus and gpus without any performance degradation being observed experimentally.
Damodar Sahasrabudhe, Eric T. Phipps, Sivasankaran Rajamanickam, Martin Berzins


Weitere Informationen

Premium Partner