Skip to main content

2018 | Buch

Accelerator Programming Using Directives

4th International Workshop, WACCPD 2017, Held in Conjunction with the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017, Denver, CO, USA, November 13, 2017, Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed post-conference proceedings of the 4th International Workshop on Accelerator Programming Using Directives, WACCPD 2017, held in Denver, CO, USA, in November 2017.
The 9 full papers presented have been carefully reviewed and selected from 14 submissions. The papers share knowledge and experiences to program emerging complex parallel computing systems. They are organized in the following three sections: applications; environments; and program evaluation.

Inhaltsverzeichnis

Frontmatter

Applications

Frontmatter
An Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC
Abstract
In this paper, we document the workflow of our practice to port a PETSc application with OpenACC to a supercomputer, Titan, at Oak Ridge National Laboratory. Our experience shows a few lines of code modifications with OpenACC directives can give us a speedup of 1.34x in a PETSc-based Poisson solver (conjugate gradient method with algebraic multigrid preconditioner). This demonstrates the feasibility of enabling GPU capability in PETSc with OpenACC. We hope our work can serve as a reference to those who are interested in porting their legacy PETSc applications to modern heterogeneous platforms.
Pi-Yueh Chuang, Fernanda S. Foertter
Hybrid Fortran: High Productivity GPU Porting Framework Applied to Japanese Weather Prediction Model
Abstract
In this work we use the GPU porting task for the operative Japanese weather prediction model “ASUCA” as an opportunity to examine productivity issues with OpenACC when applied to structured grid problems. We then propose “Hybrid Fortran”, an approach that combines the advantages of directive based methods (no rewrite of existing code necessary) with that of stencil DSLs (memory layout is abstracted). This gives the ability to define multiple parallelizations with different granularities in the same code. Without compromising on performance, this approach enables a major reduction in the code changes required to achieve a hybrid GPU/CPU parallelization - as demonstrated with our ASUCA implementation using Hybrid Fortran.
Michel Müller, Takayuki Aoki
Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation Using OpenACC
Abstract
In this paper, we develop a low-order three-dimensional finite-element solver for fast multiple-case crust deformation computation on GPU-based systems. Based on a high-performance solver designed for massively parallel CPU-based systems, we modify the algorithm to reduce random data access, and then insert OpenACC directives. By developing algorithm appropriate for each computer architecture, we enable to exhibit higher performance. The developed solver on ten Reedbush-H nodes (20 P100 GPUs) attained speedup of 14.2 times from the original solver on 20 K computer nodes. On the newest Volta generation V100 GPUs, the solver attained a further 2.52 times speedup with respect to P100 GPUs. As a demonstrative example, we computed 368 cases of crustal deformation analyses of northeast Japan with 400 million degrees of freedom. The total procedure of algorithm modification and porting implementation took only two weeks; we can see that high performance improvement was achieved with low development cost. With the developed solver, we can expect improvement in reliability of crust-deformation analyses by many-case analyses on a wide range of GPU-based systems.
Takuma Yamaguchi, Kohei Fujita, Tsuyoshi Ichimura, Muneo Hori, Maddegedara Lalith, Kengo Nakajima

Runtime Environments

Frontmatter
The Design and Implementation of OpenMP 4.5 and OpenACC Backends for the RAJA C++ Performance Portability Layer
Abstract
Portability abstraction layers such as RAJA enable users to quickly change how a loop nest is executed with minimal modifications to high-level source code. Directive-based programming models such as OpenMP and OpenACC provide easy-to-use annotations on for-loops and regions which change the execution pattern of user code. Directive-based language backends for RAJA have previously been limited to few options due to multiplicative clauses creating version explosion. In this work, we introduce an updated implementation of two directive-based backends which helps mitigate the aforementioned version explosion problem by leveraging the C++ type system and template meta-programming concepts. We implement partial OpenMP 4.5 and OpenACC backends for the RAJA portability layer which can apply loop transformations and specify how loops should be executed. We evaluate our approach by analyzing compilation and runtime overhead for both backends using PGI 17.7 and IBM clang (OpenMP 4.5) on a collection of computation kernels.
William Killian, Tom Scogland, Adam Kunen, John Cavazos
Enabling GPU Support for the COMPSs-Mobile Framework
Abstract
Using the GPUs embedded in mobile devices allows for increasing the performance of the applications running on them while reducing the energy consumption of their execution. This article presents a task-based solution for adaptative, collaborative heterogeneous computing on mobile cloud environments. To implement our proposal, we extend the COMPSs-Mobile framework – an implementation of the COMPSs programming model for building mobile applications that offload part of the computation to the Cloud – to support offloading computation to GPUs through OpenCL. To evaluate our solution, we subject the prototype to three benchmark applications representing different application patterns.
Francesc Lordan, Rosa M. Badia, Wen-Mei Hwu
Concurrent Parallel Processing on Graphics and Multicore Processors with OpenACC and OpenMP
Abstract
Hierarchical parallel computing is rapidly becoming ubiquitous in high performance computing (HPC) systems. Programming models used commonly in turbomachinery and other engineering simulation codes have traditionally relied upon distributed memory parallelism with MPI and have ignored thread and data parallelism. This paper presents methods for programming multi-block codes for concurrent computational on host multicore CPUs and many-core accelerators such as graphics processing units. Portable and standardized methods are language directives that are used to expose data and thread parallelism within the hybrid shared and distributed-memory simulation system. A single-source/multiple-object strategy is used to simplify code management and allow for heterogeneous computing. Automated load balancing is implemented to determine what portions of the domain are computed by the multi-core CPUs and GPUs. Preliminary results indicate that a moderate overall speed-up is possible by taking advantage of all processors and accelerators on a given HPC node.
Christopher P. Stone, Roger L. Davis, Daryl Y. Lee

Program Evaluation

Frontmatter
Exploration of Supervised Machine Learning Techniques for Runtime Selection of CPU vs. GPU Execution in Java Programs
Abstract
While multi-core CPUs and many-core GPUs are both viable platforms for parallel computing, programming models for them can impose large burdens upon programmers due to their complex and low-level APIs. Since managed languages like Java are designed to be run on multiple platforms, parallel language constructs and APIs such as Java 8 Parallel Stream APIs can enable high-level parallel programming with the promise of performance portability for mainstream (“non-ninja”) programmers. To achieve this goal, it is important for the selection of the hardware device to be automated rather than be specified by the programmer, as is done in current programming models. Due to a variety of factors affecting performance, predicting a preferable device for faster performance of individual kernels remains a difficult problem. While a prior approach uses machine learning to address this challenge, there is no comparable study on good supervised machine learning algorithms and good program features to track. In this paper, we explore (1) program features to be extracted by a compiler and (2) various machine learning techniques that improve accuracy in prediction, thereby improving performance. The results show that an appropriate selection of program features and machine learning algorithm can further improve accuracy. In particular, support vector machines (SVMs), logistic regression, and J48 decision tree are found to be reliable techniques for building accurate prediction models from just two, three, or four program features, achieving accuracies of 99.66%, 98.63%, and 98.28% respectively from 5-fold-cross-validation.
Gloria Y. K. Kim, Akihiro Hayashi, Vivek Sarkar
Automatic Testing of OpenACC Applications
Abstract
PCAST (PGI Compiler-Assisted Software Testing) is a feature being developed in the PGI compilers and runtime to help users automate testing high performance numerical programs. PCAST normally works by running a known working version of a program and saving intermediate results to a reference file, then running a test version of a program and comparing the intermediate results against the reference file. Here, we describe the special case of using PCAST on OpenACC programs running on a GPU-accelerated system. Instead of saving to and comparing against a reference file, the compiler generates code to run each compute region on both the host CPU and the GPU. The values computed on the host and GPU are then compared, using OpenACC data directives and clauses to decide what data to compare.
Khalid Ahmad, Michael Wolfe
Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices
Abstract
Accelerator devices are increasingly used to build large supercomputers and current installations usually include more than one accelerator per system node. To keep all devices busy, kernels have to be executed concurrently which can be achieved via asynchronous kernel launches. This work compares the performance for an implementation of the Conjugate Gradient method with CUDA, OpenCL, and OpenACC on NVIDIA Pascal GPUs. Furthermore, it takes a look at Intel Xeon Phi coprocessors when programmed with OpenCL and OpenMP. In doing so, it tries to answer the question of whether the higher abstraction level of directive based models is inferior to lower level paradigms in terms of performance.
Jonas Hahnfeld, Christian Terboven, James Price, Hans Joachim Pflug, Matthias S. Müller
Backmatter
Metadaten
Titel
Accelerator Programming Using Directives
herausgegeben von
Prof. Sunita Chandrasekaran
Guido Juckeland
Copyright-Jahr
2018
Electronic ISBN
978-3-319-74896-2
Print ISBN
978-3-319-74895-5
DOI
https://doi.org/10.1007/978-3-319-74896-2

Premium Partner