Skip to main content

2018 | Book

Evolving OpenMP for Evolving Architectures

14th International Workshop on OpenMP, IWOMP 2018, Barcelona, Spain, September 26–28, 2018, Proceedings

Editors: Bronis R. de Supinski, Pedro Valero-Lara, Xavier Martorell, Sergi Mateo Bellido, Jesus Labarta

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science


About this book

This book constitutes the proceedings of the 14th International Workshop on Open MP, IWOMP 2018, held in Barcelona, Spain, in September 2018.

The 16 full papers presented in this volume were carefully reviewed and selected for inclusion in this book. The papers are organized in topical sections named: best paper; loops and OpenMP; OpenMP in heterogeneous systems; OpenMP improvements and innovations; OpenMP user experiences: applications and tools; and tasking evaluations.

Table of Contents


Best Paper

The Impact of Taskyield on the Design of Tasks Communicating Through MPI
The OpenMP tasking directives promise to help expose a higher degree of concurrency to the runtime than traditional worksharing constructs, which is especially useful for irregular applications. In combination with process-based parallelization such as MPI, the taskyield construct in OpenMP can become a crucial aspect as it helps to hide communication latencies by allowing a thread to execute other tasks while completion of the communication operation is pending. Unfortunately, the OpenMP standard only provides little guarantees on the characteristics of the taskyield operation. In this paper, we explore different potential implementations of taskyield and present a portable black-box tool for detecting the actual implementation used in existing OpenMP compilers/runtimes. Furthermore, we discuss the impact of the different taskyield implementations on the task design of the communication-heavy Blocked Cholesky Factorization and the difference in performance that can be observed, which we found to be over 20 %.
Joseph Schuchart, Keisuke Tsugane, José Gracia, Mitsuhisa Sato

Loops and OpenMP

OpenMP Loop Scheduling Revisited: Making a Case for More Schedules
In light of continued advances in loop scheduling, this work revisits the OpenMP loop scheduling by outlining the current state of the art in loop scheduling and presenting evidence that the existing OpenMP schedules are insufficient for all combinations of applications, systems, and their characteristics. A review of the state of the art shows that due to the specifics of the parallel applications, the variety of computing platforms, and the numerous performance degradation factors, no single loop scheduling technique can be a ‘one-fits-all’ solution to effectively optimize the performance of all parallel applications in all situations. The impact of irregularity in computational workloads and hardware systems, including operating system noise, on the performance of parallel applications results in performance loss and has often been neglected in loop scheduling research, in particular the context of OpenMP schedules. Existing dynamic loop self-scheduling techniques, such as trapezoid self-scheduling, factoring and weighted factoring, offer an unexplored potential to alleviate this degradation in OpenMP due to the fact that they explicitly target the minimization of load imbalance and scheduling overhead. Through theoretical and experimental evaluation, this work shows that these loop self-scheduling methods provide a benefit in the context of OpenMP. In conclusion, OpenMP must include more schedules to offer a broader performance coverage of applications executing on an increasing variety of heterogeneous shared memory computing platforms.
Florina M. Ciorba, Christian Iwainsky, Patrick Buder
A Proposal for Loop-Transformation Pragmas
Pragmas for loop transformations, such as unrolling, are implemented in most mainstream compilers. They are used by application programmers because of their ease of use compared to directly modifying the source code of the relevant loops. We propose additional pragmas for common loop transformations that go far beyond the transformations today’s compilers provide and should make most source rewriting for the sake of loop optimization unnecessary. To encourage compilers to implement these pragmas, and to avoid a diversity of incompatible syntaxes, we would like to spark a discussion about an inclusion to the OpenMP standard.
Michael Kruse, Hal Finkel
Extending OpenMP to Facilitate Loop Optimization
OpenMP provides several mechanisms to specify parallel source-code transformations. Unfortunately, many compilers perform these transformations early in the translation process, often before performing traditional sequential optimizations, which can limit the effectiveness of those optimizations. Further, OpenMP semantics preclude performing those transformations in some cases prior to the parallel transformations, which can limit overall application performance.
In this paper, we propose extensions to OpenMP that require the application of traditional sequential loop optimizations. These extensions can be specified to apply before, as well as after, other OpenMP loop transformations. We discuss limitations implied by existing OpenMP constructs as well as some previously proposed (parallel) extensions to OpenMP that could benefit from constructs that explicitly apply sequential loop optimizations. We present results that explore how these capabilities can lead to as much as a 20% improvement in parallel loop performance by applying common sequential loop optimizations.
Ian Bertolacci, Michelle Mills Strout, Bronis R. de Supinski, Thomas R. W. Scogland, Eddie C. Davis, Catherine Olschanowsky

OpenMP in Heterogeneous Systems

Manage OpenMP GPU Data Environment Under Unified Address Space
OpenMP has supported the offload of computations to accelerators such as GPUs since version 4.0. A crucial aspect in OpenMP offloading is to manage the accelerator data environment. Currently, this has to be explicitly programmed by users, which is non-trival and often results in suboptimal performance. The unified memory feature available in recent GPU architectures introduces another option, implicit management. However, our experiments show that it incurs several performance issues, especially under GPU memory oversubscription. In this paper, we propose a compiler and runtime collaborative approach to manage OpenMP GPU data under unified memory. In our framework, the compiler performs data reuse analysis to assist runtime data management. The runtime combines static and dynamic information to make optimized data management decisions. We have implement the proposed technology in the LLVM framework. The evaluation shows our method can achieve significant performance improvement for OpenMP GPU offloading.
Lingda Li, Hal Finkel, Martin Kong, Barbara Chapman
OpenMP 4.5 Validation and Verification Suite for Device Offload
OpenMP has been widely adopted for shared memory systems for over a decade. With the heterogeneity trend in architectures rapidly growing, the programming model needed to evolve such that applications could not only be ported to traditional CPUs but also to accelerators often acting as discrete or integrated devices to CPUs. To that end, OpenMP started to provide support for heterogeneous systems since 2013 when the version 4.0 of the specification was ratified. OpenMP 4.5 is being enhanced to cover major requirements of Exascale Computing Project (ECP) applications. As a result it is time-critical to ensure that the implementations of the 4.5 features are correct and conforming to the specification. This paper focuses on building a Validation and Verification testsuite that will test and present results for several offloading features implemented in compilers such as Clang, IBM XL C/C++, CCE, and GCC. We have results for our testsuite on TITAN, Summitdev and Summit at the Oak Ridge National Lab. We will highlight some of the ambiguities we encountered in the process of validating and verifying feature implementations. We also make the testsuite available for anyone to use and will walk the readers through the infrastructure and the workflow of the testsuite. A website has been built to capture our efforts narrated in this paper https://​crpl.​cis.​udel.​edu/​ompvvsollve.
Jose Monsalve Diaz, Swaroop Pophale, Oscar Hernandez, David E. Bernholdt, Sunita Chandrasekaran
Trade-Off of Offloading to FPGA in OpenMP Task-Based Programming
In High-Performance Computing (HPC), Field Programmable Gate Array (FPGA) is attracting increased attention as an accelerator because its performance has been dramatically improved in recent years. On the other hand, task-based programming recently supported in OpenMP 4.0 enables to expose much parallelism by executing several tasks of the program in the form of a task graph. To accelerate the task-based parallel program by FPGA, it is useful for some dominant tasks frequently executed in parallel to be offloaded to FPGA as an asynchronous FPGA task. We present a performance optimization based on the trade-off between the kernel size and the number of asynchronously executed kernels in parallel in OpenMP task-based programming with FPGA tasks to make use of FPGA hardware resources efficiently. Since a “program” for FPGA is directly converted into the hardware, the hardware resource limitation raises a new issue in optimization on which and how to offload a task to FPGA. Taking task-based block Cholesky factorization as a motivating example, we present the trade-off on how to offload dominant “GEMM” task frequently executed in parallel in the execution of the task-graph. We found that under the limitation of the hardware resource, multiple small kernels are better than a single big high-performance kernel because of higher throughput and higher kernel frequency.
Yutaka Watanabe, Jinpil Lee, Taisuke Boku, Mitsuhisa Sato

OpenMP Improvements and Innovations

Compiler Optimizations for OpenMP
Modern compilers support OpenMP as a convenient way to introduce parallelism into sequential languages like C/C++ and Fortran, however, its use also introduces immediate drawbacks. In many implementations, due to early outlining and the indirection though the OpenMP runtime, the front-end creates optimization barriers that are impossible to overcome by standard middle-end compiler passes. As a consequence, the OpenMP-annotated program constructs prevent various classic compiler transformations like constant propagation and loop invariant code motion. In addition, analysis results, especially alias information, is severely degraded in the presence of OpenMP constructs which can severely hurt performance.
In this work we investigate to what degree OpenMP runtime aware compiler optimizations can mitigate these problems. We discuss several transformations that explicitly change the OpenMP enriched compiler intermediate representation. They act as stand-alone optimizations but also enable existing optimizations that were not applicable before. This is all done in the existing LLVM/Clang compiler toolchain without introducing a new parallel representation. Our optimizations do not only improve the execution time of OpenMP annotated programs but also help to determine the caveats for transformations on the current representation of OpenMP.
Johannes Doerfert, Hal Finkel
Supporting Function Variants in OpenMP
Although the OpenMP API is supported across a wide and diverse set of architectures, different models of programming – and in extreme cases, different programs altogether – may be required to achieve high levels of performance on different platforms. We reduce the complexity of maintaining multiple implementations through a proposed extension to the OpenMP API that enables developers to specify that different code paths should be executed under certain compile-time conditions, including properties of: active OpenMP constructs; the targeted device; and available OpenMP runtime extensions. Our proposal directly addresses the complexities of modern applications, allowing for OpenMP contextual information to be passed across function call boundaries, translation units and library interfaces. This can greatly simplify the task of developing and maintaining a code with specializations that address performance for distinct platforms and environments.
S. John Pennycook, Jason D. Sewall, Alejandro Duran
Towards an OpenMP Specification for Critical Real-Time Systems
OpenMP is increasingly being considered as a convenient parallel programming model to cope with the performance requirements of critical real-time systems. Recent works demonstrate that OpenMP enables to derive guarantees on the functional and timing behavior of the system, a fundamental requirement of such systems. These works, however, focus only on the exploitation of fine grain parallelism and do not take into account the peculiarities of critical real-time systems, commonly composed of a set of concurrent functionalities. OpenMP allows exploiting the parallelism exposed within real-time tasks and among them. This paper analyzes the challenges of combining the concurrency model of real-time tasks with the parallel model of OpenMP. We demonstrate that OpenMP is suitable to develop advanced critical real-time systems by virtue of few changes on the specification, which allow the scheduling behavior desired (regarding execution priorities, preemption, migration and allocation strategies) in such systems.
Maria A. Serrano, Sara Royuela, Eduardo Quiñones

OpenMP User Experiences: Applications and Tools

Performance Tuning to Close Ninja Gap for Accelerator Physics Emulation System (APES) on Intel Xeon Phi Processors
Radio frequency field and particle interaction is of critical importance in modern synchrotrons. Accelerator Physics Emulation System (APES) is a C++ code written with the purpose of simulating the particle dynamics in ring-shaped accelerators. During the tracking process, the particles interact with each other indirectly through the EM field excited by the charged particles in the RF cavity. This a hot spot in the algorithm that takes up roughly 90% of the execution time. We show how a set of well-known code restructuring and algorithmic changes coupled with advancements in modern compiler technology can bring down the Ninja gap to provide more than 7x performance improvements. These changes typically require low programming effort, as compared to the very high effort in producing Ninja code.
Tianmu Xin, Zhengji Zhao, Yue Hao, Binping Xiao, Qiong Wu, Alexander Zaltsman, Kevin Smith, Xinmin Tian
Visualization of OpenMP* Task Dependencies Using Intel® Advisor – Flow Graph Analyzer
With the introduction of task dependences, the OpenMP API considerably extended the expressiveness of its task-based parallel programming model. With task dependences, programmers no longer have to rely on global synchronization mechanisms like task barriers. Instead they can locally synchronize a restricted subset of generated tasks by expressing an execution order through the depend clause. With the OpenMP tools interface of Technical Report 6 of the OpenMP API specification, it becomes possible to monitor task creation and execution along with the corresponding dependence information of these tasks. We use this information to construct a Task Dependence Graph (TDG) for the Flow Graph Analyzer (FGA) tool of Intel® Advisor. The TDG representation is used in FGA for deriving metrics and performance prediction and analysis of task-based OpenMP codes. We apply the FGA tool to two sample application kernels and expose issues in their usage of OpenMP tasks.
Vishakha Agrawal, Michael J. Voss, Pablo Reble, Vasanth Tovinkere, Jeff Hammond, Michael Klemm
A Semantics-Driven Approach to Improving DataRaceBench’s OpenMP Standard Coverage
DataRaceBench is a benchmark suite designed to systematically and quantitatively evaluate the effectiveness of data race detection tools. Its initial release in 2017 contained 72 C99 microbenchmarks with and without data races and was successfully used to evaluate several popular data race detection tools.
In this paper, we describe a novel semantics-driven approach to improving DataRaceBench’s OpenMP standard coverage. Based on a traditional definition of data races, we define several semantic categories for parallelism, data-sharing attributes, and synchronization. This allows us to assign semantic labels to constructs, clauses and data-sharing rules in the OpenMP 4.5 specification. Based on these labels we then analyze the coverage of the initial release of DataRaceBench and add 44 new C and C++ microbenchmarks to improve the OpenMP standard coverage. Finally, we re-evaluate two popular data race detection tools with the new microbenchmarks, and show that the new version of DataRaceBench gives new insights about the selected tools.
Chunhua Liao, Pei-Hung Lin, Markus Schordan, Ian Karlin

Tasking Evaluations

On the Impact of OpenMP Task Granularity
Tasks are a good support for composition. During the development of a high-level component model for HPC, we have experimented to manage parallelism from components using OpenMP tasks. Since version 4-0, the standard proposes a model with dependent tasks that seems very attractive because it enables the description of dependencies between tasks generated by different components without breaking maintainability constraints such as separation of concerns. The paper presents our feedback on using OpenMP in our context. We discover that our main issues are a too coarse task granularity for our expected performance on classical OpenMP runtimes, and a harmful task throttling heuristic counter-productive for our applications. We present a completion time breakdown of task management in the Intel OpenMP runtime and propose extensions evaluated on a testbed application coming from the Gysela application in plasma physics.
Thierry Gautier, Christian Perez, Jérôme Richard
Mapping OpenMP to a Distributed Tasking Runtime
Tasking was introduced in OpenMP 3.0 and every major release since has added features for tasks. However, OpenMP tasks coexist with other forms of parallelism which have influenced the design of their features. HPX is one of a new generation of task-based frameworks with the goal of extreme scalability. It is designed from the ground up to provide a highly asynchronous task-based interface for shared memory that also extends to distributed memory. This work introduces a new OpenMP runtime called OMPX, which provides a means to run OpenMP applications that do not use its accelerator features on top of HPX in shared memory. We describe the OpenMP and HPX execution models, and use microbenchmarks and application kernels to evaluate OMPX and compare their performance.
Jeremy Kemp, Barbara Chapman
Assessing Task-to-Data Affinity in the LLVM OpenMP Runtime
In modern shared-memory NUMA systems which typically consist of two or more multi-core processor packages with local memory, affinity of data to computation is crucial for achieving high performance with an OpenMP program. OpenMP* 3.0 introduced support for task-parallel programs in 2008 and has continued to extend its applicability and expressiveness. However, the ability to support data affinity of tasks is missing. In this paper, we investigate several approaches for task-to-data affinity that combine locality-aware task distribution and task stealing. We introduce the task affinity clause that will be part of OpenMP 5.0 and provide the reasoning behind its design. Evaluation with our experimental implementation in the LLVM OpenMP runtime shows that task affinity improves execution performance up to 4.5x on an 8-socket NUMA machine and significantly reduces runtime variability of OpenMP tasks. Our results demonstrate that a variety of applications can benefit from task affinity and that the presented clause is closing the gap of task-to-data affinity in OpenMP 5.0.
Jannis Klinkenberg, Philipp Samfass, Christian Terboven, Alejandro Duran, Michael Klemm, Xavier Teruel, Sergi Mateo, Stephen L. Olivier, Matthias S. Müller
Evolving OpenMP for Evolving Architectures
Bronis R. de Supinski
Pedro Valero-Lara
Xavier Martorell
Sergi Mateo Bellido
Jesus Labarta
Copyright Year
Electronic ISBN
Print ISBN