Applications

Frontmatter

Estimation of Round-off Errors in OpenMP Codes

Abstract

It is crucial to control round-off error propagation in numerical simulations, because they can significantly affect computed results, especially in parallel codes like OpenMP ones. In this paper, we present a new version of the CADNA library that enables the numerical validation of OpenMP codes. With a reasonable cost in terms of execution time, it enables one to estimate which digits in computed results are affected by round-off errors and to detect numerical instabilities that may occur during the execution. The interest of this new OpenMP-enabled CADNA version is shown on various applications, along with performance results on multi-core and many-core (Intel Xeon Phi) architectures.

Pacôme Eberhart, Julien Brajard, Pierre Fortin, Fabienne Jézéquel

OpenMP Parallelization and Optimization of Graph-Based Machine Learning Algorithms

Abstract

We investigate the OpenMP parallelization and optimization of two novel data classification algorithms. The new algorithms are based on graph and PDE solution techniques and provide significant accuracy and performance advantages over traditional data classification algorithms in serial mode. The methods leverage the Nystrom extension to calculate eigenvalue/eigenvectors of the graph Laplacian and this is a self-contained module that can be used in conjunction with other graph-Laplacian based methods such as spectral clustering. We use performance tools to collect the hotspots and memory access of the serial codes and use OpenMP as the parallelization language to parallelize the most time-consuming parts. Where possible, we also use library routines. We then optimize the OpenMP implementations and detail the performance on traditional supercomputer nodes (in our case a Cray XC30), and test the optimization steps on emerging testbed systems based on Intel’s Knights Corner and Landing processors. We show both performance improvement and strong scaling behavior. A large number of optimization techniques and analyses are necessary before the algorithm reaches almost ideal scaling.

Zhaoyi Meng, Alice Koniges, Yun (Helen) He, Samuel Williams, Thorsten Kurth, Brandon Cook, Jack Deslippe, Andrea L. Bertozzi

Locality

Frontmatter

Evaluating OpenMP Affinity on the POWER8 Architecture

Abstract

As we move toward pre-Exascale systems, two of the DOE leadership class systems will consist of very powerful OpenPOWER compute nodes which will be more complex to program. These systems will have massive amounts of parallelism; where threads may be running on POWER9 cores as well as on accelerators. Advances in memory interconnects, such as NVLINK, will provide a unified shared memory address spaces for different types of memories HBM, DRAM, etc. In preparation for such system, we need to improve our understanding on how OpenMP supports the concept of affinity as well as memory placement on POWER8 systems. Data locality and affinity are key program optimizations to exploit the compute and memory capabilities to achieve good performance by minimizing data motion across NUMA domains and access the cache efficiently. This paper is the first step to evaluate the current features of OpenMP 4.0 on the POWER8 processors, and on how to measure its effects on a system with two POWER8 sockets. We experiment with the different affinity settings provided by OpenMP 4.0 to quantify the costs of having good data locality vs not, and measure their effects via hardware counters. We also find out which affinity settings benefits more from data locality. Based on this study we describe the current state of art, the challenges we faced in quantifying effects of affinity, and ideas on how OpenMP 5.0 should be improved to address affinity in the context of NUMA domains and accelerators.

Swaroop Pophale, Oscar Hernandez

Workstealing and Nested Parallelism in SMP Systems

Abstract

We present a workstealing scheduler and show its use in two separate areas: (1) to enable hierarchical parallelism and per-core load balancing in stencil codes, and (2) to reduce overhead in per-thread load balancing in particle codes.

Larry Meadows, Simon J. Pennycook, Alex Duran, Terry Wilmarth, Jim Cownie

Description, Implementation and Evaluation of an Affinity Clause for Task Directives

Abstract

OpenMP 4.0 introduced dependent tasks, which give the programmer a way to express fine grain parallelism. Using appropriate OS support (such as NUMA libraries), the runtime can rely on the information in the depend clause to dynamically map the tasks to the architecture topology. Controlling data locality is one of the key factors to reach a high level of performance when targeting NUMA architectures. On this topic, OpenMP does not provide a lot of flexibility to the programmer yet, which lets the runtime decide where a task should be executed. In this paper, we present a class of applications which would benefit from having such a control and flexibility over tasks and data placement. We also propose our own interpretation of the new affinity clause for the task directive, which is being discussed by the OpenMP Architecture Review Board. This clause enables the programmer to give hints to the runtime about tasks placement during the program execution, which can be used to control the data mapping on the architecture. In our proposal, the programmer can express affinity between a task and the following resources: a thread, a NUMA node, and a data. We then present an implementation of this proposal in the Clang-3.8 compiler, and an implementation of the corresponding extensions in our OpenMP runtime libKOMP. Finally, we present a preliminary evaluation of this work running two task-based OpenMP kernels on a 192-core NUMA architecture, that shows noticeable improvements both in terms of performance and scalability.

Philippe Virouleau, Adrien Roussel, François Broquedis, Thierry Gautier, Fabrice Rastello, Jean-Marc Gratien

Task Parallelism

Frontmatter

NUMA-Aware Task Performance Analysis

Abstract

The tasking feature enriches OpenMP by a method to express parallelism in a more general way than before, as it can be applied to loops but also to recursive algorithms without the need of nested parallel regions. However, the performance of a tasking program is very much influenced by the task scheduling inside the OpenMP runtime. Especially on large NUMA systems and when tasks work on shared data structures which are split across NUMA nodes, the runtime influence is significant. For a programmer there is no easy way to examine these performance relevant decisions taken by the runtime, neither with functionality provided by OpenMP nor with external performance tools. Therefore, we will present a method based on the Score-P measurement infrastructure which allows to analyze task parallel programs on NUMA systems more deeply, allowing the user to see if tasks were executed by the creating thread or remotely on the same or a different socket. Exemplary the Intel and the GNU Compiler were used to execute the same task parallel code, where a performance difference of 8x could be observed, mainly due to task scheduling. We evaluate the presented method by investigating both execution runs and highlight the differences of the task scheduling applied.

Dirk Schmidl, Matthias S. Müller

OpenMP Extension for Explicit Task Allocation on NUMA Architecture

Abstract

Most modern HPC systems consist of a number of cores grouped into multiple NUMA nodes. The latest Intel processors have multiple NUMA nodes inside a chip. Task parallelism using OpenMP dependent tasks is a promising programming model for many-core architecture because it can exploit parallelism in irregular applications with fine-grain synchronization. However, the current specification lacks functionality to improve data locality in task parallelism. In this paper, we propose an extension for the OpenMP task construct to specify the location of tasks to exploit the locality in an explicit manner. The prototype compiler is implemented based on GCC. The performance evaluation using the KASTORS benchmark shows that our approach can reduce remote page access. The Jacobi kernel using our approach shows 3.6 times better performance than GCC when using 36 threads on a 36-core, 4-NUMA node machine.

Jinpil Lee, Keisuke Tsugane, Hitoshi Murai, Mitsuhisa Sato

Approaches for Task Affinity in OpenMP

Abstract

OpenMP tasking supports parallelization of irregular algorithms. Recent OpenMP specifications extended tasking to increase functionality and to support optimizations, for instance with the taskloop construct. However, task scheduling remains opaque, which leads to inconsistent performance on NUMA architectures. We assess design issues for task affinity and explore several approaches to enable it. We evaluate these proposals with implementations in the Nanos++ and LLVM OpenMP runtimes that improve performance up to 40 % and significantly reduce execution time variation.

Christian Terboven, Jonas Hahnfeld, Xavier Teruel, Sergi Mateo, Alejandro Duran, Michael Klemm, Stephen L. Olivier, Bronis R. de Supinski

Towards Unifying OpenMP Under the Task-Parallel Paradigm

Implementation and Performance of the taskloop Construct

Abstract

OpenMP 4.5 introduced a task-parallel version of the classical thread-parallel for-loop construct: the taskloop construct. With this new construct, programmers are given the opportunity to choose between the two parallel paradigms to parallelize their for loops. However, it is unclear where and when the two approaches should be used when writing efficient parallel applications.

In this paper, we explore the taskloop construct. We study performance differences between traditional thread-parallel for loops and the new taskloop directive. We introduce an efficient implementation and compare our implementation to other taskloop implementations using micro- and kernel-benchmarks, as well as an application. We show that our taskloop implementation on average results in a 3.2 % increase in peak performance when compared against corresponding parallel-for loops.

Artur Podobas, Sven Karlsson

A Case for Extending Task Dependencies

Abstract

Tasks offer a natural mechanism to express asynchronous operations in OpenMP as well as to express parallel patterns with dynamic sizes and shapes. Since the release of OpenMP 4 task dependencies have made an already flexible tool practical in many more situations. Even so, while tasks can be made asynchronous with respect to the encountering thread, there are no mechanisms to tie an OpenMP task into a truly asynchronous operation outside of OpenMP without blocking an OpenMP thread. Additionally, producer/consumer parallel patterns, or more generally pipeline parallel patterns, suffer from the lack of a convenient and efficient point-to-point synchronization and data passing mechanism. This paper presents a set of extensions, leveraging the task and dependency mechanisms, that can help users and implementers tie tasks into other asynchronous systems and more naturally express pipeline parallelism while decreasing the overhead of passing data between otherwise small tasks by as much as 80 %.

Tom Scogland, Bronis de Supinski

OpenMP as a High-Level Specification Language for Parallelism

And its use in Evaluating Parallel Programming Systems

Abstract

While OpenMP is the de facto standard of shared memory parallel programming models, a number of alternative programming models and runtime systems have arisen in recent years. Fairly evaluating these programming systems can be challenging and can require significant manual effort on the part of researchers. However, it is important to facilitate these comparisons as a way of advancing both the available OpenMP runtimes and the research being done with these novel programming systems.

In this paper we present the OpenMP-to-X framework, an open source tool for mapping OpenMP constructs and APIs to other parallel programming systems. We apply OpenMP-to-X to the HClib parallel programming library, and use it to enable a fair and objective comparison of performance and programmability among HClib, GNU OpenMP, and Intel OpenMP. We use this investigation to expose performance bottlenecks in both the Intel OpenMP and HClib runtimes, to motivate improvements to the HClib programming model and runtime, and to propose potential extensions to the OpenMP standard. Our performance analysis shows that, across a wide range of benchmarks, HClib demonstrates significantly less volatility in its performance with a median standard deviation of 1.03 % in execution times and outperforms the two OpenMP implementations on 15 out of 24 benchmarks.

Max Grossman, Jun Shirako, Vivek Sarkar

Scaling FMM with Data-Driven OpenMP Tasks on Multicore Architectures

Abstract

Poor scalability on parallel architectures can be attributed to several factors, among which idle times, data movement, and runtime overhead are predominant. Conventional parallel loops and nested parallelism have proved successful for regular computational patterns. For more complex and irregular cases, however, these methods often perform poorly because they consider only a subset of these costs. Although data-driven methods are gaining popularity for efficiently utilizing computational cores, their data movement and runtime costs can be prohibitive for highly dynamic and irregular algorithms, such as fast multipole methods (FMMs). Furthermore, loop tiling, a technique that promotes data locality and has been successful for regular parallel methods, has received little attention in the context of dynamic and irregular parallelism.

We present a method to exploit loop tiling in data-driven parallel methods. Here, we specify a methodology to spawn work units characterized by a high data locality potential. Work units operate on tiled computational patterns and serve as building blocks in an OpenMP task-based data-driven execution. In particular, by the adjusting work unit granularity, idle times and runtime overheads are also taken into account. We apply this method to a popular FMM implementation and show that, with careful tuning, the new method outperforms existing parallel-loop and user-level thread-based implementations by up to fourfold on 48 cores.

Abdelhalim Amer, Satoshi Matsuoka, Miquel Pericàs, Naoya Maruyama, Kenjiro Taura, Rio Yokota, Pavan Balaji

Extensions

Frontmatter

Reducing the Functionality Gap Between Auto-Vectorization and Explicit Vectorization

Compress/Expand and Histogram

Abstract

Explicit vectorization of C/C++ and FORTRAN application programs are pioneered by Intel® Cilk™ Plus and then inherited and enhanced by OpenMP 4.0 and 4.5 standardization. There is a known functionality gap, where some auto-vectorizable code does not have adequate syntax support for explicit vector programming. In this paper, we propose and discuss a few syntax extensions to reduce the gap for compress/expand and histogram idioms, which are commonly seen in high performance computing.

Hideki Saito, Serge Preis, Nikolay Panchenko, Xinmin Tian

A Proposal to OpenMP for Addressing the CPU Oversubscription Challenge

Abstract

OpenMP has become a successful programming model for developing multi-threaded applications. However, there are still some challenges in terms of OpenMP’s interoperability within itself and with other parallel programming APIs. In this paper, we explore typical use cases that expose OpenMP’s interoperability challenges and report our proposed solutions for addressing the resource oversubscription issue as the efforts by the OpenMP Interoperability language subcommittee. The solutions include OpenMP runtime routines for changing the wait policies, which include ACTIVE(SPIN_BUSY or SPIN_PAUSE), PASSIVE (SPIN_YIELD or SUSPEND), of idling threads for improved resource management, and routines for supporting contributing OpenMP threads to other thread libraries or tasks. Our initial implementations are being done by extending two OpenMP runtime libraries, Intel OpenMP (IOMP) and GNU OpenMP (GOMP). The evaluation results demonstrate the effectiveness of the proposed approach to address the CPU oversubscription challenge and detailed analysis provide heuristics for selecting an optimal wait policy according to the oversubscription ratios.

Yonghong Yan, Jeff R. Hammond, Chunhua Liao, Alexandre E. Eichenberger

Tools

Frontmatter

Testing Infrastructure for OpenMP Debugging Interface Implementations

Abstract

With complex codes moving to systems of greater on-node parallelism using OpenMP, debugging these codes is becoming increasingly challenging. While debuggers can significantly aid programmers, OpenMP support within existing debuggers is either largely ineffective or unsustainable. The OpenMP tools working group is working to specify a debugging interface for the OpenMP standard to be implemented by every OpenMP runtime implementation. To increase the acceptance of this interface by runtime implementers and to ensure the quality of these interface implementations, availability of a common testing infrastructure compatible with any runtime implementation is critical. In this paper, we present a promising software architecture for such a testing infrastructure.

Joachim Protze, Dong H. Ahn, Ignacio Laguna, Martin Schulz, Matthias S. Müller

The Secrets of the Accelerators Unveiled: Tracing Heterogeneous Executions Through OMPT

Abstract

Heterogeneous systems are an important trend in the future of supercomputers, yet they can be hard to program and developers still lack powerful tools to gain understanding about how well their accelerated codes perform and how to improve them.

Having different types of hardware accelerators available, each with their own specific low-level APIs to program them, there is not yet a clear consensus on a standard way to retrieve information about the accelerator’s performance. To improve this scenario, OMPT is a novel performance monitoring interface that is being considered for integration into the OpenMP standard. OMPT allows analysis tools to monitor the execution of parallel OpenMP applications by providing detailed information about the activity of the runtime through a standard API. For accelerated devices, OMPT also facilitates the exchange of performance information between the runtime and the analysis tool. We implement part of the OMPT specification that refers to the use of accelerators both in the Nanos++ parallel runtime system and the Extrae tracing framework, obtaining detailed performance information about the execution of the tasks issued to the accelerated devices to later conduct insightful analysis.

Our work extends previous efforts in the field to expose detailed information from the OpenMP and OmpSs runtimes, regarding the activity and performance of task-based parallel applications. In this paper, we focus on the evaluation of FPGA devices studying the performance of two common kernels in scientific algorithms: matrix multiplication and Cholesky decomposition. Furthermore, this development is seamlessly applicable for the analysis of GPGPU accelerators and Intel® Xeon Phi^TM co-processors operating under the OmpSs programming model.

Germán Llort, Antonio Filgueras, Daniel Jiménez-González, Harald Servat, Xavier Teruel, Estanislao Mercadal, Carlos Álvarez, Judit Giménez, Xavier Martorell, Eduard Ayguadé, Jesús Labarta

Language-Centric Performance Analysis of OpenMP Programs with Aftermath

Abstract

We present a new set of tools for the language-centric performance analysis and debugging of OpenMP programs that allows programmers to relate dynamic information from parallel execution to OpenMP constructs. Users can visualize execution traces, examine aggregate metrics on parallel loops and tasks, such as load imbalance or synchronization overhead, and obtain detailed information on specific events, such as the partitioning of a loop’s iteration space, its distribution to workers according to the scheduling policy and fine-grain synchronization. Our work is based on the Aftermath performance analysis tool and a ready-to-use, instrumented version of the LLVM/clang OpenMP run-time with negligible overhead for tracing. By analyzing the performance of the MG application of the NPB suite, we show that language-centric performance analysis in general and our tools in particular can help improve the performance of large-scale OpenMP applications significantly.

Andi Drebes, Jean-Baptiste Bréjon, Antoniu Pop, Karine Heydemann, Albert Cohen

Accelerator Programming

Frontmatter

Pragmatic Performance Portability with OpenMP 4.x

Abstract

In this paper we investigate the current compiler technologies supporting OpenMP 4.x features targeting a range of devices, in particular, the Cray compiler 8.5.0 targeting an Intel Xeon Broadwell and NVIDIA K20x, IBM’s OpenMP 4.5 Clang branch (clang-ykt) targeting an NVIDIA K20x, the Intel compiler 16 targeting an Intel Xeon Phi Knights Landing, and GCC 6.1 targeting an AMD APU. We outline the mechanisms that they use to map the OpenMP model onto their target architectures, and conduct performance testing with a number of representative data parallel kernels. Following this we present a discussion about the current state of play in terms of performance portability and propose some straightforward guidelines for writing performance portable code, derived from our observations. At the time of writing, developers will likely have to rely on the pre-processor for certain kernels to achieve functional portability, but we expect that future homogenisation of required directives between compilers and architectures is feasible.

Matt Martineau, James Price, Simon McIntosh-Smith, Wayne Gaudin

Multiple Target Task Sharing Support for the OpenMP Accelerator Model

Abstract

The use of GPU accelerators is becoming common in HPC platforms due to the their effective performance and energy efficiency. In addition, new generations of multicore processors are being designed with wider vector units and/or larger hardware thread counts, also contributing to the peak performance of the whole system. Although current directive–based paradigms, such as OpenMP or OpenACC, support both accelerators and multicore-based hosts, they do not provide an effective and efficient way to concurrently use them, usually resulting in accelerated programs in which the potential computational performance of the host is not exploited. In this paper we propose an extension to the OpenMP 4.5 directive-based programming model to support the specification and execution of multiple instances of task regions on different devices (i.e. accelerators in conjunction with the vector and heavily multithreaded capabilities in multicore processors). The compiler is responsible for the generation of device-specific code for each device kind, delegating to the runtime system the dynamic schedule of the tasks to the available devices. The new proposed clause conveys useful insight to guide the scheduler while keeping a clean, abstract and machine independent programmer interface. The potential of the proposal is analyzed in a prototype implementation in the OmpSs compiler and runtime infrastructure. Performance evaluation is done using three kernels (N-Body, tiled matrix multiply and Stream) on different GPU-capable systems based on ARM, Intel x86 and IBM Power8. From the evaluation we observe speed–ups in the 8–20% range compared to versions in which only the GPU is used, reaching 96 % of the additional peak performance thanks to the reduction of data transfers and the benefits introduced by the OmpSs NUMA-aware scheduler.

Guray Ozen, Sergi Mateo, Eduard Ayguadé, Jesús Labarta, James Beyer

Early Experiences Porting Three Applications to OpenMP 4.5

Abstract

Many application developers need code that runs efficiently on multiple architectures, but cannot afford to maintain architecturally specific codes. With the addition of target directives to support offload accelerators, OpenMP now has the machinery to support performance portable code development. In this paper, we describe application ports of Kripke, Cardioid, and LULESH to OpenMP 4.5 and discuss our successes and failures. Challenges encountered include how OpenMP interacts with C++ including classes with virtual methods and lambda functions. Also, the lack of deep copy support in OpenMP increased code complexity. Finally, GPUs inability to handle virtual function calls required code restructuring. Despite these challenges we demonstrate OpenMP obtains performance within 10 % of hand written CUDA for memory bandwidth bound kernels in LULESH. In addition, we show with a minor change to the OpenMP standard that register usage for OpenMP code can be reduced by up to 10 %.

Ian Karlin, Tom Scogland, Arpith C. Jacob, Samuel F. Antao, Gheorghe-Teodor Bercea, Carlo Bertolli, Bronis R. de Supinski, Erik W. Draeger, Alexandre E. Eichenberger, Jim Glosli, Holger Jones, Adam Kunen, David Poliakoff, David F. Richards

Design and Preliminary Evaluation of Omni OpenACC Compiler for Massive MIMD Processor PEZY-SC

Abstract

PEZY-SC is a novel massive Multiple Instruction Multiple Data (MIMD) processor used as an accelerator and characterized by high power efficiency. OpenACC is a standard directive-based programming model for accelerators, and programmers can concisely offload data and computation to the accelerators. In this paper, we present the design and preliminary implementation of an OpenACC compiler for a PEZY-SC. Our compiler translates C code with OpenACC directives to the corresponding PZCL code, which is the programming environment for PEZY-SC. The evaluation shows that the performance of the OpenACC version achieves over 98 % at N-body and up to 88 % at NAS Parallel Benchmarks CG than that of the PZCL version. In addition, we examined optimization techniques such as kernel merging and explicit context switching to exploit the PEZY-SC MIMD architecture, which differs from the single instruction multiple data graphics processing units. We found these optimizations useful in improving the performance and will be implemented in the future release.

Akihiro Tabuchi, Yasuyuki Kimura, Sunao Torii, Hideo Matsufuru, Tadashi Ishikawa, Taisuke Boku, Mitsuhisa Sato

Performance Evaluations and Optimization

Frontmatter

Evaluating OpenMP Implementations for Java Using PolyBench

Abstract

This paper proposes a benchmark suite to evaluate the performance and scalability of (unofficial) OpenMP implementations for Java. The benchmark suite is based on our Java port of PolyBench, a Polyhedral Benchmark suite. We selected PolyBench instead of other existing benchmarks, like JGF, as it allows us to run and use the OpenMP C version as a performance and scalability reference. Further, PolyBench was conceived as a benchmark suite to analyse the optimisation capabilities of compilers. It is interesting to study these capabilities in the OpenMP context of a dynamically compiled language like Java in comparison to the statically compiled C. We apply the benchmark suite to two Java OpenMP implementations, Pyjama and JOMP, and compare with C code compiled by GCC, optimised and unoptimised. The sometimes surprising and unexpected results shed light on the appropriateness of Java as an OpenMP platform, the areas for improvement and the usefulness of this benchmark suite.

Xing Fan, Rui Feng, Oliver Sinnen, Nasser Giacaman

Transactional Memory for Algebraic Multigrid Smoothers

Abstract

This paper extends our early investigations in which we compared transactional memory to traditional OpenMP synchronization mechanisms [7, 8]. We study similar issues for algebraic multigrid (AMG) smoothers in hypre [16], a mature and widely used production-quality linear solver library. We compare the transactional version of the Gauss-Seidel AMG smoother to an omp critical version and the default hybrid Gauss-Seidel smoother, as well as the \(l_1\) variations of both Gauss-Seidel and Jacobi smoothers. Importantly, we present results for real-life 2-D and 3-D problems discretized by the finite element method that demonstrate the TM option can outperform the existing methods, often by orders of magnitude, in terms of the recently introduced performance measure of run time per quality.

Barna L. Bihari, Ulrike M. Yang, Michael Wong, Bronis R. de Supinski

Supporting Adaptive Privatization Techniques for Irregular Array Reductions in Task-Parallel Programming Models

Abstract

Irregular array-type reductions represent a reoccurring algorithmic pattern in many scientific applications. Their scalable execution on modern systems is not trivial as their irregular memory access pattern prohibits an efficient use of the memory subsystem and costly techniques are needed to eliminate data races. Taking a closer look at algorithms, memory access patterns and support techniques reveals that a one-size-fits-all solution does not exist and approaches are needed that can adapt to individual properties while maintaining programming transparency. In this work we propose a solution framework that generalizes the concept of privatization to support a variety of techniques, implements an inspector-executor to provide memory access analytics to the runtime for automatic tuning and shows what language extensions are needed. A reference implementation in OmpSs, a task-parallel programming model, shows programmability and scalability of this solution.

Jan Ciesko, Sergi Mateo, Xavier Teruel, Xavier Martorell, Eduard Ayguadé, Jesus Labarta

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Applications

Frontmatter

Estimation of Round-off Errors in OpenMP Codes

OpenMP Parallelization and Optimization of Graph-Based Machine Learning Algorithms

Locality

Frontmatter

Evaluating OpenMP Affinity on the POWER8 Architecture

Workstealing and Nested Parallelism in SMP Systems

Description, Implementation and Evaluation of an Affinity Clause for Task Directives

Task Parallelism

Frontmatter

NUMA-Aware Task Performance Analysis

OpenMP Extension for Explicit Task Allocation on NUMA Architecture

Approaches for Task Affinity in OpenMP

Towards Unifying OpenMP Under the Task-Parallel Paradigm

A Case for Extending Task Dependencies

OpenMP as a High-Level Specification Language for Parallelism

Scaling FMM with Data-Driven OpenMP Tasks on Multicore Architectures

Extensions

Frontmatter

Reducing the Functionality Gap Between Auto-Vectorization and Explicit Vectorization

A Proposal to OpenMP for Addressing the CPU Oversubscription Challenge

Tools

Frontmatter

Testing Infrastructure for OpenMP Debugging Interface Implementations

The Secrets of the Accelerators Unveiled: Tracing Heterogeneous Executions Through OMPT

Language-Centric Performance Analysis of OpenMP Programs with Aftermath

Accelerator Programming

Frontmatter

Pragmatic Performance Portability with OpenMP 4.x

Multiple Target Task Sharing Support for the OpenMP Accelerator Model

Early Experiences Porting Three Applications to OpenMP 4.5

Design and Preliminary Evaluation of Omni OpenACC Compiler for Massive MIMD Processor PEZY-SC

Performance Evaluations and Optimization

Frontmatter

Evaluating OpenMP Implementations for Java Using PolyBench

Transactional Memory for Algebraic Multigrid Smoothers

Supporting Adaptive Privatization Techniques for Irregular Array Reductions in Task-Parallel Programming Models

Backmatter