Skip to main content

Table of Contents


Fifth International Workshop on OpenMP IWOMP 2009

Performance and Applications

Parallel Simulation of Bevel Gear Cutting Processes with OpenMP Tasks

Modeling of bevel gear cutting processes requires highly flexible data structures and algorithms. We compare OpenMP3.0 tasks with previously applied approaches like nesting parallel sections and stack based algorithms when parallelizing recursive procedures written in Fortran 95 working on binary tree structures.
Paul Kapinos, Dieter an Mey

Evaluation of Multicore Processors for Embedded Systems by Parallel Benchmark Program Using OpenMP

Recently, multicore technology has been introduced to embedded systems in order to improve performance and reduce power consumption. In the present study, three SMP multicore processors for embedded systems and a multicore processor for a desktop PC are evaluated by the parallel benchmark using OpenMP. The results indicate that, even if the memory performance is low, applications that are not memory-intensive exhibit large speedups by parallelization. The results also indicate a large performance improvement due to parallelization using OpenMP, despite its low cost.
Toshihiro Hanawa, Mitsuhisa Sato, Jinpil Lee, Takayuki Imada, Hideaki Kimura, Taisuke Boku

Extending Automatic Parallelization to Optimize High-Level Abstractions for Multicore

Automatic introduction of OpenMP for sequential applications has attracted significant attention recently because of the proliferation of multicore processors and the simplicity of using OpenMP to express parallelism for shared-memory systems. However, most previous research has only focused on C and Fortran applications operating on primitive data types. C++ applications using high-level abstractions, such as STL containers and complex user-defined types, are largely ignored due to the lack of research compilers that are readily able to recognize high-level object-oriented abstractions and leverage their associated semantics. In this paper, we automatically parallelize C++ applications using ROSE, a multiple-language source-to-source compiler infrastructure which preserves the high-level abstractions and allows us to unambiguously leverage their known semantics. Several representative parallelization candidate kernels are used to explore semantic-aware parallelization strategies for high-level abstractions, combined with extended compiler analyses. Those kernels include an array-based computation loop, a loop with task-level parallelism, and a domain-specific tree traversal. Our work extends the applicability of automatic parallelization to modern applications using high-level abstractions and exposes more opportunities to take advantage of multicore processors.
Chunhua Liao, Daniel J. Quinlan, Jeremiah J. Willcock, Thomas Panas

Scalability Evaluation of Barrier Algorithms for OpenMP

OpenMP relies heavily on barrier synchronization to coordinate the work of threads that are performing the computations in a parallel region. A good implementation of barriers is thus an important part of any implementation of this API. As the number of cores in shared and distributed shared memory machines continues to grow, the quality of the barrier implementation is critical for application scalability. There are a number of known algorithms for providing barriers in software. In this paper, we consider some of the most widely used approaches for implementing barriers on large-scale shared-memory multiprocessor systems: a ”blocking” implementation that de-schedules a waiting thread, a ”centralized” busy wait and three forms of distributed ”busy” wait implementations are discussed. We have implemented the barrier algorithms in the runtime library associated with a research compiler, OpenUH. We first compare the impact of these algorithms on the overheads incurred for OpenMP constructs that involve a barrier, possibly implicitly. We then show how the different barrier implementations influence the performance of two different OpenMP application codes.
Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman, Haoqiang H. Jin

Use of Cluster OpenMP with the Gaussian Quantum Chemistry Code: A Preliminary Performance Analysis

The Intel Cluster OpenMP (CLOMP) compiler and associated runtime environment offer the potential to run OpenMP applications over a few nodes of a cluster. This paper reports on our efforts to use CLOMP with the Gaussian quantum chemistry code. Sample results on a four node quad core Intel cluster show reasonable speedups. In some cases it is found preferable to use multiple nodes compared to using multiple cores within a single node. The performances of the different benchmarks are analyzed in terms of page faults and by using a critical path analysis technique.
Rui Yang, Jie Cai, Alistair P. Rendell, V. Ganesh

Runtime Environments

Evaluating OpenMP 3.0 Run Time Systems on Unbalanced Task Graphs

The UTS benchmark is used to evaluate task parallelism in OpenMP 3.0 as implemented in a number of recently released compilers and run-time systems. UTS performs parallel search of an irregular and unpredictable search space, as arises e.g. in combinatorial optimization problems. As such UTS presents a highly unbalanced task graph that challenges scheduling, load balancing, termination detection, and task coarsening strategies. Scalability and overheads are compared for OpenMP 3.0, Cilk, and an OpenMP implementation of the benchmark without tasks that performs all scheduling, load balancing, and termination detection explicitly. Current OpenMP 3.0 implementations generally exhibit poor behavior on the UTS benchmark.
Stephen L. Olivier, Jan F. Prins

Dynamic Task and Data Placement over NUMA Architectures: An OpenMP Runtime Perspective

Exploiting the full computational power of current hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture so as to avoid memory access penalties. Directive-based programming languages such as OpenMPprovide programmers with an easy way to structure the parallelism of their application and to transmit this information to the runtime system.
Our runtime, which is based on a multi-level thread scheduler combined with a NUMA-aware memory manager, converts this information into “scheduling hints” to solve thread/memory affinity issues. It enables dynamic load distribution guided by application structure and hardware topology, thus helping to achieve performance portability. First experiments show that mixed solutions (migrating threads and data) outperform next-touch-based data distribution policies and open possibilities for new optimizations.
François Broquedis, Nathalie Furmento, Brice Goglin, Raymond Namyst, Pierre-André Wacrenier

Scalability of Gaussian 03 on SGI Altix: The Importance of Data Locality on CC-NUMA Architecture

Performance anomalies when running Gaussian frequency calculations in parallel on SGI Altix computers with CC-NUMA memory architecture are analyzed using performance tools that access hardware counters. The bottleneck is the frequent and nearly simultaneous data-loads of all threads involved in the calculation of data allocated in the node where the master thread runs. Code changes that ensure these data-loads are localized improve performance by a factor close to two. The improvements carry over to other molecular models and other types of calculations. An expansion or an alternative of FirstPrivate OpenMP’s clause can facilitate the code transformations.
Roberto Gomperts, Michael Frisch, Jean-Pierre Panziera

Tools and Benchmarks

Providing Observability for OpenMP 3.0 Applications

Providing observability for OpenMP applications is a technically challenging task. Most current tools treat OpenMP applications as native multi-threaded applications. They expose too much implementation detail while failing to present useful information at the OpenMP abstraction level. In this paper, we present a rich data model that captures the runtime behavior of OpenMP applications. By carefully designing interactions between all involved components (compiler, OpenMP runtime, collector, and analyzer), we are able to collect all needed information and keep overall runtime overhead and data volume low.
Yuan Lin, Oleg Mazurov

A Microbenchmark Suite for Mixed-Mode OpenMP/MPI

With the current prevalence of multi-core processors in HPC architectures, mixed-mode programming, using both MPI and OpenMP in the same application, is becoming increasingly important. However, no low-level synthetic benchmarks exist to test the performance of this programming model. We have designed and implemented a set of microbenchmarks for mixed-mode programming, including both point-to-point and collective communication patterns. These microbenchmarks have been run on a number of current HPC architectures: the results show some interesting performance differences between the architectures and highlight some possible inefficiencies in the implementation of MPI on multi-core systems.
J. Mark Bull, James P. Enright, Nadia Ameer

Performance Profiling for OpenMP Tasks

Tasking in OpenMP 3.0 allows irregular parallelism to be expressed much more easily and it is expected to be a major step towards the widespread adoption of OpenMP for multicore programming. We discuss the issues encountered in providing monitoring support for tasking in an existing OpenMP profiling tool with respect to instrumentation, measurement, and result presentation.
Karl Fürlinger, David Skinner

Proposed Extensions to OpenMP

Tile Reduction: The First Step towards Tile Aware Parallelization in OpenMP

Tiling is widely used by compilers and programmer to optimize scientific and engineering code for better performance. Many parallel programming languages support tile/tiling directly through first-class language constructs or library routines. However, the current OpenMP programming language is tile oblivious, although it is the de facto standard for writing parallel programs on shared memory systems. In this paper, we introduce tile aware parallelization into OpenMP. We propose tile reduction, an OpenMP tile aware parallelization technique that allows reduction to be performed on multi-dimensional arrays. The paper has three contributions: (a) it is the first paper that proposes and discusses tile aware parallelization in OpenMP. We argue that, it is not only necessary but also possible to have tile aware parallelization in OpenMP; (b) the paper introduces the methods used to implement tile reduction, including the required OpenMP API extension and the associated code generation techniques; (c) we have applied tile reduction on a set of benchmarks. The experimental results show that tile reduction can make parallelization more natural and flexible. It not only can expose more parallelism in a program, but also can improve its data locality.
Ge Gan, Xu Wang, Joseph Manzano, Guang R. Gao

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures

OpenMP has evolved recently towards expressing unstructured parallelism, targeting the parallelization of a broader range of applications in the current multicore era. Homogeneous multicore architectures from major vendors have become mainstream, but with clear indications that a better performance/power ratio can be achieved using more specialized hardware (accelerators), such as SSE-based units or GPUs, clearly deviating from the easy-to-understand shared-memory homogeneous architectures. This paper investigates if OpenMP could still survive in this new scenario and proposes a possible way to extend the current specification to reasonably integrate heterogeneity while preserving simplicity and portability. The paper leverages on a previous proposal that extended tasking with dependencies. The runtime is in charge of data movement, tasks scheduling based on these data dependencies and the appropriate selection of the target accelerator depending on system configuration and resource availability.
Eduard Ayguade, Rosa M. Badia, Daniel Cabrera, Alejandro Duran, Marc Gonzalez, Francisco Igual, Daniel Jimenez, Jesus Labarta, Xavier Martorell, Rafael Mayo, Josep M. Perez, Enrique S. Quintana-Ortí

Identifying Inter-task Communication in Shared Memory Programming Models

Modern computers often use multi-core architectures, covering clusters of homogeneous cores for high performance computing, to heterogeneous architectures typically found in embedded systems. To efficiently program such architectures, it is important to be able to partition and map programs onto the cores of the architecture. We believe that communication patterns need to become explicit in the source code to make it easier to analyze and partition parallel programs. Extraction of these patterns are difficult to automate due to limitations in compiler techniques when determining the effects of pointers.
In this paper, we propose an OpenMP extension which allows programmers to explicitly declare the pointer based data-sharing between coarse-grain program parts. We present a dependency directive, expressing the input and output relation between program parts and pointers to shared data, as well as a set of runtime operations which are necessary to enforce declarations made by the programmer. The cost and scalability of the runtime operations are evaluated using micro-benchmarks and a benchmark from the NAS parallel benchmark suite. The measurements show that the overhead of the runtime operations is small. In fact, no performance degradation is found when using the runtime operations in the benchmark from the NAS parallel benchmark suite.
Per Larsen, Sven Karlsson, Jan Madsen


Additional information

Premium Partner

    Image Credits