nach oben

2008 | Buch

Kapitel lesen Erstes Kapitel lesen

OpenMP in a New Era of Parallelism

4th International Workshop, IWOMP 2008 West Lafayette, IN, USA, May 12-14, 2008 Proceedings

herausgegeben von: Rudolf Eigenmann, Bronis R. de Supinski

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

OpenMP is a widely accepted, standard application programming interface (API) for high-level shared-memory parallel programming in Fortran, C, and C++. Since its introduction in 1997, OpenMP has gained support from most high-performance compiler and hardware vendors. Under the direction of the OpenMP Architecture Review Board (ARB), the OpenMP speci?cation has evolved, including the - cent release of Speci?cation 3. 0. Active research in OpenMP compilers, runtime systems, tools, and environments drives its evolution, including new features such as tasking. The community of OpenMP researchers and developers in academia and - dustry is united under cOMPunity (www. compunity. org). This organaization has held workshops on OpenMP around the world since 1999: the European Wo- shop on OpenMP (EWOMP), the North American Workshop on OpenMP App- cations and Tools (WOMPAT), and the Asian Workshop on OpenMP Experiences and Implementation (WOMPEI) attracted annual audiences from academia and industry. The International Workshop on OpenMP (IWOMP) consolidated these three workshop series into a single annual international event that rotates across the previous workshop sites. The ?rst IWOMP meeting was held in 2005, in - gene, Oregon, USA. IWOMP 2006 took place in Reims, France, and IWOMP 2007 in Beijing, China. Each workshop drew over 60 participants from research and - dustry throughout the world. IWOMP 2008 continued the series with technical papers, panels, tutorials, and OpenMP status reports. The ?rst IWOMP wo- shop was organized under the auspices of cOMPunity.

Inhaltsverzeichnis

Frontmatter

Fourth International Workshop on OpenMP IWOMP 2008

OpenMP Overheads, Hybrid Models

A Microbenchmark Study of OpenMP Overheads under Nested Parallelism

Abstract

In this work we present a microbenchmark methodology for assessing the overheads associated with nested parallelism in OpenMP. Our techniques are based on extensions to the well known EPCC microbenchmark suite that allow measuring the overheads of OpenMP constructs when they are effected in inner levels of parallelism. The methodology is simple but powerful enough and has enabled us to gain interesting insight into problems related to implementing and supporting nested parallelism. We measure and compare a number of commercial and freeware compilation systems. Our general conclusion is that while nested parallelism is fortunately supported by many current implementations, the performance of this support is rather problematic. There seem to exist issues which have not yet been addressed effectively, as most OpenMP systems do not exhibit a graceful reaction when made to execute inner levels of concurrency.

Vassilios V. Dimakopoulos, Panagiotis E. Hadjidoukas, Giorgos Ch. Philos

CLOMP: Accurately Characterizing OpenMP Application Overheads

Abstract

Despite its ease of use, OpenMP has failed to gain widespread use on large scale systems, largely due to its failure to deliver sufficient performance. Our experience indicates that the cost of initiating OpenMP regions is simply too high for the desired OpenMP usage scenario of many applications. In this paper, we introduce CLOMP, a new benchmark to characterize this aspect of OpenMP implementations accurately. CLOMP complements the existing EPCC benchmark suite to provide simple, easy to understand measurements of OpenMP overheads in the context of application usage scenarios. Our results for several OpenMP implementations demonstrate that CLOMP identifies the amount of work required to compensate for the overheads observed with EPCC. Further, we show that CLOMP also captures limitations for OpenMP parallelization on NUMA systems.

Greg Bronevetsky, John Gyllenhaal, Bronis R. de Supinski

Detection of Violations to the MPI Standard in Hybrid OpenMP/MPI Applications

Abstract

The MPI standard allows the usage of multiple threads per process. The main idea was that an MPI call executed at one thread should not block other threads. In the MPI-2 standard this was refined by introducing the so called level of thread support which describes how threads may interact with MPI. The multi-threaded usage is restricted by several rules stated in the MPI standard. In this paper we describe the work on an MPI checker called MARMOT[1] to enhance its capabilities towards a verification that ensures that these rules are not violated. A first implementation is capable of detecting violations if they actually occur in a run made with MARMOT. As most of these violations occur due to missing thread synchronization it is likely that they don’t appear in every run of the application. To detect whether there is a run that violates one of the MPI restrictions it is necessary to analyze the OpenMP usage. Thus we introduced artificial data races that only occur if the application violates one of the MPI rules. By this design all tools capable of detecting data races can also detect violations to some of the MPI rules. To confirm this idea we used the Intel® Thread Checker.

Tobias Hilbrich, Matthias S. Müller, Bettina Krammer

Early Experiments with the OpenMP/MPI Hybrid Programming Model

Abstract

The paper describes some very early experiments on new architectures that support the hybrid programming model. Our results are promising in that OpenMP threads interact with MPI as desired, allowing OpenMP-agnostic tools to be used. We explore three environments: a “typical” Linux cluster, a new large-scale machine from SiCortex, and the new IBM BG/P, which have quite different compilers and runtime systems for both OpenMP and MPI. We look at a few simple, diagnostic programs, and one “application-like” test program. We demonstrate the use of a tool that can examine the detailed sequence of events in a hybrid program and illustrate that a hybrid computation might not always proceed as expected.

Ewing Lusk, Anthony Chan

OpenMP for Clusters

First Experiences with Intel Cluster OpenMP

Abstract

MPI and OpenMP are the de-facto standards for distributed-memory and shared-memory parallelization, respectively. By employing a hybrid approach, that is combing OpenMP and MPI parallelization in one program, a cluster of SMP systems can be exploited. Nevertheless, mixing programming paradigms and writing explicit message passing code might increase the parallel program development time significantly. Intel Cluster OpenMP is the first commercially available OpenMP implementation for a cluster, aiming to combine the ease of use of the OpenMP parallelization paradigm with the cost efficiency of a commodity cluster. In this paper we present our first experiences with Intel Cluster OpenMP.

Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner

Micro-benchmarks for Cluster OpenMP Implementations: Memory Consistency Costs

Abstract

The OpenMP memory model allows for a temporary view of shared memory that only needs to be made consistent when barrier or flush directives, including those that are implicit, are encountered. While this relaxed memory consistency model is key to developing cluster OpenMP implementations, it means that the memory performance of any given implementation is greatly affected by which memory is used, when it is used, and by which threads. In this work we propose a micro-benchmark that can be used to measure memory consistency costs and present results for its application to two contrasting cluster OpenMP implementations, as well as comparing these results with data obtained from a hardware supported OpenMP environment.

H. J. Wong, J. Cai, A. P. Rendell, P. Strazdins

Incorporation of OpenMP Memory Consistency into Conventional Dataflow Analysis

Abstract

Current OpenMP compilers are often limited in their analysis and optimization of OpenMP programs by the challenge of incorporating OpenMP memory consistency semantics into conventional data flow algorithms. An important reason for this is that data flow analysis within current compilers traverse the program’s control-flow graph (CFG) and the CFG does not accurately model the memory consistency specifications of OpenMP. In this paper, we present techniques to incorporate memory consistency semantics into conventional dataflow analysis by transforming the program’s CFG into an OpenMP Producer-Consumer Flow Graph (PCFG), where a path exists from writes to reads of shared data if and only if a dependence is implied by the OpenMP memory consistency model. We present algorithms for these transformations, prove the correctness of these algorithms and discuss a case where this transformation is used.

Ayon Basumallik, Rudolf Eigenmann

STEP: A Distributed OpenMP for Coarse-Grain Parallelism Tool

Abstract

To benefit from distributed architectures, many applications need a coarse grain parallelisation of their programs. In order to help a non-expert parallel programmer to take advantage of this possibility, we have carried out a tool called STEP (Système de Transformation pour l’Exécution Parallèle). From a code decorated with OpenMP directives, this source-to-source transformation tool produces another code based on the message-passing programming model automatically. Thus, the programs of the legacy application can easily and reliably evolve without the burden of restructuring the code so as to insert calls to message passing API primitives. This tool deals with difficulties inherent in coarse grain parallelisation such as inter-procedural analyses and irregular code.

Daniel Millot, Alain Muller, Christian Parrot, Frédérique Silber-Chaussumier

OpenMP Tasking Models and Extensions

Evaluation of OpenMP Task Scheduling Strategies

Abstract

OpenMP is in the process of adding a tasking model that allows the programmer to specify independent units of work, called tasks, but does not specify how the scheduling of these tasks should be done (although it imposes some restrictions). We have evaluated different scheduling strategies (schedulers and cut-offs) with several applications and we found that work-first schedules seem to have the best performance but because of the restrictions that OpenMP imposes a breadth-first scheduler is a better choice to have as a default for an OpenMP runtime.

Alejandro Duran, Julita Corbalán, Eduard Ayguadé

Extending the OpenMP Tasking Model to Allow Dependent Tasks

Abstract

Tasking in OpenMP 3.0 has been conceived to handle the dynamic generation of unstructured parallelism. New directives have been added allowing the user to identify units of independent work (tasks) and to define points to wait for the completion of tasks (task barriers). In this paper we propose an extension to allow the runtime detection of dependencies between generated tasks, broading the range of applications that can benefit from tasking or improving the performance when load balancing or locality are critical issues for performance. Furthermore the paper describes our proof-of-concept implementation (SMP Superscalar) and shows preliminary performance results on an SGI Altix 4700.

Alejandro Duran, Josep M. Perez, Eduard Ayguadé, Rosa M. Badia, Jesus Labarta

OpenMP Extensions for Generic Libraries

Abstract

This paper proposes extensions to the OpenMP standard to provide first-class support for parallelizing generic libraries such as the C++ Standard Library (SL). Generic libraries are especially known for their efficiency, reusability and composibility. As such, with the advent of ubiquitous parallelism, generic libraries offer an excellent avenue for parallelizing the existing applications that use these libraries without requiring the applications to be rewritten. OpenMP, which would be ideal for executing such parallelizations, does not support many of the modern C++ idioms such as iterators and function objects that are used extensively in generic libraries. Accordingly, we propose extensions to OpenMP to better support modern C++ idioms to aid in the parallelization of generic libraries and applications built with those libraries.

Prabhanjan Kambadur, Douglas Gregor, Andrew Lumsdaine

Streams: Emerging from a Shared Memory Model

Abstract

To date OpenMP has been considered the work horse for data parallelism and more recently task level parallelism. The model has been one of shared memory working in parallel on arrays of a uniform nature, but many applications do not meet these often restrictive access patterns. With the development of accelerators on the one hand and moving beyond the node to the cluster on the other, OpenMP’s shared memory approach does not easily capture the complex memory hierarchies found in these heterogeneous systems.

Streams provide a natural approach to coupling data with its corresponding access patterns. Data within a stream can be easily and efficiently distributed across complex memory hierarchies, while retaining a shared memory point of view for the application programmer.

In this paper we present a modest extension to OpenMP to support data partitioning and streaming. Rather than add numerous new directives our approach is to utilize exiting streaming technology and extend OpenMP simply to control streams in the context of threading. The integration of streams allows the programmer to easily connect distinct compute components together in an efficient manner, supporting both, the conventional shared memory model of OpenMP and also the transparent integration of local non-shared memory.

Benedict R. Gaster

Applications, Scheduling, Tools

On Multi-threaded Satisfiability Solving with OpenMP

Abstract

The boolean satisfiability problem sat is a well-known NP-Complete problem, which is widely studied because of its conceptual simplicity. Nowadays the number of existing parallel SAT solvers is quite small. Furthermore, they are generally designed for large clusters using the message passing paradigm. These solvers are coarse grained application since they divide the search-tree among the processors avoiding communication and synchronization. In this paper mtss, for Multi Threaded Sat Solver, is introduced. It is a fine grain parallel sat solver, in shared memory. It defines a rich thread in charge of the search-tree evaluation and a set of poor threads that will help the rich one by simplifying the opened node. mtss is well designed for multi-core CPU since it reduces the memory allocation during the search.

Pascal Vander-Swalmen, Gilles Dequen, Michaël Krajecki

Parallelism and Scalability in an Image Processing Application

Abstract

The recent trends in processor architecture show that parallel processing is moving into new areas of computing in the form of many-core desktop processors and multi-processor system-on-chip. This means that parallel processing is required in application areas that traditionally have not used parallel programs. This paper investigates parallelism and scalability of an embedded image processing application. The major challenges faced when parallelizing the application were to extract enough parallelism from the application and to reduce load imbalance. The application has limited immediately available parallelism. It is difficult to further extract parallelism since the application has small data sets and parallelization overhead is relatively high. There is also a fair amount of load imbalance which is made worse by a non-uniform memory latency. Even so, we show that with some tuning relative speedups in excess of 9 on a 16 CPU system can be reached.

Morten S. Rasmussen, Matthias B. Stuart, Sven Karlsson

Scheduling Dynamic OpenMP Applications over Multicore Architectures

Abstract

Approaching the theoretical performance of hierarchical multicore machines requires a very careful distribution of threads and data among the underlying non-uniform architecture in order to minimize cache misses and NUMA penalties. While it is acknowledged that OpenMP can enhance the quality of thread scheduling on such architectures in a portable way, by transmitting precious information about the affinities between threads and data to the underlying runtime system, most OpenMP runtime systems are actually unable to efficiently support highly irregular, massively parallel applications on NUMA machines.

In this paper, we present a thread scheduling policy suited to the execution of OpenMP programs featuring irregular and massive nested parallelism over hierarchical architectures. Our policy enforces a distribution of threads that maximizes the proximity of threads belonging to the same parallel region, and uses a NUMA-aware work stealing strategy when load balancing is needed. It has been developed as a plug-in to the forestGOMP OpenMP platform [TBG+07]. We demonstrate the efficiency of our approach with a highly irregular recursive OpenMP program resulting from the generic parallelization of a surface reconstruction application. We achieve a speedup of 14 on a 16-core machine with no application-level optimization.

François Broquedis, François Diakhaté, Samuel Thibault, Olivier Aumage, Raymond Namyst, Pierre-André Wacrenier

Visualizing the Program Execution Control Flow of OpenMP Applications

Abstract

One important aspect of understanding the behavior of an application with respect to its performance, overhead, and scalability characteristics is knowledge of its control flow. In comparison to sequential applications the situation is more complicated in multithreaded parallel programs because each thread defines its own independent control flow. On the other hand, for the most common usage models of OpenMP the threads operate in a largely uniform way, synchronizing frequently at sequence points and diverging only to operate on different data items in worksharing constructs.

This paper presents an approach to capture and visualize the control flow of OpenMP applications in a compact way that does not require a full trace of program execution events but is instead based on a straightforward extension to the data collected by an existing profiling tool.

Karl Fürlinger, Shirley Moore

Backmatter

Titel: OpenMP in a New Era of Parallelism
herausgegeben von: Rudolf Eigenmann
Bronis R. de Supinski
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-540-79561-2
Print ISBN: 978-3-540-79560-5
DOI: https://doi.org/10.1007/978-3-540-79561-2

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Fourth International Workshop on OpenMP IWOMP 2008

OpenMP Overheads, Hybrid Models

A Microbenchmark Study of OpenMP Overheads under Nested Parallelism

CLOMP: Accurately Characterizing OpenMP Application Overheads

Detection of Violations to the MPI Standard in Hybrid OpenMP/MPI Applications

Early Experiments with the OpenMP/MPI Hybrid Programming Model

OpenMP for Clusters

First Experiences with Intel Cluster OpenMP

Micro-benchmarks for Cluster OpenMP Implementations: Memory Consistency Costs

Incorporation of OpenMP Memory Consistency into Conventional Dataflow Analysis

STEP: A Distributed OpenMP for Coarse-Grain Parallelism Tool

OpenMP Tasking Models and Extensions

Evaluation of OpenMP Task Scheduling Strategies

Extending the OpenMP Tasking Model to Allow Dependent Tasks

OpenMP Extensions for Generic Libraries

Streams: Emerging from a Shared Memory Model

Applications, Scheduling, Tools

On Multi-threaded Satisfiability Solving with OpenMP

Parallelism and Scalability in an Image Processing Application

Scheduling Dynamic OpenMP Applications over Multicore Architectures

Visualizing the Program Execution Control Flow of OpenMP Applications

Backmatter

Premium Partner