Skip to main content
Top

2011 | Book

Euro-Par 2011 Parallel Processing

17th International Conference, Euro-Par 2011, Bordeaux, France, August 29 - September 2, 2011, Proceedings, Part II

Editors: Emmanuel Jeannot, Raymond Namyst, Jean Roman

Publisher: Springer Berlin Heidelberg

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

The two-volume set LNCS 6852/6853 constitutes the refereed proceedings of the 17th International Euro-Par Conference held in Bordeaux, France, in August/September 2011. The 81 revised full papers presented were carefully reviewed and selected from 271 submissions. The papers are organized in topical sections on support tools and environments; performance prediction and evaluation; scheduling and load-balancing; high-performance architectures and compilers; parallel and distributed data management; grid, cluster and cloud computing; peer to peer computing; distributed systems and algorithms; parallel and distributed programming; parallel numerical algorithms; multicore and manycore programming; theory and algorithms for parallel computation; high performance networks and mobile ubiquitous computing.

Table of Contents

Frontmatter

Topic 9: Parallel and Distributed Programming

Introduction

Developing parallel or distributed applications is a hard task and it requires advanced algorithms, realistic modeling, efficient design tools, high-level programming abstractions, high-performance implementations, and experimental evaluation. Ongoing research in this field emphasizes the design and development of correct, high-performance, portable, and scalable parallel programs. Related to these central needs, important work addresses methods for reusability, performance prediction, large-scale deployment, self-adaptivity, and fault-tolerance. Given the rich history in this field, practical applicability of proposed methods, models, algorithms, or techniques is a key requirement for timely research. This topic is focusing on parallel and distributed programming in general, except for work specifically targeting multicore and manycore architectures, which has matured to becoming a Euro-Par topic of its own.

Pierre Manneback, Thierry Gautier, Gudula Rnger, Manuel Prieto Matias
Parallel Scanning with Bitstream Addition: An XML Case Study

A parallel scanning method using the concept of bitstream addition is introduced and studied in application to the problem of XML parsing and well-formedness checking. On processors supporting

W

-bit addition operations, the method can perform up to

W

finite state transitions per instruction. The method is based on the concept of parallel bitstream technology, in which parallel streams of bits are formed such that each stream comprises bits in one-to-one correspondence with the character code units of a source data stream. Parsing routines are initially prototyped in Python using its native support for unbounded integers to represent arbitrary-length bitstreams. A compiler then translates the Python code into low-level C-based implementations. These low-level implementations take advantage of the SIMD (single-instruction multiple-data) capabilities of commodity processors to yield a dramatic speed-up over traditional alternatives employing byte-at-a-time parsing.

Robert D. Cameron, Ehsan Amiri, Kenneth S. Herdy, Dan Lin, Thomas C. Shermer, Fred P. Popowich
HOMPI: A Hybrid Programming Framework for Expressing and Deploying Task-Based Parallelism

This paper presents

, a framework for programming and executing task-based parallel applications on clusters of multiprocessors and multi-cores, while providing interoperability with existing programming systems such as

and OpenMP.

facilitates expressing irregular and adaptive master-worker and divide-and-conquer applications avoiding explicit

calls. It also allows hybrid shared-memory / message-passing programming, exploiting fully the availability of multiprocessor and multi-core nodes, as it integrates by design with OpenMP; the runtime infrastructure presents a unified substrate that handles local threads and remote tasks seamlessly, allowing both programming flexibility and increased performance opportunities.

Vassilios V. Dimakopoulos, Panagiotis E. Hadjidoukas
A Failure Detector for Wireless Networks with Unknown Membership

The distributed computing scenario is rapidly evolving for integrating self-organizing and dynamic wireless networks. Unreliable failure detectors are classical mechanisms which provide information about process failures and can help systems to cope with the high dynamism of these networks. A number of failure detection algorithms has been proposed so far; nonetheless, most of them assume a global knowledge about the membership as well as a fully communication connectivity; additionally, they are timer-based, requiring that eventually some bound on the message transmission will hold. These assumptions are no longer appropriate to the new scenario. This paper presents a new failure detector protocol which implements a new class of detectors, namely

$\diamondsuit S^{\cal M}$

, which adapts the properties of the

$\diamondsuit S$

class to a dynamic network with an unknown membership. It has the interesting feature to be time-free, so that it does not rely on timers to detect failures; moreover, it tolerates mobility of nodes and message losses.

Fabíola Greve, Pierre Sens, Luciana Arantes, Véronique Simon
Towards Systematic Parallel Programming over MapReduce

MapReduce is a useful and popular programming model for data-intensive distributed parallel computing. But it is still a challenge to develop parallel programs with MapReduce systematically, since it is usually not easy to derive a proper divide-and-conquer algorithm that matches MapReduce. In this paper, we propose a homomorphism-based framework named Screwdriver for systematic parallel programming with MapReduce, making use of the program calculation theory of list homomorphisms. Screwdriver is implemented as a Java library on top of Hadoop. For any problem which can be resolved by two sequential functions that satisfy the requirements of the third homomorphism theorem, Screwdriver can automatically derive a parallel algorithm as a list homomorphism and transform the initial sequential programs to an efficient MapReduce program. Users need neither to care about parallelism nor to have deep knowledge of MapReduce. In addition to the simplicity of the programming model of our framework, such a calculational approach enables us to resolve many problems that it would be nontrivial to resolve directly with MapReduce.

Yu Liu, Zhenjiang Hu, Kiminori Matsuzaki
Correlated Set Coordination in Fault Tolerant Message Logging Protocols

Based on our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases, due to the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes, but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols, but eliminates the need for costly payload logging between coordinated processes.

Aurelien Bouteiller, Thomas Herault, George Bosilca, Jack J. Dongarra

Topic 10: Parallel Numerical Algorithms

Introduction

The solution of Computational Science problems relies on the availability of accurate and efficient numerical algorithms and software capable of harnessing the processing power of modern parallel and distributed computers. Such algorithms and software allow to prototype and develop new large-scale applications, as well as to improve existing ones, by including up-to-date numerical methods, or well-assessed ones re-designed in the light of the new architectures.

Martin Berzins, Daniela di Serafino, Martin Gander, Luc Giraud
A Bit-Compatible Parallelization for ILU(k) Preconditioning

ILU(k) is a commonly used preconditioner for iterative linear solvers for sparse, non-symmetric systems. It is often preferred for the sake of its stability. We present TPILU(k), the first efficiently parallelized ILU(k) preconditioner that maintains this important stability property. Even better, TPILU(k) preconditioning produces an answer that is bit-compatible with the sequential ILU(k) preconditioning. In terms of performance, the TPILU(k) preconditioning is shown to run faster whenever more cores are made available to it — while continuing to be as stable as sequential ILU(k). This is in contrast to some competing methods that may become unstable if the degree of thread parallelism is raised too far. Where Block Jacobi ILU(k) fails in an application, it can be replaced by TPILU(k) in order to maintain good performance, while also achieving full stability. As a further optimization, TPILU(k) offers an optional

level-based incomplete inverse method

as a fast approximation for the original ILU(k) preconditioned matrix. Although this enhancement is not bit-compatible with classical ILU(k), it is bit-compatible with the output from the single-threaded version of the same algorithm. In experiments on a 16-core computer, the enhanced TPILU(k)-based iterative linear solver performed up to 9 times faster. As we approach an era of many-core computing, the ability to efficiently take advantage of many cores will become ever more important.

Xin Dong, Gene Cooperman
Parallel Inexact Constraint Preconditioners for Saddle Point Problems

In this paper we propose a parallel implementation of the FSAI preconditioner to accelerate the PCG method in the solution of symmetric positive definite linear systems of very large size. This preconditioner is used as

building block

for the construction of an indefinite Inexact Constraint Preconditioner (ICP) for saddle point-type linear systems arising from Finite Element (FE) discretization of 3D coupled consolidation problems. The FSAI-ICP preconditioner, based on an efficient approximation of the inverse of the (1,1) block proves very effective in the acceleration of the BiCGSTAB iterative solver in parallel environments. Numerical results on a number of realistic test cases of size up to 6 ×10

6

unknowns and 3 ×10

8

nonzeros show the almost perfect scalability of the overall code up to 512 processors.

Luca Bergamaschi, Angeles Martinez
Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms

Extra memory allows parallel matrix multiplication to be done with asymptotically less communication than Cannon’s algorithm and be faster in practice. “3D” algorithms arrange the

p

processors in a 3D array, and store redundant copies of the matrices on each of

p

1/3

layers. ‘2D” algorithms such as Cannon’s algorithm store a single copy of the matrices on a 2D array of processors. We generalize these 2D and 3D algorithms by introducing a new class of “2.5D algorithms”. For matrix multiplication, we can take advantage of any amount of extra memory to store

c

copies of the data, for any

$c \in\{1,2,...,\lfloor p^{1/3}\rfloor\}$

, to reduce the bandwidth cost of Cannon’s algorithm by a factor of

c

1/2

and the latency cost by a factor

c

3/2

. We also show that these costs reach the lower bounds, modulo polylog(

p

) factors. We introduce a novel algorithm for 2.5D LU decomposition. To the best of our knowledge, this LU algorithm is the first to minimize communication along the critical path of execution in the 3D case. Our 2.5D LU algorithm uses communication-avoiding pivoting, a stable alternative to partial-pivoting. We prove a novel lower bound on the latency cost of 2.5D and 3D LU factorization, showing that while

c

copies of the data can also reduce the bandwidth by a factor of

c

1/2

, the latency must

increase

by a factor of

c

1/2

, so that the 2D LU algorithm (

c

 = 1) in fact minimizes latency. We provide implementations and performance results for 2D and 2.5D versions of all the new algorithms. Our results demonstrate that 2.5D matrix multiplication and LU algorithms strongly scale more efficiently than 2D algorithms. Each of our 2.5D algorithms performs over 2X faster than the corresponding 2D algorithm for certain problem sizes on 65,536 cores of a BG/P supercomputer.

Edgar Solomonik, James Demmel

Topic 11: Multicore and Manycore Programming

Introduction

Modern multicore and manycore systems offer impressive performance for various applications. However, achieving this performance is a challenging task. While multicore and manycore processors alleviate several problems that are related to single-core processors - known as memory wall, power wall, or instruction-level parallelism wall - they raise the issue of the programmability wall. The multicore and manycore programmability wall calls for new parallel programming methods and tools. Therefore, this topic focuses on novel solutions for efficient programming of multicore and manycore processors in the context of general-purpose and embedded systems.

Sabri Pllana, Jean-François Méhaut, Eduard Ayguade, Herbert Cornelius, Jacob Barhen
Hardware and Software Tradeoffs for Task Synchronization on Manycore Architectures

Manycore architectures – hundreds to thousands of cores per processor – are seen by many as a natural evolution of multicore processors. To take advantage of this massive parallelism in practice requires a productive parallel programming model, and an efficient runtime for the scheduling and coordination of concurrent tasks. A critical prerequisite for an efficient runtime is a scalable synchronization mechanism to support task coordination at different levels of granularity.

This paper describes the implementation of a high-level synchronization construct called

phasers

on the IBM Cyclops64 manycore processor, and compares phasers to lower-level synchronization primitives currently available to Cyclops64 programmers. Phasers support synchronization of dynamic tasks by allowing tasks to register and deregister with a phaser object. It provides a general unification of point-to-point and collective synchronizations with easy-to-use interfaces, thereby offering productivity advantages over hardware primitives when used on manycores. We have experimented with several approaches to phaser implementation using software, hardware and a combination of both to explore their portability and performance. The results show that a highly-optimized phaser implementation delivered comparable performance to that obtained with lower-level synchronization primitives. We also demonstrate the success of the hardware optimizations proposed for phasers.

Yonghong Yan, Sanjay Chatterjee, Daniel A. Orozco, Elkin Garcia, Zoran Budimlić, Jun Shirako, Robert S. Pavel, Guang R. Gao, Vivek Sarkar
OpenMPspy: Leveraging Quality Assurance for Parallel Software

OpenMP is widely used in practice to create parallel software, however, software quality assurance tool support is still immature. OpenMPspy introduces a new approach, with a short-term and a long-term perspective, to aid software engineers write better parallel programs in OpenMP. On the one hand, OpenMPspy acts like an online-debugger that statically detects problems with incorrect construct usage and which reports problems while programmers are typing code in Eclipse. We detect simple slips as well as more complex anti-patterns that can lead to correctness problems and performance problems. In addition, OpenMPspy can aggregate statistics about OpenMP language usage and bug patterns from many projects. Insights generated from such data help OpenMP language designers improve the usability of constructs and reduce error potential, thus enhancing parallel software quality in the long run. Using OpenMPspy, this paper presents one of the first detailed empirical studies of over 40 programs with more than 4 million lines of code, which shows how OpenMP constructs are actually used in practice. Our results reveal that constructs believed to be frequently used are actually rarely used. Our insights give OpenMP language and compiler designers a clearer picture on where to focus the efforts for future improvements.

Victor Pankratius, Fabian Knittel, Leonard Masing, Martin Walser
A Generic Parallel Collection Framework

Most applications manipulate structured data. Modern languages and platforms provide collection frameworks with basic data structures like lists, hashtables and trees. These data structures have a range of predefined operations which include mapping, filtering or finding elements. Such bulk operations traverse the collection and process the elements sequentially. Their implementation relies on iterators, which are not applicable to parallel operations due to their sequential nature.

We present an approach to parallelizing collection operations in a generic way, used to factor out common parallel operations in collection libraries. Our framework is easy to use and straightforward to extend to new collections. We show how to implement concrete parallel collections such as parallel arrays and parallel hash maps, proposing an efficient solution to parallel hash map construction. Finally, we give benchmarks showing the performance of parallel collection operations.

Aleksandar Prokopec, Phil Bagwell, Tiark Rompf, Martin Odersky
Progress Guarantees When Composing Lock-Free Objects

Highly concurrent and reliable data objects are vital for parallel programming. Lock-free shared data objects are highly concurrent and guarantee that at least one operation, from a set of concurrently executed operations, finishes after a finite number of steps regardless of the state of the other operations. Lock-free data objects provide progress guarantees on the object level. In this paper, we first examine the progress guarantees provided by lock-free shared data objects that have been constructed by composing other lock-free data objects. We observe that although lock-free data objects are composable when it comes to linearizability, when it comes to progress guarantees they are not. More specifically we show that when a lock-free data object is used as a component (is shared) by two or more lock-free data objects concurrently, these objects can no longer guarantee lock-free progress. This makes it impossible for programmers to directly compose lock-free data objects and guarantee lock-freedom. To help programmability in concurrent settings, this paper presents a new synchronization mechanism for composing lock-free data objects. The proposed synchronization mechanism provides an interface to be used when calling a lock-free object from other lock-free objects, and guarantees lock-free progress for every object constructed. An experimental evaluation of the performance cost that the new mechanism introduces, as expected, for providing progress guarantees is also presented.

Nhan Nguyen Dang, Philippas Tsigas
Engineering a Multi-core Radix Sort

We present a fast radix sorting algorithm that builds upon a microarchitecture-aware variant of counting sort. Taking advantage of virtual memory and making use of write-combining yields a per-pass throughput corresponding to at least 89% of the system’s peak memory bandwidth. Our implementation outperforms Intel’s recently published radix sort by a factor of 1.64. It also compares favorably to the reported performance of an algorithm for Fermi GPUs when data-transfer overhead is included. These results indicate that scalar, bandwidth-sensitive sorting algorithms remain competitive on current architectures. Various other memory-intensive applications can benefit from the techniques described herein.

Jan Wassenberg, Peter Sanders
Accelerating Code on Multi-cores with FastFlow

FastFlow is a programming framework specifically targeting cache-coherent shared-memory multi-cores. It is implemented as a stack of C++ template libraries built on top of lock-free (and memory fence free) synchronization mechanisms. Its philosophy is to combine programmability with performance. In this paper a new FastFlow programming methodology aimed at supporting parallelization of existing sequential code via offloading onto a dynamically created software accelerator is presented. The new methodology has been validated using a set of simple micro-benchmarks and some real applications.

Marco Aldinucci, Marco Danelutto, Peter Kilpatrick, Massimiliano Meneghin, Massimo Torquati
A Novel Shared-Memory Thread-Pool Implementation for Hybrid Parallel CFD Solvers

The Computational Fluid Dynamics (CFD) solver TAU for unstructured grids is widely used in the European aerospace industry. TAU runs on High-Performance Computing (HPC) clusters with several thousands of cores using MPI-based domain decomposition. In order to make more efficient use of current multi-core CPUs and to prepare TAU for the many-core era, a shared-memory parallelization has been added to one of TAU’s solver to obtain a hybrid parallelization: MPI-based domain decomposition plus multi-threaded processing of a domain.

For the edge-based solver considered, a simple loop-based approach via OpenMP FOR directives would – due to the Amdahl trap – not deliver the required speed-up. A more sophisticated, thread-pool-based shared-memory parallelization has been developed which allows for a relaxed thread synchronization with automatic and dynamic load balancing.

In this paper we describe the concept behind this shared-memory parallelization, we explain how the multi-threaded computation of a domain works. Some details of its implementation in TAU as well as some first performance results are presented. We emphasize that the concept is not TAU-specific. Actually, this design pattern appears to be very generic and may well be applied to other grid/mesh/graph-based codes.

Jens Jägersküpper, Christian Simmendinger
A Fully Empirical Autotuned Dense QR Factorization for Multicore Architectures

Tuning numerical libraries has become more difficult over time, as systems get more sophisticated. In particular, modern multicore machines make the behaviour of algorithms hard to forecast and model. In this paper, we tackle the issue of tuning a dense QR factorization on multicore architectures using a fully empirical approach.We exhibit a few strong empirical properties that enable us to efficiently prune the search space. Our method is automatic, fast and reliable. The tuning process is indeed fully performed at install time in less than one hour and ten minutes on five out of seven platforms. We achieve an average performance varying from 97% to 100% of the optimum performance depending on the platform. This work is a basis for autotuning the PLASMA library and enabling easy performance portability across hardware systems.

Emmanuel Agullo, Jack Dongarra, Rajib Nath, Stanimire Tomov
Parallelizing a Real-Time Physics Engine Using Transactional Memory

The simulation of the dynamics and kinematics of solid bodies is an important problem in a wide variety of fields in computing ranging from animation and interactive environments to scientific simulations. While rigid body simulation has a significant amount of potential parallelism, efficiently synchronizing irregular accesses to the large amount of mutable shared data in such programs remains a hurdle. There has been a significant amount of interest in transactional memory systems for their potential to alleviate some of the problems associated with fine-grained locking and more broadly for writing correct and efficient parallel programs. While results so far are promising, the effectiveness of TM systems has so far been predominantly evaluated on small benchmarks and kernels.

In this paper we present our experiences in parallelizing ODE, a real-time physics engine that is widely used in commercial and open source games. Rigid body simulation in ODE consists of two main phases that are amenable to effective coarse-grained parallelization and which are also suitable for using transactions to orchestrate shared data synchronization. We found ODE to be a good candidate for applying parallelism and transactions to - it is a large real world application, there is a large amount of potential parallelism, it exhibits irregular access patterns and the amount of contention may vary at runtime. We present an experimental evaluation of our implementation of the parallel transactional ODE engine that shows speedups of up to 1.27x relative to the sequential version.

Jaswanth Sreeram, Santosh Pande

Topic 12: Theory and Algorithms for Parallel Computation

Introduction

Parallelism permeates all levels of current computing systems. It can be observed in systems as varied as multiple single-CPU machines, large server farms, and geographically dispersed “volunteers” who collaborate over the Internet. The effective use of parallelism depends crucially on the availability of faithful, yet tractable, models of computation for algorithm design and analysis and of efficient strategies for solving key computational problems on prominent classes of computing platforms. No less important are good models of the way the different components/subsystems of a platform are interconnected.

Kunal Agarwal, Panagiota Fatourou, Arnold L. Rosenberg, Frédéric Vivien
Petri-nets as an Intermediate Representation for Heterogeneous Architectures

Many modern systems provide heterogeneous parallelism, for example NUMA multi-core processors and CPU-GPU combinations. Placement, scheduling and indeed algorithm choices affect the overall execution time and, for portable programs, must adapt to the target machine at either load-time or run-time. We see these choices as preserving I/O determinism but exposing

performance non-determinism

. We use Petri-nets as an intermediate representation for programs to give a unified view of all forms of performance non-determinism. This includes some scenarios which other models cannot support. Whilst NP-hard, efficient heuristics for approximating optimum executions in these nets would lead to performant portable execution across arbitrary heterogeneous architectures.

Peter Calvert, Alan Mycroft
A Bi-Objective Scheduling Algorithm for Desktop Grids with Uncertain Resource Availabilities

In this work, we consider the execution of applications on desktop grids. Such parallel systems use idle computing resources of desktops distributed over the Internet for running massively parallel computations. The applications are composed of workflows of independent non-preemptive sequential jobs that are submitted by successive batches. Then, the corresponding jobs are executed on the distributed available resources according to some scheduling policy.

However, most resources are not continuously available over time since the users give their idle CPU time only for some time when they are not using their desktops. Moreover, even if the dates of unavailability periods are estimated in advance, they are subject to uncertainties. This may drastically impact the global performances by delaying the completion time of the applications.

The aim of this paper is to study how to schedule efficiently a set of jobs in the presence of unavailability periods on identical machines. In the same time, we are interested in reducing the impact of disturbances on the unavailability periods. This is achieved by maximizing the

stability

that measures the distance between the makespan of the disturbed instance over the initial one. Our main contribution is the design of a new parametrized algorithm and the analysis of its performance through structural properties. This algorithm reduces the impact of disturbances on availability periods without worsening too much the makespan. Its interest is assessed by running simulations based on realistic workflows. Moreover, theoretical results are obtained under the assumption that the size of every availability interval is at least twice the size of the largest job.

Louis-Claude Canon, Adel Essafi, Grégory Mounié, Denis Trystram
New Multithreaded Ordering and Coloring Algorithms for Multicore Architectures

We present new multithreaded vertex

ordering

and

distance-k graph coloring

algorithms that are well-suited for multicore platforms. The vertex ordering techniques rely on various notions of “degree”, are known to be effective in reducing the number of colors used by a

greedy

coloring algorithm, and are generic enough to be applicable to contexts other than coloring. We employ

approximate degree

computation in the ordering algorithms and

speculation

and

iteration

in the coloring algorithms as our primary tools for breaking sequentiality and achieving effective parallelization. The algorithms have been implemented using OpenMP, and experiments conducted on Intel Nehalem and other multi-core machines using various types of graphs attest that the algorithms provide scalable runtime performance. The number of colors the algorithms use is often close to optimal. The techniques used for computing the ordering and coloring in parallel are applicable to other problems where there is an inherent ordering to the computations that needs to be relaxed for increasing concurrency.

Md. Mostofa Ali Patwary, Assefaw H. Gebremedhin, Alex Pothen

Topic 13: High Performance Networks and Communication

Introduction

The Euro-Par topic on high-performance networks and communications is devoted to communication issues in scalable compute and storage systems, such as tightly coupled parallel computers, clusters, and networks of workstations, including hierarchical and hybrid designs featuring several levels of possibly different interconnects. All aspects of communication in modern compute and storage systems are of interest, for example advances in the design, implementation, and evaluation of interconnection networks, network interfaces, system and storage area networks, on-chip interconnects, communication protocols and interfaces, routing and communication algorithms, and communication aspects of parallel and distributed algorithms.

Jesper Larsson Träff, Brice Goglin, Ulrich Bruening, Fabrizio Petrini
Kernel-Based Offload of Collective Operations – Implementation, Evaluation and Lessons Learned

Optimized implementations of blocking and nonblocking collective operations are most important for scalable high-performance applications. Offloading such collective operations into the communication layer can improve performance and asynchronous progression of the operations. However, it is most important that such offloading schemes remain flexible in order to support user-defined (sparse neighbor) collective communications. In this work, we describe an operating system kernel-based architecture for implementing an interpreter for the flexible Group Operation Assembly Language (GOAL) framework to offload collective communications. We describe an optimized scheme to store the schedules that define the collective operations and show an extension to profile the performance of the kernel layer. Our microbenchmarks demonstrate the effectiveness of the approach and we show performance improvements over traditional progression in user-space. We also discuss complications with the design and offloading strategies in general.

Timo Schneider, Sven Eckelmann, Torsten Hoefler, Wolfgang Rehm
A High Performance Superpipeline Protocol for InfiniBand

InfiniBand high performance networks require that the buffers used for sending or receiving data are registered. Since memory registration is an expensive operation, some communication libraries use caching (rcache) to amortize its cost, and copy data into pre-registered buffers for small messages. In this paper, we present a software protocol for InfiniBand that always uses a memory copy, and amortizes the cost of this copy with a superpipeline to overlap the memory copy and the RDMA.We propose a performance model of our protocol to study its behavior and optimize its parameters. We have implemented our protocol in the NewMadeleine communication library. The results of MPI benchmarks show a significant improvement in cache-unfriendly applications that do not reuse the same memory blocks all over the time, without degradation for cache-friendly applications.

Alexandre Denis

Topic 14: Mobile and Ubiquitous Computing

Introduction

The tremendous advances in wireless networks, mobile computing, sensor networks along with the rapid growth of small, portable and powerful computing devices offers opportunities for pervasive computing and communications. Topic 14 deals with cutting-edge research in various aspects related to the theory or practice of mobile computing or wireless and mobile networking, including architectures, algorithms, networks, protocols, modeling and performance, applications, services, and data management.

Eric Fleury, Qi Han, Pedro Marron, Torben Weis
ChurnDetect: A Gossip-Based Churn Estimator for Large-Scale Dynamic Networks

With the ever increasing scale of dynamic wireless networks (such as MANETs, WSNs, VANETs, etc.), there is a growing need for performing aggregate computations, such as online detection of network churn, via distributed, robust and scalable algorithms. In this paper we introduce the ChurnDetect algorithm, a novel solution to the distributed churn estimation problem. Our solution consists in a gossiping-based algorithm, which incorporates a periodic reset mechanism (introduced as DiffusionReset). The main difference with existing state-of-the-art is that ChurnDetect does not require nodes to advertise their departure from the network nor to detect neighbors leaving the network. In our solution, all the nodes are interacting with each other wirelessly, by using a gossip-alike approach, thus keeping the message complexity to a minimum. We only use easy accessible information (i.e., about new nodes joining the network) rather than presuming knowledge on nodes leaving the system since that is highly unfeasible for most distributed applications. We provide convergence proofs for ChurnDetect, and present a number of results based on simulations and implementation on our local testbed. We characterize the performance of the algorithm, showcasing its distributed light-weight characteristics. The analysis leads to the conclusion that ChurnDetect is an attractive alternative to existing work on online churn estimation for dynamic wireless networks.

Andrei Pruteanu, Venkat Iyer, Stefan Dulman

Topic 15: High-Performance and Scientific Applications

Introduction

As demand in high-resolution and high-fidelity modeling and simulation increases, desktop computers or small clusters of processors have proven to be insufficient to carry out the calculations in many scientific, engineering, and industrial applications. Indeed, many such applications typically require a significant amount of computing time or need to process a large amount of data. The High-Performance and Scientific Applications Topic highlights recent progress in the use of high-performance parallel and scientific computing, with an emphasis on successes, advances, and lessons learned in the development and implementation of novel scientific, engineering, and industrial applications.

Olivier Coulaud, Kengo Nakajima, Esmond G. Ng, Mariano Vazquez
Real Time Contingency Analysis for Power Grids

Modern power grids are continuously monitored by trained system operators equipped with sophisticated monitoring and control systems. Despite such precautionary measures, large blackouts, that affect more than a million consumers, occur quite frequently. To prevent such blackouts, it is important to perform high-order contingency analysis in real time. However, contingency analysis is computationally very expensive as many different combinations of power system component failures must be analyzed. Analyzing several million such possible combinations can take inordinately long time and it is not be possible for conventional systems to predict blackouts in time to take necessary corrective actions.

To address this issue, we present a scalable parallel implementation of a probabilistic contingency analysis scheme that processes only most severe and most probable contingencies. We evaluate our implementation by analyzing benchmark IEEE 300 bus and 118 bus test grids. We perform contingency analysis up to level eight (contingency chains of length eight) and can correctly predict blackouts in real time to a high degree of accuracy. To the best of our knowledge, this is the first implementation of real time contingency analysis beyond level two.

Anshul Mittal, Jagabondhu Hazra, Nikhil Jain, Vivek Goyal, Deva P. Seetharam, Yogish Sabharwal
CRSD: Application Specific Auto-tuning of SpMV for Diagonal Sparse Matrices

Sparse Matrix-Vector multiplication (SpMV) is an important computational kernel in scientific applications. Its performance highly depends on the nonzero distribution of sparse matrices. In this paper, we propose a new storage format for diagonal sparse matrices, defined as Compressed Row Segment with Diagonal-pattern (CRSD). We design diagonal patterns to represent the diagonal distribution. As the diagonal distributions are similar within matrices from one application, some diagonal patterns remain unchanged. First, we sample one matrix to obtain the unchanged diagonal patterns. Next, the optimal SpMV codelets are generated automatically for those diagonal patterns. Finally, we combine the generated codelets as the optimal SpMV implementation. In addition, the information collected during auto-tuning process is also utilized for parallel implementation to achieve load-balance. Experimental results demonstrate that the speedup reaches up to 2.37 (1.70 on average) in comparison with DIA and 4.60 (2.10 on average) in comparison with CSR under the same number of threads on two mainstream multi-core platforms.

Xiangzheng Sun, Yunquan Zhang, Ting Wang, Guoping Long, Xianyi Zhang, Yan Li
The LOFAR Beam Former: Implementation and Performance Analysis

Traditional radio telescopes use large, steel dishes to observe radio sources. The LOFAR radio telescope is different, and uses tens of thousands of fixed, non-movable antennas instead, a novel design that promises ground-breaking research in astronomy. The antennas observe omnidirectionally, and sky sources are observed by signal-processing techniques that combine the data from all antennas.

Another new feature of LOFAR is the elaborate use of

software

to do signal processing in real time, where traditional telescopes use custom-built hardware. The use of software leads to an instrument that is inherently more flexible. However, the enormous data rate (198 Gb/s of input data) and processing requirements compel the use of a supercomputer: we use an IBM Blue Gene/P.

This paper presents a collection of new processing pipelines, collectively called the beam-forming pipelines, that greatly enhance the functionality of the telescope. Where our first pipeline could only correlate data to create sky images, the new pipelines allow the discovery of unknown pulsars, observations of known pulsars, and (in the future), to observe cosmic rays and study transient events. Unlike traditional telescopes, we can observe in hundreds of directions simultaneously. This is useful, for example, to search the sky for new pulsars. The use of software allows us to quickly add new functionality and to adapt to new insights that fully exploit the novel features and the power of our unique instrument. We also describe our optimisations to use the Blue Gene/P at very high efficiencies, maximising the effectiveness of the entire telescope. A thorough performance study identifies the limits of our system.

Jan David Mol, John W. Romein
Application-Specific Fault Tolerance via Data Access Characterization

Recent trends in semiconductor technology and supercomputer design predict an increasing probability of faults during an application’s execution. Designing an application that is resilient to system failures requires careful evaluation of the impact of various approaches on preserving key application state. In this paper, we present our experiences in an ongoing effort to make a large computational chemistry application fault tolerant. We construct the data access signatures of key application modules to evaluate alternative fault tolerance approaches. We present the instrumentation methodology, characterization of the application modules, and evaluation of fault tolerance techniques using the information collected. The application signatures developed capture application characteristics not traditionally revealed by performance tools. We believe these can be used in the design and evaluation of runtimes beyond fault tolerance.

Nawab Ali, Sriram Krishnamoorthy, Niranjan Govind, Karol Kowalski, Ponnuswamy Sadayappan
High-Performance Numerical Optimization on Multicore Clusters

This paper presents a software infrastructure for high performance numerical optimization on clusters of multicore systems. At the core, a runtime system implements a programming and execution environment for irregular and adaptive task-based parallelism. Building on this, we extract and exploit the parallelism of a global optimization application at multiple levels, which include Hessian calculations and Newton-based local optimizations. We discuss parallel implementations details and task distribution schemes for managing nested parallelism. Finally, we report experimental performance results for all the components of our software system on a multicore cluster.

Panagiotis E. Hadjidoukas, Constantinos Voglis, Vassilios V. Dimakopoulos, Isaac E. Lagaris, Dimitris G. Papageorgiou
Parallel Monte-Carlo Tree Search for HPC Systems

Monte-Carlo Tree Search (MCTS) is a simulation-based search method that brought about great success to applications such as Computer-Go in the past few years. The power of MCTS strongly depends on the number of simulations computed per time unit and the amount of memory available to store data gathered during simulation. High-performance computing systems such as large compute clusters provide vast computation and memory resources and thus seem to be natural targets for running MCTS. However, so far only few publications deal with parallelizing MCTS for distributed memory machines. In this paper, we present a novel approach for the parallelization of MCTS which allows for an equally distributed spreading of both the work and memory load among all compute nodes within a distributed memory HPC system. We describe our approach termed UCT-Treesplit and evaluate its performance on the example of a state-of-the-art Go engine.

Tobias Graf, Ulf Lorenz, Marco Platzner, Lars Schaefers
Petascale Block-Structured AMR Applications without Distributed Meta-data

Adaptive mesh refinement (AMR) applications to solve partial differential equations (PDE) are very challenging to scale efficiently to the petascale regime.

We describe optimizations to the Chombo AMR framework that enable it to scale efficiently to petascale on the Cray XT5. We describe an example of a hyperbolic solver (inviscid gas dynamics) and an matrix-free geometric multigrid elliptic solver. Both show good weak scaling to 131K processors without any thread-level or SIMD vector parallelism.

This paper describes the algorithms used to compress the Chombo metadata and the optimizations of the Chombo infrastructure that are necessary for this scaling result. That we are able to achieve petascale performance without distribution of the metadata is a significant advance which allows for much simpler and faster AMR codes.

Brian Van Straalen, Phil Colella, Daniel T. Graves, Noel Keen
Accelerating Anisotropic Mesh Adaptivity on nVIDIA’s CUDA Using Texture Interpolation

Anisotropic mesh smoothing is used to generate optimised meshes for Computational Fluid Dynamics (CFD). Adapting the size and shape of elements in an unstructured mesh to a specification encoded in a metric tensor field is done by relocating mesh vertices. This computationally intensive task can be accelerated by engaging nVIDIA’s CUDA-enabled GPUs. This article describes the algorithmic background, the design choices and the implementation details that led to a mesh-smoothing application running in double-precision on a Tesla C2050 board. Engaging CUDA’s texturing hardware to manipulate the metric tensor field accelerates execution by up to 6.2 times, leading to a total speedup of up to 148 times over the serial CPU code and up to 15 times over the 12-threaded OpenMP code.

Georgios Rokos, Gerard Gorman, Paul H. J. Kelly

Topic 16: GPU and Accelerators Computing

Introduction

The recent years have seen great research interest in exploiting GPUs and accelerators for computations, as shown by the latest TOP500 editions, whose very top entries are fully based on their use. Their potential computation power and energy consumption efficiency are appealing, but programming them however reveals to be very challenging, as not only task offloading and data transfer issues show up, but programming paradigms themselves appear to need reconsideration. Fully taping into this new kind of computation resource thus stands out as an open issue, particularly when conjointly using regular CPUs and several accelerator simultaneously. This is why we welcome this year the opening of a new “GPU and Accelerators Computing” topic along the collection of Euro-Par topics. The focus of this topic covers all areas related to accelerators: architecture, languages, compilers, libraries, runtime, debugging and profiling tools, algorithms, applications, etc.

Wolfgang Karl, Samuel Thibault, Stanimire Tomov, Taisuke Boku
Model-Driven Tile Size Selection for DOACROSS Loops on GPUs

DOALL loops are tiled to exploit DOALL parallelism and data locality on GPUs. In contrast, due to loop-carried dependences, DOACROSS loops must be skewed first in order to make tiling legal and exploit wavefront parallelism across the tiles and within a tile. Thus, tile size selection, which is performance-critical, becomes more complex for DOACROSS loops than DOALL loops on GPUs. This paper presents a model-driven approach to automating this process. Validation using 1D, 2D and 3D SOR solvers shows that our framework can find the tile sizes for these representative DOACROSS loops to achieve performances close to the best observed for a range of problem sizes tested.

Peng Di, Jingling Xue
Iterative Sparse Matrix-Vector Multiplication for Integer Factorization on GPUs

The Block Wiedemann (BW) and the Block Lanczos (BL) algorithms are frequently used to solve sparse linear systems over GF(2). Iterative sparse matrix-vector multiplication is the most time consuming operation of these approaches. The necessity to accelerate this step is motivated by the application of these algorithms to very large matrices used in the linear algebra step of the Number Field Sieve (NFS) for integer factorization. In this paper we derive an efficient CUDA implementation of this operation using a newly designed hybrid sparse matrix format. This leads to speedups between 4 and 8 on a single GPU for a number of tested NFS matrices compared to an optimized multi-core implementation.

Bertil Schmidt, Hans Aribowo, Hoang-Vu Dang
Lessons Learned from Exploring the Backtracking Paradigm on the GPU

We explore the backtracking paradigm with properties seen as sub-optimal for GPU architectures, using as a case study the maximal clique enumeration problem, and find that the presence of these properties limit GPU performance to approximately 1.4–2.25 times a single CPU core. The GPU performance “lessons” we find critical to providing this performance include a

coarse

-and-

fine

-grain parallelization of the search space, a low-overhead

load-balanced

distribution of work, global memory latency hiding through

coalescence

,

saturation

, and

shared memory utilization

, and the use of GPU

output buffering

as a solution to irregular workloads and a large solution domain. We also find a strong reliance on an efficient global problem structure representation that bounds any efficiencies gained from these lessons, and discuss the meanings of these results to backtracking problems in general.

John Jenkins, Isha Arkatkar, John D. Owens, Alok Choudhary, Nagiza F. Samatova
Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design

The OpenCL standard allows targeting a large variety of CPU, GPU and accelerator architectures using a single unified programming interface and language. While the standard guarantees portability of functionality for complying applications and platforms, performance portability on such a diverse set of hardware is limited. Devices may vary significantly in memory architecture as well as type, number and complexity of computational units. To characterize and compare the OpenCL performance of existing and future devices we propose a suite of microbenchmarks, uCLbench.

We present measurements for eight hardware architectures – four GPUs, three CPUs and one accelerator – and illustrate how the results accurately reflect unique characteristics of the respective platform. In addition to measuring quantities traditionally benchmarked on CPUs like arithmetic throughput or the bandwidth and latency of various address spaces, the suite also includes code designed to determine parameters unique to OpenCL like the dynamic branching penalties prevalent on GPUs. We demonstrate how our results can be used to guide algorithm design and optimization for any given platform on an example kernel that represents the key computation of a linear multigrid solver. Guided manual optimization of this kernel results in an average improvement of 61% across the eight platforms tested.

Peter Thoman, Klaus Kofler, Heiko Studt, John Thomson, Thomas Fahringer
Backmatter
Metadata
Title
Euro-Par 2011 Parallel Processing
Editors
Emmanuel Jeannot
Raymond Namyst
Jean Roman
Copyright Year
2011
Publisher
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-23397-5
Print ISBN
978-3-642-23396-8
DOI
https://doi.org/10.1007/978-3-642-23397-5

Premium Partner