Invited Talk

Selfish Distributed Optimization

In this talk, we present a selection of important concepts and results in

algorithmic game theory

in recent years, some of which received the 2012 Gödel Prize, along with some applications in distributed settings.

A famous solution concept for non-cooperative games is the

Nash equilibrium

. In a Nash equilibrium, no selfish player can unilaterally deviate from his current strategy and improve his profit. Nash dynamics is a method to compute a Nash equilibrium. Here, in each round, a single player is allowed to perform a selfish step, i.e. unilaterally change his strategy and improve his cost. The Nash dynamics terminates if it does not run into a cycle. This is always the case if the game has a potential function. In this case, computing a Nash equilibrium is a

$\mathcal{PLS}$

problem (Polynomial Local Search) and belongs to the large class of well-studied local optimization problems.

Inspired by real-world networks,

network congestion games

have been under severe scrutiny for the last years. Network congestion games model selfish routing of unsplittable units. These units may be weighted or unweighted. Weighted congestion games do not necessarily have a pure Nash equilibrium. Conversely, an unweighted congestion game has a potential function. Computing a pure Nash equilibrium for an unweighted congestion game is

$\mathcal{PLS}$

-complete.

The absence of a central coordinating authority can result in a loss of performance due to the selfishness of the participants. This situation is formalized in the notion

“Price of Anarchy”

. The Price of Anarchy is defined to be the worst case ratio between the maximal social cost in a Nash equilibrium and the optimal social cost. We present the recent results for congestion games and for the special case of load balancing.

Classical game theory assumes that each player acts rationally and wants to improve his profit. This is not realistic in a distributed setting since it requires that each player has the complete information about the state of the system. We introduce the concept of

selfish distributed load balancing

and describe recent results.

We will also consider distributed algorithms for

network creation games

. In the past, network creation games have mostly been studied under the assumption that the players have a global view on the network, or more precisely, that the players are able to compute the average distance or the maximum distance to the nodes they want to interact with in the given network, depending on the objective function. A player may then decide to add one or more edges for some extra cost or to drop an edge. We will look at network creation games from a different angle. In our case, the players have fixed distances to each other that are based on some underlying metric (determined by, for example, the geographic positions of the players), and the goal is to study the networks formed if players selfishly add and remove edges based on that metric. We show that for certain metrics like the line metric, tree metric, and the Euclidean metric, certain selfish behavior, that only requires a local view of the players on the network, will lead to stable networks that give a good approximation of the underlying metric.

Burkhard Monien, Christian Scheideler

Topic 1: Support Tools and Environments

Despite an impressive body of research, parallel and distributed computing remains a complex task prone to subtle software issues that can affect both the correctness and the performance of the computation. It is interesting to note that this topic has always been listed as Topic 1 in the EuroPar conference series for some time now - emphasising its importance and focus in the parallel and distributed systems community. The increasing demand to distribute computing over large-scale parallel and distributed platforms, such as grids and large clusters, often combined with the use of hardware accelerators, overlaps with an increasing pressure to make computing more dependable. To address these challenges, the parallel and distributed computing community continuously requires better tools and environments to design, program, debug, test, tune, and monitor programs that must execute over parallel and distributed systems. This topic aims to bring together tool designers, developers and users to share their concerns, ideas, solutions, and products covering a wide range of platforms, including homogeneous and heterogeneous multi-core architectures. Contributions with solid theoretical foundations and experimental validations on production-level parallel and distributed systems were particularly valued. This year we encouraged submissions proposing intelligent monitoring and diagnosis tools and environments which can exploit behavioral knowledge to detect programming bugs or performance bottlenecks and help ensure correct and efficient parallel program execution.

Omer Rana, Marios Dikaiakos, Daniel S. Katz, Christine Morin

Tulipse: A Visualization Framework for User-Guided Parallelization

Parallelization of existing code for modern multicore processors is tedious as the person performing these tasks must understand the algorithms, data structures and data dependencies in order to do a good job. Current options available to the programmer include either automatic parallelization or a complete rewrite in a parallel programming language. However, there are limitations with these options. In this paper, we propose a framework that enables the programmer to visualize information critical for semi-automated parallelization. The framework, called Tulipse, offers a program structure view that is augmented with key performance information, and a loop-nest dependency view that can be used to visualize data dependencies gathered from static or dynamic analyses. Our paper will demonstrate how these two new perspectives aid in the parallelization of code.

Yi Wen Wong, Tomasz Dubrownik, Wai Teng Tang, Wen Jun Tan, Rubing Duan, Rick Siow Mong Goh, Shyh-hao Kuo, Stephen John Turner, Weng-Fai Wong

Enabling Cloud Interoperability with COMPSs

The advent of Cloud computing has given to researchers the ability to access resources that satisfy their growing needs, which could not be satisfied by traditional computing resources such as PCs and locally managed clusters. On the other side, such ability, has opened new challenges for the execution of their computational work and the managing of massive amounts of data into resources provided by different private and public infrastructures.

COMP Superscalar (COMPSs) is a programming framework that provides a programming model and a runtime that ease the development of applications for distributed environments and their execution on a wide range of computational infrastructures. COMPSs has been recently extended in order to be interoperable with several cloud technologies like Amazon, OpenNebula, Emotive and other OCCI compliant offerings.

This paper presents the extensions of this interoperability layer to support the execution of COMPSs applications into the Windows Azure Platform. The framework has been evaluated through the porting of a data mining workflow to COMPSs and the execution on an hybrid testbed.

Fabrizio Marozzo, Francesc Lordan, Roger Rafanell, Daniele Lezzi, Domenico Talia, Rosa M. Badia

Pattern-Independent Detection of Manual Collectives in MPI Programs

In parallel applications, a significant amount of communication occurs in a collective fashion to perform, for example, broadcasts, reductions, or complete exchanges. Although the MPI standard defines many convenience functions for this purpose, which not only improve code readability and maintenance but are usually also highly efficient, many application programmers still create their own, manual implementations using point-to-point communication. We show how instances of such hand-crafted collectives can be automatically detected. Matching pre- and post-conditions of hashed message exchanges recorded in event traces, our method is independent of the specific communication pattern employed. We demonstrate that replacing detected broadcasts in the HPL benchmark can yield significant performance improvements.

Alexandru Calotoiu, Christian Siebert, Felix Wolf

A Type-Based Approach to Separating Protocol from Application Logic

A Case Study in Hybrid Computer Programming

Numerous programming models have been introduced to allow programmers to utilize new accelerator-based architectures. While OpenCL and CUDA provide low-level access to accelerator programming, the task cries out for a higher-level abstraction. Of the higher-level programming models which have emerged, few are intended to co-exist with mainstream, general-purpose languages while supporting tunability, composability, and transparency of implementation. In this paper, we propose extensions to the type systems (implementable as syntactically neutral annotations) of traditional, general-purpose languages can be made which allow programmers to work at a higher level of abstraction with respect to memory, deferring much of the tedium of data management and movement code to an automatic code generation tool. Furthermore, our technique, based on formal term rewriting, allows for user-defined reduction rules to optimize low-level operations and exploit domain- and/or application-specific knowledge.

Geoffrey C. Hulette, Matthew J. Sottile, Allen D. Malony

Topic 2: Performance Prediction and Evaluation

In recent years, a range of novel methodologies and tools have been developed for the purpose of evaluation, design, and model reduction of existing and emerging parallel and distributed systems. At the same time, the coverage of the term ‘performance’ has constantly broadened to include reliability, robustness, energy consumption, and scalability in addition to classical performance-oriented evaluations of system functionalities. Indeed, the increasing diversification of parallel systems, from cloud computing to exascale, being fueld by technological advances, is placing greater emphasis on the methods and tools to address more comprehensive concerns. The aim of the Performance Prediction and Evaluation topic is to bring together system designers and researchers involved with the qualitative and quantitative evaluation and modeling of large-scale parallel and distributed applications and systems to focus on current critical areas of performance prediction and evaluation theory and practice.

Allen D. Malony, Helen Karatza, William Knottenbelt, Sally McKee

Energy Consumption Modeling for Hybrid Computing

Energy efficiency is increasingly critical for embedded systems and mobile devices, where their continuous operation is based on battery life. In order to increase energy efficiency, chip manufacturers are developing heterogeneous CMP chips.

We present analytical models based on an energy consumption metric to analyze the different performance gains and energy consumption of various architectural design choices for hybrid CPU-GPU chips. We also analyzed the power consumption implications of different processing modes and various chip configurations. The analysis shows clearly that greater parallelism is the most important factor affecting energy saving.

Ami Marowka

HPC File Systems in Wide Area Networks: Understanding the Performance of Lustre over WAN

Using the first commercially available 100 Gbps Ethernet technology with a link of varying length, we have evaluated the performance of the Lustre file system and its networking layer under different latency scenarios. The results led us to a better understanding of the impact that the network latency has on Lustre’s performance. In particular spanning Lustre’s networking layer, striped small I/O, and the parallel creation of files inside a common directory. The main contribution of this work is the derivation of useful rules of thumbs to help users and system administrators predict the variation in Lustre’s performance produced as a result of changes in the latency of the I/O network.

Alvaro Aguilera, Michael Kluge, Thomas William, Wolfgang E. Nagel

Understanding I/O Performance Using I/O Skeletal Applications

We address the difficulty involved in obtaining meaningful measurements of I/O performance in HPC applications, as well as the further challenge of understanding the causes of I/O bottlenecks in these applications. The need for I/O optimization is critical given the difficulty in scaling I/O to ever increasing numbers of processing cores. To address this need, we have pioneered a new approach to the analysis of I/O performance using automatic generation of I/O benchmark codes given a high-level description of an application’s I/O pattern. By combining this with low-level characterization of the performance of the various components of the underlying I/O method we are able to produce a complete picture of the I/O behavior of an application.

We compare the performance measurements obtained using Skel, the tool that implements our approach, with those of an instrumented version of the original application to show that our approach is accurate. We demonstrate the use of Skel to compare the performance of several I/O methods. Finally we show that the detailed breakdown of timing information produced by Skel provides better understanding of the reasons for the performance differences between the examined I/O methods. We conclude that our approach facilitates faster, more accurate and more meaningful I/O performance testing, allowing application I/O performance to be predicted, and new systems and I/O methods to be evaluated.

Jeremy Logan, Scott Klasky, Hasan Abbasi, Qing Liu, George Ostrouchov, Manish Parashar, Norbert Podhorszki, Yuan Tian, Matthew Wolf

ASK: Adaptive Sampling Kit for Performance Characterization

Characterizing performance is essential to optimize programs and architectures. The open source Adaptive Sampling Kit (ASK) measures the performance trade-offs in large design spaces. Exhaustively sampling all points is computationally intractable. Therefore, ASK concentrates exploration in the most irregular regions of the design space through multiple adaptive sampling methods. The paper presents the ASK architecture and a set of adaptive sampling strategies, including a new approach: Hierarchical Variance Sampling. ASK’s usage is demonstrated on two performance characterization problems: memory stride accesses and stencil codes. ASK builds precise models of performance with a small number of measures. It considerably reduces the cost of performance exploration. For instance, the stencil code design space, which has more than 31.10

8

points, is accurately predicted using only 1 500 points.

Pablo de Oliveira Castro, Eric Petit, Jean Christophe Beyler, William Jalby

CRAW/P: A Workload Partition Method for the Efficient Parallel Simulation of Manycores

This paper addresses the workload partition strategies in the simulation of manycore architectures. The key observation behind this paper is that, compared to traditional multicores, manycores feature more non-uniform memory access and unpredictable network traffic; these features degrades simulation speed and accuracy of

Parallel Discrete Event Simulators (PDES)

when one uses static workload partition schemes. Based on the observation, we propose an adaptive workload partition method:

Core/Router-Adaptive Workload Partition (CRAW/P)

. The method delivers more speedup and accuracy than static partition schemes by partitioning the simulation of on-chip-network independently from that of the cores and by synchronizing them differently. Using a PDES simulator, we evaluate the performance of CRAW/P in simulating a 256-core general purpose many-core processor. Running SPLASH2 benchmark applications, the experimental results demonstrate it can deliver speed improvement by 28%~67% over static partition scheme and reduces timing errors to <10% in very relaxed simulation (quantum size as 64).

Shuai Jiao, Paolo Ienne, Xiaochun Ye, Da Wang, Dongrui Fan, Ninghui Sun

Topic 3: Scheduling and Load Balancing

More than ever, parallelism is available today at every level of computing systems, including dedicated embedded systems, basic instructions and registers, hardware accelerators, multi-core platforms, computational grids, etc. Despite of lot of efforts and nice positive results obtained during the past years, such systems are still not fully exploited. Scheduling represents the use or optimization of resources allocation in parallel and distributed systems. There are many issues to study for a better share of the load, a better reliability, a better adaptivity under computing, bandwidth or memory constraints. They are all crucial for obtaining a better use of parallel and distributed systems. It is a big challenge to study related techniques provided at both application and system levels. At the application level, the choice of the adequate computational model, the design of dynamic algorithms that are able to adapt to the particular characteristics, the mapping of applications onto the underlying computing platforms and the actual utilization of the systems are particularly relevant.

Denis Trystram, Ioannis Milis, Zhihui Du, Uwe Schwiegelshohn

Job Scheduling Using Successive Linear Programming Approximations of a Sparse Model

In this paper we tackle the well-known problem of scheduling a collection of parallel jobs on a set of processors either in a cluster or in a multiprocessor computer. For the makespan objective, i.e., the completion time of the last job, this problem has been shown to be NP-Hard and several heuristics have already been proposed to minimize the execution time. We introduce a novel approach based on successive linear programming (LP) approximations of a sparse model. The idea is to relax an integer linear program and use ℓ

p

norm-based operators to force the solver to find almost-integer solutions that can be assimilated to an integer solution. We consider the case where jobs are either rigid or moldable. A rigid parallel job is performed with a predefined number of processors while a moldable job can define the number of processors that it is using just before it starts its execution. We compare the scheduling approach with the classic Largest Task First list based algorithm and we show that our approach provides good results for small instances of the problem. The contributions of this paper are both the integration of mathematical methods in the scheduling world and the design of a promising approach which gives good results for scheduling problems with less than a hundred processors.

Stephane Chretien, Jean-Marc Nicod, Laurent Philippe, Veronika Rehn-Sonigo, Lamiel Toch

Speed Scaling on Parallel Processors with Migration

We study the problem of scheduling a set of jobs with release dates, deadlines and processing requirements (works), on parallel speed-scalable processors so as to minimize the total energy consumption. We consider that both preemption and migration of jobs are allowed. We formulate the problem as a convex program and we propose a polynomial-time combinatorial algorithm which is based on a reduction to the maximum flow problem. We extend our algorithm to the multiprocessor speed scaling problem with preemption and migration where the objective is the minimization of the maximum lateness under a budget of energy.

Eric Angel, Evripidis Bampis, Fadi Kacem, Dimitrios Letsios

Dynamic Distributed Scheduling Algorithm for State Space Search

Petascale computing requires complex runtime systems that need to consider load balancing along with low time and message complexity for scheduling massive scale parallel computations. Simultaneous consideration of these objectives makes online distributed scheduling a very challenging problem. For state space search applications such as UTS, NQueens, Balanced Tree Search, SAT and others, the computations are highly irregular and data dependent. Here, prior scheduling approaches such as [16], [14], [7], HotSLAW [10], which are dominantly locality-aware work-stealing driven, could lead to low parallel efficiency and scalability along with potentially high stack memory usage.

In this paper we present a novel distributed scheduling algorithm (

LDSS

) for

multi-place

parallel computations, that uses an unique combination of

d

-choice randomized remote (inter-place) spawns and topology-aware randomized remote work steals to reduce the overheads in the scheduler and dynamically maintain load balance across the compute nodes of the system. Our design was implemented using GASNet API and POSIX threads. For the UTS (Unbalanced Tree Search) benchmark (using upto 4096 nodes of Blue Gene/P), we deliver the best parallel efficiency (92%) for 295

B

node binomial tree, better than [16] (87%) and demonstrate super-linear speedup on 1 Trillion node (largest studied so far) geometric tree along with higher tree node processing rate. We also deliver upto 40% better performance than Charm++. Further, our memory utilization is lower compared to

HotSLAW

. Moreover, for NQueens (

N

= 18), we demonstrate superior parallel efficiency (92%) as compared Charm++ (85%).

Ankur Narang, Abhinav Srivastava, Ramnik Jain, R. K. Shyamasundar

Using Load Information in Work-Stealing on Distributed Systems with Non-uniform Communication Latencies

We evaluate four state-of-the-art work-stealing algorithms for distributed systems with non-uniform communication latenices (Random Stealing, Hierarchical Stealing, Cluster-aware Random Stealing and Adaptive Cluster-aware Random Stealing) on a set of irregular Divide-and-Conquer (D&C) parallel applications. We also investigate the extent to which these algorithms could be improved if dynamic load information is available, and how accurate this information needs to be. We show that, for highly-irregular D&C applications, the use of load information can significantly improve application speedups, whereas there is little improvement for less irregular ones. Furthermore, we show that when load information is used, Cluster-aware Random Stealing gives the best speedups for both regular and irregular D&C applications.

Vladimir Janjic, Kevin Hammond

Energy Efficient Frequency Scaling and Scheduling for Malleable Tasks

We give an efficient algorithm for solving the following scheduling problem to optimality: Assign

n

jobs to

m

processors such that they all meet a common deadline

T

and energy consumption is minimized by appropriately controlling the clock frequencies of the processors. Jobs are malleable, i.e., their amount of parallelism can be flexibly adapted. In contrast to previous work on energy efficient scheduling we allow more realistic energy consumption functions including a minimum and maximum clock frequency and a linear term in energy consumption. We need certain assumptions on the speedup function of the jobs that we show to apply for a large class of practically occurring functions.

Peter Sanders, Jochen Speck

Scheduling MapReduce Jobs in HPC Clusters

MapReduce (MR) has become a de facto standard for large-scale data analysis. Moreover, it has also attracted the attention of the HPC community due to its simplicity, efficiency and highly scalable parallel model. However, MR implementations present some issues that may complicate its execution in existing HPC clusters, specially concerning the job submission. While on MR there are no strict parameters required to submit a job, in a typical HPC cluster, users must specify the number of nodes and amount of time required to complete the job execution. This paper presents the MR Job Adaptor, a component to optimize the scheduling of MR jobs along with HPC jobs in an HPC cluster. Experiments performed using real-world HPC and MapReduce workloads have show that MR Job Adaptor can properly transform MR jobs to be scheduled in an HPC Cluster, minimizing the job turnaround time, and exploiting unused resources in the cluster.

Marcelo Veiga Neves, Tiago Ferreto, César De Rose

A Job Scheduling Approach for Multi-core Clusters Based on Virtual Malleability

Many commercial job scheduling strategies in multi processing systems tend to minimize waiting times of short jobs. However, long jobs cannot be left aside as their impact on the performance of the system is also determinant. In this work we propose a job scheduling strategy that maximizes resources utilization and improves the overall performance by allowing jobs to adapt to variations in the load. The experimental evaluations include both simulations and executions of real workloads. The results show that our strategy provides significant improvements over the traditional EASY backfilling policy, especially in medium to high machine loads.

Gladys Utrera, Siham Tabik, Julita Corbalan, Jesús Labarta

Topic 4: High-Performance Architecture and Compilers

High-performance architecture and compilation are the foundation on which the modern computer systems are built. The two sub-topics are very strongly related and only in combination can deliver performance levels we came to expect from systems. The topic is quite broad, with sub-areas of interest ranging from multicore and multi-threaded processors to large-scale parallel machines, and from program analysis, program transformation, automatic discovery and management of parallelism, programmer productivity tools, concurrent and sequential languages, and other compiler issues.

Alex Veidenbaum, Nectarios Koziris, Toshinori Sato, Avi Mendelson

Dynamic Last-Level Cache Allocation to Reduce Area and Power Overhead in Directory Coherence Protocols

Last level caches (LLC) play an important role in current and future chip multiprocessors, since they constitute the last opportunity to avoid expensive off-chip accesses. In a tiled CMP, the LLC is typically shared by all cores but physically distributed along the chip, thus providing a global banked capacity memory with high associativity. The memory hierarchy is orchestrated through a directory-based coherence protocol, typically associated to the LLC banks. The LLC (and directory structure) occupies a significant chip area and has a large contribution on the global chip leakage energy. To counter measure these effects, we provide in this paper a reorganization of the LLC cache and the directory by decoupling tag and data entry allocation, and by exploiting the high percentage of private data typically found in CMP systems. Private blocks are kept in L1 caches whereas LLC area is reorganized to reduce L2 entries while still allowing directory entries for private data, thus, maximizing on-chip memory reuse. This is achieved with no performance drop in terms of execution time. Evaluation results demonstrate a negligible impact on performance while achieving 45% of area saving and 75% of static power saving. For more aggressive designs, we achieve 80% area and 82% static power savings, while impacting performance by 10%.

Mario Lodde, Jose Flich, Manuel E. Acacio

A Practical Approach to DOACROSS Parallelization

Loops with cross-iteration dependences (

doacross

loops) often contain significant amounts of parallelism that can potentially be exploited on modern manycore processors. However, most production-strength compilers focus their automatic parallelization efforts on

doall

loops, and consider

doacross

parallelism to be impractical due to the space inefficiencies and the synchronization overheads of past approaches. This paper presents a novel and

practical

approach to automatically parallelizing

doacross

loops for execution on manycore-SMP systems. We introduce a compiler-and-runtime optimization called

dependence folding

that bounds the number of synchronization variables allocated per worker thread (processor core) to be at most the maximum depth of a loop nest being considered for automatic parallelization. Our approach has been implemented in a development version of the IBM XL Fortran V13.1 commercial parallelizing compiler and runtime system. For four benchmarks where automatic

doall

parallelization was largely ineffective (speedups of under 2×), our implementation delivered speedups of 6.5×, 9.0×, 17.3×, and 17.5× on a 32-core IBM Power7 SMP system, thereby showing that

doacross

parallelization can be a valuable technique to complement

doall

parallelization.

Priya Unnikrishnan, Jun Shirako, Kit Barton, Sanjay Chatterjee, Raul Silvera, Vivek Sarkar

Exploiting Semantics of Virtual Memory to Improve the Efficiency of the On-Chip Memory System

Different virtual memory regions (e.g., stack and heap) have different properties and characteristics. For example, stack data are thread-private by definition while heap data can be shared between threads. Compared with heap memory, stack memory tends to take a large number of accesses to a rather small number of pages. These facts have been largely ignored by designers. In this paper, we propose two novel designs that exploit stack memory’s unique characteristics to optimize the on-chip memory system.

The first design is

Anticipatory Superpaging

- automatically create superpages for stack memory at the first page fault in a potential superpage, increasing TLB reach and reducing TLB misses. It is transparent to applications and does not require kernel to employ online analysis algorithms and page copying. The second design is

Stack-Aware Cache Placement

- stack accesses are routed to their local slices in a distributed shared cache, while non-stack accesses are still routed using cacheline interleaving. The primary benefit of this mechanism is reduced power consumption of the on-chip interconnect. Our simulation shows that the first innovation reduces TLB misses by 10% - 20%, and the second one reduces interconnect power consumption by over 14%.

Bin Li, Zhen Fang, Li Zhao, Xiaowei Jiang, Lin Li, Andrew Herdrich, Ravishankar Iyer, Srihari Makineni

From Serial Loops to Parallel Execution on Distributed Systems

Programmability and performance portability are two major challenges in today’s dynamic environment. Algorithm designers targeting efficient algorithms should focus on designing high-level algorithms exhibiting maximum parallelism, while relying on compilers and run-time systems to discover and exploit this parallelism, delivering sustainable performance on a variety of hardware. The compiler tool presented in this paper can analyze the data flow of serial codes with imperfectly nested, affine loop-nests and if statements, commonly found in scientific applications. This tool operates as the front-end compiler for the

DAGuE

run-time system by automatically converting serial codes into the symbolic representation of their data flow. We show how the compiler analyzes the data flow, and demonstrate that scientifically important, dense linear algebra operations can benefit from this analysis, and deliver high performance on large scale platforms.

George Bosilca, Aurelien Bouteiller, Anthony Danalis, Thomas Herault, Jack Dongarra

Topic 5: Parallel and Distributed Data Management

The ever-increasing data volumes used to empower contemporary data-intensive applications as well as aggregations of computing systems call for novel approaches and efficient techniques in the management of geographically dispersed data. Despite recent advances, Internet-scale requirements for both applications and underlying systems require effective provisioning, staging,manipulation, continuous maintenance and monitoring of data hosted in multiple, pre-existing autonomous, distributed and often heterogeneous systems. Evidently, the notions of parallelism and concurrent execution at all levels remain key elements in attaining scalability and effective management for nearly-all modern data-intensive applications. Moreover, as underlying computing environments get transformed through the introduction of novel infrastructures, enhanced capacities and extended functionalities, new solutions are sought to cope with these changes.

Domenico Talia, Alex Delis, Haimonti Dutta, Arkady Zaslavsky

DS-Means: Distributed Data Stream Clustering

This paper proposes

DS-means

, a novel algorithm for clustering distributed data streams. Given a network of computing nodes, each of them receiving its share of a distributed data stream, our goal is to obtain a common clustering under the following restrictions (i) the number of clusters is not known in advance and (ii) nodes are not allowed to share single points of their datasets, but only aggregate information. A motivating example for

DS-means

is the decentralized detection of botnets, where a collection of independent ISPs may want to detect common threats, but are unwilling to share their precious users’ data. In

DS-means

, nodes execute a distributed version of

K-means

on each chunk of data they receive to provide a compact representation of the data of the entire network. Later,

X-means

is executed on this representation to obtain an estimate of the number of clusters. A number of experiments on both synthetic and real-life datasets show that our algorithm is precise, efficient and robust.

Alessio Guerrieri, Alberto Montresor

3D Inverted Index with Cache Sharing for Web Search Engines

Web search engines achieve efficient performance by partitioning and replicating the indexing data structure used to support query processing. Current practice simply partitions and replicates the text collection on the set of cluster processors and then constructs in each processor an index data structure. This paper proposes a different approach by constructing an index data structure that properly considers the fact that data is partitioned and replicated. This leads to a so-called 3D indexing strategy that outperforms current approaches. Performance is further boosted by introducing an application caching scheme devised to hold most frequently issued queries.

Esteban Feuerstein, Veronica Gil-Costa, Mauricio Marin, Gabriel Tolosa, Ricardo Baeza-Yates

Quality-of-Service for Consistency of Data Geo-replication in Cloud Computing

Today we are increasingly more dependent on critical data stored in cloud data centers across the world. To deliver high-availability and augmented performance, different replication schemes are used to maintain consistency among replicas. With classical consistency models, performance is necessarily degraded, and thus most highly-scalable cloud data centers sacrifice to some extent consistency in exchange of lower latencies to end-users. More so, those cloud systems blindly allow stale data to exist for some constant period of time and disregard the semantics and importance data might have, which undoubtedly can be used to gear consistency more wisely, combining stronger and weaker levels of consistency. To tackle this inherent and well-studied trade-off between availability and consistency, we propose the use of

VFC

3

, a novel consistency model for replicated data across data centers with framework and library support to enforce increasing degrees of consistency for different types of data (based on their semantics). It targets cloud tabular data stores, offering rationalization of resources (especially bandwidth) and improvement of QoS (performance, latency and availability), by providing strong consistency where it matters most and relaxing on less critical classes or items of data.

Sérgio Esteves, João Silva, Luís Veiga

A Fault-Tolerant Cache Service for Web Search Engines: RADIC Evaluation

Large Web search engines are constructed as a collection of services that are deployed on dedicated clusters of distributed-memory processors. In particular, efficient query throughput heavily relies on using result cache services devoted to maintaining the answers to most frequent queries. Load balancing and fault tolerance are critical to this service. A previous paper [7] described the design of a result cache service based on consistent hashing and a strategy for enabling fault tolerance. This paper goes further into implementation details and experiments related to the basic scheme to support fault-tolerance which is critical for overall performance. To this end, we evaluate the performance of the RADIC scheme [14] for fault-tolerance under demanding scenarios imposed in the caching service.

Carlos Gómez-Pantoja, Dolores Rexachs, Mauricio Marin, Emilio Luque

Topic 6: Grid, Cluster and Cloud Computing

Grid and cloud computing have changed the IT landscape in the way we access and manage IT infrastructures. Both technologies provide easy-to-use and on-demand access to large-scale infrastructures. Grid and cloud computing are major research areas with strong involvement from both academia and industry. Although significant progress has been made in the design, deployment, operation and use of such infrastructures, many key research challenges remain to achieve the goal of user-friendly, efficient, and reliable grid and cloud infrastructures. Research issues cover many areas of computer science to address the fundamental capabilities and services that are required in a heterogeneous environment, such as adaptability, scalability, reliability and security, and to support applications as diverse as ubiquitous local services, enterprise-scale virtual organizations, and internet-scale distributed supercomputing. While there are several differences, grid and cloud computing are closely related in their research issues. Both areas will greatly benefit from interactions with the many related areas of computer science, making Euro-Par an excellent venue to present results and discuss issues.

Erik Elmroth, Paraskevi Fragopoulou, Artur Andrzejak, Ivona Brandic, Karim Djemame, Paolo Romano

Scalable Reed-Solomon-Based Reliable Local Storage for HPC Applications on IaaS Clouds

With increasing interest among mainstream users to run HPC applications, Infrastructure-as-a-Service (IaaS) cloud computing platforms represent a viable alternative to the acquisition and maintenance of expensive hardware, often out of the financial capabilities of such users. Also, one of the critical needs of HPC applications is an efficient, scalable and persistent storage. Unfortunately, storage options proposed by cloud providers are not standardized and typically use a different access model. In this context, the local disks on the compute nodes can be used to save large data sets such as the data generated by Checkpoint-Restart (CR). This local storage offers high throughput and scalability but it needs to be combined with persistency techniques, such as block replication or erasure codes. One of the main challenges that such techniques face is to minimize the overhead of performance and I/O resource utilization (i.e., storage space and bandwidth), while at the same time guaranteeing high reliability of the saved data. This paper introduces a novel persistency technique that leverages Reed-Solomon (RS) encoding to save data in a reliable fashion. Compared to traditional approaches that rely on block replication, we demonstrate about 50% higher throughput while reducing network bandwidth and storage utilization by a factor of 2 for the same targeted reliability level. This is achieved both by modeling and real life experimentation on hundreds of nodes.

Leonardo Bautista Gomez, Bogdan Nicolae, Naoya Maruyama, Franck Cappello, Satoshi Matsuoka

Caching VM Instances for Fast VM Provisioning: A Comparative Evaluation

One of the key metrics of performance in an infrastructure cloud is the speed of provisioning a virtual machine (or a virtual appliance) on request. A VM is instantiated from an image file stored in the image repository. Since the image files are large, often GigaBytes in size, transfer of the file from the repository to a compute node running the hypervisor can take time in the order of minutes. In addition to it, booting an image file can be a time consuming process if several applications are pre-installed. Use of caching to pre-fetch items that may be requested in future is known to reduce service latency. In order to overcome the delays in transfer and booting time, we prepare a VM a priori, and save it in a standby state in a “cache” space collocated with the compute nodes. On receiving a matching request, the VM from the cache is instantly served to the user, thereby reducing service time. In this paper, we compare multiple approaches for pre-provisioning and evaluate their benefits. Based on usage data collected from an enterprise cloud, and through simulation, we show that a reduction of 60% in service time is achievable.

Pradipta De, Manish Gupta, Manoj Soni, Aditya Thatte

Improving Scheduling Performance Using a Q-Learning-Based Leasing Policy for Clouds

Academic data centers are commonly used to solve the major amount of scientific computing. Depending on upcoming research projects the user generated workload may change. Especially in phases of high computational demand it may be useful to temporarily extend the local site. This can be done by leasing computing resources from a cloud computing provider, e.g. Amazon EC2, to improve the service for the local user community. We present a reinforcement learning-based policy which controls the maximum leasing size with regard to the current resource/workload state and the balance between scheduling benefits and costs in an online adaptive fashion. Further, we provide an appropriate model to evaluate such policies and present heuristics to determine upper and lower reference values for the performance evaluation under the given model. Using event driven simulation and real workload traces, we are able to investigate the dynamics of the learning policy and to demonstrate the adaptivity on workload changes. By showing its performance as a ratio between costs and scheduling improvement with regard to the upper and lower reference heuristics we prove the benefit of our concept.

Alexander Fölling, Matthias Hofmann

Impact of Variable Priced Cloud Resources on Scientific Workflow Scheduling

We analyze the problem of provisioning Cloud instances to large scientific workflows that do not benefit from sufficient Grid resources as required by their computational requirements. We propose an extension to the dynamic critical path scheduling algorithm to deal with the general resource leasing model encountered in today’s commercial Clouds. We analyze the availability of the cheaper and unreliable Spot instances and study their potential to complement the unavailability of Grid resources for large workflow executions. Experimental results demonstrate that Spot instances represent a 60% cheaper but equally reliable alternative to Standard instances provided that a correct user bet is made.

Simon Ostermann, Radu Prodan

Topic 7: Peer to Peer Computing

Peer-to-peer (P2P) systems enable computers to share information and other resources with their networked peers in large-scale distributed computing environments. The resulting overlay networks are inherently decentralized, selforganizing, and self-coordinating.Well-designed P2P systems should be adaptive to peer arrivals and departures, resilient to failures, tolerant to network performance variations, and scalable to huge numbers of peers (tens of thousands to millions). As P2P research becomes more mature, new challenges emerge to support complex and heterogeneous decentralized environments for sharing and managing data, resources, and knowledge with highly dynamic and unpredictable usage patterns. This topic provides a forum for researchers to present new contributions to P2P systems, technologies, middleware, and applications that address key research issues and challenges.

Alberto Montresor, Evaggelia Pitoura, Anwitaman Datta, Spyros Voulgaris

ID-Replication for Structured Peer-to-Peer Systems

Structured overlay networks, like any distributed system, use replication to avoid losing data in the presence of failures. In this paper, we discuss the short-comings of existing replication schemes and propose a technique for replication, called

ID-Replication

. ID-Replication allows different replication degrees for keys in the system, thus allowing popular data to have more copies. We discuss how ID-Replication is less sensitive to churn compared to existing replication schemes, which makes ID-Replication better suited for building consistent services on top of overlays compared to other schemes. Furthermore, we show why ID-Replication is simpler to load-balance and more secure compared to successor-list replication. We evaluate our scheme in detail, and compare it with successor-list replication.

Tallat M. Shafaat, Bilal Ahmad, Seif Haridi

Changing the Unchoking Policy for an Enhanced Bittorrent

In this paper, we propose a novel optimistic unchoking approach for the

BitTorrent

protocol whose key objective is to improve the quality of inter-connections amongst peers. In turn, this yields enhanced data distribution without penalizing underutilized and/or idle peers. The suggested policy takes into consideration the number of peers currently interested in downloading from a client that is to be unchoked. Our conjecture is that clients having few peers interested in downloading data from them should be favored with optimistic unchoke intervals. This will enable the clients in question to receive data since they become unchoked faster and consequently, they will trigger the interest of additional peers. In contrast, clients with plenty of “interested” peers should enjoy a lower priority to be selected as “planned optimistic unchoked” as they likely have enough data to forward and have saturated their uplinks. In this context, we increase the aggregate probability that the swarm obtains a higher number of interested-in-cooperation and directly-connected peers leading to improved peer inter-connection. Experimental results indicate that our approach significantly outperforms the existing optimistic unchoking policy.

Vaggelis Atlidakis, Mema Roussopoulos, Alex Delis

Peer-to-Peer Multi-class Boosting

We focus on the problem of data mining over large-scale fully distributed databases, where each node stores only one data record. We assume that a data record is never allowed to leave the node it is stored at. Possible motivations for this assumption include privacy or a lack of a centralized infrastructure. To tackle this problem, earlier we proposed the generic gossip learning framework (GoLF), but so far we have studied only basic linear algorithms. In this paper we implement the well-known boosting technique in GoLF. Boosting techniques have attracted growing attention in machine learning due to their outstanding performance in many practical applications. Here, we present an implementation of a boosting algorithm that is based on FilterBoost. Our main algorithmic contribution is a derivation of a pure online multi-class version of FilterBoost, so that it can be employed in GoLF. We also propose improvements to GoLF, with the aim of maximizing the diversity of the evolving models gossiped in the network, a feature that we show to be important. We evaluate the robustness and the convergence speed of the algorithm empirically over three benchmark databases.We compare the algorithm with the sequential AdaBoost algorithm and we test its performance in a failure scenario involving message drop and delay, and node churn.

István Hegedűs, Róbert Busa-Fekete, Róbert Ormándi, Márk Jelasity, Balázs Kégl

Topic 8: Distributed Systems and Algorithms

The increasing significance of

Distributed Computing

becomes more and more crucial with the prevail of technological advances that make

Global Computing

a reality in modern world. Indeed, it is hard to imagine some application or computational activity and process that falls outside

Distributed Computing

. With the large advent of distributed systems, we are faced with the real challenges of distributed computation: How do we cope with asynchrony and failures? How (and how well) do we achieve load balancing? How do we model and analyze malicious and selfish behavior? How do we address mobility, heterogeneity and the dynamic nature of participating processes? What can we achieve in the presence of disconnecting operations that cause network partitioning?

Andrzej Goscinski, Marios Mavronicolas, Weisong Shi, Teo Yong Meng

Towards Load Balanced Distributed Transactional Memory

We consider the problem of implementing transactional memory in

d

-dimensional mesh networks. We present and analyze

MultiBend

, a novel load balanced directory-based protocol, which is designed for the

data-flow

distributed implementation of software transactional memory. It supports three basic operations,

publish

,

lookup

, and

move

, on a shared object. A pleasing aspect of

MultiBend

is that it is load balanced (minimizes maximum node and edge utilization) which is achieved by using paths of multiple bends in the mesh. This protocol guarantees an

${\cal O}(d^2\log n)$

approximation for the load and also for the distance stretch of

move

requests, where

n

is the number of nodes in the network. For fixed

d

, both the load and the

move

stretch are optimal within a constant and a loglog factor, respectively. It also guarantees

${\cal O}(d^2)$

approximation for

lookup

requests which is optimal within a constant factor for fixed

d

. To the best of our knowledge, this is the first distributed directory protocol that is load balanced.

Gokarna Sharma, Costas Busch

CUDA-For-Clusters: A System for Efficient Execution of CUDA Kernels on Multi-core Clusters

Rapid advancements in multi-core processor architectures coupled with low-cost, low-latency, high-bandwidth interconnects have made clusters of multi-core machines a common computing resource. Unfortunately, writing good parallel programs that efficiently utilize all the resources in such a cluster is still a major challenge. Various programming languages have been proposed as a solution to this problem, but are yet to be adopted widely to run performance-critical code mainly due to the relatively immature software framework and the effort involved in re-writing existing code in the new language. In this paper, we motivate and describe our initial study in exploring CUDA as a programming language for a cluster of multi-cores. We develop CUDA-For-Clusters (CFC), a framework that transparently orchestrates execution of CUDA kernels on a cluster of multi-core machines. The well-structured nature of a CUDA kernel, the growing popularity, support and stability of the CUDA software stack collectively make CUDA a good candidate to be considered as a programming language for a cluster. CFC uses a mixture of source-to-source compiler transformations, a work distribution runtime and a light-weight software distributed shared memory to manage parallel executions. Initial results on running several standard CUDA benchmark programs achieve impressive speedups of up to 7.5X on a cluster with 8 nodes, thereby opening up an interesting direction of research for further investigation.

Raghu Prabhakar, R. Govindarajan, Matthew J. Thazhuthaveetil

From a Store-Collect Object and Ω to Efficient Asynchronous Consensus

This paper presents an efficient algorithm that build a consensus object. This algorithm is based on an Ω failure detector (to obtain consensus liveness) and a store-collect object (to maintain its safety). A store-collect object provides the processes with two operations, a store operation which allows the invoking process to deposit a new value while discarding the previous value it has deposited and a collect operation that returns to the invoking process a set of pairs (

i

,

val

) where

val

is the last value deposited by the process

p

i

. A store-collect object has no sequential specification.

While store-collect objects have been used as base objects to design wait-free constructions of more sophisticated objects (such as snapshot or renaming objects), as far as we know, they have not been explicitly used to built consensus objects. The proposed store-collect-based algorithm, which is round-based, has several noteworthy features. First it uses a single store-collect object (and not an object per round). Second, during a round, a process invokes at most once the store operation and the value

val

it deposits is a simple pair 〈

r

,

v

〉 where

r

is a round number and

v

a proposed value. Third, a process is directed to skip rounds according to its view of the current global state (thereby saving useless computation rounds). Finally, the algorithm benefits from the adaptive wait-free implementations that have been proposed for store-collect objects, namely, the number of shared memory accesses involved in a collect operation is

O

(

k

) where

k

is the number of processes that have invoked the store operation. This makes the proposed algorithm particularly efficient and interesting for multiprocess programs made up of asynchronous crash-prone processes that run on top of multicore architectures.

Michel Raynal, Julien Stainer

An Investigation into the Performance of Reduction Algorithms under Load Imbalance

Today, most reduction algorithms are optimized for balanced workloads; they assume all processes will start the reduction at about the same time. However, in practice this is not always the case and significant load imbalances may occur and affect the performance of said algorithms. In this paper we investigate the impact of such imbalances on the most commonly employed reduction algorithms and propose a new algorithm specifically adapted to the presented context. Firstly, we analyze the optimistic case where we have a priori knowledge of all imbalances and propose a near-optimal solution. In the general case, where we do not have any foreknowledge of the imbalances, we propose a dynamically rebalanced tree reduction algorithm. We show experimentally that this algorithm performs better than the default OpenMPI and MVAPICH2 implementations.

Petar Marendić, Jan Lemeire, Tom Haber, Dean Vučinić, Peter Schelkens

Achieving Reliability in Master-Worker Computing via Evolutionary Dynamics

This work considers Internet-based task computations in which a master process assigns tasks, over the Internet, to rational workers and collect their responses. The objective is for the master to obtain the correct task outcomes. For this purpose we formulate and study the dynamics of evolution of Internet-based master-worker computations through reinforcement learning.

Evgenia Christoforou, Antonio Fernández Anta, Chryssis Georgiou, Miguel A. Mosteiro, Angel (Anxo) Sánchez

Topic 9: Parallel and Distributed Programming

This topic provides a forum for the presentation of the latest research results and practical experience in parallel and distributed programming in general, except for work specifically targeting multicore and manycore architectures, which has matured to becoming a Euro-Par topic of its own.

Sergei Gorlatch, Rizos Sakellariou, Marco Danelutto, Thilo Kielmann

Dynamic Thread Mapping Based on Machine Learning for Transactional Memory Applications

Thread mapping is an appealing approach to efficiently exploit the potential of modern chip-multiprocessors. However, efficient thread mapping relies upon matching the behavior of an application with system characteristics. In particular, Software Transactional Memory (STM) introduces another dimension due to its runtime system support. In this work, we propose a dynamic thread mapping approach to automatically infer a suitable thread mapping strategy for transactional memory applications composed of multiple execution phases with potentially different transactional behavior in each phase. At runtime, it profiles the application at specific periods and consults a decision tree generated by a Machine Learning algorithm to decide if the current thread mapping strategy should be switched to a more adequate one. We implemented this approach in a state-of-the-art STM system, making it transparent to the user. Our results show that the proposed dynamic approach presents performance improvements up to 31% compared to the best static solution.

Márcio Castro, Luís Fabrício Wanderley Góes, Luiz Gustavo Fernandes, Jean-François Méhaut

A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI

Most predictions of Exascale machines picture billion way parallelism, encompassing not only millions of cores, but also tens of thousands of nodes. Even considering extremely optimistic advances in hardware reliability, probabilistic amplification entails that failures will be unavoidable. Consequently, software fault tolerance is paramount to maintain future scientific productivity. Two major problems hinder ubiquitous adoption of fault tolerance techniques: 1) traditional checkpoint based approaches incur a steep overhead on failure free operations and 2) the dominant programming paradigm for parallel applications (the MPI standard) offers extremely limited support of software-level fault tolerance approaches. In this paper, we present an approach that relies exclusively on the features of a high quality implementation, as defined by the current MPI standard, to enable algorithmic based recovery, without incurring the overhead of customary periodic checkpointing. The validity and performance of this approach are evaluated on large scale systems, using the QR factorization as an example.

Wesley Bland, Peng Du, Aurelien Bouteiller, Thomas Herault, George Bosilca, Jack Dongarra

Hierarchical Partitioning Algorithm for Scientific Computing on Highly Heterogeneous CPU + GPU Clusters

Hierarchical level of heterogeneity exists in many modern high performance clusters in the form of heterogeneity between computing nodes, and within a node with the addition of specialized accelerators, such as GPUs. To achieve high performance of scientific applications on these platforms it is necessary to perform load balancing. In this paper we present a hierarchical matrix partitioning algorithm based on realistic performance models at each level of hierarchy. To minimise the total execution time of the application it iteratively partitions a matrix between nodes and partitions these sub-matrices between the devices in a node. This is a self-adaptive algorithm that dynamically builds the performance models at run-time and it employs an algorithm to minimise the total volume of communication. This algorithm allows scientific applications to perform load balanced matrix operations with nested parallelism on hierarchical heterogeneous platforms. To show the effectiveness of the algorithm we applied it to a fundamental operation in scientific parallel computing, matrix multiplication. Large scale experiments on a heterogeneous multi-cluster site incorporating multicore CPUs and GPU nodes show that the presented algorithm outperforms current state of the art approaches and successfully load balance very large problems.

David Clarke, Aleksandar Ilic, Alexey Lastovetsky, Leonel Sousa

Encapsulated Synchronization and Load-Balance in Heterogeneous Programming

Programming models and techniques to exploit parallelism in accelerators, such as GPUs, are different from those used in traditional parallel models for shared- or distributed-memory systems. It is a challenge to blend different programming models to coordinate and exploit devices with very different characteristics and computation powers. This paper presents a new extensible framework model to encapsulate run-time decisions related to data partition, granularity, load balance, synchronization, and communication for systems including assorted GPUs. Thus, the main parallel code becomes independent of them, using internal topology and system information to transparently adapt the computation to the system. The programmer can develop specific functions for each architecture, or use existent specialized library functions for different CPU-core or GPU architectures. The high-level coordination is expressed using a programming model built on top of message-passing, providing portability across distributed- or shared-memory systems. We show with an example how to produce a parallel code that can be used to efficiently run on systems ranging from a Beowulf cluster to a machine with mixed GPUs. Our experimental results show how the run-time system, guided by hints about the computational-power ratios of different devices, can automatically part and distribute large computations across heterogeneous systems, improving the overall performance.

Yuri Torres, Arturo Gonzalez-Escribano, Diego Llanos

Transactional Access to Shared Memory in StarSs, a Task Based Programming Model

With an increase in the number of processors on a single chip, programming environments which facilitate the exploitation of parallelism on multicore architectures have become a necessity. StarSs is a task-based programming model that enables a flexible and high level programming. Although task synchronization in StarSs is based on data flow and dependency analysis, some applications (e.g.

reductions

) require

locks

to access shared data.

Transactional Memory is an alternative to lock-based synchronization for controlling access to shared data. In this paper we explore the idea of integrating a lightweight Software Transactional Memory (STM) library, TinySTM , into an implementation of StarSs (SMPSs). The SMPSs runtime and the compiler have been modified to include and use calls to the STM library. We evaluated this approach on four applications and observe better performance in applications with high lock contention.

Rahulkumar Gayatri, Rosa M. Badia, Eduard Ayguade, Mikel Luján, Ian Watson

On-the-Fly Task Execution for Speeding Up Pipelined MapReduce

The MapReduce programming model is widely acclaimed as a key solution to designing data-intensive applications. However, many of the computations that fit this model cannot be expressed as a single MapReduce execution, but require a more complex design. Such applications consisting of multiple jobs chained into a long-running execution are called pipeline MapReduce applications. Standard MapReduce frameworks are not optimized for the specific requirements of pipeline applications, yielding performance issues. In order to optimize the execution on pipelined MapReduce, we propose a mechanism for creating map tasks along the pipeline, as soon as their input data becomes available. We implemented our approach in the Hadoop MapReduce framework. The benefits of our dynamic task scheduling are twofold: reducing job-completion time and increasing cluster utilization by involving more resources in the computation. Experimental evaluation performed on the Grid’5000 testbed, shows that our approach delivers performance gains between 9% and 32%.

Diana Moise, Gabriel Antoniu, Luc Bougé

Assessing the Performance and Scalability of a Novel Multilevel K-Nomial Allgather on CORE-Direct Systems

In this paper, we propose a novel allgather algorithm, Reindexed Recursive K-ing (RRK), which leverages flexibility in the algorithm’s tree topology and ability to make asynchronous progress coupled with Core-Direct communication offload capability to optimize the MPI_Allgather for Core-Direct enabled systems. In particular, the RRK introduces a reindexing scheme which ensures contiguous data transfers while adding only a single additional send and receive operation for any radix,

k

, or communicator size, N. This allows us to improve algorithm scalability by avoiding the use of a scatter/gather elements (SGE) list on InfiniBand networks. The implementations of the RRK algorithm and its evaluation shows that it performs and scales well on Core-Direct systems for a wide range of message sizes and various communicator configurations.

Joshua S. Ladd, Manjunath Gorentla Venkata, Richard Graham, Pavel Shamis

Topic 10: Parallel Numerical Algorithms

The solution of large-scale problems in Computational Science and Engineering relies on the availability of accurate, robust and efficient numerical algorithms and software that are able to exploit the power offered by modern computer architectures. Such algorithms and software provide building blocks for prototyping and developing novel applications, and for improving existing ones, by relieving the developers from details concerning numerical methods as well as their implementation in new computing environments.

Iain Duff, Efstratios Gallopoulos, Daniela di Serafino, Bora Ucar

Avoiding Communication through a Multilevel LU Factorization

Due to the evolution of massively parallel computers towards deeper levels of parallelism and memory hierarchy, and due to the exponentially increasing ratio of the time required to transfer data, either through the memory hierarchy or between different compute units, to the time required to compute floating point operations, the algorithms are confronted with two challenges. They need not only to be able to exploit multiple levels of parallelism, but also to reduce the communication between the compute units at each level of the hierarchy of parallelism and between the different levels of the memory hierarchy.

In this paper we present an algorithm for performing the LU factorization of dense matrices that is suitable for computer systems with two levels of parallelism. This algorithm is able to minimize both the volume of communication and the number of messages transferred at every level of the two-level hierarchy of parallelism. We present its implementation for a cluster of multicore processors based on MPI and Pthreads. We show that this implementation leads to a better performance than routines implementing the LU factorization in well-known numerical libraries. For matrices that are tall and skinny, that is they have many more rows than columns, our algorithm outperforms the corresponding algorithm from ScaLAPACK by a factor of 4.5 on a cluster of 32 nodes, each node having two quad-core Intel Xeon EMT64 processors.

Simplice Donfack, Laura Grigori, Amal Khabou

Locality Improvement of Data-Parallel Adams–Bashforth Methods through Block-Based Pipelining of Time Steps

Adams–Bashforth methods are a well-known class of explicit linear multi-step methods for the solution of initial value problems of ordinary differential equations. This article discusses different data-parallel implementation variants with different loop structures and communication patterns and compares the resulting locality and scalability. In particular, pipelining of time steps is employed to improve the locality of memory references. The comparison is based on detailed runtime experiments performed on parallel computer systems with different architectures, including the two supercomputer systems JUROPA and HLRB II.

Matthias Korch

Parallel SOR for Solving the Convection Diffusion Equation Using GPUs with CUDA

In this paper we study a parallel form of the SOR method for the numerical solution of the Convection Diffusion equation suitable for GPUs using CUDA. To exploit the parallelism offered by GPUs we consider the fine grain parallelism model. This is achieved by considering the local relaxation version of SOR. More specifically, we use SOR with red black ordering with two sets of parameters

ω

ij

and

$\omega_{ij}^{'}$

. The parameter

ω

ij

is associated with each red (i+j even) grid point (ij), whereas the parameter

$\omega_{ij}^{'}$

is associated with each black (i+j odd) grid point (ij). The use of a parameter for each grid point avoids the global communication required in the adaptive determination of the best value of

ω

and also increases the convergence rate of the SOR method [3]. We present our strategy and the results of our effort to exploit the computational capabilities of GPUs under the CUDA environment. Additionally, a program for the CPU was developed as a performance reference. Significant performance improvement was achieved with the three developed GPU kernel variations which proved to have different pros and cons.

Yiannis Cotronis, Elias Konstantinidis, Maria A. Louka, Nikolaos M. Missirlis

Topic 11: Multicore and Manycore Programming

Modern multicore and manycore systems enjoy the benefits of technology scaling and promise impressive performance. However, harvesting this potential is not straightforward. While multicore and manycore processors alleviate several problems that are related to single-core processors – known as memory-, power-, or instruction-level parallelism-wall – they raise the issue of the programmability and programming effort. This topic focuses on novel solutions for multicore and manycore programmability and efficient programming in the context of generalpurpose systems.

Eduard Ayguade, Dionisios Pnevmatikatos, Rudolf Eigenmann, Mikel Luján, Sabri Pllana

Efficient Support for In-Place Metadata in Transactional Memory

Implementations of Software Transactional Memory (STM) algorithms associate metadata with the memory locations accessed during a transaction’s lifetime. This metadata may be stored either in-place, by wrapping every memory cell in a container that includes the memory cell itself and the corresponding metadata; or out-place (also called external), by resorting to a mapping function that associates the memory cell address with an external table entry containing the corresponding metadata. The implementation techniques for these two approaches are very different and each STM framework is usually biased towards one of them, only allowing the efficient implementation of STM algorithms following that approach, hence inhibiting the fair comparison with STM algorithms falling into the other. In this paper we introduce a technique to implement in-place metadata that does not wrap memory cells, thus overcoming the bias by allowing STM algorithms to directly access the transactional metadata. The proposed technique is available as an extension to the DeuceSTM framework, and enables the efficient implementation of a wide range of STM algorithms and their fair (unbiased) comparison in a common STM infrastructure. We illustrate the benefits of our approach by analyzing its impact in two popular TM algorithms with two different transactional workloads, TL2 and multi-versioning, with bias to out-place and in-place respectively.

Ricardo J. Dias, Tiago M. Vale, João M. Lourenço

Folding of Tagged Single Assignment Values for Memory-Efficient Parallelism

The dynamic-single-assignment property for shared data accesses can establish data race freedom and determinism in parallel programs. However, memory management is a well known challenge in making dynamic-single-assignment practical, especially when objects can be accessed through tags that can be computed by any step.

In this paper, we propose a new memory management approach based on user-specified

folding functions

that map logical dynamic-single -assignment (DSA) tags into dynamic-multiple-assignment (DMA) tags. We also compare folding with

get-counts

, an approach in which the user specifies a reference count for each single-assignment value. The context for our work is parallel programming models in which shared data accesses are coordinated by put/get operations on tagged DSA data structures. These models include dataflow programs with I-structures, functional subsets of parallel programs based on tuple spaces (notably, Linda), and programs written in the Concurrent Collections (CnC) coordination language. Our conclusion, based on experimental evaluation of five CnC programs, is that folding and get-counts can offer significant memory efficiency improvements, and that folding can handle cases that the get-counts cannot.

Dragoş Sbîrlea, Kathleen Knobe, Vivek Sarkar

High-Level Support for Pipeline Parallelism on Many-Core Architectures

With the increasing architectural diversity of many-core architectures the challenges of parallel programming and code portability will sharply rise. The EU project PEPPHER addresses these issues with a component-based approach to application development on top of a task-parallel execution model. Central to this approach are multi-architectural components which encapsulate different implementation variants of application functionality tailored for different core types. An intelligent runtime system selects and dynamically schedules component implementation variants for efficient parallel execution on heterogeneous many-core architectures. On top of this model we have developed language, compiler and runtime support for a specific class of applications that can be expressed using the pipeline pattern. We propose C/C++ language annotations for specifying pipeline patterns and describe the associated compilation and runtime infrastructure. Experimental results indicate that with our high-level approach performance comparable to manual parallelization can be achieved.

Siegfried Benkner, Enes Bajrovic, Erich Marth, Martin Sandrieser, Raymond Namyst, Samuel Thibault

Node.Scala: Implicit Parallel Programming for High-Performance Web Services

Event-driven programming frameworks such as Node.JS have recently emerged as a promising option for Web service development. Such frameworks feature a simple programming model with implicit parallelism and asynchronous I/O. The benefits of the event-based programming model in terms of concurrency management need to be balanced against its limitations in terms of scalability on multicore architectures and against the impossibility of sharing a common memory space between multiple Node.JS processes. In this paper we present Node.Scala, an event-based programming framework for the JVM which overcomes the limitations of current event-driven frameworks. Node.Scala introduces safe stateful programming for event-based services. The programming model of Node.Scala allows threads to safely share state in a standard event-based programming model. The runtime system of Node.Scala automatically parallelizes and synchronizes state access to guarantee correctness. Experiments show that services developed in Node.Scala yield linear scalability and high throughput when deployed on multicore machines.

Daniele Bonetta, Danilo Ansaloni, Achille Peternier, Cesare Pautasso, Walter Binder

Task-Parallel Programming on NUMA Architectures

The multicore era has led to a renaissance of shared memory parallel programming models. Moreover, the introduction of task-level parallelization raises the level of abstraction compared to thread-centric expression of parallelism. However, tasks might exhibit poor performance on NUMA systems if locality cannot be controlled and non-local data is accessed.

This work investigates various approaches to express task-parallelism using the OpenMP tasking model, from a programmer’s point of view. We describe and compare task creation strategies and devise methods to preserve locality on NUMA architectures while optimizing the degree of parallelism. Our proposals are evaluated on reasonably large NUMA systems with both important application kernels as well as real-world simulation codes.

Christian Terboven, Dirk Schmidl, Tim Cramer, Dieter an Mey

Speeding Up OpenMP Tasking

In this work we present a highly efficient implementation of OpenMP tasks. It is based on a runtime infrastructure architected for data locality, a crucial prerequisite for exploiting the NUMA nature of modern multicore multiprocessors. In addition, we employ fast work-stealing structures, based on a novel, efficient and fair blocking algorithm. Synthetic benchmarks show up to a 6-fold increase in throughput (tasks completed per second), while for a task-based OpenMP application suite we measured up to 87% reduction in execution times, as compared to other OpenMP implementations.

Spiros N. Agathos, Nikolaos D. Kallimanis, Vassilios V. Dimakopoulos

An Efficient Unbounded Lock-Free Queue for Multi-core Systems

The use of efficient synchronization mechanisms is crucial for implementing fine grained parallel programs on modern shared cache multi-core architectures. In this paper we study this problem by considering Single-Producer/Single-Consumer (SPSC) coordination using unbounded queues. A novel unbounded SPSC algorithm capable of reducing the row synchronization latency and speeding up Producer-Consumer coordination is presented. The algorithm has been extensively tested on a shared-cache multi-core platform and a sketch proof of correctness is presented. The queues proposed have been used as basic building blocks to implement the FastFlow parallel framework, which has been demonstrated to offer very good performance for fine-grain parallel applications.

Marco Aldinucci, Marco Danelutto, Peter Kilpatrick, Massimiliano Meneghin, Massimo Torquati

Topic 12: Theory and Algorithms for Parallel Computation

Parallelism permeates all levels of current computing systems, from single CPU machines, to large server farms, to geographically dispersed “volunteers” who collaborate over the Internet. The effective use of parallelism depends crucially on the availability of faithful, yet tractable, models of computation for algorithm design and analysis, and on efficient strategies for solving key computational problems on prominent classes of computing platforms. No less important are good models of the way the different components/subsystems of a platform are interconnected. With the development of new genres of computing platforms, such as multicore parallel machines, desktop grids, clouds, and hybrid GPU/CPUbased systems, new models and paradigms are needed that will allow parallel programming to advance into mainstream computing. Topic 12 focuses on contributions providing new results on foundational issues regarding parallelism in computing, and/or proposing improved approaches to the solution of specific algorithmic problems.

Geppino Pucci, Christos Zaroliagis, Kieran T. Herley, Henning Meyerhenke

A Lower Bound Technique for Communication on BSP with Application to the FFT

Communication complexity

is defined, within the

Bulk Synchronous Parallel

(BSP) model of computation, as the sum of the degrees of all the supersteps. A lower bound to the communication complexity is derived for a given class of DAG computations in terms of the

switching potential

of a DAG, that is, the number of permutations that the DAG can realize when viewed as a switching network. The proposed technique yields a novel and tight lower bound for the FFT graph.

Gianfranco Bilardi, Michele Scquizzato, Francesco Silvestri

A Fast Parallel Algorithm for Minimum-Cost Small Integral Flows

We present a new approach to the minimum-cost integral flow problem for small values of the flow. It reduces the problem to the tests of simple multi-variable polynomials over a finite field of characteristic two for non-identity with zero. In effect, we show that a minimum-cost flow of value

k

in a network with

n

vertices, a sink and a source, integral edge capacities and positive integral edge costs polynomially bounded in

n

can be found by a randomized PRAM, with errors of exponentially small probability in

n

, running in

O

(

k

log(

kn

) + log

2

(

kn

)) time and using 2

k

(

kn

)

O

(1)

processors. Thus, in particular, for the minimum-cost flow of value

O

(log

n

), we obtain an

RNC

2

algorithm.

Andrzej Lingas, Mia Persson

Topic 13: High Performance Network and Communication

This topic on

High-Performance Network and Communication

is devoted to communication issues in scalable compute and storage systems, such as parallel computers, networks of workstations, and clusters. All aspects of communication in modern systems were solicited, including advances in the design, implementation, and evaluation of interconnection networks, network interfaces, system and storage area networks, on-chip interconnects, communication protocols, routing and communication algorithms, and communication aspects of parallel and distributed algorithms.

Chris Develder, Emmanouel Varvarigos, Admela Jukan, Dimitra Simeonidou

Topology Configuration in Hybrid EPS/OCS Interconnects

We consider a hybrid Electronic Packet Switched (EPS) and Optical Circuit Switched (OCS) interconnection network (IN) for future HPC and DC systems. Given the point-to-point communication graph of an application, we present a heuristic algorithm that partitions logical parallel tasks to compute resources and configures the (re-configurable) optical part of the hybrid IN to efficiently serve point-to-point communication. We measure the performance of a hybrid IN employing the proposed algorithm using real workloads, as well as extrapolated traffic, and compare it against application mapping on conventional fixed, electronic-only INs based on toroidal topologies.

Konstantinos Christodoulopoulos, Marco Ruffini, Donal O’Mahony, Kostas Katrinis

Towards an Efficient Fat–Tree like Topology

Topology and routing algorithm are two key design parameters for interconnection networks. They highly define the performance achieved by the network, but also its complexity and cost. Many of the commodity interconnects for clusters are based on the fat–tree topology, which allows both a rich interconnection among the nodes of the network and the use of adaptive routing. In this paper, we analyze how the routing algorithm affects the complexity of the switch, and considering this, we also propose and analyze some extensions of the fat–tree topology to take advantage of the available hardware resources. We analyze not only the impact on performance of these extensions but also their influence over switch complexity, analyzing its cost.

D. Bermúdez Garzón, C. Gómez, M. E. Gómez, P. López, J. Duato

An Adaptive, Scalable, and Portable Technique for Speeding Up MPI-Based Applications

This paper presents a portable optimization for MPI communications, called

PRAcTICaL-MPI

(Portable Adaptive Compression Library- MPI).

PRAcTICaL-MPI

reduces the data volume exchanged among processes by using lossless compression and offers two main advantages. Firstly, it is independent of the MPI implementation and the application used. Secondly, it allows for turning the compression on and off and selecting the most appropriate compression algorithm at run-time, depending on the characteristics of each message and on network performance.

We have validated

PRAcTICaL-MPI

in different MPI implementations and HPC clusters. The evaluation shows that compressing MPI messages with the best algorithm and only when it is worthwhile, we obtain a great reduction in the overall execution time for many of the scenarios considered.

Rosa Filgueira, Malcolm Atkinson, Alberto Nuñez, Javier Fernández

Cost-Effective Contention Avoidance in a CMP with Shared Memory Controllers

Efficient CMP utilisation requires virtualisation. This forces multiple applications to contend for the same network resources and memory bandwidth. In this paper we study the cause and effect of network congestion with respect to traffic local to the applications, and traffic caused by memory access. This reveals that applications close to the memory controller suffer because of congestion caused by memory controller traffic from other applications. We present a simple mechanism to reduce head-of-line blocking in the switches, which efficiently reduces network congestion, increases network performance, and evens out the performance differences between the CMP applications.

Samuel Rodrigo, Frank Olaf Sem-Jacobsen, Hervé Tatenguem, Tor Skeie, Davide Bertozzi

Topic 14: Mobile and Ubiquitous Computing

The tremendous advances in wireless networks, mobile computing, and sensor networks, along with the rapid growth of small, portable and powerful computing devices, offers more and more opportunities for pervasive computing and communications. This topic deals with cutting-edge research in various aspects related to the theory and practice of mobile computing or wireless and mobile networking. These aspects include architectures, algorithms, networks, protocols, modeling and performance issues, data management, and novel applications and services. The aim of this topic is to bring together computer scientists and engineers from both academia and industry working in this exciting and emerging area of pervasive computing and communications, to share their ideas and results with their peers.

Paolo Santi, Sotiris Nikoletseas, Cecilia Mascolo, Thiemo Voigt

Watershed-Based Clustering for Energy Efficient Data Gathering in Wireless Sensor Networks with Mobile Collector

This paper presents a clustering protocol combined with a mobile sink (MS) solution for efficient data gathering in wireless sensor networks (WSNs). The main insight for the cluster creation method is drawn from image processing field and namely from the watershed transformation which is widely used for image segmentation. The proposed algorithm creates multi-hop clusters whose clusterheads (CHs) as well as all cluster members near the CHs have high energy reserves. As these are exactly the nodes most burdened with relaying of data from other cluster members, the higher levels of available energy at these nodes prolong the network lifetime eventually. After cluster creation, a MS periodically visits each CH and collects the data from cluster members already gathered at the CH. Simulation results show the higher performance of the proposed scheme in comparison to other competent approaches from the literature.

Charalampos Konstantopoulos, Basilis Mamalis, Grammati Pantziou, Vasileios Thanasias

Distribution of Liveness Property Connectivity Interval in Selected Mobility Models of Wireless Ad Hoc Networks

The

ad hoc network liveness property

disallows permanent partitioning to occur by requiring (informally) that from each time moment

reliable direct connectivity

must emerge between some nodes from every (non-empty) subset of hosts and its complementary set within some finite, but unknown,

connectivity time interval

I

. An analysis of the connectivity interval is important because its finite values legitimise the liveness property assumption. Moreover, since the connectivity interval demonstrates a crucial factor of message dissemination time in ad hoc networks, its distribution significantly affects the efficiency of all protocols based on the liveness property. Therefore, in this paper, we present the distribution of the connectivity interval determined experimentally by simulation of several entity and group mobility models and real-life GPS traces of mobile nodes. We also conduct a statistical analysis of received results and show how the connectivity interval correlates with other network parameters.

Jerzy Brzeziński, Michał Kalewski, Marcin Kosiba, Marek Libuda

Topic 15: High Performance and Scientific Applications

Many fields of science and engineering are characterized by an increasing demand for computational resources. Coupled with important algorithmic advances, high performance computing allows for the gain of new results and insights by utilizing powerful computers and big storage systems. Workflows in science and engineering produce huge amounts of data through numerical simulations and derive new knowledge by a subsequent analysis of this data. Progress in these fields depends on the availability of HPC environments, from medium sized teraflops systems up to leading petaflops systems.

Thomas Ludwig, Costas Bekas, Alice Koniges, Kengo Nakajima

Memory-Access Optimization of Parallel Molecular Dynamics Simulation via Dynamic Data Reordering

Dynamic irregular applications such as molecular dynamics (MD) simulation often suffer considerable performance deterioration during execution. To address this problem, an optimal data-reordering schedule has been developed for runtime memory-access optimization of MD simulations on parallel computers. Analysis of the memory-access penalty during MD simulations shows that the performance improvement from computation and data reordering degrades gradually as data translation lookaside buffer misses increase. We have also found correlations between the performance degradation with physical properties such as the simulated temperature, as well as with computational parameters such as the spatial-decomposition granularity. Based on a performance model and pre-profiling of data fragmentation behaviors, we have developed an optimal runtime data-reordering schedule, thereby archiving speedup of 1.35, 1.36 and 1.28, respectively, for MD simulations of silica at temperatures 300 K, 3,000 K and 6,000 K.

Manaschai Kunaseth, Ken-ichi Nomura, Hikmet Dursun, Rajiv K. Kalia, Aiichiro Nakano, Priya Vashishta

On Analyzing Quality of Data Influences on Performance of Finite Elements Driven Computational Simulations

For multi-scale simulations, the quality of the input data as well as the quality of algorithms and computing environments will strongly impact the intermediate results, the final outcome, and the performance of the simulation. To date, little attention has been paid on understanding the impact of quality of data (QoD) on such multi-scale simulations. In this paper, we present a critical analysis of how QoD influences the results and performance of basic simulation building blocks for multi-scale simulations. We analyze the impact of QoD for Finite Element Method (FEM) based simulation building blocks, and study the dependencies between the QoD of input data and results as well as the performance of the simulation. We devise and implement novel QoD metrics for data intensive, FEM-based simulations and show experiments with real-world applications by demonstrating how QoD metrics can be efficiently used to control and tune the execution of FEM-based simulation at runtime.

Michael Reiter, Hong-Linh Truong, Schahram Dustdar, Dimka Karastoyanova, Robert Krause, Frank Leymann, Dieter Pahr

Performance Evaluation and Optimization of Nested High Resolution Weather Simulations

Weather models with high spatial and temporal resolutions are required for accurate prediction of meso-micro scale weather phenomena. Using these models for operational purposes requires forecasts with sufficient lead time, which in turn calls for large computational power. There exists a lot of prior studies on the performance of weather models on single domain simulations with a uniform horizontal resolution. However, there has not been much work on high resolution nested domains that are essential for high-fidelity weather forecasts.

In this paper, we focus on improving and analyzing the performance of nested domain simulations using WRF on IBM Blue Gene/P. We demonstrate a significant reduction (up to 29%) in runtime via a combination of compiler optimizations, mapping of process topology to the physical torus topology, overlapping communication with computation, and parallel communications along torus dimensions. We also conduct a detailed performance evaluation using four nested domain configurations to assess the benefits of the different optimizations as well as the scalability of different WRF operations. Our analysis indicates that the choice of nesting configuration is critical for good performance. To aid WRF practitioners in making this choice, we describe a performance modeling approach that can predict the total simulation time in terms of the domain and processor configurations with a very high accuracy (< 8%) using a regression-based model learned from empirical timing data.

Preeti Malakar, Vaibhav Saxena, Thomas George, Rashmi Mittal, Sameer Kumar, Abdul Ghani Naim, Saiful Azmi bin Hj Husain

Optimized Hybrid Parallel Lattice Boltzmann Fluid Flow Simulations on Complex Geometries

Computational fluid dynamics (CFD) have become more and more important in the last decades, accelerating research in many different areas for a variety of applications. In this paper, we present an optimized hybrid parallelization strategy capable of solving large-scale fluid flow problems on complex computational domains. The approach relies on the combination of lattice Boltzmann methods (LBM) for the fluid flow simulation, octree data structures for a sparse block-wise representation and decomposition of the geometry as well as graph partitioning methods optimizing load balance and communication costs. The approach is realized in the framework of the open source library OpenLB and evaluated for the simulation of respiration in a subpart of a human lung. The efficiency gains are discussed by comparing the results of the full optimized approach with those of more simpler ones realized prior.

Jonas Fietz, Mathias J. Krause, Christian Schulz, Peter Sanders, Vincent Heuveline

Topology-Aware Mappings for Large-Scale Eigenvalue Problems

Obtaining highly accurate predictions for properties of light atomic nuclei using the Configuration Interaction (CI) approach requires computing the lowest eigenvalues and associated eigenvectors of a large many-body nuclear Hamiltonian matrix,

$\hat{H}$

. Since

$\hat{H}$

is a large sparse matrix, a parallel iterative eigensolver designed for multi-core clusters is used. Due to the extremely large size of

$\hat{H}$

, thousands of compute nodes are required. Communication overhead may hinder the scalability of the eigensolver at such scales. In this paper, we discuss how to reduce such overhead. In particular, we quantitatively show that topology-aware mapping of computational tasks to physical processors on large-scale multi-core clusters may have a significant impact on efficiency. For typical large-scale eigenvalue calculations, we obtain up to a factor of 2.5 improvement in overall performance by using a topology-aware mapping.

Hasan Metin Aktulga, Chao Yang, Esmond G. Ng, Pieter Maris, James P. Vary

Fast and Effective Lossy Compression Algorithms for Scientific Datasets

This paper focuses on developing effective and efficient algorithms for compressing scientific simulation data computed on structured and unstructured grids. A paradigm for lossy compression of this data is proposed in which the data computed on the grid is modeled as a graph, which gets decomposed into sets of vertices which satisfy a user defined error constraint

ε

. Each set of vertices is replaced by a constant value with reconstruction error bounded by

ε

. A comprehensive set of experiments is conducted by comparing these algorithms and other state-of-the-art scientific data compression methods. Over our benchmark suite, our methods obtained compression of 1% of the original size with average PSNR of 43.00 and 3% of the original size with average PSNR of 63.30. In addition, our schemes outperform other state-of-the-art lossy compression approaches and require on the average 25% of the space required by them for similar or better PSNR levels.

Jeremy Iverson, Chandrika Kamath, George Karypis

Topic 16: GPU and Accelerators Computing

Accelerator-based computing systems invest significant fractions of hardware real estate to execute critical computation with vastly higher efficiency than general-purpose CPUs. Amdahl’s Law of the Multi-core Era suggests that such an heterogeneous approach to parallel computing is bound to deliver better scalability and power-efficiency than homogeneous system scaling. While General Purpose Graphics Processing Units (GPGPUs) have catalyzed research in this area, new ideas emerge to help us model, deconstruct and analyze the performance of accelerators, develop new standards for programming accelerators at a high level of abstraction, and port end-to-end applications on accelerator-based systems. Topic 16 provides a forum to discuss advances in all aspects of GPUand accelerator-based computing.

Alex Ramirez, Dimitrios S. Nikolopoulos, David Kaeli, Satoshi Matsuoka

OpenACC — First Experiences with Real-World Applications

Today’s trend to use accelerators like GPGPUs in heterogeneous computer systems has entailed several low-level APIs for accelerator programming. However, programming these APIs is often tedious and therefore unproductive. To tackle this problem, recent approaches employ directive-based high-level programming for accelerators. In this work, we present our first experiences with OpenACC, an API consisting of compiler directives to offload loops and regions of C/C++ and Fortran code to accelerators. We compare the performance of OpenACC to PGI Accelerator and OpenCL for two real-world applications and evaluate programmability and productivity. We find that OpenACC offers a promising ratio of development effort to performance and that a directive-based approach to program accelerators is more efficient than low-level APIs, even if suboptimal performance is achieved.

Sandra Wienke, Paul Springer, Christian Terboven, Dieter an Mey

accULL: An OpenACC Implementation with CUDA and OpenCL Support

The irruption in the HPC scene of hardware accelerators, like GPUs, has made available unprecedented performance to developers. However, even expert developers may not be ready to exploit the new complex processor hierarchies. We need to find a way to leverage the programming effort in these devices at programming language level, otherwise, developers will spend most of their time focusing on device-specific code instead of implementing algorithmic enhancements. The recent advent of the OpenACC standard for heterogeneous computing represents an effort in this direction. This initiative, combined with future releases of the OpenMP standard, will converge into a fully heterogeneous framework that will cope the programming requirements of future computer architectures. In this work we present

accULL

, a novel implementation of the OpenACC standard, based on the combination of a source to source compiler and a runtime library. To our knowledge, our approach is the first providing support for both OpenCL and CUDA platforms under this new standard.

Ruymán Reyes, Iván López-Rodríguez, Juan J. Fumero, Francisco de Sande

Understanding the Performance of Concurrent Data Structures on Graphics Processors

In this paper we revisit the design of concurrent data structures – specifically queues – and examine their performance portability with regard to the move from conventional CPUs to graphics processors. We have looked at both lock-based and lock-free algorithms and have, for comparison, implemented and optimized the same algorithms on both graphics processors and multi-core CPUs. Particular interest has been paid to study the difference between the old Tesla and the new Fermi and Kepler architectures in this context. We provide a comprehensive evaluation and analysis of our implementations on all examined platforms. Our results indicate that the queues are in general performance portable, but that platform specific optimizations are possible to increase performance. The Fermi and Kepler GPUs, with optimized atomic operations, are observed to provide excellent scalability for both lock-based and lock-free queues.

Daniel Cederman, Bapi Chatterjee, Philippas Tsigas

A New Programming Paradigm for GPGPU

Graphics Processing units (GPU) have become a valuable support for High Performance Computing (HPC) applications. However, despite the many improvements of General Purpose GPUs, the current programming paradigms available, such as NVIDIA’s CUDA, are still low-level and require strong programming effort, especially for irregular applications where dynamic load balancing is a key point to reach high performances.

This paper introduces a new hybrid programming scheme for general purpose graphics processors using two levels of parallelism. In the upper level, a program creates, in a lazy fashion, tasks to be scheduled on the different

Streaming Multiprocessors

(MP), as defined in the NVIDIA’s architecture. We have embedded inside GPU a well-known work stealing algorithm to dynamically balance the workload. At lower level, tasks exploit each

Streaming Processor

(SP) following a data-parallel approach. Preliminary comparisons on data-parallel iteration over vectors show that this approach is competitive on regular workload over the standard CUDA library Thrust, based on a static scheduling. Nevertheless, our approach outperforms Thrust-based scheduling on irregular workloads.

Julio Toss, Thierry Gautier

GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement

In hardware-aware high performance computing, block- asynchronous iteration and mixed precision iterative refinement are two techniques that may be used to leverage the computing power of SIMD accelerators like GPUs in the iterative solution of linear equation systems. Although they use a very different approach for this purpose, they share the basic idea of compensating the convergence properties of an inferior numerical algorithm by a more efficient usage of the provided computing power. In this paper, we analyze the potential of combining both techniques. Therefore, we derive a mixed precision iterative refinement algorithm using a block-asynchronous iteration as an error correction solver, and compare its performance with a pure implementation of a block-asynchronous iteration and an iterative refinement method using double precision for the error correction solver. For matrices from the University of Florida Matrix collection, we report the convergence behaviour and provide the total solver runtime using different GPU architectures.

Hartwig Anzt, Piotr Luszczek, Jack Dongarra, Vincent Heuveline

GPURoofline: A Model for Guiding Performance Optimizations on GPUs

Performance optimization on GPUs requires deep technical knowledge of the underlying hardware. Modern GPU architectures are becoming more and more diversified, which further exacerbates the already difficult problem. This paper presents GPURoofline, an empirical model for guiding optimizations on GPUs. The goal is to help non-expert programmers with limited knowledge of GPU architectures implement high performance GPU kernels. The model addresses this problem by exploring potential performance bottlenecks and evaluating whether specific optimization techniques bring any performance improvement. To demonstrate the usage of the model, we optimize four representative kernels with different computation densities, namely matrix transpose, Laplace transform, integral and face-dection, on both NVIDIA and AMD GPUs. Experimental results show that under the guidance of GPURoofline, performance of those kernels achieves 3.74~14.8 times speedup compared to their naïve implementations on both NVIDIA and AMD GPU platforms.

Haipeng Jia, Yunquan Zhang, Guoping Long, Jianliang Xu, Shengen Yan, Yan Li

Building a Collision for 75-Round Reduced SHA-1 Using GPU Clusters

SHA-1 is one of the most widely used cryptographic hash functions. An important property of all cryptographic hash functions is collision resistance, that is, infeasibility of finding two different input messages such that they have the same hash values. Our work improves on differential attacks on SHA-1 and its reduced variants. In this work we describe porting collision search using method of characteristics to a GPU cluster. Method of characteristics employs backtracking search, which leads to low GPU performance due to branch divergence if implemented naively. Using a number of optimizations, we reduce branch divergence and achieve GPU usage efficiency of 50%, which gives 39 × acceleration over a single CPU core. With the help of our application running on a 512-GPU cluster, we were able to find a collision for a version of SHA-1 reduced to 75 rounds, which is currently (February 2012) the world’s best result in terms of number of rounds for SHA-1.

Andrew V. Adinetz, Evgeny A. Grechnikov

GPU-Vote: A Framework for Accelerating Voting Algorithms on GPU

Voting algorithms, such as histogram and Hough transforms, are frequently used algorithms in various domains, such as statistics and image processing. Algorithms in these domains may be accelerated using GPUs. Implementing voting algorithms efficiently on a GPU however is far from trivial due to irregularities and unpredictable memory accesses. Existing GPU implementations therefore target only specific voting algorithms while we propose in this work a methodology which targets voting algorithms in general.

This methodology is used in

gpu-vote

, a framework to accelerate current and future voting algorithms on a GPU without significant programming effort. We classify voting algorithms into four categories. We describe a transformation to merge categories which enables

gpu-vote

to have a single implementation for all voting algorithms. Despite the generality of

gpu-vote

, being able to handle various voting algorithms, its performance is not compromised. Compared to recently published GPU implementations of the Hough transform and the histogram algorithms,

gpu-vote

yields a 11% and 38% lower execution time respectively. Additionally, we give an accurate and intuitive performance prediction model for the generalized GPU voting algorithm. Our model can predict the execution time of

gpu-vote

within an average absolute error of 5%.

Gert-Jan van den Braak, Cedric Nugteren, Bart Mesman, Henk Corporaal

Springer Professional

About this book

Table of Contents

Frontmatter

Invited Talk

Selfish Distributed Optimization

Topic 1: Support Tools and Environments

Topic 1: Support Tools and Environments

Tulipse: A Visualization Framework for User-Guided Parallelization

Enabling Cloud Interoperability with COMPSs

Pattern-Independent Detection of Manual Collectives in MPI Programs

A Type-Based Approach to Separating Protocol from Application Logic

Topic 2: Performance Prediction and Evaluation

Topic 2: Performance Prediction and Evaluation

Energy Consumption Modeling for Hybrid Computing

HPC File Systems in Wide Area Networks: Understanding the Performance of Lustre over WAN

Understanding I/O Performance Using I/O Skeletal Applications

ASK: Adaptive Sampling Kit for Performance Characterization

CRAW/P: A Workload Partition Method for the Efficient Parallel Simulation of Manycores

Topic 3: Scheduling and Load Balancing

Topic 3: Scheduling and Load Balancing

Job Scheduling Using Successive Linear Programming Approximations of a Sparse Model

Speed Scaling on Parallel Processors with Migration

Dynamic Distributed Scheduling Algorithm for State Space Search

Using Load Information in Work-Stealing on Distributed Systems with Non-uniform Communication Latencies

Energy Efficient Frequency Scaling and Scheduling for Malleable Tasks

Scheduling MapReduce Jobs in HPC Clusters

A Job Scheduling Approach for Multi-core Clusters Based on Virtual Malleability

Topic 4: High-Performance Architecture and Compilers

Topic 4: High-Performance Architecture and Compilers

Dynamic Last-Level Cache Allocation to Reduce Area and Power Overhead in Directory Coherence Protocols

A Practical Approach to DOACROSS Parallelization

Exploiting Semantics of Virtual Memory to Improve the Efficiency of the On-Chip Memory System

From Serial Loops to Parallel Execution on Distributed Systems

Topic 5: Parallel and Distributed Data Management

Topic 5: Parallel and Distributed Data Management

DS-Means: Distributed Data Stream Clustering

3D Inverted Index with Cache Sharing for Web Search Engines

Quality-of-Service for Consistency of Data Geo-replication in Cloud Computing

A Fault-Tolerant Cache Service for Web Search Engines: RADIC Evaluation

Topic 6: Grid, Cluster and Cloud Computing

Topic 6: Grid, Cluster and Cloud Computing

Scalable Reed-Solomon-Based Reliable Local Storage for HPC Applications on IaaS Clouds

Caching VM Instances for Fast VM Provisioning: A Comparative Evaluation

Improving Scheduling Performance Using a Q-Learning-Based Leasing Policy for Clouds

Impact of Variable Priced Cloud Resources on Scientific Workflow Scheduling

Topic 7: Peer to Peer Computing

Topic 7: Peer to Peer Computing

ID-Replication for Structured Peer-to-Peer Systems

Changing the Unchoking Policy for an Enhanced Bittorrent

Peer-to-Peer Multi-class Boosting

Topic 8: Distributed Systems and Algorithms

Topic 8: Distributed Systems and Algorithms

Towards Load Balanced Distributed Transactional Memory

CUDA-For-Clusters: A System for Efficient Execution of CUDA Kernels on Multi-core Clusters

From a Store-Collect Object and Ω to Efficient Asynchronous Consensus

An Investigation into the Performance of Reduction Algorithms under Load Imbalance

Achieving Reliability in Master-Worker Computing via Evolutionary Dynamics

Topic 9: Parallel and Distributed Programming

Topic 9: Parallel and Distributed Programming

Dynamic Thread Mapping Based on Machine Learning for Transactional Memory Applications

A Checkpoint-on-Failure Protocol for Algorithm-Based Recovery in Standard MPI

Hierarchical Partitioning Algorithm for Scientific Computing on Highly Heterogeneous CPU + GPU Clusters

Encapsulated Synchronization and Load-Balance in Heterogeneous Programming

Transactional Access to Shared Memory in StarSs, a Task Based Programming Model

On-the-Fly Task Execution for Speeding Up Pipelined MapReduce

Assessing the Performance and Scalability of a Novel Multilevel K-Nomial Allgather on CORE-Direct Systems

Topic 10: Parallel Numerical Algorithms

Topic 10: Parallel Numerical Algorithms

Avoiding Communication through a Multilevel LU Factorization

Locality Improvement of Data-Parallel Adams–Bashforth Methods through Block-Based Pipelining of Time Steps

Parallel SOR for Solving the Convection Diffusion Equation Using GPUs with CUDA

Topic 11: Multicore and Manycore Programming

Topic 11: Multicore and Manycore Programming

Efficient Support for In-Place Metadata in Transactional Memory

Folding of Tagged Single Assignment Values for Memory-Efficient Parallelism

High-Level Support for Pipeline Parallelism on Many-Core Architectures

Node.Scala: Implicit Parallel Programming for High-Performance Web Services

Task-Parallel Programming on NUMA Architectures

Speeding Up OpenMP Tasking