nach oben

2010 | Buch

Kapitel lesen Erstes Kapitel lesen

Euro-Par 2010 - Parallel Processing

16th International Euro-Par Conference, Ischia, Italy, August 31 - September 3, 2010, Proceedings, Part I

herausgegeben von: Pasqua D’Ambra, Mario Guarracino, Domenico Talia

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Inhaltsverzeichnis

Frontmatter

Topic 1: Support Tools and Environments

Distributed Systems and Algorithms

Despite an impressive body of research, parallel and distributed computing remains a complex task prone to subtle software issues that can affect both the correctness and the performance of the computation. The increasing demand to distribute computing over large-scale parallel and distributed platforms, such as grids and large clusters, often combined with the use of hardware accelerators, overlaps with an increasing pressure to make computing more dependable. To address these challenges, the parallel and distributed computing community continuously requires better tools and environments to design, program, debug, test, tune, and monitor parallel programs. This topic aims to bring together tool designers, developers, and users to share their concerns, ideas, solutions, and products covering a wide range of platforms, including homogeneous and heterogeneous multi-core architectures. Contributions with solid theoretical foundations and experimental validations on production-level parallel and distributed systems were particularly valued. This year, we encouraged submissions proposing intelligent monitoring and diagnosis tools and environments, which can exploit behavior knowledge to detect programming bugs or performance bottlenecks and help ensure correct and efficient parallel program execution.

Omer Rana, Giandomenico Spezzano, Michael Gerndt, Daniel S. Katz

Starsscheck: A Tool to Find Errors in Task-Based Parallel Programs

Star Superscalar is a task-based programming model. The programmer starts with an ordinary C program, and adds pragmas to mark functions as tasks, identifying their inputs and outputs. When the main thread reaches a task, an instance of the task is added to a run-time dependency graph, and later scheduled to run on a processor. Variants of Star Superscalar exist for the Cell Broadband Engine and SMPs.

Star Superscalar relies on the annotations provided by the programmer. If these are incorrect, the program may exhibit race conditions or exceptions deep inside the run-time system.

This paper introduces Starsscheck, a tool based on Valgrind, which helps debug Star Superscalar programs. Starsscheck verifies that the pragma annotations are correct, producing a warning if a task or the main thread performs an invalid access. The tool can be adapted to support similar programming models such as TPC. For most benchmarks, Starsscheck is faster than memcheck, the default Valgrind tool.

Paul M. Carpenter, Alex Ramirez, Eduard Ayguade

Automated Tuning in Parallel Sorting on Multi-core Architectures

Empirical search is an emerging strategy used in systems like ATLAS, FFTW and SPIRAL to find the parameter values of the implementation that deliver near-optimal performance for a particular machine. However, this approach has only proven successful for scientific kernels or

serial

symbolic sorting. Even commercial libraries like Intel MKL or IBM ESSL do not include parallel version of sorting routines. In this paper we study empirical search in the generation of parallel sorting routines for multi-core systems. Parallel sorting presents new challenges that the relative performance of the algorithms depends not only on the characteristics of the architectures and input data, but also on the data partitioning schemes and thread interactions. We have studied parallel sorting algorithms including quick sort, cache-conscious radix sort, multiway merge sort, sample sort and quick-radix sort, and have built a sorting library using empirical search and artificial neural network. Our results show that this sorting library could generate the best parallel sorting algorithms for different input sets on both x86 and SPARC multi-core architectures, with a peak speedup of 2.2x and 3.9x, respectively.

Haibo Lin, Chao Li, Qian Wang, Yi Zhao, Ninghe Pan, Xiaotong Zhuang, Ling Shao

Estimating and Exploiting Potential Parallelism by Source-Level Dependence Profiling

Manual parallelization of programs is known to be difficult and error-prone, and there are currently few ways to measure the amount of potential parallelism in the original sequential code.

We present an extension of Embla, a Valgrind-based dependence profiler that links dynamic dependences back to source code. This new tool estimates potential task-level parallelism in a sequential program and helps programmers exploit it at the source level. Using the popular fork-join model, our tool provides a realistic estimate of potential speed-up for parallelization with frameworks like Cilk, TBB or OpenMP 3.0 . Estimates can be given for several different parallelization models, varying in programmer effort and capabilities required of the underlying implementation. Our tool also outputs source-level dependence information to aid the parallelization of programs with lots of inherent parallelism, as well as critical paths to suggest algorithmic rewrites of programs with little of it.

We validate our claims by running our tool over

serial elisions

of sample Cilk programs, finding additional inherent parallelism not exploited by the Cilk code, as well as over serial C benchmarks where the profiling results suggest parallelism-enhancing algorithmic rewrites.

Jonathan Mak, Karl-Filip Faxén, Sverker Janson, Alan Mycroft

Source-to-Source Optimization of CUDA C for GPU Accelerated Cardiac Cell Modeling

Large and complex systems of ordinary differential equations (ODEs) arise in diverse areas of science and engineering, and pose special challenges on a streaming processor owing to the large amount of state they manipulate. We describe a set of domain-specific source transformations on CUDA C that improved performance by ×6.7 on a system of ODEs arising in cardiac electrophysiology running on the nVidia GTX-295, without requiring expert knowledge of the GPU. Our transformations should apply to a wide range of

reaction-diffusion systems.

Fred V. Lionetti, Andrew D. McCulloch, Scott B. Baden

Efficient Graph Partitioning Algorithms for Collaborative Grid Workflow Developer Environments

Collaborative editing systems allow a group of users to view and edit a shared item from geographically dispersed sites. Consistency maintenance in the face of concurrent accesses to shared entities is one of the core issues in the design of these systems. The paper introduces a lock based solution and three associated algorithms by which grid workflow developer environments can enable concurrent access to grid applications for multiple persons. The methods assure that collaborators cannot break the consistency criteria of workflows by introducing cycles or invalid edges to them. A formal analysis of the three algorithms is provided, focusing on the number of users that can simultaneously edit the same graph. Based on the findings an integrated algorithm is defined and it allows even more users to collaborate during workflow development.

Gergely Sipos, Péter Kacsuk

Profile-Driven Selective Program Loading

Complex software systems use many shared libraries frequently composed of large off-the-shelf components. Only a limited number of functions are used from these shared libraries. Historically demand paging prevented this from wasting large amounts of memory. Many high end systems lack virtual memory and thus must load the entire shared library into each node’s memory. In this paper we propose a system which decreases the memory footprint of applications by selectively loading only the used portions of the shared libraries. After profiling executables and shared libraries, our system rewrites all target shared libraries with a new function ordering and updated ELF program headers so that the loader only loads those functions that are likely to be used by a given application and includes a fallback user-level paging system to recover in the case of failures in our analysis. We present a case study that shows our system achieves more than 80% reduction in the number of pages that are loaded for several HPC applications while causing no performance overhead for reasonably long running programs.

Tugrul Ince, Jeffrey K. Hollingsworth

Characterizing the Impact of Using Spare-Cores on Application Performance

Increased parallelism on a single processor is driving improvements in peak-performance at both the node and system levels. However achievable performance, in particular from production scientific applications, is not always directly proportional to the core count. Performance is often limited by constraints in the memory hierarchy and also by a node inter-connectivity. Even on state-of-the-art processors, containing between four and eight cores, many applications cannot take full advantage of the compute-performance of all cores. This trend is expected to increase on future processors as the core count per processor increases. In this work we characterize the use of spare-cores, cores that do not provide any improvements in application performance, on current multi-core processors. By using a pulse-width modulation method, we examine the possible performance profile of using a spare-core and quantify under what situations its use will not impact application performance. We show that, for current AMD and Intel multi-core processors, spare-cores can be used for substantial computational tasks but can impact application performance when using shared caches or when significantly accessing main memory.

José Carlos Sancho, Darren J. Kerbyson, Michael Lang

Topic 2: Performance Prediction and Evaluation

Performance Prediction and Evaluation

In recent years a range of novel methodologies and tools have been developed for the purpose of evaluation, design, and model reduction of existing and emerging parallel and distributed systems. At the same time, the coverage of the term “performance” has broadened to include reliability, robustness, energy consumption, and scalability, in addition that is to the classical performance-oriented evaluation of system functionality. The aim of the conference topic “Performance Prediction and Evaluation”, was to bring together system designers and researchers involved with the qualitative and quantitative evaluation and modeling of large-scale parallel and distributed applications and systems (e.g., Grids, Cloud computing environments, multicore architectures).

Stephen Jarvis, Massimo Coppola, Junwei Cao, Darren Kerbyson

A Model for Space-Correlated Failures in Large-Scale Distributed Systems

Distributed systems such as grids, peer-to-peer systems, and even Internet DNS servers have grown significantly in size and complexity in the last decade. This rapid growth has allowed distributed systems to serve a large and increasing number of users, but has also made resource and system failures inevitable. Moreover, perhaps as a result of system complexity, in distributed systems a single failure can trigger within a short time span several more failures, forming a group of time-correlated failures. To eliminate or alleviate the significant effects of failures on performance and functionality, the techniques for dealing with failures require good failure models. However, not many such models are available, and the available models are valid for few or even a single distributed system. In contrast, in this work we propose a model that considers groups of time-correlated failures and is valid for many types of distributed systems. Our model includes three components, the group size, the group inter-arrival time, and the resource downtime caused by the group. To validate this model, we use failure traces corresponding to fifteen distributed systems. We find that space-correlated failures are dominant in terms of resource downtime in seven of the fifteen studied systems. For each of these seven systems, we provide a set of model parameters that can be used in research studies or for tuning distributed systems. Last, as a result of our work six of the studied traces have been made available through the Failure Trace Archive (

http://fta.inria.fr

Matthieu Gallet, Nezih Yigitbasi, Bahman Javadi, Derrick Kondo, Alexandru Iosup, Dick Epema

Architecture Exploration for Efficient Data Transfer and Storage in Data-Parallel Applications

Due to the complexity of modern data parallel applications such as image processing applications, automatic approach to infer suitable and efficient hardware realizations are more and more required. Typically, the optimization of data transfer and storage micro-architecture has a key role for the data parallelism. In this paper, we propose a comprehensive method to explore the mapping of a high-level representation of an application into a customizable hardware accelerator. The high-level representation is in a language called Array-OL. The customizable architecture uses FIFO queues and double buffering mechanism to mask the latency of data transfers and external memory access. The mapping of a high-level representation onto the given architecture is performed by applying a set of loop transformations in Array-OL. A method based on integer partition is used to reduce the space of explored solutions.

Rosilde Corvino, Abdoulaye Gamatié, Pierre Boulet

jitSim: A Simulator for Predicting Scalability of Parallel Applications in Presence of OS Jitter

Traditionally, Operating system jitter has been a source of performance degradation for parallel applications running on large number of processors. While some large scale HPC systems such as Blue Gene/L and Cray XT4, mitigate jitter by making use of a specialized light-weight operating system on compute nodes, other clusters have attempted using HPC-ready commodity operating systems such as ZeptoOS (based on Linux). However, as large systems continue to be designed to work with commodity OSes, OS jitter still remains an active area of research within the HPC community. While, it is true that some of the specialized commodity OSes like ZeptoOS have relatively low OS jitter levels, there is still a need to have a quick and easy set of tools that can predict the impact of OS jitter at a given configuration and processor number. Such tools are also required to validate and compare any new techniques or OS enhancements that mitigate jitter. Emulating jitter on a large “jitter-free” platform using either synthetic jitter or real traces from commodity OSes has been proposed as one useful mechanism to study scalability behavior under the presence of jitter. However, this requires access to large scale jitter free systems, which are few in number and not so easily accessible. As new systems are built, that should scale up to a million tasks and more, the emulation approach is still limited by the largest jitter free system available. In this paper we present jitSim - a simulation framework for predicting scalability of parallel compute intensive applications in presence of OS jitter using trace driven simulation. The jitter simulation framework can be used to quickly simulate the effects of jitter that is characteristic of a given OS using a given trace. Furthermore, this system can be used to predict scalability up to any arbitrarily large number of task counts. Our methodology comprises of collection of real jitter traces, measurement of network latency, message passing stack latency, and shared memory latency. The simulation framework takes the above as inputs and then simulates multiple parallel tasks starting at randomly chosen points in the jitter trace and executing a compute phase. We validate the simulation results by comparing it with real data and demonstrate the efficacy of the simulation framework by evaluating various jitter mitigation techniques through simulation.

Pradipta De, Vijay Mann

pCFS vs. PVFS: Comparing a Highly-Available Symmetrical Parallel Cluster File System with an Asymmetrical Parallel File System

pCFS is a highly available parallel, symmetrical (where nodes perform both compute and I/O work) cluster file system that we have designed to run in medium-sized clusters. In this paper, using exactly the same hardware and Linux version across all nodes we compare pCFS with two distinct configurations of PVFS: one using internal disks, and therefore not able to provide any tolerance against disk and/or I/O node failures, and another where PVFS I/O servers access LUNs in a disk array and thus provide high availability (in the following named HA-PVFS). We start by measuring I/O bandwidth and CPU consumption of PVFS and HA-PVFS setups; then, the same set of tests is performed with pCFS. We conclude that, when using the same hardware, pCFS compares very favourably with HA-PVFS, offering the same or higher I/O bandwidths at a much lower CPU consumption.

Paulo A. Lopes, Pedro D. Medeiros

Comparing Scalability Prediction Strategies on an SMP of CMPs

Diminishing performance returns and increasing power consumption of single-threaded processors have made chip multiprocessors (CMPs) an industry imperative. Unfortunately, poor software/hardware interaction and bottlenecks in shared hardware structures can prevent scaling to many cores. In fact, adding a core may harm performance

and

increase power consumption. Given these observations, we compare two approaches to predicting parallel application scalability: multiple linear regression and artificial neural networks (ANNs). We throttle concurrency to levels with higher predicted power/performance efficiency. We perform experiments on a state-of-the-art, dual-processor, quad-core platform, showing that both methodologies achieve high accuracy and identify energy-efficient concurrency levels in multithreaded scientific applications. The ANN approach has advantages, but the simpler regression-based model achieves slightly higher accuracy and performance. The approaches exhibit median error of 7.5% and 5.6%, and improve performance by an average of 7.4% and 9.5%, respectively.

Karan Singh, Matthew Curtis-Maury, Sally A. McKee, Filip Blagojević, Dimitrios S. Nikolopoulos, Bronis R. de Supinski, Martin Schulz

Topic 3: Scheduling and Load-Balancing

Scheduling and Load Balancing

Scheduling and load balancing techniques are crucial for implementing efficient parallel and distributed applications and for making best use of parallel and distributed systems. This includes planning and optimization of resource allocation as well as coping with the dynamics of the systems.

Ramin Yahyapour, Raffaele Perego, Frédéric Desprez, Leah Epstein, Francesc Guim Bernat

A Fast 5/2-Approximation Algorithm for Hierarchical Scheduling

We present in this article a new approximation algorithm for scheduling a set of

independent rigid (meaning requiring a fixed number of processors) jobs on hierarchical parallel computing platform. A hierarchical parallel platform is a collection of

parallel machines of different sizes (number of processors). The jobs are submitted to a central queue and each job must be allocated to one of the

parallel machines (and then scheduled on some processors of this machine), targeting the minimization of the maximum completion time (makespan). We assume that no job require more resources than available on the smallest machine.

This problem is hard and it has been previously shown that there is no polynomial approximation algorithm with a ratio lower than 2 unless

. The proposed scheduling algorithm achieves a

${{5}\over{2}}$

ratio and runs in

(

log

(

max

)

knlog

(

)), where

max

is the maximum processing time of the jobs. Our results also apply for the Multi Strip Packing problem where the jobs (rectangles) must be allocated on contiguous processors.

Marin Bougeret, Pierre-François Dutot, Klaus Jansen, Christina Otte, Denis Trystram

Non-clairvoyant Scheduling of Multiple Bag-of-Tasks Applications

The bag-of-tasks application model, albeit simple, arises in many application domains and has received a lot of attention in the scheduling literature. Previous works propose either theoretically sound solutions that rely on unrealistic assumptions, or ad-hoc heuristics with no guarantees on performance. This work attempts to bridge this gap through the design of non-clairvoyant heuristics based on solid theoretical foundations. The performance achieved by these heuristics is studied via simulations in a view to comparing them both to previously proposed solutions and to theoretical upper bounds on achievable performance. Also, an interesting theoretical result in this work is that a straightforward

on-demand

heuristic delivers asymptotically optimal performance when the communications or the computations can be neglected.

Henri Casanova, Matthieu Gallet, Frédéric Vivien

Extremal Optimization Approach Applied to Initial Mapping of Distributed Java Programs

An extremal optimization algorithm for initial Java program placement on clusters of Java Virtual Machines (JVMs) is presented. JVMs are implemented on multicore processors working under the ProActive Java execution framework. Java programs are represented as Directed Acyclic Graphs in which tasks correspond to methods of distributed active Java objects that communicate using a RMI mechanism. The presented probabilistic extremal optimization approach is based on the local fitness function composed of two sub-functions in which elimination of delays of task execution after reception of required data and the imbalance of tasks execution in processors are used as heuristics for improvements of extremal optimization solutions. The evolution of an extremal optimization solution is governed by task clustering supported by identification of the dominant path in the graph. The applied task mapping is based on dynamic measurements of current loads of JVMs and inter-JVM communication link bandwidth. The JVM loads are approximated by observation of the average idle time that threads report to the OS. The current link bandwidth is determined by observation of the performed average number of RMI calls per second.

Ivanoe De Falco, Eryk Laskowski, Richard Olejnik, Umberto Scafuri, Ernesto Tarantino, Marek Tudruj

A Delay-Based Dynamic Load Balancing Method and Its Stability Analysis and Simulation

Delay phenomenon commonly exists in most load balancing systems for parallel computing. It can cause some unstable oscillatory actions and intensely affect the performance of the load balancing system. In this case, a time delay feedback control model is presented to describe the dynamic load balancing system in parallel surrounding. By choosing proper Lyapunov-Krasovskii functionals and using Moon inequality and Schur complement lemma, the optimal control law is obtained not only for the different delay conditions but also in a scalable system scale. At last, the simulation experiments based on multi-threads proved the validity of the theory and method introduced.

Qingyang Meng, Jianzhong Qiao, Shukuan Lin, Enze Wang, Peng Han

Code Scheduling for Optimizing Parallelism and Data Locality

As chip multiprocessors proliferate, programming support for these devices is likely to receive a lot of attention in the near future. Parallelism and data locality are two critical issues in a chip multiprocessor environment. Unfortunately, most of the published work in the literature focuses only on one of these problems, and this can prevent one from achieving the best possible performance. The main goal of this paper is to propose and evaluate a compiler-directed code parallelization scheme, which considers both parallelism and data locality at the same time. Our compiler captures the inherent parallelism and data reuse in the application code being analyzed using a novel representation called the

locality-parallelism graph

(LPG). Our partitioning/scheduling algorithm assigns the nodes of this graph to the processors in the architecture and schedules them for execution. We implemented this algorithm and evaluated its effectiveness using a set of benchmark codes. The results collected so far indicate that our approach improves overall execution latency significantly. In this paper, we also introduce an ILP (Integer Linear Programming) based formulation of the problem, and implement the schedule obtained by the ILP solver. The results indicate that our approach gets within 4% of the ILP solution.

Taylan Yemliha, Mahmut Kandemir, Ozcan Ozturk, Emre Kultursay, Sai Prashanth Muralidhara

Hierarchical Work-Stealing

dynamic load-balancing on hierarchical platforms. In particular, we consider applications involving heavy communications on a distributed platform. The work-stealing algorithm introduced by Blumofe and Leiserson is a commonly used technique to balance load in a distributed environment but it suffers from poor performance with some communication-intensive applications. We describe here several variants of this algorithm found in the literature and in different grid middle-wares like

Satin

and

Kaapi

. In addition, we propose two new variations of the work-stealing algorithm : HWS and PWS. These algorithms improve performance by considering the network structure. We conduct a theoretical analysis of HWS in the case of fork-join task graphs and prove that HWS reduces communication overhead. In addition, we present experimental results comparing the most relevant algorithms. Experiments on Grid’5000 show that HWS and PWS allow us to obtain performance gains of up to twenty per cent when compared to the classical work-stealing algorithm. Moreover in some cases, PWS and HWS achieve speedup while classical work-stealing policies result in speed-down.

Jean-Noël Quintin, Frédéric Wagner

Optimum Diffusion for Load Balancing in Mesh Networks

This paper studies the Diffusion method for the load balancing problem in case of weighted mesh graphs. Closed form formulae for the optimum values of the edge weights are determined using local Fourier analysis. It is shown that an extrapolated version of Diffusion (EDF) can become twice as fast for orthogonal mesh graphs. Also, as a byproduct of our analysis it is shown that EDF on tori is four times faster than on meshes.

George S. Markomanolis, Nikolaos M. Missirlis

A Dynamic, Distributed, Hierarchical Load Balancing for HLA-Based Simulations on Large-Scale Environments

The dynamic management of load in large-scale distributed systems is essential for the performance of simulations due to the influence that computing capacity and work load have on execution time. The High Level Architecture (HLA) was designed with the purpose of providing management services in order to organize distributed simulations, but the framework does not offer tools for controlling load imbalances of distributed simulations. In order to provide a generic solution for the simulation load imbalances, many approaches have been proposed. These schemes are limited to solve balancing issues regarding specific simulation or environment characteristics. With focus on balancing the computational load specially for HLA-based simulations, an approach have been previously proposed based on a centralized method, but this solution performs load re-distributions based on a central element, introducing global synchronization in the system. Therefore, avoiding the issues caused by centralization, a distributed, hierarchical balancing design is proposed to dynamically organize the load through three phases: monitoring, redistribution, and migration. The proposed scheme addresses improvement of fault tolerance, decrease of balancing overhead, and reduction of delays and bottlenecks, while exhibiting performance similar to the centralized approach in the experiments.

Robson Eduardo De Grande, Azzedine Boukerche

Topic 4: High Performance Architectures and Compilers

High Performance Architectures and Compilers

This topic deals with architecture design and compilation for high performance systems. The areas of interest range from microprocessors to large-scale parallel machines; from general-purpose platforms to specialized hardware (e.g., graphic coprocessors, low-power embedded systems); and from hardware design to compiler technology. On the compilation side, topics of interest include programmer productivity issues, concurrent and/or sequential language aspects, program analysis, transformation, automatic discovery and/or management of parallelism at all levels, and the interaction between the compiler and the rest of the system. On the architecture side, the scope spans system architectures, processor micro-architecture, memory hierarchy, and multi-threading, and the impact of emerging trends. All the papers submitted to this track highlight the growing significance of Chip Multi-Processors (CMP) and Simultaneous Multi- Threaded (SMT) processors in contemporary high-performance architectures.

Pedro C. Diniz, Marco Danelutto, Denis Barthou, Marc Gonzales, Michael Hübner

Power-Efficient Spilling Techniques for Chip Multiprocessors

Current trends in CMPs indicate that the core count will increase in the near future. One of the main performance limiters of these forthcoming microarchitectures is the latency and high-demand of the on-chip network and the off-chip memory communication. To optimize the usage of on-chip memory space and reduce off-chip traffic several techniques have proposed to use the N-chance forwarding mechanism, a solution for distributing unused cache space in chip multiprocessors. This technique, however, can lead in some cases to extra unnecessary network traffic or inefficient cache allocation. This paper presents two alternative power-efficient spilling methods to improve the efficiency of the N-chance forwarding mechanism. Compared to traditional Spilling, our Distance-Aware Spilling technique provides an energy efficiency improvement (MIPS

/W) of 16% on average, and a reduction of the network usage of 14% in a ring configuration while increasing performance 6%. Our Selective Spilling technique is able to avoid most of the unnecessary reallocations and it doubles the reuse of spilled blocks, reducing network traffic by an average of 22%. A combination of both techniques allows to reduce the network usage by 30% on average without degrading performance, allowing a 9% increase of the energy efficiency.

Enric Herrero, José González, Ramon Canal

Scalable Object-Aware Hardware Transactional Memory

A Hardware Transactional Memory (HTM) aids the construction of lock-free regions within applications with fewer concerns about correctness and potentially greater performance through optimistic concurrency. Object-aware hardware adds a level of indirection to memory accesses, memory addresses become a combination of the object being accessed and the offset within it. The hardware maintains a mapping from objects to memory locations just as a mapping from virtual to real memory is handled through page tables. In a scalable object-aware system the directories are addressed by objects identifiers.

In this paper we extend a scalable object-aware memory system to implement a HTM. Our object-aware protocol permits locks on directories to be avoided for objects only read during a transaction. Working at the granularity of an object allows entries within the directories to be associated with multiple cache lines, as opposed to one, and reduce the amount of network traffic. Finally, our commit protocol dispenses with the need for a centrally controlled transaction ID order.

Behram Khan, Matthew Horsnell, Mikel Lujan, Ian Watson

Efficient Address Mapping of Shared Cache for On-Chip Many-Core Architecture

Performance of the on-chip cache is critical for processor. The multi-thread program model usually employed by on-chip many-core architectures may have effects on cache access patterns and eventually on cache conflict miss behaviors. However, the behavior of cache is still unclear, and little has been known of the effectiveness of XOR mapping scheme for many-core systems. In this paper we focus on these problems. We propose an XOR-based address mapping scheme for on-chip many core architecture to increase performance of cache system. Then we evaluate the proposed scheme for various applications, including an application for bioinformatics, matrix multiplication, LU decomposition, FFT from Splash2 benchmarks. Experimental results show that with the proposed scheme, it makes conflict misses of shared cache reduced by about 53% on average, and makes overall performance improved by about 6%. Experimental results also show that the XOR scheme is more cost effectively than victim cache scheme.

Fenglong Song, Dongrui Fan, Zhiyong Liu, Junchao Zhang, Lei Yu, Weizhi Xu

Thread Owned Block Cache: Managing Latency in Many-Core Architecture

Shared last level cache is crucial to performance. However, multi-thread program model incurs serious contention in shared cache. In this paper, to reduce average cache access latency, we propose two schemes. First, an implicitly dynamic cache partitioning scheme,

i.e.

block agglutinating. The purpose is to isolate conflicting data blocks. Second, a novel hardware buffer, called thread owned block cache,

i.e.

TOB Cache. The purpose is to store conflicting data blocks. Extensive analysis of the proposed schemes with Splash2 benchmarks and Bioinformatics workloads is performed using a cycle accurate many-core simulator. Experimental results show that the proposed schemes make conflict miss rate of shared cache reduced by 40% compared to traditional shared cache. Compared with victim cache, average load latency of shared cache and primary data cache is reduced by about 26% and 12%, respectively; primary data cache miss penalties are reduced by about 14%, and IPC is improved by 17%.

Fenglong Song, Zhiyong Liu, Dongrui Fan, Hao Zhang, Lei Yu, Shibin Tang

Extending the Cell SPE with Energy Efficient Branch Prediction

Energy-efficient dynamic branch predictors are proposed for the Cell SPE, which normally depends on compiler-inserted hint instructions to predict branches. All designed schemes use a Branch Target Buffer (BTB) to store the branch target address and the prediction, which is computed using a bimodal counter. One prediction scheme pre-decodes instructions when they are fetched from the local store and accesses the BTB only for branch instructions, thereby saving power compared to conventional dynamic predictors that access the BTB for every instruction. In addition, several ways to leverage the existing hint instructions for the dynamic branch predictor are studied. We also introduce branch warning instructions which initiate branch prediction before the actual branch instruction is fetched. They allow fetching the instructions starting at the branch target and thus completely remove the branch penalty for correctly predicted branches. For a 256-entry BTB, a speedup of up to 18.8% is achieved. The power consumption of the branch prediction schemes is estimated at 1% or less of the total power dissipation of the SPE and the average energy-delay product is reduced by up to 6.2%.

Martijn Briejer, Cor Meenderinck, Ben Juurlink

Topic 5: Parallel and Distributed Data Management

Parallel and Distributed Data Management

The manipulation and handling of an ever increasing volume of data by current data-intensive applications requires novel techniques for efficient data management. Despite recent advances in every aspect of data management (storage, access, querying, analysis, mining), future applications are expected to scale to even higher degrees, not only in terms of volumes of data handled but also in terms of users and resources, often making use of multiple, pre-existing autonomous, distributed or heterogeneous resources. The notion of parallelism and concurrent execution at all levels remains a key element in achieving scalability and managing efficiently such data-intensive applications, but the changing nature of the underlying environments requires new solutions to cope with such changes. In this context, this topic sought papers in all aspects of data management (including databases and data-intensive applications) whose focus relates to some form of parallelism and concurrency.

Rizos Sakellariou, Salvatore Orlando, Josep Lluis Larriba-Pey, Srinivasan Parthasarathy, Demetrios Zeinalipour-Yazti

Federated Enactment of Workflow Patterns

In this paper we address two research questions concerning workflows: 1) how do we abstract and catalogue recurring workflow patterns?; and 2) how do we facilitate optimisation of the mapping from workflow patterns to actual resources at runtime? Our aim here is to explore techniques that are applicable to large-scale workflow compositions, where the resources could change dynamically during the lifetime of an application. We achieve this by introducing a registry-based mechanism where pattern abstractions are catalogued and stored. In conjunction with an enactment engine, which communicates with this registry, concrete computational implementations and resources are assigned to these patterns, conditional to the execution parameters. Using a data mining application from the life sciences, we demonstrate this new approach.

Gagarine Yaikhom, Chee Sun Liew, Liangxiu Han, Jano van Hemert, Malcolm Atkinson, Amy Krause

A Distributed Approach to Detect Outliers in Very Large Data Sets

We propose a distributed approach addressing the problem of distance-based outlier detection in very large data sets. The presented algorithm is based on the concept of

outlier detection solving set

([1]), which is a small subset of the data set that can be provably used for predicting novel outliers. The algorithm exploits parallel computation in order to meet two basic needs: (

) the reduction of the run time with respect to the centralized version and (

) the ability to deal with distributed data sets. The former goal is achieved by decomposing the overall computation into cooperating parallel tasks. Other than preserving the correctness of the result, the proposed schema exhibited excellent performances. As a matter of fact, experimental results showed that the run time scales up with respect to the number of nodes. The latter goal is accomplished through executing each of these parallel tasks only on a portion of the entire data set, so that the proposed algorithm is suitable to be used over distributed data sets. Importantly, while solving the distance-based outlier detection task in the distributed scenario, our method computes an outlier detection solving set of the overall data set of the same quality as that computed by the corresponding centralized method.

Fabrizio Angiulli, Stefano Basta, Stefano Lodi, Claudio Sartori

Topic 6: Grid, Cluster and Cloud Computing

Grid, Cluster and Cloud Computing

Grid computing is a major research area with strong involvement from both academia and the computing industry. The common vision is that grid computing represents the culmination of truly general distributed computing across various resources in a ubiquitous, open-ended infrastructure to support a wide range of different application areas. Although significant progress has been made in the design and deployment of grids, many challenges still remain before the goal of a user-friendly, efficient, and reliable grid can be realized. Grid research issues cover many areas of computer science to address the fundamental capabilities and services that are required in a heterogeneous environment, such as adaptability, scalability, reliability and security, and to support applications as diverse as ubiquitous local services, enterprise-scale virtual organizations, and internet-scale distributed supercomputing. Cloud computing is also emerging as an alternate platform for large-scale distributed applications where resources are typically provided by a single administrative domain in a pay per-use mode. To some, cloud computing is a natural evolution of grid computing, to others, it is a complementary and perhaps competing technology. Grid and cloud research will greatly benefit from interactions with the many related areas of computer science, making Euro-Par an excellent venue to present results and discuss issues.

K. Keahey, D. Laforenza, A. Reinefeld, P. Ritrovato, D. Thain, N. Wilkins-Diehr

Deployment of a Hierarchical Middleware

Accessing the power of distributed resources can nowadays easily be done using a

middleware

based on a client/server approach. Several architectures exist for those middleware. The most scalable ones rely on a hierarchical design. Determining the best shape for the hierarchy, the one giving the best throughput of services, is not an easy task.

We first propose a computation and communication model for such hierarchical middleware. Our model takes into account the deployment of several services in the hierarchy. Then, based on this model, we propose an algorithm for automatically constructing a hierarchy. This algorithm aims at offering the users the best obtained to requested throughput ratio, while providing fairness on this ratio for the different kind of services, and using as few resources as possible. Finally, we compare our model with experimental results on a real middleware called

Diet

Eddy Caron, Benjamin Depardon, Frédéric Desprez

Toward Real-Time, Many-Task Applications on Large Distributed Systems

In the age of Grid, Cloud, volunteer computing, massively parallel applications are deployed over tens or hundreds of thousands of resources over short periods of times to complete immense computations. In this work, we consider the problem of deploying such applications with stringent real-time requirements. One major challenge is the server-side management of these tasks, which often number in tens or hundreds of thousands on a centralized server. In this work, we design and implement a real-time task management system for many-task computing, called RT-BOINC. The system gives low

(1) worst-case execution time for task management operations, such as task scheduling, state transitioning, and validation. We implement this system on top of BOINC, a common middleware for volunteer computing. Using micro and macro-benchmarks executed in emulation experiments, we show that RT-BOINC provides significantly lower worst-case execution time, and lessens the gap between the average and the worst-case performance compared with the original BOINC implementation.

Sangho Yi, Derrick Kondo, David P. Anderson

Scheduling Scientific Workflows to Meet Soft Deadlines in the Absence of Failure Models

Highly distributed systems such as Clouds and Grids are used to execute complex scientific workflow applications by researchers from various areas of science. While scientists rightfully expect efficient and reliable execution of their applications, current systems often cannot deliver the required Quality of Service. We propose a dynamic execution and scheduling heuristic able to schedule workflow applications with a high degree of fault tolerance, while taking into account soft deadlines. Experimental results show that our method meets soft deadlines in volatile highly distributed systems in the absence of historic failure trace data or complex failure models of the target system.

Kassian Plankensteiner, Radu Prodan, Thomas Fahringer

A GPGPU Transparent Virtualization Component for High Performance Computing Clouds

The GPU Virtualization Service (gVirtuS) presented in this work tries to fill the gap between in-house hosted computing clusters, equipped with GPGPUs devices, and pay-for-use high performance virtual clusters deployed via public or private computing clouds. gVirtuS allows an instanced virtual machine to access GPGPUs in a transparent and hypervisor independent way, with an overhead slightly greater than a real machine/GPGPU setup. The performance of the components of gVirtuS is assessed through a suite of tests in different deployment scenarios, such as providing GPGPU power to cloud computing based HPC clusters and sharing remotely hosted GPGPUs among HPC nodes.

Giulio Giunta, Raffaele Montella, Giuseppe Agrillo, Giuseppe Coviello

What Is the Price of Simplicity?

A Cross-Platform Evaluation of the SAGA API

The abundance of middleware to access grids and clouds and their often complex APIs hinders ease of programming and portability. The Open Grid Forum (OGF) has therefore initiated the development and standardization of SAGA: a Simple API for Grid Applications. SAGA provides a simple yet powerful API with high-level constructs that abstract from the details of the underlying infrastructure. In this paper we investigate the price that possibly comes with such an API. We discuss the effects on expressiveness and ease of programming, and analyze the performance overhead of three different SAGA implementations (written in Java, Python, and C++) on various middleware. We conclude that SAGA is a good pragmatic approach to make grids easily accessible. The API considerably improves usability and uniformity, but offers a compromise between expressiveness and runtime dependencies. The overall performance of the tested implementations is acceptable, but the strict API semantics require various runtime checks that occasionally cause significant overhead, depending on the underlying infrastructure.

Mathijs den Burger, Ceriel Jacobs, Thilo Kielmann, Andre Merzky, Ole Weidner, Hartmut Kaiser

User-Centric, Heuristic Optimization of Service Composition in Clouds

With the advent of Cloud computing, there is a high potential for third-party solution providers such as composite service providers, aggregators or resellers to tie together services from different clouds to fulfill the pay-per-use demands of their customers. Customer satisfaction which is primarily based on the fulfillment of user-centric objectives is a crucial success factor to excel in such a service market. The clients’ requirements, if they change over time even after the desired solution composition, may result in a failure of this approach. On the other hand, business prospects expand with the possibility of reselling already designed solutions to different customers after the underlying services become available again. The service composition strategies must cope with the above-mentioned dynamic situations.

In this paper we address these challenges in context with the customer-driven service selection. We present a formal approach to map customer requirements onto functional and non-functional attributes of the services. We define a happiness measure to guarantee user satisfaction and devise a parallelizable service composition algorithm to maximize this happiness measure. We devise a heuristic approach based on historical information of service composition to rapidly react to changes in client requirements at design time and indicate run-time remedies such as for service failures. The heuristic algorithm is also useful to recompose similar solutions for different clients with matching requirements. Our algorithms are evaluated by the results of a simulation developed on the workflow tool Kepler coupled with a C++ implementation of the optimization algorithms.

Kevin Kofler, Irfan ul Haq, Erich Schikuta

A Distributed Market Framework for Large-Scale Resource Sharing

Current distributed computing infrastructures, such as peer-to-peer networks, grids, and more recently clouds, make sharing and trading resources ubiquitous. In these large distributed systems, rational users are both providers and consumers of resources. Currently, there is growing interest in exploiting economic models for the allocation of shared computing resources that incentivize rational users. However, when the number of resource types and users increases, computational complexity of the allocation algorithms grows rapidly and efficiency deteriorates. In this paper, we propose a scalable distributed market framework for the allocation of shared resources in large distributed systems. We use mechanism design to create a pricing scheme that allocates a request for multiple resource types, by trading economic efficiency for computational efficiency, strategy-proof and budget-balance. To address scalability, our proposed framework leverages on a peer-to-peer overlay for resource discovery and management. We prototype our framework using FreePastry, a popular overlay network based on the Pastry protocol. We show that our scheme is efficient and scalable using both simulation experiments and results from the deployment on PlantLab.

Marian Mihailescu, Yong Meng Teo

Using Network Information to Perform Meta-scheduling in Advance in Grids

In extremely heterogeneous and distributed systems, like Grid environments, it is quite difficult to provide quality of service (QoS). In addition, the dynamic behaviour of the resources makes the time needed to complete the execution of a job highly variable. So, fulfilling the user QoS requirements in a Grid is still an open issue. The main aim of this work is to provide QoS in Grid environments through network-aware job scheduling in advance. This paper presents a technique to manage idle/busy periods of resources using red-black trees which considers the network as a first level resource. Besides, no

a priori

knowledge on the duration of jobs is required, as opposed to other works. A performance evaluation using a real testbed is presented which illustrates the efficiency of this approach to meet the QoS requirements of users, and highlights the importance of taking the network into account when predicting the duration of jobs.

Luis Tomás, Agustín Caminero, Blanca Caminero, Carmen Carrión

Topic 7: Peer to Peer Computing

Peer-to-Peer Computing

After several years of intensive investigation, peer-to-peer computing has established itself as an accepted research topic in the general area of distributed systems. Going beyond the initial file sharing applications that spurred productive research and developement, peer-to-peer computing is associated with inherently decentralized, self-organizing, and self-coordinating large-scale systems. Performance requirements include adaptivity to churn, high resilience to failures, tolerance to network performance variations, and scalability to huge numbers of peers (tens of thousands to millions), but also stronger consistency and security.

Adriana Iamnitchi, Paolo Trunfio, Jonathan Ledlie, Florian Schintke

Overlay Management for Fully Distributed User-Based Collaborative Filtering

Offering personalized recommendation as a service in fully distributed applications such as file-sharing, distributed search, social networking, P2P television, etc, is an increasingly important problem. In such networked environments recommender algorithms should meet the same performance and reliability requirements as in centralized services. To achieve this is a challenge because a large amount of distributed data needs to be managed, and at the same time additional constraints need to be taken into account such as balancing resource usage over the network. In this paper we focus on a common component of many fully distributed recommender systems, namely the overlay network. We point out that the overlay topologies that are typically defined by node similarity have highly unbalanced degree distributions in a wide range of available benchmark datasets: a fact that has important—but so far largely overlooked—consequences on the load balancing of overlay protocols. We propose algorithms with a favorable convergence speed and prediction accuracy that also take load balancing into account. We perform extensive simulation experiments with the proposed algorithms, and compare them with known algorithms from related work on well-known benchmark datasets.

Róbert Ormándi, István Hegedűs, Márk Jelasity

Dynamic Publish/Subscribe to Meet Subscriber-Defined Delay and Bandwidth Constraints

Current distributed publish/subscribe systems assume that all participants have similar QoS requirements and equally contribute to the system’s resources. However, in many real-world applications, the message delay tolerance of individual peers may differ widely. Disseminating messages according to individual delay requirements not only allows for the satisfaction of user-specific needs but also significantly improves the utilization of the resources in a publish/subscribe system. In this paper, we propose a peer-to-peer-based approach to satisfy the individual delay requirements of subscribers in the presence of bandwidth constraints. Our approach allows subscribers to dynamically adjust the granularity of their subscriptions according to their bandwidth constraints and delay requirements. Subscribers maintain the publish/subscribe overlay in a decentralized manner by establishing connections to peers that provide messages meeting exactly their subscription granularity and complying to their delay requirements. Evaluations show that for practical workloads, the proposed system scales up to a large number of subscribers and performs robustly in a very dynamic setting.

Muhammad Adnan Tariq, Gerald G. Koch, Boris Koldehofe, Imran Khan, Kurt Rothermel

Combining Hilbert SFC and Bruijn Graphs for Searching Computing Markets in a P2P System

This paper proposes an efficient and scalable computational resource discovery overlay orientated towards P2P computing. Our proposal gathers the peers into markets according to their computational resources. Each market is arranged in an N-tree and the trees are linked by a Bruijn graph.

The tree topology allows efficient searching of available resources in a specific market, while Bruijn provides good scalability because search complexity does not depend on the number of markets. A Hilbert function is used to arrange markets in one ordered and mono-dimensional space. This way, the proposed architecture exploits the Bruijn and N-tree topologies together with the Hilbert function. A look-up query mechanism for simple and multiple queries with a low algorithmic cost is also introduced over this architecture. The performance of our proposal was analysed by means of simulation in relation to the widely used Chord overlay with the case of simple queries, and the Baton algorithm with the case of range queries. Furthermore, a large number of experiments demonstrate the proper behaviour of the system. The results obtained reveal the competitiveness of our proposals.

Damia Castellà, Hector Blanco, Francesc Giné, Francesc Solsona

Sampling Bias in BitTorrent Measurements

Real-world measurements play an important role in understanding the characteristics and in improving the operation of BitTorrent, which is currently a popular Internet application. Much like measuring the Internet, the complexity and scale of the BitTorrent network make a single, complete measurement impractical. While a large number of measurements have already employed diverse sampling techniques to study parts of BitTorrent network, until now there exists no investigation of their sampling bias, that is, of their ability to objectively represent the characteristics of BitTorrent. In this work we present the first study of the sampling bias in BitTorrent measurements. We first introduce a novel taxonomy of sources of sampling bias in BitTorrent measurements. We then investigate the sampling among fifteen long-term BitTorrent measurements completed between 2004 and 2009, and find that different data sources and measurement techniques can lead to significantly different measurement results. Last, we formulate three recommendations to improve the design of future BitTorrent measurements, and estimate the cost of using these recommendations in practice.

Boxun Zhang, Alexandru Iosup, Johan Pouwelse, Dick Epema, Henk Sips

A Formal Credit-Based Incentive Model for Sharing Computer Resources

Peer-to-Peer (P2P) computing, the harnessing of idle CPU cycles through the Internet, offers new research challenges in the domain of distributed computing. This paper presents an incentive mechanism based on credits, designed to operate on different types of shared computing networks such as P2P, P2P Grid, Opportunistic Grid, Desktop Grid, volunteer computing platforms, and so on. The main contribution is a new reinvestment policy called Weighted that increases peer participation significantly. This mechanism reflects P2P user dynamics, penalizes free-riders efficiently and encourages peer participation. Simulation results show that our policy outperforms alternative approaches, maximizing system throughput and limiting free-riding behaviour. Furthermore, a mathematical model has been proposed and analysed in order to formalize the policy and setting up the configuration of the principal parameters of the incentive mechanism.

Josep Rius, Ignasi Barri, Fernando Cores, Francesc Solsona

Topic 8: Distributed Systems and Algorithms

Distributed Systems and Algorithms

Parallel computing is increasingly exposed to the development and challenges of distributed systems, such as asynchrony, long latencies, network partitions, failures, disconnected operations, heterogeneity and protocol standardization. Furthermore, distributed systems are becoming larger, more diverse and more dynamic (changing topology, highly dynamic number of participants). This Euro- Par topic provides a forum for research and practice about new advances in distributed computing and distributed algorithms. Submissions were encouraged across the whole area with emphasis on design and practice of distributed algorithms, scalability, concurrency, performance evaluations, and self-organized distributed systems.

Pascal Felber, Ricardo Jimenez-Peris, Giovanni Schmid, Pierre Sens

Improving Message Logging Protocols Scalability through Distributed Event Logging

Message logging is an attractive solution to provide fault tolerance for message passing applications because it is more scalable than coordinated checkpointing. Sender-based message logging is a well known optimization that allows to save messages payload in the sender memory and so only the events corresponding to message receptions have to be logged reliably using an event logger. In existing work on message logging, the event logger has always been considered as a centralized process, limiting message logging protocols scalability. In this paper, we propose a distributed event logger. This new event logger takes advantage of multi-cores processors to be executed in parallel with application processes. It makes use of the nodes’ volatile memory to save events reliably. We propose a simple gossip-based dissemination protocol to make application processes aware of new stable events. We evaluated our distributed event logger in the Open MPI library with an optimistic and a pessimistic message logging protocol. Experiments show that distributed event logging improves message logging protocols scalability.

Thomas Ropars, Christine Morin

Value-Based Sequential Consistency for Set Objects in Dynamic Distributed Systems

This paper introduces a shared object, namely a set object that allows processes to add and remove values as well as take a snapshot of its content. A new consistency condition suited to such an object is introduced. This condition, named value-based sequential consistency, is weaker than linearizability. The paper also addresses the construction of a set object in a synchronous anonymous distributed system where participants can continuously join and leave the system. Interestingly, the protocol is proved correct under the assumption that some constraint on the churn is satisfied. This shows that the notion of “provably correct software” can be applied to dynamic systems.

Roberto Baldoni, Silvia Bonomi, Michel Raynal

Robust Self-stabilizing Construction of Bounded Size Weight-Based Clusters

We propose the first robust self-stabilizing protocol building 1-hop clusters whose size is bounded, moreover the clusterhead selection is weight-based. The protocol reaches quickly (in 4 rounds) a safe configuration, where the safety property is satistfied: network nodes are partitionned into bounded clusters (clusterheads are not the most suitable nodes). During the convergence to a legitimate configuration, where more desired properties are guaranteed, the safety property is preserved, ensuring then the continuity functioning of hierarchical protocols.

Colette Johnen, Fouzi Mekhaldi

Adaptive Conflict Unit Size for Distributed Optimistic Synchronization

Distributed and parallel applications often require accessing shared data. Distributed transactional memory is an emerging concept for concurrent shared data access. By using optimistic synchronization, transactional memory is simpler to use and less error-prone than explicit lock-based synchronization. However, distributed transactional memories are particularly sensitive to phenomena such as true sharing and false sharing, which are caused by correlated data access patterns on multiple nodes. In this paper, we propose a transparent technique that adaptively manages conflict unit sizes for distributed optimistic synchronization in order to relieve application developers from reasoning about such sharing phenomena. Experiments with micro-benchmarks and an on-line data processing application similar to Twitter (using the MapReduce computing model) show the benefits of the proposed approach.

Kim-Thomas Rehmann, Marc-Florian Müller, Michael Schöttner

Frame Allocation Algorithms for Multi-threaded Network Cameras

This paper presents a first attempt to solve a challenging problem, proposing novel and successful algorithms to efficiently distribute video frames from network cameras to many concurrent clients.

The usual scenario studied is composed of a camera generating video frames at a given rate and distributing them over a network to several concurrent clients. In general, the idea is to allocate one thread per client at the camera, sharing a pool of one-frame buffers. The algorithms studied consider the allocation of buffers to new frames and the allocation of frames to clients.

We study different combinations of algorithms, buffers and clients in order to find an optimal solution for the usual scenarios we face when the network camera is under heavy use. The main conclusion is that frame allocation algorithms have a strong impact on system performance: under the same conditions, client performance improves from 4 to 25 frames per second with the best algorithm combination at the camera.

Jos’e Miguel Piquer, Javier Bustos-Jim’enez

Scalable Distributed Simulation of Large Dense Crowds Using the Real-Time Framework (RTF)

The simulation of large groups (crowds) of individuals is a complex and challenging task. It requires creating an adequate model that takes into account psychological features of individuals, as well as developing and implementing efficient computation and communication strategies in order to manage the immense computational workload implied by the numerous interactions within a crowd.

This paper develops a novel model for a realistic, real-time simulation of large and dense crowds, focusing on evacuation scenarios and the modeling of panic situations. Our approach ensures that both global navigation and local motion are modeled close to reality, and the user can flexibly change both the simulation environment and parameters at runtime.

Because of the high computation intensity of the model, we implement the simulation in a distributed manner on multiple server machines, using the RTF (Real-Time Framework) middleware. We implement state replication as an alternative to the traditional state distribution via zoning. We show that RTF enables a high-level development of distributed simulations, and supports an efficient runtime execution on multiple servers.

Ole Scharf, Sergei Gorlatch, Felix Blanke, Christoph Hemker, Sebastian Westerheide, Tobias Priebs, Christoph Bartenhagen, Alexander Ploss, Frank Glinka, Dominik Meilaender

The x-Wait-Freedom Progress Condition

The liveness of concurrent objects despite asynchrony and failures is a fundamental problem. To that end several progress conditions have been proposed. Wait-freedom is the strongest of these conditions: it states that any object operation must terminate if the invoking process does not crash. Obstruction-freedom is a weaker progress condition as it requires progress only when a process executes in isolation for a long enough period.

This paper explores progress conditions in

-process asynchronous read/write systems enriched with base objects with consensus number

, 1 <

≤

(i.e., objects that wait-free solve consensus in a set of

processes). It is easy to solve consensus in such a system if progress is required only when one of the

processes allowed to access the underlying consensus object invokes this object and does not crash. This paper proposes and investigates a stronger progress condition that we call

-wait-freedom (

-wait-freedom is wait-freedom). While it does not need more assumptions than the previous one in order to ensure progress, that condition identifies additional scenarios in which progress is required despite the fact that none of the

processes allowed to access the underlying consensus object participates. The paper then presents and proves correct a consensus algorithm that satisfies this progress condition.

Damien Imbs, Michel Raynal

Backmatter

Titel: Euro-Par 2010 - Parallel Processing
herausgegeben von: Pasqua D’Ambra
Mario Guarracino
Domenico Talia
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-15277-1
Print ISBN: 978-3-642-15276-4
DOI: https://doi.org/10.1007/978-3-642-15277-1

Springer Professional

Inhaltsverzeichnis

Frontmatter

Topic 1: Support Tools and Environments

Distributed Systems and Algorithms

Starsscheck: A Tool to Find Errors in Task-Based Parallel Programs

Automated Tuning in Parallel Sorting on Multi-core Architectures

Estimating and Exploiting Potential Parallelism by Source-Level Dependence Profiling

Source-to-Source Optimization of CUDA C for GPU Accelerated Cardiac Cell Modeling

Efficient Graph Partitioning Algorithms for Collaborative Grid Workflow Developer Environments

Profile-Driven Selective Program Loading

Characterizing the Impact of Using Spare-Cores on Application Performance

Topic 2: Performance Prediction and Evaluation

Performance Prediction and Evaluation

A Model for Space-Correlated Failures in Large-Scale Distributed Systems

Architecture Exploration for Efficient Data Transfer and Storage in Data-Parallel Applications

jitSim: A Simulator for Predicting Scalability of Parallel Applications in Presence of OS Jitter

pCFS vs. PVFS: Comparing a Highly-Available Symmetrical Parallel Cluster File System with an Asymmetrical Parallel File System

Comparing Scalability Prediction Strategies on an SMP of CMPs

Topic 3: Scheduling and Load-Balancing

Scheduling and Load Balancing

A Fast 5/2-Approximation Algorithm for Hierarchical Scheduling

Non-clairvoyant Scheduling of Multiple Bag-of-Tasks Applications

Extremal Optimization Approach Applied to Initial Mapping of Distributed Java Programs

A Delay-Based Dynamic Load Balancing Method and Its Stability Analysis and Simulation

Code Scheduling for Optimizing Parallelism and Data Locality

Hierarchical Work-Stealing

Optimum Diffusion for Load Balancing in Mesh Networks

A Dynamic, Distributed, Hierarchical Load Balancing for HLA-Based Simulations on Large-Scale Environments

Topic 4: High Performance Architectures and Compilers

High Performance Architectures and Compilers

Power-Efficient Spilling Techniques for Chip Multiprocessors

Scalable Object-Aware Hardware Transactional Memory

Efficient Address Mapping of Shared Cache for On-Chip Many-Core Architecture

Thread Owned Block Cache: Managing Latency in Many-Core Architecture

Extending the Cell SPE with Energy Efficient Branch Prediction

Topic 5: Parallel and Distributed Data Management

Parallel and Distributed Data Management

Federated Enactment of Workflow Patterns

A Distributed Approach to Detect Outliers in Very Large Data Sets

Topic 6: Grid, Cluster and Cloud Computing

Grid, Cluster and Cloud Computing

Deployment of a Hierarchical Middleware

Toward Real-Time, Many-Task Applications on Large Distributed Systems

Scheduling Scientific Workflows to Meet Soft Deadlines in the Absence of Failure Models

A GPGPU Transparent Virtualization Component for High Performance Computing Clouds

What Is the Price of Simplicity?

User-Centric, Heuristic Optimization of Service Composition in Clouds

A Distributed Market Framework for Large-Scale Resource Sharing

Using Network Information to Perform Meta-scheduling in Advance in Grids

Topic 7: Peer to Peer Computing

Peer-to-Peer Computing

Overlay Management for Fully Distributed User-Based Collaborative Filtering

Dynamic Publish/Subscribe to Meet Subscriber-Defined Delay and Bandwidth Constraints

Combining Hilbert SFC and Bruijn Graphs for Searching Computing Markets in a P2P System

Sampling Bias in BitTorrent Measurements

A Formal Credit-Based Incentive Model for Sharing Computer Resources

Topic 8: Distributed Systems and Algorithms

Distributed Systems and Algorithms

Improving Message Logging Protocols Scalability through Distributed Event Logging

Value-Based Sequential Consistency for Set Objects in Dynamic Distributed Systems

Robust Self-stabilizing Construction of Bounded Size Weight-Based Clusters

Adaptive Conflict Unit Size for Distributed Optimistic Synchronization

Frame Allocation Algorithms for Multi-threaded Network Cameras

Scalable Distributed Simulation of Large Dense Crowds Using the Real-Time Framework (RTF)

The x-Wait-Freedom Progress Condition

Backmatter

Premium Partner