Skip to main content



Topic 1: Support Tools and Environments

Topic 1: Support Tools and Environments

Support tools and environments are vital to the production of efficient parallel and distributed applications. This year, eleven papers were submitted to this topic area, from which five were accepted as full papers.

Bronis R. de Supinski, Matthias Brehm, Luiz DeRose, Tomás Margalef

IOAgent: A Parallel I/O Workload Generator

The purpose of this paper is twofold. First, we present IOAgent, a tool that allows to generate synthetic workloads for parallel environments in a simple way. IOAgent has been implemented for Linux and takes into account different I/O characteristics like synchronous and asynchronous calls, buffered and unbuffered accesses, as well as different numbers of disks, intermediate buffers and number of agents simulating the workload. Second, we propose statistical models that help us to analyze the I/O behaviour of an IBM e-server OpenPower 710, with 4 SCSI drives. The observations used to build the model have been obtained using IOAgent.

Sergio Gómez-Villamor, Victor Muntés-Mulero, Marta Pérez-Casany, John Tran, Steve Rees, Josep-L. Larriba-Pey

TDP_SHELL: An Interoperability Framework for Resource Management Systems and Run-Time Monitoring Tools

Resource management systems and tool support are two important factors for efficiently developing applications in large clusters. On the one hand, management systems (in the form of batch queue systems) are responsible for all issues related to executing jobs on the existing machines. On the other hand, run-time tools (in the form of debuggers, tracers, performance analyzers, etc.) are used to guarantee the correctness and the efficiency of execution. Executing an application under the control of both a resource management system and a run-time tool is still a challenging problem in most cases. Using run-time tools might be difficult or even impossible in usual environments due to the restrictions imposed by resource managers. We propose TDP-Shell as a framework for providing the necessary mechanisms to enable and simplify using run-time tools under a specific resource management system. We have analyzed the essential interactions between common run-time tools and resource management systems and implemented a pilot TDP-Shell. The paper describes the main components of TDP-Shell and its use with some illustrative examples.

Vicente Ivars, Ana Cortes, Miquel A. Senar

Supporting Cache Locality Optimization with a Toolset

Cache performance significantly influences the computation power of modern processors. With the trend of microprocessor design for both general use and embedded systems towards chip-multiple, cache performance becomes more important because an off-chip access is rather expensive in comparison with on-chip references. This means cache locality optimization remains a hot research area for the next generation of computer architectures.

In this paper we present a tool environment aiming at providing the programmers sufficient support in the task of optimizing source codes for better runtime cache behavior. This environment contains a set of tools ranging from profiling, analysis, and simulation tools for gathering performance data, to visualization tools for graphical presentation and platforms for program development. Together, these tools establish a feedback loop for tuning cache performance on current and emerging uniprocessor and multiprocessor systems.

Jie Tao, Wolfgang Karl

Model-Based Performance Diagnosis of Master-Worker Parallel Computations

Parallel performance tuning naturally involves a diagnosis process to locate and explain sources of program inefficiency. Proposed is an approach that exploits parallel computation patterns (models) for diagnosis discovery. Knowledge of performance problems and inference rules for hypothesis search are engineered from model semantics and analysis expertise. In this manner, the performance diagnosis process can be automated as well as adapted for parallel model variations. We demonstrate the implementation of model-based performance diagnosis on the classic Master-Worker pattern. Our results suggest that pattern-based performance knowledge can provide effective guidance for locating and explaining performance bugs at a high level of program abstraction.

Li Li, Allen D. Malony

Specification of Inefficiency Patterns for MPI-2 One-Sided Communication

Automatic performance analysis of parallel programs can be accomplished by scanning event traces of program execution for patterns representing inefficient behavior. The temporal and spatial relationships between individual runtime events recorded in the event trace allow the recognition of wait states as a result of suboptimal parallel interaction. In our earlier work [1], we have shown how patterns related to


point-to-point and collective communication can be easily specified using common abstractions that represent execution-state information and links between related events. In this article, we present new abstractions targeting remote memory access (also referred to as one-sided communication) as defined in the


-2 standard. We also describe how the general structure of these abstractions differs from our earlier work to accommodate the more complicated sequence of data-transfer and synchronization operations required for this type of communication. To demonstrate the benefits of our methodology, we specify typical performance properties related to one-sided communication.

Andrej Kühnal, Marc-André Hermanns, Bernd Mohr, Felix Wolf

Topic 2: Performance Prediction and Evaluation

Topic 2: Performance Prediction and Evaluation

Parallel computing enables solutions to computational problems that are impossible on sequential systems due to their limited performance. To meet this objective, it is critical that users can both measure performance on a given system and predict the performance for other systems. Achieving high performance on parallel computer systems is the product of an intimate combination of hardware architecture (processor, memory, interconnection network), system software, runtime environment, algorithms, and application design. Performance evaluation is the science of understanding these factors that contribute to the overall expression of parallel performance on real machines and on systems yet to be realized. Benchmarking and performance characterization methodologies and tools provide an empirical foundation for performance evaluation. Performance prediction techniques provide a means to model performance behaviors and properties as system, algorithm, and software features change, particularly in the context of large-scale parallelism. These two areas are closely related since most prediction requires data to be gathered from measured runs of a program, to identify application signatures or to understand the performance characteristics of current machines.

Jesús Labarta, Bernd Mohr, Allan Snavely, Jeffrey Vetter

Hierarchical Model Validation of Symbolic Performance Models of Scientific Kernels

Multi-resolution validation of hierarchical performance models of scientific applications is critical primarily for two reasons. First, the step-by-step validation determines the correctness of all essential components or phases in a science simulation. Second, a model that is validated at multiple resolution levels is the very first step to generate predictive performance models, for not only existing systems but also for emerging systems and future problem sizes. We present the design and validation of hierarchical performance models of two scientific benchmarks using a new technique called the modeling assertions (MA). Our MA prototype framework generates symbolic performance models that can be evaluated efficiently by generating the equivalent model representations in Octave and MATLAB. The multi-resolution modeling and validation is conducted on two contemporary, massively-parallel systems, XT3 and Blue Gene/L system. The workload distribution and the growth rates predictions generated by the MA models are confirmed by the experimental data collected on the MPP platforms. In addition, the physical memory requirements that are generated by the MA models are verified by the runtime values on the Blue Gene/L system, which has 512 MBytes and 256 MBytes physical memory capacity in its two unique execution modes.

Sadaf R. Alam, Jeffrey S. Vetter

Tuning Application in a Multi-cluster Environment

The joining of geographically distributed heterogeneous clusters of workstations through the Internet can be a simple and effective approach to speed up a parallel application execution. This paper describes a methodology to migrate a parallel application from a single-cluster to a collection of clusters, guaranteeing a minimum level of efficiency. This methodology is applied to a parallel scientific application to use three geographically scattered clusters located in Argentina, Brazil and Spain. Experimental results prove that the speedup and efficiency estimations provided by this methodology are more than 90% precision. Without the tuning process of the application a 45% of the maximum speedup is obtained whereas a 94% of that maximum speedup is attained when a tuning process is applied. In both cases efficiency is over 90%.

Eduardo Argollo, Adriana Gaudiani, Dolores Rexachs, Emilio Luque

Analyzing the Interaction of OpenMP Programs Within Multiprogramming Environments on a Sun Fire E25K System with PARbench

Nowadays, most high performance computing systems run in multiprogramming mode with several user programs simultaneously utilizing the available CPUs. Even though most current SMP systems are implemented as ccNUMA to reduce the bottleneck of main memory access, the user programs still interact as they share other system resources and influence the scheduler decisions with their generated load. PARbench was designed to generate complete load scenarios based on synthetic jobs and to measure the job behavior during the execution of these scenarios. The E25K is a ccNUMA system with up to 72 dual core CPUs and a crossbar-based connection network. This paper describes the results of the examination of such a Sun Fire E25K system with PARbench. First, PARbench was used to investigate the performance impact caused by the interactions of jobs on fully loaded and overloaded machines. Second, the impact of operating system tasks to the performance of OpenMP parallelized programs in scenarios of full load as created by the cluster batch engine is quantized, especially when these system tasks are not considered in the initial load calculation. Additionally, the generated scenarios were used for a statistical analysis of the scheduling of OpenMP programs, focusing on data locality and migration frequency.

Rick Janda, Wolfgang E. Nagel, Bernd Trenkler

Early Experiences with KTAU on the IBM BG/L

The influences of OS and system-specific effects on application performance are increasingly important in high performance computing. In this regard, OS kernel measurement is necessary to understand the interrelationship of system and application behavior. This can be viewed from two perspectives:




. An integrated methodology and framework to observe both views in HPC systems using OS kernel measurement has remained elusive. We demonstrate a new tool called KTAU (Kernel TAU) that aims to provide parallel kernel performance measurement from both perspectives. KTAU extends the TAU performance system with kernel-level monitoring, while leveraging TAU’s measurement and analysis capabilities. As part of the ZeptoOS scalable operating systems project, we report early experiences using KTAU in ZeptoOS on the IBM BG/L system.

Aroon Nataraj, Allen D. Malony, Alan Morris, Sameer Shende

PAM-SoC: A Toolchain for Predicting MPSoC Performance

In the past, research on Multiprocessor Systems-on-Chip (MPSoC) has focused mainly on increasing the available processing power on a chip, while less effort was put into specific system-level performance analysis, or into behavior prediction. This paper introduces PAM-SoC, a light-weight performance predictor for MPSoC system-level performance. Being based on


, a static performance predictor for parallel applications, PAM-SoC can compute its prediction in seconds for cases when cycle-accurate simulation takes tens of minutes. The paper includes a set of PAM-SoC validation experiments, as well as two sets of experiments to show how PAM-SoC can be used for either application tuning or MPSoC platform tuning in early system design phases.

Ana Lucia Varbanescu, Henk Sips, Arjan van Gemund

Analysis of the Memory Registration Process in the Mellanox InfiniBand Software Stack

To leverage high speed interconnects like InfiniBand it is important to minimize the communication overhead. The most interfering overhead is the registration of communication memory.

In this paper, we present our analysis of the memory registration process inside the Mellanox InfiniBand driver and possible ways out of this bottleneck. We evaluate and characterize the most time consuming parts in the execution path of the memory registration function using the Read Time Stamp Counter (RDTSC) instruction. We present measurements on AMD Opteron and Intel Xeon systems with different types of Host Channel Adapters for PCI-X and PCI-Express. Finally, we conclude with first results using Linux hugepage support to shorten the time of registering a memory region.

Frank Mietke, Robert Rex, Robert Baumgartl, Torsten Mehlan, Torsten Hoefler, Wolfgang Rehm

Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences

This paper presents a study of performance optimization of dense matrix multiplication on IBM Cyclops-64(C64) chip architecture. Although much has been published on how to optimize dense matrix applications on shared memory architecture with multi-level caches, little has been reported on the applicability of the existing methods to the new generation of multi-core architectures like C64. For such architectures a more economical use of on-chip storage resources appears to discourage the use of caches, while providing tremendous on-chip memory bandwidth per storage area.

This paper presents an in-depth case study of a collection of well known optimization methods and tries to re-engineer them to address the new challenges and opportunities provided by this emerging class of multi-core chip architectures. Our study demonstrates that efficiently exploiting the memory hierarchy is the key to achieving good performance. The main contributions of this paper include: (a) identifying a set of key optimizations for C64-like architectures, and (b) exploring a practical order of the optimizations, which yields good performance for applications like matrix multiplication.

Ziang Hu, Juan del Cuvillo, Weirong Zhu, Guang R. Gao

Optimizing OpenMP Parallelized DGEMM Calls on SGI Altix 3700

Using functions of parallelized mathematical libraries is a common way to accelerate numerical applications. Computer architectures with shared memory characteristics support different approaches for the implementation of such libraries, usually OpenMP or MPI.

This paper’s content is based on the performance comparison of DGEMM calls (floating point matrix multiplication, double precision) with different OpenMP parallelized numerical libraries, namely Intel MKL and SGI SCSL, and how they can be optimized. Additionally, we have a look at the memory placement policy and give hints for initializing data. Our attention has been focused on a SGI Altix 3700 Bx2 system using BenchIT [1] as a very convenient performance measurement suite for the examinations.

Daniel Hackenberg, Robert Schöne, Wolfgang E. Nagel, Stefan Pflüger

Topic 3: Scheduling and Load Balancing

Topic 3: Scheduling and Load Balancing

An increasing variety of parallel and distributed systems is being developed throughout the world. Parallelism is available at granularities ranging from multi-core processors, via SMPs and clusters, to Grids. However, much of the computing power in these systems remains unusable because adequate algorithms and software for managing these resources are unavailable.

Michael Bender, Dror Feitelson, Allan Gottlieb, Uwe Schwiegelshohn

The Price of Approximate Stability for Scheduling Selfish Tasks on Two Links

We consider a

scheduling game

, where a set of selfish agents (traffic loads) want to be routed in exactly one of the two parallel links of a system. Every agent aims to minimize

her own completion time

, while the social objective is the


, i.e. the time at which the last agent finishes her execution. We study the problem of optimizing the makespan under the constraint that the obtained schedule is a (pure) Nash equilibrium, i.e. a schedule in which no agent has incentive to unilaterally change her strategy (link). We consider a relaxation of the notion of


by considering


-approximate Nash equilibria where an agent does not have


incentive (w.r.t. the value of


) to unilaterally change her strategy. Our main contribution is the study of the


between the

approximation ratio

for the makespan and the value of


. We first give an algorithm which provides a solution with an approximation ratio of


for the makespan and which is a 3-approximate Nash equilibrium, provided that the local policy of each link is

Longest Processing Time

(LPT). Furthermore, we show that a slight modification of the classical

Polynomial Time Approximation Scheme

(PTAS) of Graham allows to obtain a schedule whose makespan is arbitrarily close to the optimum while keeping a constant value for


. Finally, we give bounds establishing relations between the value of


and the best possible value of the approximation ratio, provided that the local policies of the links are LPT.

Eric Angel, Evripidis Bampis, Fanny Pascual

Master-Slave Tasking on Asymmetric Networks

This paper presents new techniques for master-slave tasking on tree-shaped networks with fully heterogeneous communication and processing resources. A large number of independent, equal-sized tasks are distributed from the master node to the slave nodes for processing and return of result files. The network links present bandwidth asymmetry, i.e. the send and receive bandwidths of a link may be different. The nodes can overlap computation with at most one send and one receive operation. A centralized algorithm that maximizes the platform throughput under static conditions is presented. Thereafter, we propose several distributed heuristics making scheduling decisions based on information estimated locally. Extensive simulations demonstrate that distributed heuristics are better suited to cope with dynamic environments, but also compete well with centralized heuristics in static environments.

Cyril Banino-Rokkones, Olivier Beaumont, Lasse Natvig

Using On-the-Fly Simulation for Estimating the Turnaround Time on Non-dedicated Clusters

The computation capacity of the workstations of an open laboratory in almost every university is enough to execute not only the local workload but some distributed computation. Unfortunately, the local workload introduces a big uncertainty into the predictability of the system, which hinders the applicability of the job scheduling strategies.

In this work, we introduce into our job scheduling system, termed CISNE, a simulator, which allows its scheduling decisions to be enhanced by estimating the future cluster state. This process of estimation is backed by analytic procedures which are also described in this study. Likewise, the simulation let us assure some limit to the turnaround time for the parallel user. This paper analyses the performance of the simulation process in relation to different scheduling policies. These results reveal that those policies that respect an FCFS order for the waiting jobs are more predictable than those that alter the job ordering, like Backfilling.

Mauricio Hanzich, Josep L. Lérida, Matías Torchinsky, Francesc Giné, Porfidio Hernández, Emilio Luque

An Adaptive Scheduling Method for Grid Computing

This paper presents an adaptive scheduling method, which can be used for parallel applications whose total workload is unknown a priori. This method can deal with the unpredictable execution conditions commonly encountered on grids. To address this scheduling problem, parameters which quantify the dynamic nature of the execution conditions had to be defined. The proposed scheduling method is based on an on-line algorithm so as to be adaptable to the varying execution conditions, but avoids the idle periods inherent to this on-line algorithm.

Salah-Salim Boutammine, Daniel Millot, Christian Parrot

On the Placement of Reservations into Job Schedules

We present a new method for determining placements of flexible reservation requests into a schedule. For each considered placement the


method inserts a placeholder into the schedule and simulates the processing of batch jobs currently known to the system. Each placement is evaluated wrt. well-known scheduling metrics. This information may be used by a Grid reservation service to choose the most likely successful placement of a reservation. According to the results of extensive simulations, the


method grants more reservations and improves the performance of local jobs compared to our previously used



Thomas Röblitz, Krzysztof Rzadca

A Practical Approach of Diffusion Load Balancing Algorithms

In this paper, a practical approach of diffusion load balancing algorithms and its implementation are studied. Three problems are investigated. The first one is the determination of the load balancing parameters without any global knowledge. The second problem consists in estimating the cost and the benefit of a load exchange. The last one studies the convergence detection of the load balancing algorithm. For this last point we give an algorithm based on simulated annealing to reduce the convergence towards a load repartition in steps that can be done with discrete loads. Several simulations close this paper and illustrate the impact of the various methods and algorithms introduced.

Emmanuel Jeannot, Flavien Vernier

Fast Diffusion Load Balancing Algorithms on Torus Graphs

In this paper, we consider the application of accelerated methods in order to increase the rate of convergence of the diffusive iterative load balancing algorithms. In particular, we compare the application of Semi-Iterative, Second Degree and Variable Extrapolation techniques on the basic Diffusion method and the Extrapolated Diffusion method for torus graphs. It is shown that our methods require approximately 30% less iterations to reach the balanced state compared to the existed ones.

Gregory Karagiorgos, Nikolaos M. Missirlis, Filippos Tzaferis

A Parallel Shape Optimizing Load Balancer

Load balancing is an important issue in parallel numerical simulations. However, state-of-the-art libraries addressing this problem show several deficiencies: they are hard to parallelize, focus on small edge-cuts rather than few boundary vertices, and often produce disconnected partitions.

We present a distributed implementation of a load balancing heuristic for parallel adaptive FEM simulations. It is based on a disturbed diffusion scheme embedded in a learning framework. This approach incorporates a high degree of parallelism that can be exploited and it computes well-shaped partitions as shown in previous publications. Our focus lies on improving the condition of the involved matrix and solving the resulting linear systems with local accuracy. This helps to omit unnecessary computations as well as allows to replace the domain decomposition by an alternative data distribution scheme reducing the communication overhead, as shown by experiments with our new MPI based implementation.

Henning Meyerhenke, Stefan Schamberger

Improvement of the Efficiency of Genetic Algorithms for Scalable Parallel Graph Partitioning in a Multi-level Framework

Parallel graph partitioning is a difficult issue, because the best sequential graph partitioning methods known to date are based on iterative local optimization algorithms that do not parallelize nor scale well. On the other hand, evolutionary algorithms are highly parallel and scalable, but converge very slowly as problem size increases. This paper presents methods that can be used to reduce problem space in a dramatic way when using graph partitioning techniques in a multi-level framework, thus enabling the use of evolutionary algorithms as possible candidates, among others, for the realization of efficient scalable parallel graph partitioning tools. Results obtained on the recursive bipartitioning problem with a multi-threaded genetic algorithm are presented, which show that this approach outperforms existing state-of-the-art parallel partitioners.

Cédric Chevalier, François Pellegrini

Probablistic Self-Scheduling

Scheduling for large parallel systems such as clusters and grids presents new challenges due to multiprogramming/polyprocessing [1]. In such systems, several jobs (each consisting of a number of parallel tasks) of multiple users may run at the same time. Processors are allocated to the different jobs either statically or dynamically; further, a processor may be taken away from a task of one job and be reassigned to a task of another job. Thus, the number of processors available to a job varies with time. Although several approaches have been proposed in the past for scheduling tasks on multiprocessors, they assume a dedicated availability of processors. Consequently, the existing scheduling approaches are not suitable for multiprogrammed systems. In this paper, we present a novel probabilistic approach for scheduling parallel tasks on multiprogrammed parallel systems. The key characteristic of the proposed scheme is its self-adaptive nature, i.e., it is responsive to systemic parameters such as number of processors available. Self-adaptation helps achieve better load balance between the different processors and helps reduce the synchronization overhead (number of allocation points). Experimental results show the effectiveness of our technique.

Milind Girkar, Arun Kejariwal, Xinmin Tian, Hideki Saito, Alexandru Nicolau, Alexander Veidenbaum, Constantine Polychronopoulos

Data Sharing Conscious Scheduling for Multi-threaded Applications on SMP Machines

Extensive use of multi-threaded applications that run on SMP mac hines, justifies modifications in thread scheduling algorithms to consider threads’ characteristics in order to improve performance. Current schedulers (e.g. in Linux, AIX) avoid migrating tasks between CPUs unless absolutely necessary. Unwarranted data cache misses occur when tasks that share data run on different CPUs, or are far apart time-wise on the same CPU. This work presents an extension to the Linux scheduler that exploits inter-task data relat ions to reduce data cache misses in multi-threaded applications running on SMP platforms, thus improving runtime, memory throughput, and energy consumpt ion. Our approach schedules the tasks to the CPU that holds the relevant data rather than to the one with highest affinity. We observed improve ments in CPU time and throughput on several benchmarks. For the Chat benchmark, the improvement in CPU time and cache misses is over 30% on average.

Shlomit S. Pinter, Marcel Zalmanovici

Topic 4: Compilers for High Performance

Topic 4: Compilers for High Performance

Welcome to Euro-Par’s Topic 4, which provides a forum for the presentation of the latest research results and practical experience in Compilers for High Performance. Topic 4 deals with compilation for high performance architectures. Papers were invited targeting both general-purpose platforms and specialised hardware designs such as graphic coprocessors or low-power embedded systems. We were also keen to solicit papers that explore more general issues, such as program analysis, languages, run-time systems, and feedback-directed optimisation.

William Jalby, Oscar Plata, Barbara Chapman, Paul Kelly

Compiler Technology for Blue Gene Systems

Standard compilers are incapable of fully harnessing the enormous performance potential of Blue Gene systems. To reach the leading position in the Top500 supercomputing list, IBM had to put considerable effort into coding and tuning a limited range of low-level numerical kernel routines by hand. In this paper the Vienna MAP compiler is presented, which particularly targets signal transform codes ubiquitous in compute-intensive scientific applications. Compiling


code, MAP reaches as much as 80% of the optimum performance of Blue Gene systems. In an application code MAP enabled a sustained performance of 60 Tflop/s to be reached on BlueGene/L.

Stefan Kral, Markus Triska, Christoph W. Ueberhuber

SCAN: A Heuristic for Near-Optimal Software Pipelining

Software pipelining is a classic compiler optimization that improves the performances of inner loops on instruction-level parallel processors. In the context of embedded computing, applications are compiled prior to manufacturing the system, so it is possible to invest large amounts of time for compiler optimizations.

Traditionally, software pipelining is performed by heuristics such as iterative modulo scheduling. Optimal software pipelining can be formulated as integer linear programs, however these formulations can take exponential time to solve. As a result, the size of loops that can be optimally software pipelined is quite limited.

In this article, we present the SCAN heuristic, which enables to benefit from the integer linear programming formulations of software pipelining even on loops of significant size. The principle of the SCAN heuristic is to iteratively constrain the software pipelining problem until the integer linear programming formulation is solvable in reasonable time.

We applied the SCAN heuristic to a multimedia benchmark for the ST200 VLIW processor. We show that it almost always compute an optimal solution for loops that are intractable by classic integer linear programming approaches. This improves performances by up to 33.3% over the heuristic modulo scheduling of the production ST200 compiler.

F. Blachot, Benoît Dupont de Dinechin, Guillaume Huard

Code Generation for STA Architecture

This paper presents a novel compiler backend which generates assembly code for Synchronous Transfer Architecture (STA). STA is a Very Long Instruction Word (VLIW) architecture and in addition it uses a non-orthogonal Instruction Set Architecture (ISA). Generating efficient code for this architecture needs highly optimizing techniques. The compiler backend presented in this paper is based on Integer Linear Programming (ILP). Experimental results show that the generated assembly code consumes much less execution time than the code generated by traditional ways, and the code generation can be accomplished in acceptable time.

J. Guo, T. Limberg, E. Matus, B. Mennenga, R. Klemm, G. Fettweis

Multi-dimensional Kernel Generation for Loop Nest Software Pipelining

Single-dimension Software Pipelining (SSP) has been proposed as an effective software pipelining technique for multi-dimensional loops [16]. This paper introduces for the first time the scheduling methods that actually produce the kernel code. Because of the multi-dimensional nature of the problem, the scheduling problem is more complex and challenging than with traditional modulo scheduling. The scheduler must handle multiple subkernels and initiation rates under specific scheduling constraints, while producing a solution that minimizes the execution time of the final schedule.

In this paper three approaches are proposed: the


method, which schedules operations in loop level order, starting from the innermost, and does not let other operations interfere with the already scheduled levels, the


method, which schedules operations from different loop levels with the same priority, and the


method, which uses the level-by-level mechanism for the innermost level and the flat solution for the other levels. The methods subsume Huff’s modulo scheduling [8] for single loops as a special case. We also break a scheduling constraint introduced in earlier publications and allow for a more compact kernel. The proposed approaches were implemented in the Open64/ORC compiler, and evaluated on loop nests from the Livermore, SPEC200 and NAS benchmarks.

Alban Douillet, Hongbo Rong, Guang R. Gao

Towards a Versatile Pointer Analysis Framework

Current pointer analysis techniques fail to find parallelism in heap accesses. However, some of them are still capable of obtaining valuable information about the way dynamic memory is used in pointer-based programs. It would be desirable to have a unified framework with a broadened perspective that can take the best out of available techniques and compensate for their weaknesses. We present an early view of such a framework, featuring a graph-based shape analysis technique. We describe some early experiments that obtain detailed information about how dynamic memory arranges in the heap. Furthermore, we document how def-use information can be used to greatly optimize shape analysis.

R. Castillo, A. Tineo, F. Corbera, A. Navarro, R. Asenjo, E. L. Zapata

Topic 5: Parallel and Distributed Databases, Data Mining and Knowledge Discovery

Topic 5: Parallel and Distributed Databases, Data Mining and Knowledge Discovery

Managing and efficiently analysing the vast amounts of data produced by a huge variety of data sources is one of the big challenges in computer science. The development and implementation of algorithms and applications that can extract information diamonds from these ultra-large, and often distributed, databases is a key challenge for the design of future data management infrastructures. Today’s data-intensive applications often suffer from performance problems and an inability to scale to high numbers of distributed data sources. Therefore, distributed and parallel databases have a key part to play in overcoming resource bottlenecks, achieving guaranteed quality of service and providing system scalability. The increased availability of distributed architectures, clusters, Grids and P2P systems, supported by high performance networks and intelligent middleware provides parallel and distributed databases and digital repositories with a great opportunity to cost-effectively support key everyday applications. Further, there is the prospect of data mining and knowledge discovery tools adding value to these vast new data resources by automatically extracting useful information from them.

Patrick Valduriez, Wolfgang Lehner, Domenico Talia, Paul Watson

Dynamic and Distributed Reconciliation in P2P-DHT Networks

Optimistic replication can provide high data availability for collaborative applications in large scale distributed systems (grid, P2P, and mobile systems). However, if data reconciliation is performed by a single node, data availability remains an important issue since the reconciler node can fail. Thus, reconciliation should also be distributed and reconciliation data should be replicated. We have previously proposed the


algorithm, a distributed version of the IceCube semantic reconciliation engine designed for cluster networks. However


is not suitable for P2P networks, which are usually built on top of the Internet. In this case, network costs must be considered. The main contribution of this paper is the


algorithm, a distributed reconciliation algorithm designed for P2P networks. We first propose a P2P-DHT cost model for computing communication costs in a DHT overlay network. Second, taking into account this model, we propose a cost model for computing the cost of each reconciliation step. Third, we propose an algorithm that dynamically selects the best nodes for each reconciliation step. Our algorithm yields high data availability with acceptable performance and limited overhead.

Vidal Martins, Esther Pacitti

HyParSVM – A New Hybrid Parallel Software for Support Vector Machine Learning on SMP Clusters

In this paper we describe a new hybrid distributed/shared memory parallel software for support vector machine learning on large data sets. The support vector machine (SVM) method is a well-known and reliable machine learning technique for classification and regression tasks. Based on a recently developed shared memory decomposition algorithm for support vector machine classifier design we increased the level of parallelism by implementing a cross validation routine based on message passing. With this extention we obtained a flexible parallel SVM software that can be used on high-end machines with SMP architectures to process the large data sets that arise more and more in bioinformatics and other fields of research.

Tatjana Eitrich, Wolfgang Frings, Bruno Lang

Supporting a Real-Time Distributed Intrusion Detection Application on GATES

Increasingly, a number of applications across computer sciences and other science and engineering disciplines rely on, or can potentially benefit from, analysis and monitoring of

data streams

. We view the problem of flexible and adaptive processing of distributed data streams as a grid computing problem. In our recent work, we have been developing a middleware, GATES (Grid-based AdapTive Execution on Streams), for enabling grid-based processing of distributed data streams.

This paper reports an application study using the GATES middleware system. We focus on the problem of intrusion detection. We have created a distributed and self-adaptive real-time implementation of the algorithm proposed by Eskin using our middleware. The main observations from our experiments are as follows. First, our distributed implementation can achieve detection rates which are very close to the detection rate by a centralized algorithm. Second, our implementation is able to effectively adjust the adaptation parameters.

Qian Zhu, Liang Chen, Gagan Agrawal

On the Use of Semantic Annotations for Supporting Provenance in Grids

There has seen a strong demand for provenance in grid applications, which enables users to trace how a particular result has been arrived at by identifying the resources, configurations and execution settings. In this paper we analyses the requirements of provenance support and discusses the nature and characteristics of provenance data on the Grid. We define a new conception called augmented provenance that enhances conventional provenance data with extensive metadata and semantics. A hybrid approach is proposed for the creation and management of augmented provenance in which semantic annotation is used to generate semantic provenance data and the database management system is used for execution data management. The approach has been applied to a real world application, and tools and GUIs are developed to facilitate provenance management and exploitation.

Liming Chen, Zhuoan Jiao, Simon J. Cox

Topic 6: Grid and Cluster Computing: Models, Middleware and Architectures

Topic 6: Grid and Cluster Computing: Models, Middleware and Architectures

Grid computing is a major research area with strong involvement from both academia and the computing industry. The common vision is that Grid computing represents the culmination of truly general distributed computing across various resources in a ubiquitous, open-ended infrastructure to support a wide range of different application areas. Recently the CoreGrid ( Executive Committee reached an agreement on the following definition: a Grid is ?a fully distributed, dynamically reconfigurable, scalable and autonomous infrastructure to provide location independent, pervasive, reliable, secure and efficient access to a coordinated set of services encapsulating and virtualizing resources (computing power, storage, instruments, data, etc.) in order to generate knowledge?. Although significant progress has been made in the design and deployment of Grids, many challenges still remain before the goal of a user-friendly, efficient, and reliable grid can be realized. Grid research issues cover many areas of computer science to address the fundamental capabilities and services that are required in a heterogeneous environment, such as adaptability, scalability, reliability and security, and to support applications as diverse as ubiquitous local services, enterprise-scale virtual organizations, and Internet-scale distributed supercomputing.

Domenico Laforenza, Alexander Reinefeld, Dieter Kranzlmüller, Luc Moreau

Supporting Efficient Execution of MPI Applications Across Multiple Sites

One of the main goals of the CrossGrid Project [1] is to provide explicit support to parallel and interactive compute- and data- intensive applications. The CrossBroker job manager provides services as part of the CrossGrid middleware and allows execution of parallel MPI applications on Grid resources in a transparent and automatic way. This document describes the design and implementation of the key components responsible for an efficient and reliable execution of MPI jobs splitted over multiple Grid sites, executed either in an on-line or batch manner. We also provide details on the overheads introduced by our system, as well as an experimental study showing that our system is well-suited for embarrassingly parallel applications.

Enol Fernández, Elisa Heymann, Miquel Àngel Senar

Private Virtual Cluster: Infrastructure and Protocol for Instant Grids

Given current complexity of Grid technologies, the lack of security of P2P systems and the rigidity of VPN technologies make sharing resources belonging to different institutions still technically difficult. We propose a new approach called ”Instant Grid” (IG), which combines various Grid, P2P and VPN approaches, allowing simple deployment of applications over different administration domains. Three main requirements should be fulfilled to make Instant Grids realistic: 1) simple networking configuration (Firewall and NAT), 2) no degradation of resource security and 3) no need to re-implement existing distributed applications. In this paper, we present Private Virtual Cluster, a low-level middleware that meets these three requirements. To demonstrate its properties, we have connected with PVC a set of firewall-protected PCs and conducted experiments to evaluate the networking performance and the capability to execute unmodified MPI applications.

Ala Rezmerita, Tangui Morlier, Vincent Neri, Franck Cappello

Reducing Communication Overhead and Page Faults in SDSM Platforms

In this paper we present a new dynamic, cache coherence protocol for Software Distributed Shared Memory (SDSM) systems that adopt the scope-consistency model[7]. We initially outline our basic protocol, called Reduced Message Protocol (RMP), and then propose two enhancements: the Multiple Home RMP (RMP-MH) and the Lock Migration RMP (RMP-LM). The experimentation we conducted with the proposed protocols, exhibits significant improvements by reducing two of the major latency factors in SDSM platforms: the total communication messages and the overall number of page faults. To demonstrate the efficiency and the effectiveness of the RMP protocols, we used SPLASH as well as synthetic application benchmarks.

Artemis A. Christopoulou, Eleftherios D. Polychronopoulos

Flexible I/O Support for Reconfigurable Grid Environments

With growing computational power of current supercomputers, scientific computing applications can work on larger problems. The corresponding increase in dataset size is often correlated to an increase in needed storage for the results. Current storage area networks (


s) balance


load on multiple disks using high speed networks, but are integrated on the operating system level, demanding administrative intervention if the usage topology changes. While this is practical for single sites or fairly static grid environments, it is hard to extend to a user defined per-job basis. Reconfigurable grid environments, where computing and storage resources are coupled on a per-job basis, need a more flexible approach for parallel


on remote locations.

This paper gives a detailed overview of the abilities of the transparent remote access provided by


, a part of the




project. We show how


manages flexible and transparent access to remote


resources in a reconfigurable grid environment, supporting the definition of the amount and location of persistent storage services on a per-job basis.

Marc-André Hermanns, Rudolf Berrendorf, Marcel Birkner, Jan Seidel

Storage Exchange: A Global Trading Platform for Storage Services

The Storage Exchange (SX) is a new platform allowing storage to be treated as a tradeable resource. Organisations with varying storage requirements can use the SX platform to trade and exchange storage services. Organisations have the ability to federate their storage, be-it dedicated or scavenged and advertise it to a global storage market. In this paper we discuss the high level architecture employed by our platform and investigate a sealed Double Auction market model. We implement and experiment the following clearing algorithms: maximise surplus, optimise utilisation and an efficient combination of both.

Martin Placek, Rajkumar Buyya

Vigne: Towards a Self-healing Grid Operating System

We consider building a Grid Operating System in order to relieve users and programmers from the burden of dealing with the highly distributed and volatile resources of computational grids. To tolerate the volatility of the nodes, the system should be self-healing, that is continuously adapt to additions, removals, and failures of nodes. We present the self-healing architecture of the Vigne Grid Operating System through three of its services: system membership, application management, and volatile data management. The experimental results obtained show that our approach is feasible.

Louis Rilling

Problems for Resource Brokering in Large and Dynamic Grid Environments

Running workloads in a Grid environment may become a challenging problem when no appropriate means are available for resource brokering. Many times resources are provided under various administrative policies and agreements that must be known in order to perform adequate scheduling decisions. Thus, providing suitable solutions for resource management is important if we want to cope with the increased scale and complexity of such distributed system. In this paper we explore the key requirements a brokering infrastructure must meet in large and dynamic Grid environments and illustrate how these requirements are addressed by a specialized infrastructure, DI-GRUBER – a distributed usage service level agreement (uSLA) brokering service. The accuracy function of the brokering infrastructure connectivity and the performance gains when a client scheduling policy is employed are analyzed in high detail. In addition, a performance comparison with a P2P-based distributed lookup service is performed to illustrate the performance differences between two different technologies that address similar problems (Grids that focus on federated resource sharing scenarios and P2Ps that focus on self-organizing distributed resource sharing systems, in which most of the communication is symmetric).

Catalin L. Dumitrescu

Topic 7: Parallel Computer Architecture and Instruction Level Parallelism

Topic 7: Parallel Computer Architecture and Instruction Level Parallelism

We welcome you to the two Parallel Computer Architecture and Instruction Level Parallelism sessions of Euro-Par 2006 conference being held in Dresden, Germany. The call for papers for this Euro-Par topic area sought papers on all hardware/software aspects of parallel computer architecture, processor architecture and microarchitecture. This year 12 papers were submitted to this topic area. Among the submissions, 5 papers were accepted as full papers for the conference (41% acceptance rate).

Eduard Ayguadé, Wolfgang Karl, Koen De Bosschere, Jean-Francois Collard

Optimal Integrated VLIW Code Generation with Integer Linear Programming

We give an Integer Linear Programming (ILP) solution that fully integrates all steps of code generation,


. instruction selection, register allocation and instruction scheduling, on the basic block level for VLIW processors.

In earlier work, we contributed a dynamic programming (DP) based method for optimal integrated code generation, implemented in our retargetable code generator OPTIMIST. In this paper we give first results to evaluate and compare our ILP formulation with our DP method on a VLIW processor. We also demonstrate how to precondition the ILP model by a heuristic relaxation of the DP method to improve ILP optimization time.

Andrzej Bednarski, Christoph Kessler

Speeding-Up Synchronizations in DSM Multiprocessors

Synchronization in parallel programs is a major performance bottleneck. Shared data is protected by locks and a lot of time is spent in the competition arising at the lock hand-off. In this period of time, a large amount of traffic is targeted to the line holding the lock variable. In order to be serialized, the requests to the same cache line can either be bounced (NACKed) or buffered in the coherence controller. In this paper we focus on systems whose coherence controllers buffer requests.

During lock hand-off only the requests from the winning processor contribute to the computation progress, because the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism named Request Bypass, which allows requests from the winning processor to bypass the requests buffered in the home coherence controller keeping the lock line. The mechanism does not require compiler or programmer support nor ISA or coherence protocol changes.

By simulating a 32 processor system we show that Request Bypass reduces execution time and lock stall time up to 35% and 75%, respectively. The programs limited by synchronization benefit the most from Request Bypass.

A. de Dios, B. Sahelices, P. Ibáñez, V. Viñals, J. M. Llabería

Design and Effectiveness of Small-Sized Decoupled Dispatch Queues

Continuing demands for high degrees of Instruction Level Parallelism (ILP) require large dispatch queues in modern superscalar microprocessors. However, such large queues are inevitably accompanied by high circuit complexity which correspondingly limits the pipeline clock rates. This is due to the fact that most of today’s designs are based upon a centralized dispatch queue which depends on globally broadcasting operations to wake up and select the ready instructions. As an alternative to this conventional design, we propose the design of hierarchically distributed dispatch queues, based on the access/execute decoupled architecture model. Simulation results based on 14 data intensive benchmarks show that our DDQ (Decoupled Dispatch Queues) design achieves performance comparable to a superscalar machine with a large dispatch queue. We also show that our DDQ can be designed with small-sized, distributed dispatch queues which consequently can be implemented with low hardware complexity and high clock rates.

Won W. Ro, Jean-Luc Gaudiot

Sim-async: An Architectural Simulator for Asynchronous Processor Modeling Using Distribution Functions

In this paper we present


, an architectural simulator able to model a 64-bit asynchronous superscalar microarchitecture. The aim of this tool is to serve the designers on the study of different architectural proposals for asynchronous processors.


models the data-dependant timing of the processor modules by using distribution functions that describe the probability of a given delay to be spent on a computation. This idea of characterizing the timing of the modules at the architectural level of abstraction using distribution functions is introduced for the first time with this work. In addition,


models the delays of all the relevant hardware involved in the asynchronous communication between stages.

To tackle the development of


we have modified the source code of SimpleScalar by substituting the simulator’s core with our own execution engine, which provides the functionality of a parameterizable microarchitecture adapted to the Alpha ISA. The correctness of


was checked by comparing the outputs of the SPEC2000 benchmarks with SimpleScalar executions, and the asynchronous behavior was successfully tested in relation to a synchronous configuration of



J. M. Colmenar, O. Garnica, J. Lanchares, J. I. Hidalgo, G. Miñana, S. Lopez

A Hybrid Hardware/Software Generated Prefetching Thread Mechanism on Chip Multiprocessors

This paper proposes a hybrid hardware/software generated prefetching thread mechanism on Chip Multiprocessors(CMP). Two kinds of prefetching threads appear in our hybrid mechanism. Most threads belong to Dynamic Prefetching Thread, which are automatically generated, triggered, spawn and managed by hardware; The others are of Static Prefetching Thread, targeting at the

critical delinquent loads

which can not be accurately or timely predicted by Dynamic Prefetching Thread. Static Prefetching Threads are statically generated by binary-level optimization tool with the guide of profiling information. Also, some aggressive thread construction policies are proposed. Furthermore, the necessary hardware infrastructure for CMP supporting this hybrid mechanism are described. For a set of memory limited benchmarks with complicated access patterns, an average speedup of 3.1% is achieved on dual-core CMP when constructing basic hardware-generated prefetching thread, and this gain grows to 31% when adopting our hybrid mechanism.

Hou Rui, Longbing Zhang, Weiwu Hu

Topic 8: Distributed Systems and Algorithms

Topic 8: Distributed Systems and Algorithms

Parallel computing is strongly influenced by the challenges of distributed systems, such as a need for a Single System Image, resource sharing and allocation, failures and a need for fault tolerance, long latencies, network partition, disconnected operation, demands of users wishing to solve more computationally and communication demanding problems, and opportunities created by grids and Web services. Distributed computing is the computing mainstream now; it is based on different forms of distributed systems: clusters, grids, peer-to-peer systems, web services, service oriented architectures. This topic provides a forum for research and practice, of interest to both academia and industry, about distributed computing and distributed algorithms. Submissions were encouraged in all areas of distributed systems and algorithms relevant to parallel computing, with emphasis on design and practice of distributed algorithms, analysis of the behaviour of distributed systems and algorithms, distributed fault-tolerance, distributed operating systems and databases, scalability, concurrency and performance in distributed systems, resource sharing and load balancing in distributed systems, distributed algorithms in telecommunications, distributed mobile computing, resource and service discovery, security in distributed systems, and standards and middleware for the distribution of parallel computations. Twenty papers were submitted in this topic. The subjects were varied, but a common theme of many of them is recovery, resource allocation, mutual exclusion, garbage collection and coordination. Other themes include load balancing, scheduling and consensus algorithms. Eight papers have been accepted.

Andrzej Goscinski, Gudula Rünger, Edgar Gabriel, Christine Morin

Distributed Approximation Allocation Resources Algorithm for Connecting Groups

This paper presents a distributed algorithm to allocate resources (links of a network) for interconnecting machines (forming a group) spread in a network. This is what we call a

connection structure

for this


of machines. An important innovative feature of our construction method is that we


(not just simulate on particular and restricted cases) the fact that this structure has good properties in terms of,


, induced distances (for latency considerations) and cost (for cost considerations). Hence, we propose a

distributed multicriteria approximation


Fabien Baille, Lelia Blin, Christian Laforest

Rollback-Recovery Protocol Guarantying MR Session Guarantee in Distributed Systems with Mobile Clients

This paper presents rVsMR rollback-recovery protocol for distributed mobile systems, guarantying Monotonic Reads consistency model, even in case of server’s failures. The proposed protocol employs known rollback-recovery techniques, however, while applying them, the semantics of session guarantees is taken into account. Consequently, rVsMR protocol is optimized with respect to session guarantees requirements. The paper includes the proof of safety property of the presented protocol.

Jerzy Brzeziński, Anna Kobusińska, Michał Szychowiak

A Practical Single-Register Wait-Free Mutual Exclusion Algorithm on Asynchronous Networks

This paper is motivated by a need of practical asynchronous network systems, i.e., a wait-free distributed mutual exclusion algorithm (


). The


algorithm is very appealing when a process runs on asynchronous network systems and its timing constraint is so restricted that the process cannot perform a local-spin in a wait-queue, which forces it to abort whenever it cannot access the critical region immediately. The


algorithm proposed in this paper is devised to eliminate the need for processes to send messages to determine whether the critical region has been entered by another process, an unfavorable drawback of a naive transformation of the shared-memory mutual exclusion algorithm to an asynchronous network model. This drawback leads to an unbounded message explosion, and it is very critical in real network systems. Design of the


algorithm is simple, and the algorithm is practical enough to be used in current distributed systems. The algorithm has


(1) message complexity which is suboptimal between two consecutive runs of critical section.

Hyungsoo Jung, Heon Y. Yeom

Optimal and Practical WAB-Based Consensus Algorithms

In this paper we introduce two new WAB-based consensus algorithms for the crash-recovery model. The first one, B*-Consensus, is resilient to up to




/2 permanent faults, and can solve consensus in three communication steps. R*-Consensus, our second algorithm, is




/3 resilient, and can solve consensus in two communication steps. These algorithms are optimal with respect to the time complexity versus resilience tradeoff. We compare our algorithms to other consensus algorithms in the crash-recovery model.

Lasaro Camargos, Edmundo R. M. Madeira, Fernando Pedone

Self-stabilizing Deadlock Detection Under the OR Requirement Model

This article introduces a self-stabilizing deadlock-detection algorithm for the OR model. The algorithm is complete, because it detects all deadlocks, and it is correct, because it does not detect false deadlocks. Because of the self-stabilization property, the algorithm supports dynamic changes in the wait-for graph on which it works, and transient faults; also, it can be started in an arbitrary state. Previous deadlock-detection algorithms for the OR model are not guaranteed to recover from transient faults, nor can they be started in an arbitrary state. Once the algorithm terminates, each process knows if it is or not deadlocked; moreover, deadlocked processes know whether they cause or only suffer from deadlock.

Christian F. Orellana, Cristian Ruz, S. Yadran Eterovic

Incremental Distributed Garbage Collection Using Reverse Reference Tracking

Most modern middleware systems like Java Beans and .NET provide automatic garbage collection (GC). In spite of the many distributed solutions proposed in literature collection is typically limited to a single node and simple leasing techniques are used for remote references. In this paper we present a new incremental multistage GC. It has been implemented in the Plurix operating system but might easily be applied to other platforms. The scheme works incrementally and avoids blocking remote nodes. The reverse reference tracking scheme efficiently detects acyclic garbage and is also used for finding cyclic garbage without precomputing a global root set. To minimize network communication cycle detection splits into a local and a global detection part. Keeping the object markers in a separate stack avoids invalidation of replicated objects. Performance measurements show that the proposed distributed GC scheme scales very nicely.

M. Schoettner, R. Goeckelmann, S. Frenz, M. Fakler, P. Schulthess

Run-Time Switching Between Total Order Algorithms

Total order broadcast protocols are a fundamental building block in the construction of many fault-tolerant distributed applications. Unfortunately, total order is an intrinsically expensive operation. Moreover, there are certain algorithms that perform better in specific scenarios and given network properties. This paper proposes and evaluates an adaptive protocol that is able to dynamically switch between different total order algorithms. The protocol allows to achieve the best possible performance, by selecting, in each moment, the algorithm that is most appropriate to the present network conditions. Experimental results show that, using our protocol, adaptation can be achieved with negligible interference with the data flow.

José Mocito, Luís Rodrigues

On Greedy Graph Coloring in the Distributed Model

In the paper we consider distributed algorithms for greedy graph coloring. For the largest-first (


) approach, we propose a new distributed algorithm which is shown to color a graph in an expected time of




logΔ) rounds, and we prove that any distributed


-coloring algorithm requires at least Ω(Δ) rounds. We discuss the quality of obtained colorings in the general case and for particular graph classes. Finally, we show that other greedy graph coloring approaches, such as smallest-last (


) or dynamic-saturation (


), are not suitable for application in distributed computing, requiring Ω(


) rounds.

Adrian Kosowski, Łukasz Kuszner

Topic 9: Parallel Programming: Models, Methods and Languages

Topic 9: Parallel Programming: Models, Methods and Languages

This topic provides a forum for the presentation of research results and practical experience in the development of parallel programs. Advances in algorithmic and programming models, design methods, languages, and interfaces are needed to produce correct, portable parallel software with predictable performance on different parallel and distributed architectures.

The topic emphasizes results that improve the process of developing high-performance programs, including high-integrity programs that are scalable with both problem size and complexity. Of particular interest are novel techniques by which parallel software can be assembled from reusable parallel components without compromising efficiency. Related to this is the need for parallel software to adapt, both to available resources and to the problem being solved.

José C. Cunha, Sergei Gorlatch, Daniel Quinlan, Peter H. Welch

Surrounding Theorem: Developing Parallel Programs for Matrix-Convolutions

Computations on two-dimensional arrays such as matrices and images are one of the most fundamental and ubiquitous things in computational science and its vast application areas, but development of efficient parallel programs on two-dimensional arrays is known to be hard. To solve this problem, we have proposed a skeletal framework on two-dimensional arrays based on the theory of constructive algorithmics. It supports users, even with little knowledge about parallel machines, to develop systematically both correct and efficient parallel programs on two-dimensional arrays. In this paper, we apply our framework to the matrix-convolutions often used in image filters and difference methods. We show the efficacy of the framework by giving a general parallel program for the matrix-convolutions described with the skeletons, and a theorem that optimizes the general program into an application-specific one.

Kento Emoto, Kiminori Matsuzaki, Zhenjiang Hu, Masato Takeichi

Dynamic Task Generation and Transformation Within a Nestable Workpool Skeleton

Within a classical workpool skeleton a master process employs a set of worker processes to solve tasks contained in a task pool. In contrast to the usual statically fixed task set some applications generate tasks dynamically. Additionally often the need for dynamic task pool transformation arises, for example to combine newly generated partial tasks to form full tasks. We present an

extended workpool skeleton

for the parallel Haskell dialect


which provides both features and employs careful stream-processing and a termination detection mechanism. We also show how to nest the skeleton to alleviate the bottleneck a single master presents. Furthermore we demonstrate its efficiency by its fruitful use for the parallelisation of a DNA sequence alignment algorithm.

S. Priebe

Data Parallel Iterators for Hierarchical Grid and Tree Algorithms

The data parallel programming language construct of a “foreach” loop is proposed in the context of hierarchically nested arrays and unbalanced k-ary trees used in high performance applications. In order perform an initial evaluation, an implementation of an automatic parallelization system for C++ programs is introduced, which consists of a preprocessor and a matching library for distributed memory, shared memory and mixed model parallelism. For a full compile time dependence analysis and a tight distributed memory parallelization, some additional application knowledge about alignment of arrays or indirect data access can be put into the application’s code data declarations. Results for a multigrid and a fast multipole benchmark code illustrate the concept.

Gerhard Zumbusch

Implementing Irregular Parallel Algorithms with OpenMP

Writing irregular parallel algorithms with OpenMP has been rarely practised in the past. Yet it is possible, and in this paper we will use a simple breadth–first search application as an example to show a possible stepping stone and deficiency of the OpenMP specification: It is very difficult to cancel the threads in a parallel region. A way to work around the issue within the existing OpenMP specification is sketched, while our main contribution is a proposal for an extension of OpenMP to solve the issue in an easier way.

Michael Süß, Claudia Leopold

Toward Enhancing OpenMP’s Work-Sharing Directives

OpenMP provides a portable programming interface for shared memory parallel computers (SMPs). Although this interface has proven successful for small SMPs, it requies greater flexibility in light of the steadily growing size of individual SMPs and the recent advent of multithreaded chips. In this paper, we describe two application development experiences that exposed these expressivity problems in the current OpenMP specification. We then propose mechanisms to overcome these limitations, including thread subteams and thread topologies. Thus, we identify language features that improve OpenMP application performance on emerging and large-scale platforms while preserving ease of programming.

Barbara M. Chapman, Lei Huang, Haoqiang Jin, Gabriele Jost, Bronis R. de Supinski

Toward a Definition of and Linguistic Support for Partial Quiescence

The global quiescence of a distributed computation (or distributed termination detection) is an important problem. Some concurrent programming languages and systems provide global quiescence detection as a built-in feature so that programmers do not need to write special synchronization code to detect quiescence. This paper introduces

partial quiescence

(PQ), which generalizes quiescence detection to a specified part of a distributed computation. Partial quiescence is useful, for example, when two independent concurrent computations that both rely on global quiescence need to be combined into a single program. The paper describes how we have designed and implemented a PQ mechanism within an experimental version of the JR concurrent programming language. Our early results are promising qualitatively and quantitatively.

Billy Yan-Kit Man, Hiu Ning (Angela) Chan, Andrew J. Gallagher, Appu S. Goundan, Aaron W. Keen, Ronald A. Olsson

Tying Memory Management to Parallel Programming Models

Stand-alone threading libraries lack sophisticated memory management techniques. In this paper, we present a methodology that allows threading libraries that implement non-preemptive parallel programming models to reduce their memory requirements, based on the properties of those models. We applied the methodology to NthLib, which is an implementation of the Nano-Threads programming model, and evaluated it on an Intel based multiprocessor system with HyperThreading and on the SMTSIM simulator. Our results indicate that not only memory requirements drop drastically, but that execution time also improves, compared to the original implementation. This allows more fine-grained, but also larger numbers of parallel tasks to be created.

Ioannis E. Venetis, Theodore S. Papatheodorou

Topic 10: Parallel Numerical Algorithms

Topic 10: Parallel Numerical Algorithms

Since the early days of supercomputing, numerical routines have caused the highest demand for computing power anywhere, making their efficient parallelization one of the core methodical tasks in high-performance computing. And still, many of today’s fastest computers in the world are mostly used for the solution of huge systems of equations as they arise in the simulation of complex large scale problems in engineering and science.

Michel Cosnard, Hans-Joachim Bungartz, Efstratios Gallopoulos, Yousef Saad

Parallel LOD Scheme for 3D Parabolic Problem with Nonlocal Boundary Condition

A parallel LOD algorithms for solving the 3D problem with nonlocal boundary condition is considered. The algorithm is implemented using the parallel array object tool


, then a parallel algorithm follows semi-automatically from the serial one. Results of computational experiments are presented.

Raimondas Čiegis

Online Checkpointing for Parallel Adjoint Computation in PDEs: Application to Goal-Oriented Adaptivity and Flow Control

The computation of derivatives for the optimization of time-dependent flow problems is based on the integration of the adjoint differential equation. For this purpose, the knowledge of the complete forward solution is required. Similar information is needed for a posteriori error estimation with respect to a given functional. In the area of flow control, especially for three dimensional problems, it is usually impossible to store the full forward solution due to the lack of memory capacities. Additionally, adaptive time-stepping procedures are needed for efficient integration schemes in time. Therefore, standard optimal offline checkpointing strategies are usually not well-suited in that framework.

We present a new online procedure for determining the checkpoint distribution on the fly. Complexity estimates and consequences for storing and retrieving the checkpoints using parallel I/O are discussed. The resulting checkpointing approach is integrated in HiFlow, a multipurpose parallel finite-element package with a strong emphasis in computational fluid dynamic, reactive flows and related subjects. Using an adjoint-based error control for prototypical three dimensional flow problems, numerical experiments demonstrate the effectiveness of the proposed approach.

Vincent Heuveline, Andrea Walther

Parallel Fault Tolerant Algorithms for Parabolic Problems

With increasing number of processors available on nowadays high performance computing systems, the mean time between failure of these machines is decreasing. The ability of hardware and software components to handle process failures is therefore getting increasingly important. The objective of this paper is to present a fault tolerant approach for the implicit forward time integration of parabolic problems using explicit formulas. This technique allows the application to recover from process failures and to reconstruct the lost data of the failed process(es) avoiding the roll-back operation required in most checkpoint-restart schemes. The benchmark used to highlight the new algorithms is the two dimensional heat equation solved with a first order implicit Euler scheme.

Hatem Ltaief, Marc Garbey, Edgar Gabriel

Parallel Solution of Large-Scale and Sparse Generalized Algebraic Riccati Equations

We discuss a parallel algorithm for the solution of large-scale generalized algebraic Riccati equations with dimension up to


. We survey the numerical algorithms underlying the implementation of the method, in particular, a Newton-type iterative solver for the generalized Riccati equation and an LR-ADI solver for the generalized Lyapunov equation. Experimental results on a cluster of Intel Xeon processors illustrate the benefits of our approach.

José M. Badía, Peter Benner, Rafael Mayo, Enrique S. Quintana-Ortí

Applicability of Load Balancing Strategies to Data-Parallel Embedded Runge-Kutta Integrators

Embedded Runge-Kutta methods are among the most popular methods for the solution of non-stiff initial value problems of ordinary differential equations (ODEs). We investigate the use of load balancing strategies in a data-parallel implementation of embedded Runge-Kutta integrators. Since the parallelism contained in the function evaluation of the ODE system is typically very fine-grained, our aim is to find out whether the employment of load balancing strategies can be profitable in spite of the additional overhead they involve.

Matthias Korch, Thomas Rauber

A Software Framework for the Portable Parallelization of Particle-Mesh Simulations

We present a software framework for the transparent and portable parallelization of simulations using particle-mesh methods. Particles are used to transport physical properties and a mesh is required in order to reinitialize the distorted particle locations, ensuring the convergence of the method. Field quantities are computed on the particles using fast multipole methods or by discretizing and solving the governing equations on the mesh. This combination of meshes and particles presents a challenging set of parallelization issues. The present library addresses these issues for a wide range of applications, and it enables orders of magnitude increase in the number of computational elements employed in particle methods. We demonstrate the performance and scalability of the library on several problems, including the first-ever billion particle simulation of diffusion in real biological cell geometries.

I. F. Sbalzarini, J. H. Walther, B. Polasek, P. Chatelain, M. Bergdorf, S. E. Hieber, E. M. Kotsalis, P. Koumoutsakos

Parallelization of a Discrete Radiosity Method

In this paper we present three different parallelizations of a discrete radiosity method achieved on a cluster of workstations. This radiosity method has lower complexity when compared with most of the radiosity algorithms and is based on the discretization of surfaces into voxels and not into patches. The first two parallelizations distribute the tasks. They present good performance in time but they did not distribute the data. The third parallelization distributes voxels and required the transmission of small part of the voxels between machines. It improved time while distributing data.

Rita Zrour, Pierre Chatelier, Fabien Feschet, Rémy Malgouyres

Parallelising Matrix Operations on Clusters for an Optimal Control-Based Quantum Compiler

Quantum control plays a key role in quantum technology,


for steering quantum hardware systems, spectrometers or superconducting solid-state devices. In terms of computation, quantum systems provide a unique potential for coherent parallelisation that may exponentially speed up algorithms as in Shor’s prime factorisation. Translating quantum software into a sequence of classical controls steering the quantum hardware,


the quantum compilation task, lends itself to be tackled by optimal control. It is computationally demanding since the classical resources needed grow exponentially with the size of the quantum system. Here we show concepts of parallelisation tailored to run on high-end computer clusters speeding up matrix multiplication, exponentials, and trace evaluations used in numerical quantum control. In systems of 10 spin qubits, the time gain is beyond a factor of 500 on a 128-


cluster as compared to standard techniques on a single



T. Gradl, A. Spörl, T. Huckle, S. J. Glaser, T. Schulte-Herbrüggen

Topic 11: Distributed and High-Performance Multimedia

Topic 11: Distributed and High-Performance Multimedia

It is increasingly common for information – whether it be scientific, industrial, or otherwise – to be composed of multiple media items, e.g., video-based, image-based, linguistic, or auditory items. As digital video may produce over 100 Mbytes of data per second, and image sets routinely require Terabytes of storage space, traditional resource management techniques in both end-systems and networks are rapidly becoming bottlenecks in the handling of such information. Moreover, in emerging multimedia applications, the generation, processing, storage, indexing, querying, retrieval, delivery, shielding, and visualization of multimedia content are fundamentally intertwined processes, all taking place at the same time and – potentially – in different administrative domains.

Geoff Coulson, Harald Kosch, Odej Kao, Frank Seinstra

Supporting Reconfigurable Parallel Multimedia Applications

Programming multimedia applications for System-on-Chip (SoC) architectures is difficult because streaming communication, user event handling, reconfiguration, and parallelism have to be dealt with. We present Hinch, a runtime system for multimedia applications, that efficiently exploits parallelism by running the application in a dataflow style. The application has to be implemented as components that communicate using streams. Reconfigurability is supported by a generic component interface. Measurements have been performed on a SpaceCake SoC architecture simulator. Hinch can easily be ported to other shared-memory architectures.

Maik Nijhuis, Herbert Bos, Henri E. Bal

Providing VCR in a Distributed Client Collaborative Multicast Video Delivery Scheme

In order to design a high scalable video delivery technology for VoD systems, two representative solutions have been developed: multicast and P2P. Each of them has limitations when it has to implement VCR interactions to offer true-VoD services. With multicast delivery schemes, part of system resources has to be exclusively allocated in order to implement VCR operations, therefore the initial VoD system performance is considerably reduced. The P2P technology is able to decentralize the video delivery process among all the clients. However, P2P solutions are for video streaming systems in Internet and do not implement VCR interactivity. Therefore, P2P solutions are not suitable for true-VoD systems. In this paper, we propose the design of VCR mechanisms for a P2P multicast delivery scheme. The new mechanisms coordinate all the clients to implement the VCR operations using multicast communications. We compared our design with previous schemes and the results show that our approach is able to reduce the resource requirements by up to 16%.

X. Y. Yang, P. Hernández, F. Cores, A. Ripoll, R. Suppi, E. Luque

Linear Hashtable Motion Estimation Algorithm for Distributed Video Processing

This paper presents a parallel Linear Hashtable Motion Estimation Algorithm (LHMEA). Most parallel video compression algorithms focus on Group of Picture (GOP). Based on LHMEA we proposed earlier [1][2], we developed a parallel motion estimation algorithm focus inside of frame. We divide each reference frames into equally sized regions. These regions are going to be processed in parallel to increase the encoding speed significantly. The theory and practice speed up of parallel LHMEA according to the number of PCs in the cluster are compared and discussed. Motion Vectors (MV) are generated from the first-pass LHMEA and used as predictors for second-pass Hexagonal Search (HEXBS) motion estimation, which only searches a small number of Macroblocks (MBs). We evaluated distributed parallel implementation of LHMEA of TPA for real time video compression.

Yunsong Wu, Graham Megson

Topic 12: Theory and Algorithms for Parallel Computation

Topic 12: Theory and Algorithms for Parallel Computation

Parallelism exists at all levels in computing systems from circuits to grids. Effective use of parallelism crucially relies on the availability of suitable models of computation for algorithm design and analysis, and on efficient strategies for the solution of key computational problems on prominent classes of platforms. The study of foundational and algorithmic issues has led to many important advances in parallel computing and has been well represented in the Euro-Par community over that past two decades. A distinctive feature of this topic is the variety of results it as reported over the years that address classical problems as well as the new challenges posed by emerging computing paradigms. This year was no different.

Danny Krizanc, Michael Kaufmann, Pierre Fraigniaud, Christos Zaroliagis

A Hierarchical CLH Queue Lock

Modern multiprocessor architectures such as CC-NUMA machines or CMPs have nonuniform communication architectures that render programs sensitive to memory access locality. A recent paper by Radović and Hagersten shows that performance gains can be obtained by developing general-purpose mutual-exclusion locks that encourage threads with high mutual memory locality to acquire the lock consecutively, thus reducing the overall cost due to cache misses. Radović and Hagersten present the first such


locks. Unfortunately, their locks are backoff locks, which are known to incur higher cache miss rates than queue-based locks, suffer from various fundamental fairness issues, and are hard to tune so as to maximize locality of lock accesses.

Extending queue-locking algorithms to be hierarchical requires that requests from threads with high mutual memory locality be consecutive in the queue. Until now, it was not clear that one could design such locks because collecting requests locally and moving them into a global queue seemingly requires a level of coordination whose cost would defeat the very purpose of hierarchical locking.

This paper presents a hierarchical version of the Craig, Landin, and Hagersten CLH queue lock, which we call the

HCLH queue lock

. In this algorithm, threads build implicit local queues of waiting threads, splicing them into a global queue at the cost of only a single CAS operation.

In a set of microbenchmarks run on a large scale multiprocessor machine and a state-of-the-art multi-threaded multi-core chip, the HLCH algorithm exhibits better performance and significantly better fairness than the hierarchical backoff locks of Radović and Hagersten.

Victor Luchangco, Dan Nussbaum, Nir Shavit

Competitive Freshness Algorithms for Wait-Free Data Objects

Wait-free concurrent data objects are widely used in multiprocessor systems and real-time systems. Their popularity results from the fact that they avoid locking and that concurrent operations on such data objects are guaranteed to finish in a bounded number of steps regardless of the other operations interference. The data objects allow high access parallelism and guarantee correctness of the concurrent access with respect to its semantics. In such a highly-concurrent environment, where many wait-free write-operations updating the object state can overlap a single read-operation, the age/freshness of the state returned by this read-operation is a significant measure of the object quality, especially for real-time systems.

In this paper, we first propose a freshness measure for wait-free concurrent data objects. Subsequently, we model the freshness problem as an online problem and present two algorithms for it. The first one is a deterministic algorithm with asymptotically optimal competitive ratio


, where


is a function of the execution-time upper-bound of wait-free operations. The second one is a competitive randomized algorithm with competitive ratio

$\frac{\ln \alpha}{1 + \ln 2 -- \frac{2}{\sqrt{\alpha}}}$


Peter Damaschke, Phuong Hoai Ha, Philippas Tsigas

A Parallel Algorithm for the Two-Dimensional Cutting Stock Problem

Cutting Stock Problems (CSP) arise in many production industries where large stock sheets must be cut into smaller pieces. We present a parallel algorithm – based on Viswanathan and Bagchi algorithm (VB) – solving the Two-Dimensional Cutting Stock Problem (2DCSP). The algorithm guarantees the processing of best nodes first and does not introduce any redundant combinations – others than the already present in the sequential version. The improvement is orthogonal to any other sequential improvements. Computational results of an OpenMP implementation confirm the optimality of the algorithm. We also produce a new syntactic based reformulation of the 2DCSP problem which leads to a concise representation of the solutions. A highly efficient data structure to store subproblems is introduced.

Luis García, Coromoto León, Gara Miranda, Casiano Rodríguez

A BSP/CGM Algorithm for Finding All Maximal Contiguous Subsequences of a Sequence of Numbers

Given a sequence


of real numbers, we wish to find a list of all non-overlapping contiguous subsequences of


that are


. A






has the property that no proper subsequence of


has a greater sum of values. Furthermore,


may not be contained properly within any subsequence of


with this property. This problem can be solved sequentially in linear time. We present a BSP/CGM algorithm that uses


processors and takes






) time and






) space per processor. The algorithm uses a constant number of communication rounds of size at most






). Thus the algorithm achieves linear speed-up and is highly scalable.

Carlos Eduardo Rodrigues Alves, Edson Norberto Cáceres, Siang Wun Song

On-Line Adaptive Parallel Prefix Computation

We consider parallel prefix computation on processors of different and possibly changing speeds. Extending previous works on identical processors, we provide a lower bound for this problem. We introduce a new adaptive algorithm which is based on the on-line recursive coupling of an optimal sequential algorithm and a parallel one, non-optimal but recursive and fine-grain. The coupling relies on a work-stealing scheduling. Its theoretical performance is analysed on


processors of different and changing speeds. It is close to the lower bound both on identical processors and close to the lower bound for processors of changing speeds. Experiments performed on an eight-processor machine confirms this theoretical result.

Jean-Louis Roch, Daouda Traoré, Julien Bernard

Topic 13: Routing and Communication in Interconnection Networks

Topic 13: Routing and Communication in Interconnection Networks

Welcome to the Euro-Par 2006 Topic 13 on Routing and Communication in Interconnection Networks. Interconnection networks is a key area in the quest for performance in parallel and distributed computers, and this topic is dedicated to techniques that improve the state of the art in interconnecting parallel computers, workstations or clusters.

Jose A. Gregorio, Bettina Schnor, Angelos Bilas, Olav Lysne

A Model for the Development of AS Fabric Management Protocols

Advanced Switching (AS) is a switching fabric architecture based on the PCI Express technology. In order to support high availability, AS includes important features, such as device hot addition and removal, redundant pathways, and fabric management failover. This work presents an AS model developed in OPNET. The contribution of this tool is that it can help researchers to design and evaluate management mechanisms for this new technology. It can also be used to analyze other key aspects of the architecture, such as routing, congestion, and quality of service.

Antonio Robles-Gómez, Eva M. García, Aurelio Bermúdez, Rafael Casado, Francisco J. Quiles

On the Influence of the Selection Function on the Performance of Fat-Trees

Fat-tree topology has become very popular among switch manufacturers. Routing in fat-trees is composed of two phases, an adaptive upwards phase, and a deterministic downwards phase. The unique downwards path to the destination depends on the switch that has been reached in the upwards phase. As adaptive routing is used in the ascending phase, several output ports are possible at each switch and the final choice depends on the selection function. The impact of the selection function on performance has been previously studied for direct networks and has not resulted to be very important. In fat-trees, the decisions made in the upwards phase by the selection function can be critical, since it determines the switch reached in the upwards phase, and therefore the unique downwards path to the destination. In this paper, we analyze the effect of the selection function on fat-trees. Several selection functions are defined, compared and evaluated. The evaluation shows that selection function has a great impact on fat-trees.

F. Gilabert, M. E. Gómez, P. López, J. Duato

Scalable Ethernet Clos-Switches

Scalability of Cluster-Computers utilizing Gigabit-Ethernet as an interconnect is limited by the unavailability of scalable switches that provide full bisectional bandwidth. Clos’ idea of connecting small crossbar-switches to a large, non-blocking crossbar – wide-spread in the field of high-performance networks – is not applicable in a straight-forward manner to Ethernet fabrics. This paper presents techniques necessary to implement such large crossbar-switches based on available Gigabit-Ethernet technology. We point out the ability to build Gigabit-Ethernet crossbar switches of up to 1152 ports providing full bisectional bandwidth. The cost of our configuration is at about €125 per port, with an observed latency of less than 10


sec. We were able to find a bi-directional point-to-point throughput of 210 MB/s using the ParaStation Cluster middle-ware [2].

Norbert Eicker, Thomas Lippert

Towards a Cost-Effective Interconnection Network Architecture with QoS and Congestion Management Support

Congestion management and quality of service (QoS) provision are two important issues in current network design. The most popular techniques proposed for both issues require the existence of specific resources in the interconnection network, usually a high number of separate queues at switch ports. Therefore, the implementation of these techniques is expensive or even infeasible. However, two novel, efficient, and cost-effective techniques for provision of QoS and for congestion management have been proposed recently. In this paper, we combine those techniques to build a single interconnection network architecture, providing an excellent performance while reducing the number of required resources.

A. Martínez, P. J. García, F. J. Alfaro, J. L. Sánchez, J. Flich, F. J. Quiles, J. Duato

Topic 14: Mobile and Ubiquitous Computing

Topic 14: Mobile and Ubiquitous Computing

Mobile networking, mobile systems and applications and ubiquitous computing infrastructures are of strongly growing importance in the IT sector in general, and particularly for the parallel and distributed computing community. Mobile internet access has become a commonplace service. Many conventional software products such as e-mail systems, databases, or enterprise resource planning have been adapted towards mobile usage patterns. The world of ubiquitous information processing will soon be revolutionizing our day-to-day routines. Major components are sensor networks, radio frequency identification technology, and whole new layers of data management assembling their low-level signals to highlevel knowledge.

Alois Ferscha, Alexander Schill, Gianluigi Ferrari, Valerie Issarny

Multi-rated Packet Transmission Scheme for IEEE 802.11 WLAN Networks

In a multirate wireless network such as IEEE 802.11 WLAN, the connection having a good channel condition uses a high transmission rate and the connection having a poor channel condition uses a low transmission rate. However, this coexistence of different transmission rates degrades the total system performance of the network. In order to eliminate this performance abnormality and improve protocol capacity, we propose a new packet transmission algorithm, the RAT (Rate-Adapted Transmission) scheme. The RAT scheme distributes the wireless channel fairly based on the channel occupancy time. Moreover, it efficiently transmits packets even in a single station using rate-based queue management. Therefore, the RAT scheme obtains not only the inter-rate contention gain among stations but also the intra-rate contention gain among connections in a single station. By simulation, we show that the proposed RAT scheme is superior to the default IEEE 802.11 MAC DCF access method and the modified OAR (Opportunistic Auto Rate) scheme.

Namgi Kim

Comparison of Different Methods for Next Location Prediction

Next location prediction anticipates a person’s movement based on the history of previous sojourns. It is useful for proactive actions taken to assist the person in an ubiquitous environment. This paper evaluates next location prediction methods: dynamic Bayesian network, multi-layer perceptron, Elman net, Markov predictor, and state predictor. For the Markov and state predictor we use additionally an optimization, the confidence counter. The criterions for the comparison are the prediction accuracy, the quantity of useful predictions, the stability, the learning, the relearning, the memory and computing costs, the modelling costs, the expandability, and the ability to predict the time of entering the next location. For evaluation we use the same benchmarks containing movement sequences of real persons within an office building.

Jan Petzold, Faruk Bagci, Wolfgang Trumler, Theo Ungerer

SEER: Scalable Energy Efficient Relay Schemes in MANETs

In Mobile Ad Hoc Networks (MANETs),


is widely used to support many applications. Several adaptive broadcast schemes have been proposed to reduce the number of rebroadcasting, and can consequently reduce the chance of contention and collision among neighboring nodes. In practice, broadcasting is power intensive especially in dense networks. Thus, a good energy-efficient relay scheme should be able to further maximize the system lifetime without sacrificing the reachability of broadcasting. In this paper, we propose two Scalable Energy Efficient Relay (SEER) schemes that use probabilistic approaches to achieve higher performance and to prolong the system lifetime. In the schemes, each node uses some energy-based heuristic method to independently determine an appropriate rebroadcast probability. Nodes with more residual energy are responsible for forwarding more broadcast messages. One important feature is that such heuristic knowledge is obtained by self-contained local operation. To further improve the effectiveness of broadcasting, we also study how to dynamically adjust the rebroadcast probability according to node mobility. The simulation results show that our proposed approach outperforms the related scheme when the number of broadcast messages, broadcast reachability, and system lifetime are taken into consideration altogether.

Lin-Fei Sung, Cheng-Lin Wu, Yi-Kai Chiang, Shyh-In Hwang

Multicost Routing over an Infinite Time Horizon in Energy and Capacity Constrained Wireless Ad-Hoc Networks

In this work we study the dynamic one-to-one communication problem in energy- and capacity-constrained wireless ad-hoc networks. The performance of such networks is evaluated under random traffic generation and continuous energy recharging at the nodes over an infinite-time horizon. We are interested in the maximum throughput that can be sustained by the network with the node queues being finite and in the average packet delay for a given throughput. We propose a multicost energy-aware routing algorithm and compare its performance to that of minimum-hop routing. The results of our experiments show that generally the energy-aware algorithm achieves a higher maximum throughput than the minimum-hop algorithm. More specifically, when the network is mainly energy-constrained and for the 2-dimensional topology considered, the throughput of the proposed energy-aware routing algorithm is found to be almost twice that of the minimum-hop algorithm.

Christos A. Papageorgiou, Panagiotis C. Kokkinos, Emmanouel A. Varvarigos

An Adaptive Self-organization Protocol for Wireless Sensor Networks

This paper proposes a new self-organization protocol for sensors with low-power battery in wireless sensor networks. In our protocol, sensor networks consist of a hierarchical architecture with a sink, which is a root node, using the spanning tree algorithm. Our protocol utilizes some control messages to construct a hierarchical architecture, and by exchanging the messages between nodes, maintains adaptively the network topology by reorganizing the tree architecture as the network evolves. We perform the simulation to evaluate the performance of our protocol over wireless sensor networks. We provide simulation results comparing our protocol with the conventional approaches. The results show that our protocol outperforms other protocols over a broad range of parameters.

Kil-Woong Jang, Byung-Soon Kim

COPRA – A Communication Processing Architecture for Wireless Sensor Networks

Typical sensor nodes are composed of cheap hardware because they have to be affordable in great numbers. This means that memory and communication bandwidth are small, CPUs are slow and energy is limited. It also means that all unnecessary software components must be omitted. Thus it is necessary to use application specific communication protocols. As it is cumbersome to write these from scratch every time a configurable framework is needed.


provides such an architectural framework that allows the construction of application specific communication protocol stacks from prefabricated components.

Reinhardt Karnapke, Joerg Nolte

DAEDALUS – A Peer-to-Peer Shared Memory System for Ubiquitous Computing

Data sharing in a large scale and for high volatility tolerance requires peer-to-peer solutions where traditional multiprocessor shared memory systems are not applicable. Efficiency of those P2P shared memory systems depends, in particular, on scale, dynamics, and concurrent write accesses. We have developed a P2P shared memory solution, DAEDALUS, based on SUN’s JXTA framework, and integrated an efficient stochastic locking protocol, proper resource clustering, and semi-hierarchical grouping of nodes. We evaluated the applicability under heavy load, scale, and node mobility. Here, DAEDALUS outperformed a client/server system and solved its inherent scalability problem.

Peter Ibach, Vladimir Stantchev, Christian Keller

Context Awareness: An Experiment with Hoarding

Computer mobility allows people to use computers in varied and changing environments. This variability forces applications to adapt thus requiring awareness of the computational and physical environment (e.g. information about power management, network connections, synchronization opportunities, storage, computation, location-based services, etc.).

An important application for mobility is hoarding, i.e. automatic file replication between devices. To be accurate and not obstructive to the user, the hoarding mechanism requires both context awareness (e.g. amount of usable storage) and estimation of future environment conditions (e.g. network connection, tasks to be performed by the user in the near future, etc.). However, making applications context-aware is hindered by the complexity of dealing with the large variety of different modules, sensors and service platforms, i.e. there is no middleware supporting such applications and their development in a uniform and integrated way.

This paper presents the architecture for an environment awareness system (EAS) and how it applies to hoarding. EAS is a middleware component that acts as an intermediary between applications and all mechanisms that assess the surrounding environment. It lets applications query and combine environment properties in a standardized way. Crucial for the success of automatic file hoarding is the EAS’s capability of supporting environment prediction based on simple reasoning and pattern detection. Thus, applications may advise users accordingly or even make decisions on their behalf.

João Garcia, Luís Veiga, Paulo Ferreira

A Client-Server Approach to Enhance Interactive Virtual Environments on Mobile Devices over Wireless Ad Hoc Networks

Interactive 3D environments have been studied for years and represent an important application area of computer graphics. However, high quality virtual environment interaction requires powerful computers equipped with expensive graphics accelerator cards. The high 3D data volume and the dynamic nature of bandwidth pose significant challenges when providing a smooth virtual navigation on thin mobile devices over wireless ad hoc networks. In this paper, we show that it is possible to provide a virtual environment walkthrough on mobile devices through a client-server approach. Although mobile devices have low processing power and memory, they can still render images with relative ease. Based on this fact, instead of using traditional geometry-rendering techniques and locally rendering complex scenes, we employ an image-based mechanism on the client that uses images, which are provided by a remote server through an interactive streaming transport protocol. In this paper, we propose a bandwidth feedback algorithm together with a rate control and virtual user path prediction to better adapt the system to the changing bandwidth. We also discuss our ideas and show an extensive set of simulations in order to evaluate the performance of our solutions.

Azzedine Boukerche, Richard Werner Nelem Pazzi, Tingxue Huang

Topic 15: Peer-to-Peer and Web Computing

Topic 15: Peer-to-Peer and Web Computing

Peer-to-peer (P2P) systems have become a major area of research in the past few years. Their potential was first revealed by the hugely popular P2P file sharing applications, which allow any computer (as a peer), anywhere in a large scale distributed computing environment, to share information and resources with others. The computing environments promoted by P2P systems and technology are decentralized in nature, exploring a symmetric pairwise interaction model. They are self- organized and self-coordinated, dynamically adapted to peer arrivals and departures, and highly resilient to failures. As P2P research becomes more mature, new challenges emerge to support complex and heterogeneous distributed environments for sharing and managing data, resources, and knowledge, with highly volatile and dynamic usage patterns. This topic provides a forum for researchers to present new contributions on P2P technologies, applications, and systems, identifying key research issues and new challenges.

Henrique J. Domingos, Anne-Marie Kermarrec, Pascal Felber, Mark Jelasity

Top k RDF Query Evaluation in Structured P2P Networks

Berners-Lee’s vision of the Semantic Web describes the idea of providing machine readable and processable information using key technologies such as ontologies and automated reasoning in order to create intelligent agents.

The prospective amount of machine readable information available in the future will be large. Thus, heterogeneity and scalability will be central issues, rendering exhaustive searches and central storage of data infeasible. This paper presents a scalable peer-to-peer based approach to distributed querying of Semantic Web information that allows ordering of entries in result sets and limiting the size of result sets which is necessary to prevent results with millions of matches. The system relies on the graph-based W3C standard

Resource Description Framework

(RDF) for knowledge description. Thereby, it enables queries on large, distributed RDF graphs.

Dominic Battré, Felix Heine, Odej Kao

Roogle: Supporting Efficient High-Dimensional Range Queries in P2P Systems

Multi-dimensional range query is an important query type and especially useful when the user doesn’t know exactly what he is looking for. However, due to improper indexing method and high routing latency, existing schemes cannot perform well under high-dimensional situations. In this paper, we propose Roogle, a decentralized non-flooding P2P search engine that can efficiently support high-dimensional range queries in P2P systems. Roogle makes improvements on both indexing and routing. The high-dimensional data is indexed based on the maximum or minimum value among all dimensions. This simple indexing method performs rather well under high-dimensional situations and tolerates data points with missing values or different dimensionality. To speed query routing, Roogle is built on top of our proposed structured overlay – Aurelia, which has better routing performance by exploiting node heterogeneity. Aurelia also guarantees the data locality and efficiently support range queries. Experimental results from simulation validate the scalability and efficiency of Roogle.

Di Wu, Ye Tian, Kam-Wing Ng

Creating and Maintaining Replicas in Unstructured Peer-to-Peer Systems

In peer-to-peer systems, replication is an important issue as it improves search performance and data availability. It has been shown that optimal replication is attained when the number of replicas per item is proportional to the square root of their popularity. In this paper, we focus on updates in the case of optimal replication. In particular, we propose a new practical strategy for achieving square root replication called pull-then-push replication (PtP). With PtP, after a successful search, the requesting node enters a replicate-push phase where it transmits copies of the item to its neighbors. We show that updating the replicas can be significantly improved through an update-push phase where the node that created the copies propagates any updates it has received using similar parameters as in replicate-push. Our experimental results show that replicate-push coupled with an update-push strategy achieves good replica placement and consistency with small message overhead.

Elias Leontiadis, Vassilios V. Dimakopoulos, Evaggelia Pitoura

DOH: A Content Delivery Peer-to-Peer Network

Many SMEs and non-profit organizations suffer when their Web servers become unavailable due to flash crowd effects when their web site becomes popular. One of the solutions to the flash-crowd problem is to place the web site on a scalable CDN (Content Delivery Network) that replicates the content and distributes the load in order to improve its response time.

In this paper, we present our approach to building a scalable Web Hosting environment as a CDN on top of a structured peer-to-peer system of collaborative web-servers integrated to share the load and to improve the overall system performance, scalability, availability and robustness. Unlike cluster-based solutions, it can run on heterogeneous hardware, over geographically dispersed areas. To validate and evaluate our approach, we have developed a system prototype called DOH (DKS Organized Hosting) that is a CDN implemented on top of the DKS (Distributed K-nary Search) structured P2P system with DHT (Distributed Hash table) functionality [9]. The prototype is implemented in Java, using the DKS middleware, the Jetty web-server, and a modified JavaFTP server. The proposed design of CDN has been evaluated by simulation and by evaluation experiments on the prototype.

Jimmy Jernberg, Vladimir Vlassov, Ali Ghodsi, Seif Haridi

Topic 16: Applications of High-Performance and Grid Computing

Topic 16: Applications of High-Performance and Grid Computing

The use of high performance and grid computing has spread rapidly, revolutionising the ability of scientists and engineers to tackle the challenges they face. Driven by commoditisation and open standards: the widespread availability of parallel computers, large data storage, fast networks, maturing Grid middleware, and distributed service-oriented technologies have led to the development and deployment of large scale distributed simulation and data analysis solutions in many areas. The papers in this topic highlight recent progress in applications of all aspects of distributed computing technologies with an emphasis on successes, advances, and lessons learned in the development, implementation, and deployment of novel scientific, engineering and industrial applications on high performance and grid computing platforms. Today’s large computational solutions often require access to or generate large volumes of data- indeed today seamless data access and management can be as important to the underlying computational algorithm as raw computing power. Papers in the sessions highlight data intensive applications which couple together High Performance/ Grid computing with large-scale data access/ management.

Simon J. Cox, Thomas Lippert, Giovanni Erbacci, Denis Trystram

Task Pool Teams Implementation of the Master Equation Approach for Random Sierpinski Carpets

We consider the use of task pool teams in implementation of the master equation on random Sierpinski carpets. Though the basic idea of dynamic storage of the probability density reported earlier applies straightforward to random carpets, the randomized construction breaks up most of the simplifications possible for regular carpets. In addition, parallel implementations show highly irregular communication patterns. We compare four implementations on three different Beowulf-Cluster architectures, mainly differing in throughput and latency of their interconnection networks. It appears that task pool teams provide a powerful programming paradigm for handling the irregular communication patterns that arise in our application and show a promising approach to efficiently handle the problems that appear with such randomized structures. This will allow for highly improved modelling of anomalous diffusion in porous media, taking the random structure of real materials into account.

K. H. Hoffmann, M. Hofmann, G. Rünger, S. Seeger

A Preliminary Out-of-Core Extension of a Parallel Multifrontal Solver

The memory usage of sparse direct solvers can be the bottleneck to solve large-scale problems. This paper describes a first implementation of an


extension to a parallel multifrontal solver (


). We show that larger problems can be solved on limited-memory machines with reasonable performance, and we illustrate the behaviour of our parallel


factorization. Then we use simulations to discuss how our algorithms can be modified to solve much larger problems.

Emmanuel Agullo, Abdou Guermouche, Jean-Yves L’Excellent

A Parallel Adaptive Cartesian PDE Solver Using Space–Filling Curves

In this paper, we present a parallel multigrid PDE solver working on adaptive hierarchical cartesian grids. The presentation is restricted to the linear elliptic operator of second order, but extensions are possible and have already been realised as prototypes. Within the solver the handling of the vertices and the degrees of freedom associated to them is implemented solely using stacks and iterates of a Peano space–filling curve. Thus, due to the structuredness of the grid, two administrative bits per vertex are sufficient to store both geometry and grid refinement information. The implementation and parallel extension, using a space–filling curve to obtain a load balanced domain decomposition, will be formalised. In view of the fact that we are using a multigrid solver of linear complexity


, it has to be ensured that communication cost and, hence, the parallel algorithm’s overall complexity do not exceed this linear behaviour.

Hans-Joachim Bungartz, Miriam Mehl, Tobias Weinzierl

Load Balanced Parallel Simulated Annealing on a Cluster of SMP Nodes

The paper focuses on a parallel implementation of a simulated annealing algorithm. In order to take advantage of the properties of modern clustered SMP architectures a hybrid method using a combination of OpenMP nested in MPI is advocated. The development of the reference implementation is proposed. Furthermore, a few load balancing strategies are introduced: time scheduling at the annealing process level, clustering at the basic annealing step level and suspending—inside of the basic annealing step. The application of the algorithm to VRPTW—a generally accepted benchmark problem—is used to illustrate their positive influence on execution time and the quality of results.

Agnieszka Debudaj-Grabysz, Rolf Rabenseifner

A Grid Computing Based Virtual Laboratory for Environmental Simulations

The grid computing technology permits the coordinate, efficient and effective use of (geographically spread) computational and storage resources with the aim to achieve high performance throughputs for intensive CPU load applications.

In this paper we describe the development of a virtual laboratory for environmental applications. The software infrastructure, and the related interface, are developed for the straightforward use of shared and distributed observations, software, computing and storage resources. The user can design and execute his experiments building up and assembling data acquisition procedures, numerical models, and applications for the rendering of output data, with limited knowledge of grid computing, thereby focusing his attention to the application.

Our solution aims at the goal of developing black-box grid applications for earth observation, marine and environmental sciences.

I. Ascione, G. Giunta, P. Mariani, R. Montella, A. Riccio

Exploiting Throughput for Pipeline Execution in Streaming Image Processing Applications

There is a large range of image processing applications that act on an input sequence of image frames that are continuously received. Throughput is a key performance measure to be optimized when executing them. In this paper we propose a new task replication methodology for optimizing throughput for an image processing application in the field of medicine. The results show that by applying the proposed methodology we are able to achieve the desired throughput in all cases, in such a way that the input frames can be processed at any given rate.

F. Guirado, A. Ripoll, C. Roig, A. Hernàndez, E. Luque

dCache, Storage System for the Future

In 2007, the most challenging high energy physics experiment ever, the

Large Hardon Collider(LHC)

, at CERN, will produce a sustained stream of data in the order of 300MB/sec, equivalent to a stack of CDs as high as the Eiffel Tower once per week. This data is, while produced, distributed and persistently stored at several dozens of sites around the world, building the LHC data grid. The destination sites are expected to provide the necessary middle-ware, so called Storage Elements, offering standard protocols to receive the data and to store it at the site specific Storage Systems. A major player in the set of Storage Elements is the


system. dCache/SRM has proven to be capable of managing the storage and exchange of several hundreds of terabytes of data, transparently distributed among dozens of disk storage nodes. One of the key design features of the dCache is that although the location and multiplicity of the data is autonomously determined by the system, based on configuration, cpu load and disk space, the name space is uniquely represented within a single file system tree. The system has shown to significantly improve the efficiency of connected tape storage systems, by caching, ’gather & flush’ and scheduled staging techniques. Furthermore, it optimizes the throughput to and from data clients as well as smoothing the load of the connected disk storage nodes by dynamically replicating datasets on the detection of load hot spots. The system is tolerant against failures of its data servers which enables administrators to go for commodity disk storage components. Access to the data is provided by various standard protocols. Furthermore the software is coming with an implementation of the

Storage Resource Manager

protocol (SRM), which is evolving to an open standard for grid middleware to communicate with site specific storage fabrics.

Patrick Fuhrmann, Volker Gülzow

Computing the Diameter of 17-Pancake Graph Using a PC Cluster



-pancake graph is a graph whose vertices are the permutations of


symbols and each pair of vertices are connected with an edge if and only if the corresponding permutations can be transitive by a prefix reversal. Since the


-pancake graph has


! vertices, it is known to be a hard problem to compute its diameter by using an algorithm with the polynomial order of the number of vertices. Fundamental approaches of the diameter computation have been proposed. However, the computation of the diameter of 15-pancake graph has been the limit in practice. In order to compute the diameters of the larger pancake graphs, it is indispensable to establish a sustainable parallel system with enough scalability. Therefore, in this study, we have proposed an improved algorithm to compute the diameter and have developed a sustainable parallel system with the Condor/MW framework, and computed the diameters of 16- and 17-pancake graphs by using PC clusters.

Shogo Asai, Yuusuke Kounoike, Yuji Shinano, Keiichi Kaneko

Topic 17: High-Performance Bioinformatics

Topic 17: High-Performance Bioinformatics

High performance computational biology and bioinformatics are increasingly required to extract valuable biological and biomedical knowledge from ever-increasing biological data. New computational techniques and new theoretical models are required to simulate complex biological behavior of biological systems. Topic 17 focuses on high-performance and high-throughput computing necessary for management of biological data, extraction of meaning from biological data and using such data in modeling and simulation of biological systems.

Craig A. Stewart, Michael Schroeder, Concettina Guerra, Konagaya Akihiko

Multidimensional Dynamic Programming for Homology Search on Distributed Systems

Alignment problems in computational biology have been focused recently because of the rapid growth of sequence databases. By computing alignment, we can understand similarity among the sequences. Dynamic programming is a technique to find optimal alignment, but it requires very long computation time. We have shown that dynamic programming for more than two sequences can be efficiently processed on a compact system which consists of an off-the-shelf FPGA board and its host computer (


). The performance is, however, not enough for comparing long sequences. In this paper, we describe a computation method for the multidimensional dynamic programming on distributed systems. The method is now being tested using two nodes connected by Ethernet. According to our experiments, it is possible to achieve 5.1 times speedup with 16 nodes, and more speedup can be expected for comparing longer sequences using more number of nodes. The performance is affected only a little by the data transfer delay when comparing long sequences. Therefore, our method can be mapped on any kinds of networks with large delays.

Shingo Masuno, Tsutomu Maruyama, Yoshiki Yamaguchi, Akihiko Konagaya

Load Balancing and Parallel Multiple Sequence Alignment with Tree Accumulation

Multiple sequence alignment program, ClustalW, is time consuming, however, commonly used to compare the protein sequences. ClustalW includes two main time consuming parts: pairwise alignment and progressive alignment. Due to the irregular computation based on tree in progressive alignment, available parallel programs can not achieve reasonable speedups for large scale number of sequences. In this paper, progressive alignment is reduced to tree accumulation problem. Load balancing is ignored in previous efficient parallel tree accumulations. We proposed a load balancing strategy for parallelizing tree accumulation in progressive alignment. The new parallel progressive alignment algorithm reducing to tree accumulation with load balancing reduced the overall running time greatly and achieved reasonable speedups.

Guangming Tan, Liu Peng, Shengzhong Feng, Ninghui Sun

ZIB Structure Prediction Pipeline: Composing a Complex Biological Workflow Through Web Services

In life sciences, scientists are confronted with an exponential growth of biological data, especially in the genomics and proteomics area. The efficient management and use of these data, and its transformation into knowledge are basic requirements for biological research. Therefore, integration of diverse applications and data from geographically distributed computing resources will become a major issue. We will present the status of our efforts for the realization of an automated protein prediction pipeline as an example for a complex biological workflow scenario in a Grid environment based on Web services. This case study demonstrates the ability of an easy orchestration of complex biological workflows based on Web services as building blocks and Triana as workflow engine.

Patrick May, Hans-Christian Ehrlich, Thomas Steinke

Evaluation of Parallel Paradigms on Anisotropic Nonlinear Diffusion

Anisotropic Nonlinear Diffusion (AND) is a powerful noise reduction technique in the field of computer vision. This method is based on a Partial Differential Equation (PDE) tightly coupled with a massive set of eigensystems. Denoising large 3D images in biomedicine and structural cellular biology by AND is extremely expensive from a computational point of view, and the requirements may become so huge that parallel computing turns out to be essential. This work addresses the parallel implementation of AND. The parallelization is carried out by means of three paradigms: (1) Shared address space paradigm, (2) Message passing paradigm, and (3) Hybrid paradigm. The three parallel approaches have been evaluated on two parallel platforms: (1) a DSM (Distributed Shared Memory) platform based on cc-NUMA memory access and (2) a cluster of Symmetric biprocessors. An analysis of the performance of the three strategies has been accomplished to determine which is the most suitable paradigm for each platform.

S. Tabik, E. M. Garzón, I. García, J. J. Fernández

Improving the Research Environment of High Performance Computing for Non-cluster Experts Based on Knoppix Instant Computing Technology

We have designed and implemented a new portable system that can rapidly construct a computer environment where high-throughput research applications can be performed instantly. One challenge in the instant computing area is constructing a cluster system instantly, and then readily restoring it to its former state. This paper presents an approach for instant computing using Knoppix technology that can allow even a non-computer specialist to easily construct and operate a Beowulf cluster . In the present bio-research field, there is now an urgent need to address the nagging problem posed by having high-performance computers. Therefore, we were assigned the task of proposing a way to build an environment where a cluster computer system can be instantly set up. Through such research, we believe that the technology can be expected to accelerate scientific research. However, when employing this technology in bio-research, a capacity barrier exists when selecting a clustered Knoppix system for a data-driven bioinformatics application. We have approached ways to overcome said barrier by using a virtual integrated RAM-DISK to adapt to a parallel file system. To show an actual example using a reference application, we have chosen InterProScan, which is an integrated application prepared by the European Bioinformatics Institute (EBI) that utilizes many database and scan methods. InterProScan is capable of scaling workload with local computational resources, though biology researchers and even bioinformatics researchers find such extensions difficult to set up. We have achieved the purpose of allowing even researchers who are non-cluster experts to easily build a system of ”Knoppix for the InterProScan4.1 High Throughput Computing Edition.” The system we developed is capable of not only constructing a cluster computer environment composed of 32 computers in about ten minutes (as opposed to six hours when done manually), but also restoring the original environment by rebooting the pre-existing operating system. The goal of our instant cluster computing is to provide an environment in which any target application can be built instantly from anywhere.

Fumikazu Konishi, Manabu Ishii, Shingo Ohki, Yusuke Hamano, Shuichi Fukuda, Akihiko Konagaya

Topic 18: Embedded Parallel Systems

Topic 18: Embedded Parallel Systems

Multi-processor systems implemented in System-on-a-Chip technology (MPSoC) are emerging for processing embedded applications such as consumer electronics, mobile phones, computer graphics, and medical imaging, to name a few. Contrary to cluster and grid processing, their design and required compilation techniques are driven by multiple conflicting design objectives simultaneously such as power consumption, speed, monetary cost, and physical as well as memory size. Here, new specification techniques, special parallelization and mapping techniques are needed in order to embed computations optimally into the parallel architecture. Various architectural concepts ranging from fine-grain to coarse-grain parallel SoC architectures with focus on dynamic programmability or reconfigurability are currently emerging in academia and industry.

Jürgen Teich, Stefanos Kaxiras, Toomas Plaks, Krisztián Flautner

Efficient Realization of Data Dependencies in Algorithm Partitioning Under Resource Constraints

Mapping algorithms to parallel architectures efficiently is very important for a cost-effective design of many modern technical products. In this paper, we present a solution to the problem of efficiently realizing uniform data dependencies on processor arrays. In contrary to existing approaches, we formulate an optimization problem to consider the cost of both: channels and registers. Further, a solution to the optimization problem assigns which channels shall be implemented and it specifies the control for the realization of the uniform data dependencies. We illustrate our method on the edge detection algorithm.

Sebastian Siegel, Renate Merker

FPGA Implementation of a Prototype Hierarchical Control Network for Large-Scale Signal Processing Applications

The performance of a high throughput and large-scale signal processing system must not be compromised by the control and monitoring flow that is inherently part of the system. In particular, the interfacing of data flow and control flow components should be such that control does not obstruct the signal flow that is of higher priority. We assume that the signal processing is modeled as a distributed hierarchy of data flow networks, and that the control and monitoring is modeled as a distributed hierarchy of communicating Finite State Machines. The interfaces between leaf-nodes of the control and monitoring network, and the signal processing nodes in the dataflow networks are specified in such a way that the semantics of both network types are preserved. In this paper, we present the prototyping of a control network and its interfacing with a data flow network in a FPGA-based platform, and we analyze the performance of the interfacing in a case study. The HDL code that is involved in the interfaces is generated in a semi-automated way.

Jérôme Lemaitre, Ed Deprettere

An Embedded Systems Programming Environment for C

Resource constraints are a major concern with the design, development, and deployment of embedded systems. Embedded systems are highly hardware-dependent and have little computational power. Mobile embedded systems are further constrained by their limited battery capacity. Many of these systems are still programmed in assembly language because there is a lack of efficient programming environments.

To overcome or at least alleviate the restrictions, we propose a light-weight and versatile programming environment for the C programming language that offers mixed-mode execution, i.e., code is either executed on the CPU or on a virtual machine (VM). This mixed-mode execution environment combines the advantages of highly compressed bytecode with the speed of machine code.

We have implemented the programming environment and conducted experiments for selected programs of the MiBench suite and the Spec 2000. The VM has a footprint of 12 KB on the Intel IA32. Initial results show that the performance of the virtual machine is typically only 2 to 36 times slower than the binary execution, with compressed code occupying only 36%–57% of the machine code size. Combining sequences of VM instructions into new VM instructions (superinstructions) increases the execution speed and reduces the VM code size. Preliminary experiments indicate a speedup by a factor of 3.

Bernd Burgstaller, Bernhard Scholz, Anton Ertl


Weitere Informationen

Premium Partner