Skip to main content
Top

2013 | Book

Euro-Par 2012: Parallel Processing Workshops

BDMC, CGWS, HeteroPar, HiBB, OMHI, Paraphrase, PROPER, Resilience, UCHPC, VHPC, Rhodes Islands, Greece, August 27-31, 2012. Revised Selected Papers

Editors: Ioannis Caragiannis, Michael Alexander, Rosa Maria Badia, Mario Cannataro, Alexandru Costan, Marco Danelutto, Frédéric Desprez, Bettina Krammer, Julio Sahuquillo, Stephen L. Scott, Josef Weidendorfer

Publisher: Springer Berlin Heidelberg

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

This book constitutes thoroughly refereed post-conference proceedings of the workshops of the 18th International Conference on Parallel Computing, Euro-Par 2012, held in Rhodes Islands, Greece, in August 2012. The papers of these 10 workshops BDMC, CGWS, HeteroPar, HiBB, OMHI, Paraphrase, PROPER, UCHPC, VHPC focus on promotion and advancement of all aspects of parallel and distributed computing.

Table of Contents

Frontmatter

1st Workshop on Big Data Management in Clouds – BDMC2012

1st Workshop on Big Data Management in Clouds – BDMC2012

As data volumes increase at exponential speed in more and more application fields of science, the challenges posed by handling Big Data in the Exabyte era gain an increasing importance. High-energy physics, statistics, climate modeling, cosmology, genetics or bio-informatics are just a few examples of fields where it becomes crucial to efficiently manipulate Big Data, which are typically shared at large scale. Rapidly storing this data, protecting it from loss and analyzing it to understand the results are significant challenges, made more difficult by decades of improvements in computation capabilities that have been unmatched in storage. For many applications, the overall performance and scalability becomes clearly driven by the performance of the data handling subsystem. As we anticipate Exascale systems in 2020, there is a growing consensus in the scientific community that revolutionary new approaches are needed in computational science data management. These new trends lead us to rethink the traditional file-based data management abstraction for large-scale applications. Moreover, for obvious cost-related reasons, new architectures are clearly needed as well as alternate infrastructures to supercomputers., like hybrid or HPC clouds.

Alexandru Costan, Ciprian Dobre
MRBS: Towards Dependability Benchmarking for Hadoop MapReduce

MapReduce is a popular programming model for distributed data processing. Extensive research has been conducted on the reliability of MapReduce, ranging from adaptive and on-demand fault-tolerance to new fault-tolerance models. However, realistic benchmarks are still missing to analyze and compare the effectiveness of these proposals. To date, most MapReduce fault-tolerance solutions have been evaluated using microbenchmarks in an ad-hoc and overly simplified setting, which may not be representative of real-world applications. This paper presents MRBS, a comprehensive benchmark suite for evaluating the dependability of MapReduce systems. MRBS includes five benchmarks covering several application domains and a wide range of execution scenarios such as data-intensive vs. compute-intensive applications, or batch applications vs. online interactive applications. MRBS allows to inject various types of faults at different rates and produces extensive reliability, availability and performance statistics. The paper illustrates the use of MRBS with Hadoop clusters.

Amit Sangroya, Damián Serrano, Sara Bouchenak
Caju: A Content Distribution System for Edge Networks

More and more, users store their data in the cloud. While the content is then retrieved, the retrieval has to respect quality of service (QoS) constraints. In order to reduce transfer latency, data is replicated. The idea is make data close to users and to take advantage of providers home storage. However to minimize the cost of their platform, cloud providers need to limit the amount of storage usage. This is still more crucial for big contents.

This problem is hard, the distribution of the popularity among the stored pieces of data is highly non-uniform: several pieces of data will never be accessed while others may be retrieved thousands of times. Thus, the trade-off between storage usage and QoS of data retrieval has to take into account the data popularity.

This paper presents our architecture gathering several storage domains composed of small-sized datacenters and edge devices; and it shows the importance of adapting the replication degree to data popularity.

Our simulations, using realistic workloads, show that a simple cache mechanism provides a eight-fold decrease in the number of SLA violations, requires up to 10 times less of storage capacity for replicas, and reduces aggregate bandwidth and number of flows by half.

Guthemberg Silvestre, Sébastien Monnet, Ruby Krishnaswamy, Pierre Sens
Data Security Perspectives in the Framework of Cloud Governance

The adoption of Cloud Computing paradigm by the Small and Medium Enterprises allows them to associate and to create virtualized forms of enterprises or clusters that better sustain the competition with large enterprises sharing the same markets. In the same time the lack of security standards in Cloud Computing generates reluctance from the Small and Medium Enterprises in fully move their activities in the Cloud. We have proposed a Cloud Governance architecture which relies on mOSAIC project’s cloud management solution called Cloud Agency, implemented as a multi-agent system. The Cloud Governance solution is based on various datastores that manage the data produced and consumed during the services lifecycle. This paper focuses on determining the requirements that must be met by the various databases that compound the most complex datastore from the proposed architecture, called Service Datastore, together with emphasizing the threats and security risks that the individual database entities must face.

Adrian Copie, Teodor-Florin Fortiş, Victor Ion Munteanu

CoreGRID/ERCIM Workshop on Grids, Clouds and P2P Computing – CGWS2012

CoreGRID/ERCIM Workshop on Grids, Clouds and P2P Computing – CGWS2012

CoreGRID is a European research Network of Excellence (NoE) that was initiated in 2004 as part of the EU FP6 research framework. CoreGRID partners, from 44 different countries, developed theoretical foundations and software infrastructures for large-scale, distributed Grid and P2P applications. An ERCIM sponsored CoreGRID Working Group was established to ensure the continuity of the CoreGrid programme after the original funding period of the NoE. The working group extended its interests to include the emerging field of servicebased cloud computing due to its great importance to the European software industry. The working group’s main goals consist in i) sustaining the operation of the CoreGRID Network, ii) establishing a forum encouraging collaboration between the Grid and P2P Computing research communities, and (iii) encourage research on the role of cloud computing as a new paradigm for distributed computing in e-Science.

Frédéric Desprez, Domenico Talia, Ramin Yayhapour
Evaluating Cloud Storage Services for Tightly-Coupled Applications

The emergence of Cloud computing has given rise to numerous attempts to study the portability of scientific applications to this new paradigm. Tightly-coupled applications are a common class of scientific HPC applications, which exhibit specific requirements previously addressed by supercomputers. A key challenge towards the adoption of the Cloud paradigm for such applications is data management. In this paper, we argue that Cloud storage services represent a suitable data storage and sharing option for Cloud applications. We evaluate a distributed storage plugin for Cumulus, an S3-compatible open-source Cloud service, and we conduct a series of experiments with an atmospheric modeling application running in a private Cloud deployed on the Grid’5000 testbed. Our results, obtained on up to 144 parallel processes, show that the application is able to scale with the size of the data and the number of processes, while storing 50 GB of output data on a Cloud storage service.

Alexandra Carpen-Amarie, Kate Keahey, John Bresnahan, Gabriel Antoniu
Targeting Distributed Systems in FastFlow

FastFlow

is a structured parallel programming framework targeting shared memory multi-core architectures. In this paper we introduce a

FastFlow

extension aimed at supporting also a network of multi-core workstations. The extension supports the execution of

FastFlow

programs by coordinating–in a structured way–the fine grain parallel activities running on a single workstation. We discuss the design and the implementation of this extension presenting preliminary experimental results validating it on state-of-the-art networked multi-core nodes.

Marco Aldinucci, Sonia Campa, Marco Danelutto, Peter Kilpatrick, Massimo Torquati
Throughput Optimization for Pipeline Workflow Scheduling with Setup Times

We tackle pipeline workflow applications that are executed on a distributed platform with setup times. Several computation stages are interconnected as a linear application graph, and each stage holds a buffer of limited size where intermediate results are stored and a processor setup time occurs when passing from one stage to another. In this paper, we focus on interval mappings (consecutive stages mapped on a same processor), and the objective is the throughput optimization. Even when neglecting setup times, the problem is NP-hard on heterogeneous platforms and we therefore restrict to homogeneous resources. We provide an optimal algorithm for constellations with identical buffer capacities. When buffer sizes are not fixed, we deal with the problem of allocating the buffers in shared memory and present a

b

/(

b

 + 1)-approximation algorithm.

Anne Benoit, Mathias Coqblin, Jean-Marc Nicod, Laurent Philippe, Veronika Rehn-Sonigo
Meteorological Simulations in the Cloud with the ASKALON Environment

Precipitation in mountainous regions is an essential process in meteorological research for its strong impact on the hydrological cycle. To support scientists, we present the design of a meteorological application using the ASKALON environment comprising graphical workflow modeling and execution in a Cloud computing environment. We illustrate performance results that demonstrate that, although limited by Amdahl’s law, our workflow can gain important speedup when executed in a virtualized Cloud environment with important operational cost reductions. Results from the meteorological research show the usefulness of our model for determining precipitation distribution in the case of two field campaigns over Norway.

Gabriela Andreea Morar, Felix Schüller, Simon Ostermann, Radu Prodan, Georg Mayr
A Science-Gateway Workload Archive to Study Pilot Jobs, User Activity, Bag of Tasks, Task Sub-steps, and Workflow Executions

Archives of distributed workloads acquired at the infrastructure level reputably lack information about users and application-level middleware. Science gateways provide consistent access points to the infrastructure, and therefore are an interesting information source to cope with this issue. In this paper, we describe a workload archive acquired at the science-gateway level, and we show its added value on several case studies related to user accounting, pilot jobs, fine-grained task analysis, bag of tasks, and workflows. Results show that science-gateway workload archives can detect workload wrapped in pilot jobs, improve user identification, give information on distributions of data transfer times, make bag-of-task detection accurate, and retrieve characteristics of workflow executions. Some limits are also identified.

Rafael Ferreira da Silva, Tristan Glatard
Energy Adaptive Mechanism for P2P File Sharing Protocols

Peer to peer (P2P) file sharing applications have gained considerable popularity and are quite bandwidth and energy intensive. With the increased usage of P2P applications on mobile devices, its battery life has become of significant concern. In this paper, we propose a novel mechanism for energy adaptation in P2P file sharing protocols to significantly enhance the possibility of a client completing the file download before exhausting its battery. The underlying idea is to group mobile clients based on their energy budget and impose restrictions on bandwidth usage and hence on energy consumption. This allows us to provide favoured treatment to low energy devices, while still ensuring long-term fairness through a credit based mechanism and preventing free riding. Furthermore, we show how the proposed mechanism can be implemented in a popular P2P file sharing application, the BitTorrent protocol and analyze it through a comprehensive set of simulations.

Mayank Raj, Krishna Kant, Sajal K. Das

Tenth International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (HeteroPar’2012)

Tenth International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (HeteroPar’2012)

The International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (HeteroPar’2012) was held, on its tenth edition, in Rhodes Island, Greece. For the fourth time, this workshop was organized in conjunction with the Euro-Par annual series of international conferences dedicated to the promotion and advancement of all aspects of parallel computing.

Rosa M. Badia
Unleashing CPU-GPU Acceleration for Control Theory Applications

In this paper we review the effect of two high-performance techniques for the solution of matrix equations arising in control theory applications on CPU-GPU platforms, in particular advanced optimization via look-ahead and iterative refinement. Our experimental evaluation on the last GPU-generation from NVIDIA, “Kepler”, shows the slight advantage of matrix inversion via Gauss-Jordan elimination, when combined with look-ahead, over the traditional LU-based procedure, as well as the clear benefits of using mixed precision and iterative refinement for the solution of Lyapunov equations.

Peter Benner, Pablo Ezzatti, Enrique S. Quintana-Ortí, Alfredo Remón
clOpenCL - Supporting Distributed Heterogeneous Computing in HPC Clusters

Clusters that combine heterogeneous compute device architectures, coupled with novel programming models, have created a true alternative to traditional (homogeneous) cluster computing, allowing to leverage the performance of parallel applications. In this paper we introduce

cl

OpenCL, a platform that supports the simple deployment and efficient running of OpenCL-based parallel applications that may span several cluster nodes, expanding the original single-node OpenCL model.

cl

OpenCL is deployed through user level services, thus allowing OpenCL applications from different users to share the same cluster nodes and their compute devices. Data exchanges between distributed

cl

OpenCL components rely on Open-MX, a high-performance communication library. We also present extensive experimental data and key conditions that must be addressed when exploiting

cl

OpenCL with real applications.

Albano Alves, José Rufino, António Pina, Luís Paulo Santos
Mastering Software Variant Explosion for GPU Accelerators

Mapping algorithms in an efficient way to the target hardware poses a challenge for algorithm designers. This is particular true for heterogeneous systems hosting accelerators like graphics cards. While algorithm developers have profound knowledge of the application domain, they often lack detailed insight into the underlying hardware of accelerators in order to exploit the provided processing power. Therefore, this paper introduces a rule-based, domain-specific optimization engine for generating the most appropriate code variant for different Graphics Processing Unit (GPU) accelerators. The optimization engine relies on knowledge fused from the application domain and the target architecture. The optimization engine is embedded into a framework that allows to design imaging algorithms in a Domain-Specific Language (DSL). We show that this allows to have one common description of an algorithm in the DSL and select the optimal target code variant for different GPU accelerators and target languages like CUDA and OpenCL.

Richard Membarth, Frank Hannig, Jürgen Teich, Mario Körner, Wieland Eckert
Exploring Heterogeneous Scheduling Using the Task-Centric Programming Model

Computer architecture technology is moving towards more heterogeneous solutions, which will contain a number of processing units with different capabilities that may increase the performance of the system as a whole. However, with increased performance comes increased complexity; complexity that is now barely handled in homogeneous multiprocessing systems. The present study tries to solve a small piece of the heterogeneous puzzle; how can we exploit all system resources in a performance-effective and user-friendly way? Our proposed solution includes a run-time system capable of using a variety of different heterogeneous components while providing the user with the already familiar task-centric programming model interface. Furthermore, when dealing with non-uniform workloads, we show that traditional approaches based on centralized or work-stealing queue algorithms do not work well and propose a scheduling algorithm based on trend analysis to distribute work in a performance-effective way across resources.

Artur Podobas, Mats Brorsson, Vladimir Vlassov
Weighted Block-Asynchronous Iteration on GPU-Accelerated Systems

In this paper, we analyze the potential of using weights for block-asynchronous relaxation methods on GPUs. For this purpose, we introduce different weighting techniques similar to those applied in block-smoothers for multigrid methods. For test matrices taken from the University of Florida Matrix Collection we report the convergence behavior and the total runtime for the different techniques. Analyzing the results, we observe that using weights may accelerate the convergence rate of block-asynchronous iteration considerably. While component-wise relaxation methods are seldom directly applied to systems of linear equations, using them as smoother in a multigrid framework they often provide an important contribution to finite element solvers. Since the parallelization potential of the classical smoothers like SOR and Gauss-Seidel is usually very limited, replacing them by weighted block-asynchronous smoothers may be beneficial to the overall multigrid performance. Due to the increase of heterogeneity in today’s architecture designs, the significance and the need for highly parallel asynchronous smoothers is expected to grow.

Hartwig Anzt, Stanimire Tomov, Jack Dongarra, Vincent Heuveline
An Optimized Parallel IDCT on Graphics Processing Units

In this paper we present an implementation of the H.264/AVC Inverse Discrete Cosine Transform (IDCT) optimized for Graphics Processing Units (GPUs) using OpenCL. By exploiting that most of the input data of the IDCT for real videos are zero valued coefficients a new compacted data representation is created that allows for several optimizations. Experimental evaluations conducted on different GPUs show average speedups from 1.7× to 7.4× compared to an optimized single-threaded SIMD CPU version.

Biao Wang, Mauricio Alvarez-Mesa, Chi Ching Chi, Ben Juurlink
Multi-level Parallelization of Advanced Video Coding on Hybrid CPU+GPU Platforms

A dynamic model for parallel H.264/AVC video encoding on hybrid GPU+CPU systems is proposed. The entire inter-prediction loop of the encoder is parallelized on both the CPU and the GPU, and a computationally efficient model is proposed to dynamically distribute the computational load among these processing devices on hybrid platforms. The presented model includes both dependency aware task scheduling and load balancing algorithms. According to the obtained experimental results, the proposed dynamic load balancing model is able to push forward the computational capabilities of these hybrid parallel platforms, achieving a speedup of up to 2 when compared with other equivalent state-of-the-art solutions. With the presented implementation, it was possible to encode 25 frames per second for HD 1920×1080 resolution, even when exhaustive motion estimation is considered.

Svetislav Momcilovic, Nuno Roma, Leonel Sousa
Multi-GPU Implementation of the NICAM Atmospheric Model

Climate simulation models are used for a variety of scientific problems and accuracy of the climate prognoses is mostly limited by the resolution of the models. Finer resolution results in more accurate prognoses but, at the same time, significantly increases computational complexity. This explains the increasing interest to the High Performance Computing (HPC), and GPU computations in particular, for the climate simulations. We present an efficient implementation of the Nonhydrostatic ICosahedral Atmospheric Model (NICAM) on the multi-GPU environment. We have obtained performance results for the number of GPUs up to 320. These results were compared with the parallel CPU version and demonstrate that our GPU implementation gives 3 times higher performance over parallel CPU version. We have also developed and validated the performance model for a full-GPU implementation of the NICAM. Results show 4.5x potential acceleration over parallel CPU version. We believe that our results are general, in that in similar applications we could achieve similar speedups, and have the ability to predict its degree over CPUs.

Irina Demeshko, Naoya Maruyama, Hirofumi Tomita, Satoshi Matsuoka
MPI vs. BitTorrent: Switching between Large-Message Broadcast Algorithms in the Presence of Bottleneck Links

Collective communication in high-performance computing is traditionally implemented as a sequence of point-to-point communication operations. For example, in MPI a broadcast is often implemented using a linear or binomial tree algorithm. These algorithms are inherently unaware of any underlying network heterogeneity. Integrating topology awareness into the algorithms is the traditional way to address this heterogeneity, and it has been demonstrated to greatly optimize tree-based collectives. However, recent research in distributed computing shows that in highly heterogeneous networks an alternative class of collective algorithms - BitTorrent-based multicasts - has the potential to outperform topology-aware tree-based collective algorithms. In this work, we experimentally compare the performance of BitTorrent and tree-based large-message broadcast algorithms in a typical heterogeneous computational cluster. We address the following question: Can the dynamic data exchange in BitTorrent be faster than the static data distribution via trees even in the context of high-performance computing? We find that both classes of algorithms have a justification of use for different settings. While on single switch clusters linear tree algorithms are optimal, once multiple switches and a bottleneck link are introduced, BitTorrent broadcasts – which utilize the network in a more adaptive way – outperform the tree-based MPI implementations.

Kiril Dichev, Alexey Lastovetsky
MIP Model Scheduling for Multi-Clusters

Multi-cluster environments are composed of multiple clusters that act collaboratively, thus allowing computational problems that require more resources than those available in a single cluster to be treated. However, the degree of complexity of the scheduling process is greatly increased by the resources heterogeneity and the co-allocation process, which distributes the tasks of parallel jobs across cluster boundaries.

In this paper, the authors propose a new MIP model which determines the best scheduling for all the jobs in the queue, identifying their resource allocation and its execution order to minimize the overall makespan. The results show that the proposed technique produces a highly compact scheduling of the jobs, producing better resources utilization and lower overall makespan. This makes the proposed technique especially useful for environments dealing with limited resources and large applications.

Héctor Blanco, Fernando Guirado, Josep Lluís Lérida, V. M. Albornoz
Systematic Approach in Optimizing Numerical Memory-Bound Kernels on GPU

The use of GPUs has been very beneficial in accelerating dense linear algebra computational kernels (DLA). Many high performance numerical libraries like CUBLAS, MAGMA, and CULA provide BLAS and LAPACK implementations on GPUs as well as hybrid computations involving both, CPUs and GPUs. GPUs usually score better performance than CPUs for compute-bound operations, especially those characterized by a regular data access pattern. This paper highlights a systematic approach for efficiently implementing memory-bound DLA kernels on GPUs, by taking advantage of the underlying device’s architecture (e.g., high throughput). This methodology proved to outperform existing state-of-the-art GPU implementations for the symmetric matrix-vector multiplication (SYMV), characterized by an irregular data access pattern, in a recent work (Abdelfattah et. al, VECPAR 2012). We propose to extend this methodology to the general matrix-vector multiplication (GEMV) kernel. The performance results show that our GEMV implementation achieves better performance for relatively small to medium matrix sizes, making it very influential in calculating the Hessenberg and bidiagonal reductions of general matrices (radar applications), which are the first step toward computing eigenvalues and singular values, respectively. Considering small and medium size matrices (≤4500), our GEMV kernel achieves an average 60% improvement in single precision (SP) and an average 25% in double precision (DP) over existing open-source and commercial software solutions. These results improve reduction algorithms for both small and large matrices. The improved GEMV performances engender an averge 30% (SP) and 15% (DP) in Hessenberg reduction and up to 25% (SP) and 14% (DP) improvement for the bidiagonal reduction over the implementation provided by CUBLAS 5.0.

Ahmad Abdelfattah, David Keyes, Hatem Ltaief
HiBB 2012: 3 rd Workshop on High Performance Bioinformatics and Biomedicine

High-throughput technologies (e.g. microarray and mass spectrometry) and clinical diagnostic tools (e.g. medical imaging) are producing an increasing amount of experimental and clinical data, yielding to the so called age of Big Data in Biosciences. In such a scenario, large scale databases and bioinformatics tools are key tools for organizing and exploring biological and biomedical data with the aim to discover new knowledge in biology and medicine. However the storage, preprocessing and analysis of experimental data is becoming the main bottleneck of the biomedical analysis pipeline.

Mario Cannataro
Using Clouds for Scalable Knowledge Discovery Applications

Cloud platforms provide scalable processing and data storage and access services that can be exploited for implementing high-performance knowledge discovery systems and applications. This paper discusses the use of Clouds for the development of scalable distributed knowledge discovery applications. Service-oriented knowledge discovery concepts are introduced, and a framework for supporting high-performance data mining applications on Clouds is presented. The system architecture, its implementation, and current work aimed at supporting the design and execution of knowledge discovery applications modeled as workflows are described.

Fabrizio Marozzo, Domenico Talia, Paolo Trunfio
P3S: Protein Structure Similarity Search

Similarity search in protein structure databases is an important task of computational biology. To reduce the time required to search for similar structures, indexing techniques are being often introduced. However, as the indexing phase is computationally very expensive, it becomes useful only when a large number of searches are expected (so that the expensive indexing cost is amortized by cheaper search cost). This is a typical situation for a public similarity search service. In this article we introduce the P3S web application (

http://siret.cz/p3s

) allowing, given a query structure, to identify the set of the most similar structures in a database. The result set can be browsed interactively, including visual inspection of the structure superposition, or it can be downloaded as a zip archive. P3S employs the SProt similarity measure and an indexing technique based on the LAESA method, both introduced recently by our group. Together with the measure and the index, the method presents an effective and efficient tool for querying protein structure databases.

Jakub Galgonek, Tomáš Skopal, David Hoksza
On the Parallelization of the SProt Measure and the TM-Score Algorithm

Similarity measures for the protein structures are quite complex and require significant computational time. We propose a parallel approach to this problem to fully exploit the computational power of current CPU architectures. This paper summarizes experience and insights acquired from the parallel implementation of the SProt similarity method, its database access method, and also the wellknown TM-score algorithm. The implementation scales almost linearly with the number of CPUs and achieves 21.4× speedup on a 24-core system. The implementation is currently employed in the web application http://siret.cz/p3s.

Jakub Galgonek, Martin Kruliš, David Hoksza
Stochastic Simulation of the Coagulation Cascade: A Petri Net Based Approach

In this paper we developed a Stochastic Petri Net (SPN) based model to introduce uncertainty to capture the variability of biological systems. The coagulation cascade, one of the most complex biochemical networks, has been widely analyzed in literature mostly with ordinary differential equations, outlining the general behavior but without pointing out the intrinsic variability of the system. Moreover, the computer simulation allows the assessment of the reactions over a broad range of conditions and provides a useful tool for the development and management of several observational studies, potentially customizable for each patient. We describe the SPN model for the Tissue Factor induced coagulation cascade, more intuitive and suitable than models hitherto appeared in the literature in terms of bioclinical manageability. The SPN has been simulated using Tau-Leaping Stochastic Simulation Algorithm, and in order to simulate a large number of models, to test different scenarios, we perform them using High Performance Computing. We analyze different settings for model representing the cases of “healthy” and “unhealthy” subjects, comparing their average behavior, their inter- and intra-variability in order to gain valuable biological insights.

Davide Castaldi, Daniele Maccagnola, Daniela Mari, Francesco Archetti
Processing the Biomedical Data on the Grid Using the UNICORE Workflow System

The huge amount of the biological and biomedical data increases demand for significant disk space and computer power to store and process them. From its beginning the Grid has been considered as possibility to provide such resources for the life sciences community. In this paper authors focus on the UNICORE system which enables scientists to access Grid resources in a seamless and secure way. Authors have used the UNICORE middleware to automate experimental and computational procedure to determine the spectrum of mutations in mitochondrial genomes of normal and colorectal cancer cells. The computational and storage resources have been provided by Polish National Grid Infrastructure PL-Grid.

Marcelina Borcz, Rafał Kluszczyński, Katarzyna Skonieczna, Tomasz Grzybowski, Piotr Bała
Multicore and Cloud-Based Solutions for Genomic Variant Analysis

Genomic variant analysis is a complex process that allows to find and study genome mutations. For this purpose, analysis and tests from both biological and statistical points of view must be conducted. Biological data for this kind of analysis are typically stored according to the Variant Call Format (VCF), in gigabytes-sized files that cannot be efficiently processed using conventional software.

In this paper, we introduce part of the High Performance Genomics (HPG) project, whose goal is to develop a collection of efficient and open-source software applications for the genomics area. The paper is mainly focused on HPG Variant, a suite that allows to get the effect of mutations and to conduct genomic-wide and family-based analysis, using a multi-tier architecture based on CellBase Database and a RESTful web service API. Two user clients are also provided: an HTML5 web client and a command-line interface, both using a back-end parallelized using OpenMP. Along with HPG Variant, a library for VCF files handling and a collection of utilities for VCF files preprocessing have been developed.

Positive performance results are shown in comparison with other applications such as PLINK, GenABEL, SNPTEST or VCFtools.

Cristina Y. González, Marta Bleda, Francisco Salavert, Rubén Sánchez, Joaquín Dopazo, Ignacio Medina
A Novel Implementation of Double Precision and Real Valued ICA Algorithm for Bioinformatics Applications on GPUs

Several applications in the field of bioinformatics require extracting individual source signals from a large amount of observed data (signal mixtures). Among the available solutions, a possible approach is the independent component analysis (ICA). However, this computationally intensive algorithm does not fit for many real-time or large size data applications. As a result, this shortcoming calls for speeding up the execution of this algorithm. Recently, graphics processing units (GPUs) have emerged as general-purpose parallel processing accelerators. This platform has the potentials to be leveraged in processing a large amount of signals received from medical devices such as EEG and ECG tools. This work provides the implementation of an ICA algorithm, Joint Approximate Diagonalization of Eigen-matrices (JADE), on a low cost programmable graphics cards using CUDA programming toolkits. For this implementation, we achieved an overall speedup of over 7.9x for estimating 64 components, each with 9760 samples.

Amin Foshati, Farshad Khunjush
PROGENIA: An Approach for Grid Interoperability at Workflow Level

This paper addresses the problem of simulating complex large-scale experiments by using the PROGENIA WorkFlow Management System (WFMS), developed in the ProGenGrid (Proteomics and Genomics Grid) research project at the University of Salento, deployed and tested in a real Grid-based Problem Solving Environment for Bioinformatics named LIBI (International Laboratory of Bioinformatics). PROGENIA aims to achieve interoperability at workflow level, supporting the deployment of a workflow on different Grids based on Globus, gLite and Unicore middlewares. By using specific adapters, the workflow engine acts as a meta scheduler, submitting the jobs on different grids. The meta scheduler selects the available resources from a list of resources, previously configured by the PROGENIA administrator, by using the interfaces of the PROGENIA editor to configure a grid (Users admin, Virtual Organization, Resource and Software Management). PROGENIA will allow domain researchers to share and reuse their scientific workflows across distributed computing infrastructure. PROGENIA has been tested in several Bioinformatics case studies. In particular a test case related to the protein multi-alignment, executed on gLite and Globus middleware will be presented.

Maria Mirto, Marco Passante, Giovanni Aloisio

OMHI 2012: First International Workshop on On-chip Memory Hierarchies and Interconnects: Organization, Management and Implementation

OMHI 2012: First International Workshop on On-chip Memory Hierarchies and Interconnects: Organization, Management and Implementation

Current CMPs include high amounts of on-chip memory storage, organized either as caches or main memory to avoid the huge latencies of accessing offchip DRAM memory. To address internal data access latencies, a fast on-chip network interconnects the memory hierarchy within the processor chip. As a consequence, performance, area, and power consumption of current chip multiprocessors (CMPs) are highly dominated by the on-chip memory hierarchy and interconnect design. This problem aggravates with the increasing number of cores since a wider and likely deeper on-chip memory hierarchy is required.

Julio Sahuquillo, María E. Gómez, Salvador Petit
Allocating Irregular Partitions in Mesh-Based On-Chip Networks

Modern CMPs require sophisticated resource management in order to provide good utilisation of the chip resources. There exists good allocation algorithms for compute clusters, but these are restricted to specific routing algorithms and not easily transferable to the on-chip domain. We present a novel resource allocation algorithm, TSB, that allows petitions with any shape that is supported by an algorithm implemented using LBDR/uLBDR, and show that this has low complexity and comparable utilisation to UDFlex.

Samuel Rodrigo, Frank Olaf Sem-Jacobsen, Tor Skeie
Detecting Sharing Patterns in Industrial Parallel Applications for Embedded Heterogeneous Multicore Systems

Embedded devices are becoming more and more present everywhere. Moreover, mobile devices are becoming also more computationally powerful. These embedded architectures present new challenges since they execute several applications that must preserve security, allow sharing information in a coherent way, to be scalable and provide the required levels of performance, while at the same time they must be power efficient. The vIrtical project focuses on these challenges.

In this context, as a starting point, we tackle the characterization of applications targeted for the hardware platform developed, that is, a heterogeneous multicore SoC. The aim is to analyze memory sharing patterns in order to exploit them to make the coherence protocols more scalable and power-efficient.

We have identified that 60% of the accessed blocks are data, and from those only 40% require coherence maintenance.

Albert Esteve, María Soler, Maria Engracia Gómez, Antonio Robles, José Flich
Addressing Link Degradation in NoC-Based ULSI Designs

Process variability makes silicon devices to become increasingly less predictable, forcing chip designers to create different techniques to avoid losing performance and keeping yield. NoC links are also affected from process variation. Actually, the probability of having faulty links in a NoC might considerably increase in future CMP systems, expected to be implemented with 22nm technology by 2015.

In this paper we propose a new technique to overcome the presence of failures in NoC links. The proposed mechanism, a variable phit-size NoC architecture, is intended to face both manufacturing defects and variation-induced timing errors. Our new mechanism adapts link operation to the real conditions of the manufactured chip and therefore it is able to keep links working in the presence of variations.

Simulation results show that most of the still available bandwidth present in links affected by process variation can be retrieved, thus avoiding the performance degradation that other mechanisms, like reducing link frequency, would introduce.

Carles Hernández, Federico Silla, José Duato
Performance and Energy Efficiency Analysis of Data Reuse Transformation Methodology on Multicore Processor

Memory latency and energy efficiency are two key constraints to high performance computing systems. Data reuse transformations aim at reducing memory latency by exploiting temporal locality in data accesses. Simultaneously, modern multicore processors provide the opportunity of improving performance with reduced energy dissipation through parallelization. In this paper, we investigate to what extent data reuse transformations in combination with a parallel programming model in a multicore processor can meet the challenges of memory latency and energy efficiency constraints. As a test case, a “full-search motion estimation” kernel is run on the Intel

®

Core

TM

i7-2600 processor. Energy Delay Product (EDP) is used as a metric to compare energy efficiencies. Achieved results show that performance and energy efficiency can be improved by a factor of more than 6 and 15, respectively, by exploiting a data reuse transformation methodology and parallel programming model in a multicore system.

Abdullah Al Hasib, Per Gunnar Kjeldsberg, Lasse Natvig
Effects of Process Variation on the Access Time in SRAM Cells

As technology advances continue reducing transistor features, microscopic variations in number and location of dopant atoms in the channel region induce increasing electrical deviations in device parameters such as the threshold voltage. Deviations refer to mismatches with respect to device parameters at design time. These deviations are specially important in SRAM cells whose transistors are constructed with minimum geometry to fulfill area constraints, since they can cause some cells to fail.

In this paper, we study the impact of threshold voltage variations in the stability of the cell for a 16nm technology node. The failure probability has been studied for the four types of SRAM failures: write, access, read, and hold. We found that, under the assumed experimental conditions, the two former types of failures can be reduced by increasing the wordline pulse width of the cell. Experimental results show that access failures can be reduced up to 43.9% and write failures around 23.4% by enlarging the wordline pulse by 5 times the nominal width.

Vicent Lorente, Julio Sahuquillo
Task Scheduling on Manycore Processors with Home Caches

Modern manycore processors feature a highly scalable and software-configurable cache hierarchy. For performance, manycore programmers will not only have to efficiently utilize the large number of cores but also understand and configure the cache hierarchy to suit the application. Relief from this manycore programming nightmare can be provided by task-based programming models where programmers parallelize using tasks and an architecture-specific runtime system maps tasks to cores and in addition configures the cache hierarchy. In this paper, we focus on the cache hierarchy of the Tilera TILEPro64 processor which features a software-configurable coherence waypoint called the

home cache

. We first show the runtime system performance bottleneck of scheduling tasks oblivious to the nature of home caches. We then demonstrate a technique in which the runtime system controls the assignment of home caches to memory blocks and schedules tasks to minimize home cache access penalties. Test results of our technique have shown a significant execution time performance improvement on selected benchmarks leading to the conclusion that by taking processor architecture features into account, task-based programming models can indeed provide continued performance and allow programmers to smoothly transit from the multicore to manycore era.

Ananya Muddukrishna, Artur Podobas, Mats Brorsson, Vladimir Vlassov

ParaPhrase Workshop 2012

ParaPhrase Workshop 2012

ParaPhrase

(Parallel Patterns for Adaptive Heterogeneous Multicore Systems)

is a three year FP7 EU funded project that started in October 20111. The project aims to develop and deploy new high-level design patterns for parallel applications that support alternative parallel implementations that can be initially mapped and subsequently dynamically re-mapped to the available

heterogeneous

(CPU+GPU) hardware. The ParaPhrase approach leverages a two-level (or ultimately multi-level) model of parallelism, where the implementations of parallel programs are expressed in terms of interacting components, and where components from different applications are collectively mapped to the available system resources.

M. Danelutto, K. Hammond, H. Gonzalez-Velez
Using the SkelCL Library for High-Level GPU Programming of 2D Applications

Application programming for GPUs (Graphics Processing Units) is complex and error-prone, because the popular approaches — CUDA and OpenCL — are intrinsically low-level and offer no special support for systems consisting of multiple GPUs. The SkelCL library offers pre-implemented recurring computation and communication patterns (skeletons) which greatly simplify programming for single- and multi-GPU systems. In this paper, we focus on applications that work on two-dimensional data. We extend SkelCL by the matrix data type and the MapOverlap skeleton which specifies computations that depend on neighboring elements in a matrix. The abstract data types and a high-level data (re)distribution mechanism of SkelCL shield the programmer from the low-level data transfers between the system’s main memory and multiple GPUs. We demonstrate how the extended SkelCL is used to implement real-world image processing applications on two-dimensional data. We show that both from a productivity and a performance point of view it is beneficial to use the high-level abstractions of SkelCL.

Michel Steuwer, Sergei Gorlatch, Matthias Buß, Stefan Breuer
Structured Data Access Annotations for Massively Parallel Computations

We describe an approach aimed at addressing the issue of joint exploitation of control (stream) and data parallelism in a skeleton based parallel programming environment, based on annotations and refactoring. Annotations drive efficient implementation of a parallel computation. Refactoring is used to transform the associated skeleton tree into a more efficient, functionally equivalent skeleton tree. In most cases, cost models are used to drive the refactoring process. We show how sample use case applications/kernels may be optimized and discuss preliminary experiments with FastFlow assessing the theoretical results.

Marco Aldinucci, Sonia Campa, Peter Kilpatrick, Massimo Torquati

PROPER 2012: Fifth Workshop on Productivity and Performance – Tools for HPC Application Development

PROPER 2012: Fifth Workshop on Productivity and Performance – Tools for HPC Application Development

Using simulation codes in science and engineering has become commonplace.Writing such code and ensuring that it runs correctly and efficiently also on large numbers of processors and cores is, however, still challenging. Software tools can assist developers of parallel applications in their often tedious tasks of debugging, correctness checking, measuring and analyzing performance. Thus, such tools can help accelerate the development process of complex simulation codes considerably.

Bettina Krammer
Performance Engineering: From Numbers to Insight

The ultimate purpose of running simulation tasks on high performance computers is to solve numerical problems. The

performance

of an algorithm, or rather an implementation, is significant in several respects: Either a given problem should be solved in the least possible amount of time or a larger problem should be solved in an “acceptable” time; in both cases, the used resources must be utilized as efficiently as possible so that overall throughput and return on investment are maximized for all users of a system.

Georg Hager
Runtime Function Instrumentation with EZTrace

High-performance computing relies more and more on complex hardware: multiple computers, multi-processor computer, multi-core processing unit, multiple general purpose graphical processing units... To efficiently exploit the power of current computing architectures, modern applications rely on a high level of parallelism. To analyze and optimize these applications, tracking the software behavior with minimum impact on the software is necessary to extract time consumption of code sections as well as resource usage (e.g., network messages).

In this paper, we present a method for instrumenting functions in a binary application. This method permits to collect data at the entry and the exit of a function, allowing to analyze the execution of an application. We implemented this mechanism in

EZTrace

and the evaluation shows a significant improvement compared to other tools for instrumentation.

Charles Aulagnon, Damien Martin-Guillerez, François Rué, François Trahay
Compiler Help for Binary Manipulation Tools

Parsing machine code is the first step for most analyses performed on binary files. These analyses build control flow graphs (CFGs). In this work we propose a compilation mechanism that augments binary files with information about where each basic block is located and how they are connected to each other. This information makes it unnecessary to analyze most instructions in a binary during the initial CFG build process. As a result, these binary analysis tools experience dramatically increased parsing speeds - 3.8x on average.

Tugrul Ince, Jeffrey K. Hollingsworth
On the Instrumentation of OpenMP and OmpSs Tasking Constructs

Parallelism has become more and more commonplace with the advent of the multicore processors. Although different parallel programming models have arisen to exploit the computing capabilities of such processors, developing applications that take benefit of these processors may not be easy. And what is worse, the performance achieved by the parallel version of the application may not be what the developer expected, as a result of a dubious utilization of the resources offered by the processor.

We present in this paper a fruitful synergy of a shared memory parallel compiler and runtime, and a performance extraction library. The objective of this work is not only to reduce the performance analysis life-cycle when doing the parallelization of an application, but also to extend the analysis experience of the parallel application by incorporating data that is only known in the compiler and runtime side. Additionally we present performance results obtained with the execution of instrumented application and evaluate the overhead of the instrumentation.

Harald Servat, Xavier Teruel, Germán Llort, Alejandro Duran, Judit Giménez, Xavier Martorell, Eduard Ayguadé, Jesús Labarta
Strategies for Real-Time Event Reduction

One of the most urgent issues in event tracing is the number of resulting event trace files pushing against the limits of today’s parallel file systems. To address this issue, we present strategies for real-time event reduction, which guarantee that data of an event tracing measurement fits into a single memory buffer. Therefore, they are a key step towards a complete in-memory event tracing workflow enabling event trace analysis on very high scales without the limitations of today’s parallel file systems. In addition, we define criteria to compare different reduction strategies and evaluate their benefits. Furthermore, we show how traditional memory buffering can be enhanced to realize these strategies with minimal overhead.

Michael Wagner, Wolfgang E. Nagel
A Scalable InfiniBand Network Topology-Aware Performance Analysis Tool for MPI

Over the last decade, InfiniBand (IB) has become an increasingly popular interconnect for deploying modern supercomputing systems. As supercomputing systems grow in size and scale, the impact of IB network topology on the performance of high performance computing (HPC) applications also increase. Depending on the kind of network (FAT Tree, Tori, or Mesh), the number of network hops involved in data transfer varies. No tool currently exists that allows users of such large-scale clusters to analyze and visualize the communication pattern of HPC applications in a network topology-aware manner. In this paper, we take up this challenge and design a scalable, low-overhead

I

nfiniBand

N

etwork

T

opology-

A

ware

P

erformance Analysis Tool for

MPI

-

INTAP-MPI

. INTAP-MPI allows users to analyze and visualize the communication pattern of HPC applications on any IB network (FAT Tree, Tori, or Mesh). We integrate INTAP-MPI into the MVAPICH2 MPI library, allowing users of HPC clusters to seamlessly use it for analyzing their applications. Our experimental analysis shows that the INTAP-MPI is able to profile and visualize the communication pattern of applications with very low memory and performance overhead at scale.

Hari Subramoni, Jerome Vienne, Dhabaleswar K. (DK) Panda
Performance Patterns and Hardware Metrics on Modern Multicore Processors: Best Practices for Performance Engineering

Many tools and libraries employ hardware performance monitoring (HPM) on modern processors, and using this data for performance assessment and as a starting point for code optimizations is very popular. However, such data is only useful if it is interpreted with care, and if the right metrics are chosen for the right purpose. We demonstrate the sensible use of hardware performance counters in the context of a structured performance engineering approach for applications in computational science. Typical performance patterns and their respective metric signatures are defined, and some of them are illustrated using case studies. Although these generic concepts do not depend on specific tools or environments, we restrict ourselves to modern x86-based multicore processors and use the likwid-perfctr tool under the Linux OS.

Jan Treibig, Georg Hager, Gerhard Wellein

Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids

Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids

Clusters, Clouds, and Grids are three different computational paradigms with the intent or potential to support High Performance Computing (HPC). Currently, they consist of hardware, management, and usage models particular to different computational regimes, e.g., high performance cluster systems designed to support tightly coupled scientific simulation codes typically utilize high-speed interconnects and commercial cloud systems designed to support software as a service (SAS) do not. However, in order to support HPC, all must at least utilize large numbers of resources and hence effective HPC in any of these paradigms must address the issue of resiliency at large-scale.

Stephen L. Scott, Chokchai (Box) Leangsuksun
High Performance Reliable File Transfers Using Automatic Many-to-Many Parallelization

Shift is a lightweight framework for high performance local and remote file transfers that provides resiliency across a wide variety of failure scenarios. Shift supports multiple file transport protocols with automatic selection of the most appropriate mechanism between each pair of participating hosts allowing it to adapt to heterogeneous clients with differing software and network access restrictions. File system information is gathered from clients and servers to detect file system equivalence and enable path rewriting so that multiple clients can be automatically spawned in parallel to carry out both single and multi-file transfers to multiple servers selected according to load and availability. This improves both reliability and performance by eliminating single points of failure and overcoming single system bottlenecks. End-to-end integrity is provided using cryptographic hashes at the source and destination with support for partial file retransmission of only corrupted portions. This paper presents the design and implementation of Shift and details the mechanisms utilized to enhance the reliability and performance of file transfers.

Paul Z. Kolano
A Reliability Model for Cloud Computing for High Performance Computing Applications

With virtualization technology, Cloud computing utilizes resources more efficiently. A physical server can deploy many virtual machines and operating systems. However, with the increase in software and hardware components, more failures are likely to occur in the system. Hence, one should understand failure behavior in the Cloud environment in order to better utilize the cloud resources. In this work, we propose a reliability model and estimate the mean time to failure and failure rate based on a system of k nodes and s virtual machines under four scenarios. Results show that if the failure of the hardware and/or the software in the system exhibits a degree of dependency, the system becomes less reliable, which means that the failure rate of the system increases and the mean time to failure decreases. Additionally, an increase in the number of nodes decreases the reliability of the system.

Thanadech Thanakornworakij, Raja F. Nassar, Chokchai Leangsuksun, Mihaela Păun
The Viability of Using Compression to Decrease Message Log Sizes

Fault-tolerance and its associated overheads are of great concern for current and future extreme-scale systems. The dominant mechanism used today, coordinated checkpoint/restart, places great demands on the I/O system and the method requires frequent synchronization. Uncoordinated checkpointing with message logging addresses many of these limitations at the cost of increasing the storage needed to hold message logs. These storage requirements are critical to the scalability of extreme-scale systems. In this paper, we investigate the viability of using standard compression algorithms to reduce message log sizes for a number of key high-performance computing workloads. Using these workloads we show that, while not be a universal solution for all applications, compression has the potential to significantly reduce message log sizes for a great number of important workloads.

Kurt B. Ferreira, Rolf Riesen, Dorian Arnold, Dewan Ibtesham, Ron Brightwell
Resiliency in Exascale Systems and Computations Using Chaotic-Identity Maps

For exascale computing systems, we propose (i) light-weight computational modules that utilize chaotic computations and customized identity maps to detect component failures, and (ii) statistical estimation methods that generate robustness estimates for the system and computations based on the module outputs. The diagnosis modules execute multiple Poincare and identity maps, which are customized to detect certain classes of failures in the compute nodes and interconnects. We propose statistical methods that generate robustness estimates for the system using the outputs of pipelined chains of diagnosis modules.

Nageswara S. V. Rao
Programming Model Extensions for Resilience in Extreme Scale Computing

The challenge of resilience is becoming increasingly important on the path to exascale capability in High Performance Computing (HPC) systems. With clock frequencies unlikely to increase as aggressively as they have in the past, future large scale HPC systems aspiring exaflop capability will need an exponential increase in the count of the ALUs and memory modules deployed in their design [Kogge 2008]. The Mean Time to Failure (MTTF) of the system however, scales inversely to the number of components in the system. Furthermore, these systems will be constructed using devices that are far less reliable than those used today, as transistor geometries shrink and the failures due to chip manufacturing variability, effects of transistor aging as well as transient soft errors will become more prevalent. Therefore the sheer scale of future exascale supercomputers, together with the shrinking VLSI geometries will conspire to make faults and failures increasingly the norm rather than the exception.

Saurabh Hukerikar, Pedro C. Diniz, Robert F. Lucas
User Level Failure Mitigation in MPI

In a constant effort to deliver steady performance improvements, the size of High Performance Computing (HPC) systems, as observed by the Top 500 ranking1, has grown tremendously over the last decade. This trend, along with the resultant decrease of the Mean Time Between Failure (MTBF), is unlikely to stop; thereby many computing nodes will inevitably fail during application execution [5]. It is alarming that most popular fault tolerant approaches see their efficiency plummet at Exascale [3,4], calling for more efficient approaches evolving around application centric failure mitigation strategies [7].

Wesley Bland

UCHPC 2012: Fifth Workshop on UnConventional High Performance Computing

UCHPC 2012: Fifth Workshop on UnConventional High Performance Computing

As the word “UnConventional” in the title suggests, the workshop focuses on hardware and platforms for HPC, which were not intended for HPC in the first place. Reasons could be raw computing power, good performance per watt, or low cost in general. To address this unconventional hardware, often, new programming approaches and paradigms are required to make best use of it. Thus, a second focus of the workshop - for the first time with UCHPC 2012 - is on innovative new programming models for unconventional hardware and how to best combine its computing power with more conventional systems.

Anders Hast, Josef Weidendorfer, Jan-Philipp Weiss
A Methodology for Efficient Use of OpenCL, ESL and FPGAs in Multi-core Architectures

OpenCL has been proposed as an open standard for application development in heterogeneous multi-core architectures, utilizing different CPU, DSP and GPU types and configurations. Recently, the technological advances in FPGA devices has turned the parallel processing community towards them. However, FPGA programming requires expertise in a different field as well as the appropriate tools and methodologies. A feasible solution introduced recently is the adoption of ESL and high-level synthesis methodologies, supporting FPGA programming from C/C++. Based on high-level synthesis, this paper presents a methodology to use OpenCL as an FPGA programming environment. Specifically, the opportunities as well as the obstacles imposed to the application developer by the FPGA computing platform and the adoption of C/C++ as input language are presented, and a systematic way to explore both data level and thread level parallelism is given. The resulting methodology can be used for the deployment of parallel applications over a wide range of diverse CPU, DSP, GPU and FPGA multi-core configurations.

Alexandros Bartzas, George Economakos
Efficient Design Space Exploration of GPGPU Architectures

The goal of this work is to revisit GPU design and introduce a fast, low-cost and effective approach to optimize resource allocation in future GPUs. We have achieved this goal by using the Plackett-Burman methodology to explore the design space efficiently. We further formulate the design exploration problem as that of a constraint optimization. Our approach produces the optimum configuration in 84% of the cases, and in case that it does not, it produces the second optimal case with a performance penalty of less than 3.5%. Also, our method reduces the number of explorations one needs to perform by as much as 78%.

Ali Jooya, Amirali Baniasadi, Nikitas J. Dimopoulos
Spin Glass Simulations on the Janus Architecture: A Desperate Quest for Strong Scaling

We describe Janus, an application-driven architecture for Monte Carlo simulations of spin glasses. Janus is a massively parallel architecture, based on reconfigurable FPGA nodes; it offers two orders of magnitude better performance than commodity systems for spin glass applications. The first generation Janus machine has been operational since early 2008; we are currently developing a new generation, that will be on line in early 2013. In this paper we present the Janus architecture, describe both implementations and compare their performances with those of commodity systems.

M. Baity-Jesi, R. A. Baños, A. Cruz, L. A. Fernandez, J. M. Gil-Narvion, A. Gordillo-Guerrero, M. Guidetti, D. Iñiguez, A. Maiorano, F. Mantovani, E. Marinari, V. Martin-Mayor, J. Monforte-Garcia, A. Muñoz-Sudupe, D. Navarro, G. Parisi, S. Perez-Gaviro, M. Pivanti, F. Ricci-Tersenghi, J. Ruiz-Lorenzo, S. F. Schifano, B. Seoane, A. Tarancon, P. Tellez, R. Tripiccione, D. Yllanes

7th Workshop on Virtualization in High-Performance Cloud Computing – VHPC2012

7th Workshop on Virtualization in High-Performance Cloud Computing – VHPC2012

Virtualization has become a common abstraction layer in modern data centers, enabling resource owners to manage complex infrastructure independently of their applications. Conjointly, virtualization is becoming a driving technology for a manifold of industry grade IT services. The cloud concept includes the notion of a separation between resource owners and users, adding services such as hosted application frameworks and queueing. Utilizing the same infrastructure, clouds carry significant potential for use in high-performance scientific computing. The ability of clouds to provide for requests and releases of vast computing resources dynamically and close to the marginal cost of providing the services is unprecedented in the history of scientific and commercial computing.

Michael Alexander, Gianluigi Zanetti, Anastassios Nanos
Pre-Copy and Post-Copy VM Live Migration for Memory Intensive Applications

Virtualization technology provides a means for server consolidation, reducing the number of physical servers required for running a given workload. Virtual Machine (VM) live migration facilitates the transfer of a running (VM) between physical hosts while appearing transparent to the running application. Memory intensive applications tend to obstruct the original pre-copy live migration process and may result in the failure of the migration process due to its inability to transfer memory faster than memory is dirtied by the running application. The focus of this paper is to present several techniques that can be applied to both pre-copy live migration and post-copy live migration to better support migration of memory intensive applications.

Aidan Shribman, Benoit Hudzia
Xen2MX: Towards High-Performance Communication in the Cloud

Efficient VM communication in Cloud computing infrastructures is an important aspect of HPC application deployment in clusters of VMs. In this paper we present Xen2MX, a high-performance messaging protocol, binary compatible with Myrinet/MX and wire compatible with MXoE. Its design is based on MX and its port over generic Ethernet adapters, Open-MX. Xen2MX combines the zero-copy characteristics of Open-MX with Xen’s memory sharing techniques, in order to construct the most efficient data path for high-performance communication, achievable with software techniques. Using Xen2MX, we are able to reduce the round-trip latency to 14

μ

s, compared to directly attached devices (13

μ

s) and to a software bridge setup (44

μ

s).

Anastassios Nanos, Nectarios Koziris
Themis: Energy Efficient Management of Workloads in Virtualized Data Centers

Virtualized data centers facilitate higher resource utilization and energy efficiency through consolidation. However, mixing services-oriented workloads with throughput (batch) jobs is typically avoided due to complex interactions and widely different quality of service (QoS) requirements. We introduce a complete VM resource management framework, called Themis, which manages combined services and batch jobs, maximizing energy-efficient throughput of the latter without sacrificing the service guarantees of the former. Themis’ resource management policy outperforms the prior proposed policies by up to 35% on average in work done per Joule when measured on a data center testbed.

Gaurav Dhiman, Vasileios Kontorinis, Raid Ayoub, Liuyi Zhang, Chris Sadler, Dean Tullsen, Tajana Simunic Rosing
Runtime Virtual Machine Recontextualization for Clouds

We introduce and define the concept of recontextualization for cloud applications by extending contextualization, i.e. the dynamic configuration of virtual machines (VM) upon initialization, with autonomous updates during runtime. Recontextualization allows VM images and instances to be dynamically re-configured without restarts or downtime, and the concept is applicable to all aspects of configuring a VM from virtual hardware to multi-tier software stacks. Moreover, we propose a runtime cloud recontextualization mechanism based on virtual device management that enables recontextualization without the need to customize the guest VM. We illustrate our concept and validate our mechanism via a use case demonstration: the reconfiguration of a cross-cloud migratable monitoring service in a dynamic cloud environment. We discuss the details of the interoperable recontextualization mechanism, its architecture and demonstrate a proof of concept implementation. A performance evaluation illustrates the feasibility of the approach and shows that the recontextualization mechanism performs adequately with an overhead of 18% of the total migration time.

Django Armstrong, Daniel Espling, Johan Tordsson, Karim Djemame, Erik Elmroth
GaaS: Customized Grids in the Clouds

Cloud Computing has been widely adopted as a new paradigm for providing resources because of the advantages it brings to both users and providers. Even if it was firstly targeted at enterprises wishing to reduce their equipment management costs, it has been rapidly recognized as both an enabler for new applications and as a mean to allow enterprises of all sizes at running high demanding applications. Recently, Cloud Providers are trying to attract new applications, such as scientific ones, that today already benefit from distributed environment like Grids. This work presents a way to remove the paradigm mismatch between Cloud and Grid Computing, enabling the use of Cloud-provided resources with well-established Grid-like interfaces, avoiding the need for users to learn new resources access and use models. The proposed approach is validated through the development of a prototype implementation and its integration in a working Grid environment.

G. B. Barone, R. Bifulco, V. Boccia, D. Bottalico, R. Canonico, L. Carracciuolo
Backmatter
Metadata
Title
Euro-Par 2012: Parallel Processing Workshops
Editors
Ioannis Caragiannis
Michael Alexander
Rosa Maria Badia
Mario Cannataro
Alexandru Costan
Marco Danelutto
Frédéric Desprez
Bettina Krammer
Julio Sahuquillo
Stephen L. Scott
Josef Weidendorfer
Copyright Year
2013
Publisher
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-36949-0
Print ISBN
978-3-642-36948-3
DOI
https://doi.org/10.1007/978-3-642-36949-0