Skip to main content
Top

2012 | Book

Euro-Par 2011: Parallel Processing Workshops

CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29 – September 2, 2011, Revised Selected Papers, Part II

Editors: Michael Alexander, Pasqua D’Ambra, Adam Belloum, George Bosilca, Mario Cannataro, Marco Danelutto, Beniamino Di Martino, Michael Gerndt, Emmanuel Jeannot, Raymond Namyst, Jean Roman, Stephen L. Scott, Jesper Larsson Traff, Geoffroy Vallée, Josef Weidendorfer

Publisher: Springer Berlin Heidelberg

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

This book constitutes thoroughly refereed post-conference proceedings of the workshops of the 17th International Conference on Parallel Computing, Euro-Par 2011, held in Bordeaux, France, in August 2011. The papers of these 12 workshops CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS HPCF, PROPER, CCPI, and VHPC focus on promotion and advancement of all aspects of parallel and distributed computing.

Table of Contents

Frontmatter

HiBB 2011: 2nd Workshop on High-Performance Bioinformatics and Biomedicine

HiBB 2011: 2 nd Workshop on High Performance Bioinformatics and Biomedicine

The availability of high-throughput technologies, such as microarray and mass spectrometry, and the diffusion of genomics and proteomics studies to large populations, are producing an increasing amount of experimental and clinical data. Biological databases and bioinformatics tools are key tools for organizing and exploring such biological and biomedical data with the aim to discover new knowledge in biology and medicine. However the storage, preprocessing and analysis of experimental data is becoming the main bottleneck of the analysis pipeline.

Mario Cannataro
On Parallelizing On-Line Statistics for Stochastic Biological Simulations

This work concerns a general technique to enrich parallel version of stochastic simulators for biological systems with tools for on-line statistical analysis of the results. In particular, within the FastFlow parallel programming framework, we describe the methodology and the implementation of a parallel Monte Carlo simulation infrastructure extended with user-defined on-line data filtering and mining functions. The simulator and the on-line analysis were validated on large multi-core platforms and representative proof-of-concept biological systems.

Marco Aldinucci, Mario Coppo, Ferruccio Damiani, Maurizio Drocco, Eva Sciacca, Salvatore Spinella, Massimo Torquati, Angelo Troina
Scalable Sequence Similarity Search and Join in Main Memory on Multi-cores

Similarity-based queries play an important role in many large scale applications. In bioinformatics, DNA sequencing produces huge collections of strings, that need to be compared and merged. We present PeARL, a data structure and algorithms for similarity-based queries on many-core servers. PeARL indexes large string collections in compressed tries which are entirely held in main memory. Parallelization of searches and joins is performed using MapReduce as the underlying execution paradigm. We show that our data structure is capable of performing many real-world applications in sequence comparisons in main memory. Our evaluation reveals that PeARL reaches a significant performance gain compared to single-threaded solutions. However, the evaluation also shows that scalability should be further improved, e.g., by reducing sequential parts of the algorithms.

Astrid Rheinländer, Ulf Leser
Enabling Data and Compute Intensive Workflows in Bioinformatics

Accelerated growth in the field of bioinformatics has resulted in large data sets being produced and analyzed. With this rapid growth has come the need to analyze these data in a quick, easy, scalable, and reliable manner on a variety of computing infrastructures including desktops, clusters, grids and clouds. This paper presents the application of workflow technologies, and, specifically, Pegasus WMS, a robust scientific workflow management system, to a variety of bioinformatics projects from RNA sequencing, proteomics, and data quality control in population studies using GWAS data.

Gaurang Mehta, Ewa Deelman, James A. Knowles, Ting Chen, Ying Wang, Jens Vöckler, Steven Buyske, Tara Matise
Homogenizing Access to Highly Time-Consuming Biomedical Applications through a Web-Based Interface

The exponential increment in the production of biomedical data is forcing a higher level of automatization to analyze it. Therefore, biomedical researchers have to entrust bioinformaticians to develop software able to process a huge amount of data on high performance unix-based servers. However, most of the software is developed with a very basic, text-based, user interface, usually because of a lack of time. In addition the applications are developed as independent tools yielding to a set of specific software tools with very different ways of use. This implies that final users continuously need developer support. Even worse, in many situations only developers themselves are able to run the software every time is required. In this contribution we present a Web-based user interface that homogenizes the way users interact with the applications installed in a server. This way, authorized users can add their applications to the Web site at a very low cost. Therefore, researchers with no special computational skills will perform analysis by themselves, gaining independence to run applications whenever they want at the cost of a very little effort. The application is portable to any unix-like system with a php+mysql server.

Luigi Grasso, Nuria Medina-Medina, Rosana Montes-Soldado, María M. Abad-Grau
Distributed Management and Analysis of Omics Data

The omics term refers to different biology disciplines such as, for instance, genomics, proteomics, or interactomics. The suffix -ome is used to indicate the objects of study of such disciplines, such as the genome, proteome, or interactome, and usually refers to a totality of some sort. This paper introduces omics data and the main computational techniques for their storage, preprocessing and analysis. The increasing availability of omics data due to the advent of high throughput technologies poses novel issues on data management and analysis that can be faced by parallel and distributed storage systems and algorithms. After a survey of main omics databases, preprocessing techniques and analysis approaches, the paper describes some recent bioinformatics tools in genomics, proteomics and interactomics that use a distributed approach.

Mario Cannataro, Pietro Hiram Guzzi

Managing and Delivering Grid Services (MDGS)

Managing and Delivering Grid Services (MDGS)

The aim of the MDGS workshop is to bring together Grid experts from the (Grid) infra-structure community with experts in IT service management in order to present and discuss the state-of-the-art in managing the delivery of ICT services and how to apply these concepts and techniques to Grid envi-ronments. Up to now, work in this area has proceeded mostly on a best effort basis. Little effort has been put into the processes and approaches from the professional (often commercial) IT service management (ITSM).

Thomas Schaaf, Adam S. Z. Belloum, Owen Appleton, Joan Serrat-Fernández, Tomasz Szepieniec
Resource Allocation for the French National Grid Initiative

Distribution of resources between different communities in production grids is the combined result of needs and policies: where the users’ needs dictate what is required, resource providers’ policies define how much is offered and how it is offered. From a provider point of view, getting a comprehensive and fair understanding of resources distribution is then a key element for the establishment of any scientific policy, and a prerequisite for delivering a high quality of service to users.

The resource allocation model which is currently applied within most national grid initiatives (NGIs) was designed with the needs of the EGEE (Enabling Grids for E-sciencE) projects and should now be revised: NGIs now especially need to assess how resources and services are delivered to their national community, and expose the return on investment for resources delivered to international communities.

The French NGI “France Grilles” is currently investigating down this route, trying to define key principles for a national resource allocation strategy that would answer this concern while allowing for the proper definition of service level agreements (SLA) between users, providers and the NGI itself.

After looking for clear definitions of who are the communities we are dealing with, we propose to look at how resource allocation is done in other environments such as high performance computing (HPC) and the concepts we could possibly reuse from there while keeping the specificities of the Grid. We then review different use-cases and scenarios before concluding on a proposal which, together with open questions, could constitute a base for a resource allocation strategy for the French national grid.

Gilles Mathieu, Hélène Cordier
On Importance of Service Level Management in Grids

The recent years saw an evolution of Grid technologies from early ideas to production deployments. At the same time, the expectations for Grids shifted from idealistic hopes — buoyed by the successes of the initial testbeds — to disillusionment with available implementations when applied to large-scale general purpose computing. In this paper, we argue that a mature e-Infrastructure aiming to bridge the gaps between visions and realities cannot be delivered without introducing Service Level Management (SLM). To support this thesis, we present an analysis of the Grid foundations and definitions that shows that SLM-related ideas were incorporated in them from the beginning. Next, we describe how implementing SLM in Grids could improve the usability and user-experience of the infrastructure – both for its customers and service providers. We also present a selection of real-life Grid application scenarios that are important for the research communities supported by the Grid, but cannot be efficiently supported without the SLM process in place. In addition, the paper contains introduction to SLM, a discussion on what introducing SLM to Grids might mean in practice, and what were the current efforts already applied in this field.

Tomasz Szepieniec, Joanna Kocot, Thomas Schaaf, Owen Appleton, Matti Heikkurinen, Adam S. Z. Belloum, Joan Serrat-Fernández, Martin Metzker
On-Line Monitoring of Service-Level Agreements in the Grid

Monitoring of Service Level Agreements is a crucial phase of SLA management. In the most challenging case, monitoring of SLA fulfillment is required in (near) real-time and needs to combine performance data regarding multiple distributed services and resources. Currently existing Grid monitoring and information services do not provide adequate on-line monitoring capabilities to fulfill this case. We present an application of Complex Event Processing principles and technologies for on-line SLA monitoring in the Grid. The capabilities of the presented SLA monitoring framework include (1) on-demand definition of SLA metrics using a high-level query language; (2) real-time calculation of the defined SLA metrics; (3) advanced query capabilities which allow for defining high-level complex metrics derived from basic metrics. SLA monitoring of data-intensive grid jobs serves as a case study to demonstrate the capabilities of the approach.

Bartosz Balis, Renata Slota, Jacek Kitowski, Marian Bubak
Challenges of Future e-Infrastructure Governance

A shift of interest of both providers and consumers from resource provisioning to a system of infrastructure services as well for a governance system for e-Infrastructures based on a user-centric approach is registered nowadays. Applying service level management tools and procedures in e-Infrastructure service provision practices allow users, service providers and funding agencies to investigate e-Infrastructure services in view of individual use cases. The shift should be sustained by legal structures, strategic and financial plans, as well as by openness, neutrality and diversity of resources and services. e-IRG as an e-infrastructure policy forum envisioned these trends and needs and expressed its position in its recent white paper that is shortly presented and discussed from a perspective of building future research agendas of individual teams.

Dana Petcu
Influences between Performance Based Scheduling and Service Level Agreements

The allocation of resources to jobs running on e-Science infrastructures is a key issue for scientific communities. In order to provide a better efficiency of computational jobs we propose an SLA-aware architecture. The core of this architecture is a scheduler relying on resource performance information. For performance characterization we propose a two-level benchmark that includes tests corresponding to specific e-Science applications. In order to evaluate the proposal we present simulation results for the proposed architecture.

Antonella Galizia, Alfonso Quarati, Michael Schiffers, Mark Yampolskiy
User Centric Service Level Management in mOSAIC Applications

Service Level Agreements (SLAs) aims at offering a simple and clear way to build up an agreement between the final users and the service provider in order to establish what is effectively granted by the cloud providers. In this paper we will show the SLA-related activities in mOSAIC, an european funded project that aims at exploiting a new programming model, which fully acquires the flexibility and dynamicity of the cloud environment, in order to build up a dedicated solution for SLA management. The key idea of SLA management in mOSAIC is that it is impossible to offer a single, static general purpose solution for SLA management of any kind of applications, but it is possible to offer a set of micro-functionalities that can be easily integrated among them in order to build up a dedicated solution for the application developer problem. Due to the mOSAIC API approach (which enable easy interoperability among moSAIC components) it will be possible to build up applications enriching them with user-oriented SLA management, from the very early development stages.

Massimiliano Rak, Rocco Aversa, Salvatore Venticinque, Beniamino Di Martino
Service Level Management for Executable Papers

Reproducibility of Science is considered as one of the main principles of the scientific method, and refers to the ability of an experiment to be accurately reproduced, by third person, in complex experiment every detail matters to ensure the correct reproducibility. In the context of the ICCS 2011, Elsevier organized the executable paper grand challenge a contest to improve the way scientific information is communicated and used. While during this contest the focus was on developing methods and technique to realize the idea of executable papers, in this paper we focus on the operational issues related to the creation a viable service with a predefined QoS.

Reginald Cushing, Spiros Koulouzis, Rudolf Strijkers, Adam S. Z. Belloum, Marian Bubak
Change Management in e-Infrastructures to Support Service Level Agreements

Service Level Agreements (SLAs) are a common instrument for outlining the responsibility scope of collaborating organizations. They are indispensable for a wide range of industrial and business applications. However, until now SLAs did not receive much attention of the research organizations that are cooperating to provide a comprehensive and sustainable computing infrastructures or e-Infrastructures (eIS) to support the European scientific community. Since many eIS projects have left their development state and are now offering highly mature services, the IT service management aspect becomes relevant.

In this article we are concentrating on the inter-organizational change management process. At present, it is very common for eIS changes to be autonomously managed by the individual resource providers. Yet such changes can affect the overall eIS availability and thus have an impact on the SLA metrics, such as performance characteristics and quality of service. We introduce the problem field with the help of a case study. This case study outlines and compares the change management process defined by PRACE and LRZ, which is one of the PRACE eIS partners and resource providers. Our analysis shows, that each of the organizations adopts and follows distinct and incompatible operational model. Following that, we demonstrate how the UMM, a modeling method based on UML and developed by UN/CEFACT, can be applied for the design of inter-organizational change management process. The advantage of this approach is the ability to design both internal and inter-organizational processes with the help of uniform methods. An evaluation of the proposed technique and conclusion ends our article.

Silvia Knittl, Thomas Schaaf, Ilya Saverchenko

PROPER 2011: Fourth Workshop on Productivity and Performance: Tools for HPC Application Development

PROPER 2011: Fourth Workshop on Productivity and Performance Tools for HPC Application Development

The PROPER workshop addresses the need for productivity and performance in high performance computing. Productivity is an important objective during the development phase of HPC applications and their later production phase. Paying attention to the performance is important to achieve efficient usage of HPC machines. At the same time it is needed for scalability, which is crucial in two ways: Firstly, to use higher degrees of parallelism to reduce the wall clock time. And secondly, to cope with the next bigger problem, which requires more CPUs, memory, etc. to be able to compute it at all.

Michael Gerndt
Scout: A Source-to-Source Transformator for SIMD-Optimizations

We present Scout, a configurable source-to-source transformation tool designed to automatically vectorize C source code. Scout provides the means to vectorize loops using SIMD instructions at source level. Our main focus during the development of Scout is a maximum flexibility of the tool in two ways: being capable of vectorizing a wide range of loop constructs and being capable of targeting various modern SIMD architectures. Scout supports several SIMD instructions sets like SSE or AVX and is easily extensible to upcoming ones.

In the second part of the paper we present results of applying Scout’s vectorizing capabilities to CFD production codes of the German Aerospace Center. The complex loops used in these codes often inhibit the automatic vectorization of usual C compilers. In contrast, Scout is able to vectorize most of these loops. We measured the resulting speedup for SSE and AVX platforms.

Olaf Krzikalla, Kim Feldhoff, Ralph Müller-Pfefferkorn, Wolfgang E. Nagel
Scalable Automatic Performance Analysis on IBM BlueGene/P Systems

Nowadays scientific endeavor becomes more and more hungry for computational power of the state-of-the-art supercomputers. However the current trend in the performance increase comes along with tremendous increase in power consumption. One of the approaches allowing to overcome the issue is tight coupling of the simplified low-frequency cores into massively parallel system, such as IBM BlueGene/P (BG/P) combining hundreds of thousands cores. In addition to revolutionary system design this scale requires new approaches in application development and performance tuning. In this paper we present a new scalable BG/P tailored design for an automatic performance analysis tool - Periscope. In this work we have elicited and implemented a new design for porting Periscope to BG/P which features optimal system utilization, minimal monitoring intrusion and high scalability.

Yury Oleynik, Michael Gerndt
An Approach to Creating Performance Visualizations in a Parallel Profile Analysis Tool

With increases in the scale of parallelism the dimensionality and complexity of parallel performance measurements has placed greater challenges on analysis tools. Performance visualization can assist in understanding performance properties and relationships. However, the creation of new visualizations in practice is not supported by existing parallel profiling tools. Users must work with presentation types provided by a tool and have limited means to change its design. Here we present an approach for creating new performance visualizations within an existing parallel profile analysis tool. The approach separates visual layout design from the underlying performance data model, making custom visualizations such as performance over system topologies straightforward to implement and adjust for various use cases.

Wyatt Spear, Allen D. Malony, Chee Wai Lee, Scott Biersdorff, Sameer Shende
INAM - A Scalable InfiniBand Network Analysis and Monitoring Tool

As InfiniBand (IB) clusters grow in size and scale, predicting the behavior of the IB network in terms of link usage and performance becomes an increasingly challenging task. There currently exists no open source tool that allows users to dynamically analyze and visualize the communication pattern and link usage in the IB network. In this context, we design and develop a scalable InfiniBand Network Analysis and Monitoring tool -

INAM

. INAM monitors IB clusters in real time and queries the various subnet management entities in the IB network to gather the various performance counters specified by the IB standard. We provide an easy to use web-based interface to visualize performance counters and subnet management attributes of a cluster in an on-demand basis. It is also capable of capturing the communication characteristics of a subset of links in the network. Our experimental results show that INAM is able to accurately visualize the link utilization as well as the communication pattern of target applications.

N. Dandapanthula, H. Subramoni, J. Vienne, K. Kandalla, S. Sur, Dhabaleswar K. Panda, Ron Brightwell
Auto-tuning for Energy Usage in Scientific Applications

The

power wall

has become a dominant impeding factor in the realm of exascale system design. It is therefore important to understand how to most effectively create software to minimize its power usage while maintaining satisfactory levels of performance. This work uses existing software and hardware facilities to tune applications to minimize for several combinations of power and performance. The tuning is done with respect to software level performance-related tunables and for processor clock frequency. These tunable parameters are explored via an offline search to find the parameter combinations that are optimal with respect to performance (or delay,

D

), energy (

E

), energy×delay (

E

×

D

) and energy×delay×delay (

E

×

D

2

). These searches are employed on a parallel application that solves Poisson’s equation using stencils. We show that the parameter configuration that minimizes energy consumption can save, on average, 5.4% energy with a performance loss of 4% when compared to the configuration that minimizes runtime.

Ananta Tiwari, Michael A. Laurenzano, Laura Carrington, Allan Snavely
Automatic Source Code Transformation for GPUs Based on Program Comprehension

In this work is presented a technique to transform sequential source code to execute it on parallel architectures as heterogeneous many-core systems or GPUs. Source code is parsed and basic algorithmic concepts are discovered from it in order to feed a knowledge base. A reasoner, by consulting algorithmic rules, can compose this basic concepts to pinpoint code regions representing a known algorithm. This code can be annotated and / or transformed with a source-to-source process. A prototype tool has been built and tested on a case study to analyse the source code of a matrix multiplication. After recognition of the algorithm, the code is modified with calls to nVIDIA GPU cuBLAS library.

Pasquale Cantiello, Beniamino Di Martino
Enhancing Brainware Productivity through a Performance Tuning Workflow

Operation costs of high performance computers, like cooling and energy, drive HPC centers towards improving the efficient usage of their resources. Performance tuning through experts here is an indispensable ingredient to ensure efficient HPC operation. This ”brainware” component, in addition to software and hardware, is in fact crucial to ensure continued performance of codes in light of diversifying and changing hardware platforms. However, as tuning experts are a scarce and costly resource themselves, processes should be developed that ensure the quality of the performance tuning process. This is not to dampen human ingenuity, but to ensure that tuning effort time is limited to achieve a realistic substantial gain, and that code changes are accepted by users and made part of their code distribution. In this paper, we therefore formalize a service-based

Performance Tuning Workflow

to standardize the tuning process and to improve usage of tuning-expert time.

Christian Iwainsky, Ralph Altenfeld, Dieter an Mey, Christian Bischof

Workshop on Resiliency in High-Performance Computing (Resilience) in Clusters, Clouds, and Grids

Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids

Clusters, Clouds, and Grids are three different computational paradigms with the intent or potential to support High Performance Computing (HPC). Currently, they consist of hardware, management, and usage models particular to different computational regimes, e.g., high performance systems designed to support tightly coupled scientific simulation codes and commercial cloud systems designed to support software as a service (SAS). However, in order to support HPC, all must at least utilize large numbers of resources and hence effective HPC in any of these paradigms must address the issue of resiliency at large-scale.

Stephen L. Scott, Chokchai (Box) Leangsuksun
The Malthusian Catastrophe Is Upon Us! Are the Largest HPC Machines Ever Up?

Thomas Malthus, an English political economist who lived from 1766 to 1834, predicted that the earth’s population would be limited by starvation since population growth increases geometrically and the food supply only grows linearly. He said, “the power of population is indefinitely greater than the power in the earth to provide subsistence for man,” thus defining the Malthusian Catastrophe. There is a parallel between this prediction and the conventional wisdom regarding super-large machines: application problem size and machine complexity is growing geometrically, yet mitigation techniques are only improving linearly.

To examine whether the largest machines are usable, the authors collected and examined component failure rates and Mean Time Between System Failure data from the world’s largest production machines, including Oak Ridge National Laboratory’s Jaguar and the University of Tennessee’s Kraken. The authors also collected MTBF data for a variety of Cray XT series machines from around the world, representing over 6 Petaflops of compute power. An analysis of the data is provided as well as plans for future work. High performance computing’s Malthusian Catastrophe hasn’t happened yet, and advances in system resiliency should keep this problem at bay for many years to come.

Patricia Kovatch, Matthew Ezell, Ryan Braby
Simulating Application Resilience at Exascale

The reliability mechanisms for future exascale systems will be a key aspect of their scalability and performance. With the expected jump in hardware component counts, faults will become increasingly common compared to today’s systems. Under these circumstances, the costs of current and emergent resilience methods need to be reevaluated. This includes the cost of recovery, which is often ignored in current work, and the impact of hardware features such as heterogeneous computing elements and non-volatile memory devices. We describe a simulation and modeling framework that enables the measurement of various resilience algorithms with varying application characteristics. For this framework we outline the simulator’s requirements, its application communication pattern generators, and a few of the key hardware component models.

Rolf Riesen, Kurt B. Ferreira, Maria Ruiz Varela, Michela Taufer, Arun Rodrigues
Framework for Enabling System Understanding

Building the effective HPC resilience mechanisms required for viability of next generation supercomputers will require in depth understanding of system and component behaviors. Our goal is to build an integrated framework for high fidelity long term information storage, historic and run-time analysis, algorithmic and visual information exploration to enable system understanding, timely failure detection/prediction, and triggering of appropriate response to failure situations. Since it is unknown what information is relevant and since potentially relevant data may be expressed in a variety of forms (e.g., numeric, textual), this framework must provide capabilities to process different forms of data and also support the integration of new data, data sources, and analysis capabilities. Further, in order to ensure ease of use as capabilities and data sources expand, it must also provide interactivity between its elements. This paper describes our integration of the capabilities mentioned above into our OVIS tool.

J. Brandt, F. Chen, A. Gentile, Chokchai (Box) Leangsuksun, J. Mayo, P. Pebay, D. Roe, N. Taerat, D. Thompson, M. Wong
Cooperative Application/OS DRAM Fault Recovery

Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need to address these resilience challenges to more effectively utilize future systems. In this paper, we describe work on a cross-layer application / OS framework to handle uncorrected memory errors. We illustrate the use of this framework through its integration with a new fault-tolerant iterative solver within the Trilinos library, and present initial convergence results.

Patrick G. Bridges, Mark Hoemmen, Kurt B. Ferreira, Michael A. Heroux, Philip Soltero, Ron Brightwell
A Tunable, Software-Based DRAM Error Detection and Correction Library for HPC

Proposed exascale systems will present a number of considerable resiliency challenges. In particular, DRAM soft-errors, or bit-flips, are expected to greatly increase due to the increased memory density of these systems. Current hardware-based fault-tolerance methods will be unsuitable for addressing the expected soft error frequency rate. As a result, additional software will be needed to address this challenge. In this paper we introduce LIBSDC, a tunable, transparent silent data corruption detection and correction library for HPC applications. LIBSDC provides comprehensive SDC protection for program memory by implementing on-demand page integrity verification. Experimental benchmarks with Mantevo HPCCG show that once tuned, LIBSDC is able to achieve SDC protection with 50% overhead of resources, less than the 100% needed for double modular redundancy.

David Fiala, Kurt B. Ferreira, Frank Mueller, Christian Engelmann
Reducing the Impact of Soft Errors on Fabric-Based Collective Communications

Collective operations might have a big impact on the performance of scientific applications, specially at large scale. Recently, it has been proposed Fabric-based collectives to address some scalability issues caused by the OS jitter. However, soft errors are becoming the next factor that significantly might degrade collective’s performance at scale. This paper evaluates two approaches to mitigate the negative effect of soft errors on Fabric-based collectives. These approaches are based on replicating multiple times the individual packets of the collective. One of them replicates packets through independent output ports at every switch (spatial replication), whereas the other only uses one output port but sending consecutively multiple packets through it (temporal replication). Results on a 1,728-node cluster showed that temporal replication achieves a 50% better performance than spatial replication in the presence of random soft errors.

José Carlos Sancho, Ana Jokanovic, Jesus Labarta
Evaluating Application Vulnerability to Soft Errors in Multi-level Cache Hierarchy

As the capacity of caches increases dramatically with new processors, soft errors originating in cache memories has become a major reliability concern for high performance processors. This paper presents application specific soft error vulnerability analysis in order to understand an application’s responses to soft errors from different levels of caches. Based on a high-performance processor simulator called Graphite, we have implemented a fault injection framework that can selectively inject bit flips to different levels of caches. We simulated a wide range of relevant bit error patterns and measured the applications’ vulnerabilities to bit errors. Our experimental results have shown the differing vulnerabilities of applications to bit errors in different levels of caches (e.g. the application failure rate for one program is more than the doulbe of that for another program for a given cache); the results have also indicated the probabilities of different failure behaviors for the given applications.

Zhe Ma, Trevor Carlson, Wim Heirman, Lieven Eeckhout
Experimental Framework for Injecting Logic Errors in a Virtual Machine to Profile Applications for Soft Error Resilience

As the high performance computing (HPC) community continues to push for ever larger machines, reliability remains a serious obstacle. Further, as feature size and voltages decrease, the rate of transient soft errors is on the rise. HPC programmers of today have to deal with these faults to a small degree and it is expected this will only be a larger problem as systems continue to scale.

In this paper we present SEFI, the Soft Error Fault Injection framework, a tool for profiling software for its susceptibility to soft errors. In particular, we focus in this paper on logic soft error injection. Using the open source virtual machine and processor emulator (QEMU), we demonstrate modifying emulated machine instructions to introduce soft errors. We conduct experiments by modifying the virtual machine itself in a way that does not require intimate knowledge of the tested application. With this technique, we show that we are able to inject simulated soft errors in the logic operations of a target application without affecting other applications or the operating system sharing the VM. We present some initial results and discuss where we think this work will be useful in next generation hardware/software co-design.

Nathan DeBardeleben, Sean Blanchard, Qiang Guan, Ziming Zhang, Song Fu
High Availability on Cloud with HA-OSCAR

Cloud computing provides virtual resources so that end-users or organizations can buy computing power as a public utility. Cloud service providers however must strive to ensure good QoS by offering highly available services with dynamically scalable resources. HA-OSCAR is an open source High Availability (HA) solution for HPC/cloud that offers component redundancy, failure detection, and automatic fail-over. In this paper, we describe HA-OSCAR as a cloud platform and analyze system availability of two potential cloud computing systems, OSCAR-V cluster and HA-OSCAR-V. We also explore our case study to improve Nimbus, a popular cloud IaaS toolkit. The results show that the system that deploys HA-OSCAR has a significantly higher degree of availability.

Thanadech Thanakornworakij, Rajan Sharma, Blaine Scroggs, Chokchai (Box) Leangsuksun, Zeno Dixon Greenwood, Pierre Riteau, Christine Morin
On the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance

The increasing size and complexity of high performance computing systems have lead to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/restart mechanisms will not scale sufficiently for future generation systems. In this work, we explore the feasibility of checkpoint data compression to reduce checkpoint commit latency and storage overheads. Leveraging a simple model for

checkpoint compression viability

, we conclude that checkpoint data compression should be considered as a part of a scalable checkpoint/restart solution and discuss additional scenarios and improvements that may make checkpoint data compression even more viable.

Dewan Ibtesham, Dorian Arnold, Kurt B. Ferreira, Patrick G. Bridges
Can Checkpoint/Restart Mechanisms Benefit from Hierarchical Data Staging?

Given the ever-increasing size of supercomputers, fault resilience and the ability to tolerate faults have become more of a necessity than an option. Checkpoint-Restart protocols have been widely adopted as a practical solution to provide reliability. However, traditional checkpointing mechanisms suffer from heavy I/O bottleneck while dumping process snapshots to a shared filesystem. In this context, we study the benefits of data staging, using a proposed hierarchical and modular data staging framework which reduces the burden of checkpointing on client nodes without penalizing them in terms of performance. During a checkpointing operation in this framework, the compute nodes transmit their process snapshots to a set of dedicated staging I/O servers through a high-throughput RDMA-based data pipeline. Unlike the conventional checkpointing mechanisms that block an application until the checkpoint data has been written to a shared filesystem, we allow the application to resume its execution immediately after the snapshots have been pipelined to the staging I/O servers, while data is simultaneously being moved from these servers to a backend shared filesystem. This framework eases the bottleneck caused by simultaneous writes from multiple clients to the underlying storage subsystem. The staging framework considered in this study is able to reduce the time penalty an application pays to save a checkpoint by 8.3 times.

Raghunath Rajachandrasekar, Xiangyong Ouyang, Xavier Besseron, Vilobh Meshram, Dhabaleswar K. Panda
Impact of Over-Decomposition on Coordinated Checkpoint/Rollback Protocol

Failure free execution will become rare in the future exascale computers. Thus, fault tolerance is now an active field of research. In this paper, we study the impact of decomposing an application in much more parallelism that the physical parallelism on the rollback step of fault tolerant coordinated protocols. This over-decomposition gives the runtime a better opportunity to balance workload after failure without the need of spare nodes, while preserving performance. We show that the overhead on normal execution remains low for relevant factor of over-decomposition. With over-decomposition, restart execution on the remaining nodes after failures shows very good performance compared to classic decomposition approach: our experiments show that the execution time after restart can be reduced by 42 %. We also consider a partial restart protocol to reduce the amount of lost work in case of failure by tracking the task dependencies inside processes. In some cases and thanks to over-decomposition, this partial restart time can represent only 54 % of the global restart time.

Xavier Besseron, Thierry Gautier

UCHPC 2011: Fourth Workshop on UnConventional High-Performance Computing

UCHPC 2011: Fourth Workshop on UnConventional High Performance Computing

As the word “UnConventional” in the title suggests, the workshop focuses on hardware or platforms used for HPC, which were not intended for HPC in the first place. Reasons could be raw computing power, good performance per watt, or low cost in general. Thus, UCHPC tries to capture solutions for HPC which are unconventional today but perhaps conventional tomorrow. For example, the computing power of platforms for games recently raised rapidly. This motivated the use of GPUs for computing (GPGPU), or even building computational grids from game consoles. The recent trend of integrating GPUs on processor chips seems to be very beneficial for use of both parts for HPC. Other examples for “unconventional” hardware are embedded, low-power processors, upcoming manycore architectures, FPGAs or DSPs. Thus, interesting devices for research in unconventional HPC are not only standard server or desktop systems, but also relative cheap devices due to being mass market products, such as smartphones, netbooks, tablets and small NAS servers. For example, smartphones seem to become more performance hungry every day. Only imagination sets the limit for use of the mentioned devices for HPC. The goal of the workshop is to present latest research in how hardware and software (yet) unconventional for HPC is or can be used to reach goals such as best performance per watt. UCHPC also covers corresponding programming models, compiler techniques, and tools.

Anders Hast, Josef Weidendorfer, Jan-Philipp Weiss
PACUE: Processor Allocator Considering User Experience

GPU accelerated applications including GPGPU ones are commonly seen in modern PCs. If many applications compete on the same GPU, the performance will decrease significantly. Some applications have a large impact on user experience. Therefore, for such applications, we have to limit GPU utilization by the other applications. It might be straightforward to modify applications to switch compute device dynamically for intelligent resources allocation. Unfortunately, we cannot do so due to software distribution policy or the other reasons. In this paper, we propose PACUE, which allows the end system to allocate compute devices arbitrary to applications. In addition, PACUE guesses optimal compute device for each application according to user preference. We implemented the dynamic compute device redirector of PACUE including OpenCL API hooking and device camouflaging features. We also implemented the frame of the resource manager of PACUE. We demonstrate PACUE achieves dynamic compute device redirecting on one out of two real applications and on all of 20 sample codes.

Tetsuro Horikawa, Michio Honda, Jin Nakazawa, Kazunori Takashio, Hideyuki Tokuda
Workload Balancing on Heterogeneous Systems: A Case Study of Sparse Grid Interpolation

Multi-core parallelism and accelerators are becoming common features of today’s computer systems, as they allow for computational power without sacrificing energy efficiency. Due to heterogeneity, tuning for each type of compute unit and adequate load balancing is essential. This paper proposes static and dynamic solutions for load balancing in the context of an application for visualizing high-dimensional simulation data. The application relies on the sparse grid technique for data compression. Its performance critical part is the interpolation routine used for decompression. Results show that our load balancing scheme allows for an efficient acceleration of interpolation on heterogeneous systems containing multi-core CPUs and GPUs.

Alin Muraraşu, Josef Weidendorfer, Arndt Bode
Performance Evaluation of a Multi-GPU Enabled Finite Element Method for Computational Electromagnetics

We study the performance of a multi-GPU enabled numerical methodology for the simulation of electromagnetic wave propagation in complex domains and heterogeneous media. For this purpose, the system of time-domain Maxwell equations is discretized by a discontinuous finite element method which is formulated on an unstructured tetrahedral mesh and which relies on a high order interpolation of the electromagnetic field components within a mesh element. The resulting numerical methodology is adapted to parallel computing on a cluster of GPU acceleration cards by adopting a hybrid strategy which combines a coarse grain SPMD programming model for inter-GPU parallelization and a fine grain SIMD programming model for intra-GPU parallelization. The performance improvement resulting from this multiple-GPU algorithmic adaptation is demonstrated through three-dimensional simulations of the propagation of an electromagnetic wave in the head of a mobile phone user.

Tristan Cabel, Joseph Charles, Stéphane Lanteri
Study of Hierarchical N-Body Methods for Network-on-Chip Architectures

In this paper, we study two hierarchical N-Body methods for Network-on-Chip (NoC) architectures. The modern Chip Multiprocessor (CMP) designs are mainly based on the shared-bus communication architecture. As the number of cores increases, it suffers from high communication delays. Therefore, NoC based architecture is proposed. The N-Body problem is a classical problem of approximating the motion of bodies. Two methods, namely Barnes-Hut (Barnes) and Fast Multipole (FMM), have been developed for fast simulation. The two algorithms have been implemented and studied in conventional computer systems and Graphics Processing Units (GPUs). However, as a promising unconventional multicore architecture, the evaluation of N-Body methods in a NoC platform has not been well addressed. We define a NoC model based on state-of-the-art systems. Evaluation results are presented using a cycle accurate full system simulator. Experiments show that, Barnes scales better (53.7x/Barnes and 36.6x/FMM for 64 processing elements) and requires less cache than FMM. However, we observe hot-spot traffic in Barnes. Our analysis and experiment results provide a guideline for studying N-Body methods in a NoC platform.

Thomas Canhao Xu, Pasi Liljeberg, Hannu Tenhunen
Extending a Highly Parallel Data Mining Algorithm to the Intel ® Many Integrated Core Architecture

Extracting knowledge from vast datasets is a major challenge in data-driven applications, such as classification and regression, which are mostly compute bound. In this paper, we extend our SG

 + + 

algorithm to the Intel

®

Many Integrated Core Architecture (Intel

®

MIC Architecture). The ease of porting an application to Intel MIC Architecture is shown: porting existing SSE code is very easy and straightforward. We evaluate the current prototype pre-release coprocessor board codenamed Intel

®

“Knights Ferry”. We utilize the pragma-based offloading programming model offered by the Intel

®

Composer XE for Intel MIC Architecture, generating both the host and the coprocessor code. We compare the achieved performance with an NVIDIA C2050 accelerator and show that the pre-release Knights Ferry coprocessor delivers better performance than the C2050 and exceeds the C2050 when comparing the productivity aspect of implementing algorithms for the coprocessors.

Alexander Heinecke, Michael Klemm, Dirk Pflüger, Arndt Bode, Hans-Joachim Bungartz

VHPC 2011: 6th Workshop on Virtualization in High-Performance Cloud Computing

VHPC 2011: 6th Workshop on Virtualization in High-Performance Cloud Computing

Virtualization has become a common abstraction layer in modern data centers, enabling resource owners to manage complex infrastructure independently of their applications. Conjointly virtualization is becoming a driving technology for a manifold of industry grade IT services. The cloud concept includes the notion of a separation between resource owners and users, adding services such as hosted application frameworks and queuing. Utilizing the same infrastructure, clouds carry significant potential for use in high-performance scientific computing. The ability of clouds to provide for requests and releases of vast computing resource dynamically and close to the marginal cost of providing the services is unprecedented in the history of scientific and commercial computing. Distributed computing concepts that leverage federated resource access are popular within the grid community, but have not seen previously desired deployed levels so far. Also, many of the scientific datacenters have not adopted virtualization or cloud concepts yet. This workshop aims to bring together industrial providers with the scientific community in order to foster discussion, collaboration and mutual exchange of knowledge and experience. This year’s workshop featured 9 papers on diverse topics in HPC virtualization. Papers of note include Kim et al. proposing group-based cloud memory deduplication along with Nanos et al. presenting results from a high-performance cluster interconnect prototype for VMs with a user-level RDMA protocol over standard 10Gbps Ethernet. The chairs would like to thank the Euro-Par organizers and the members of the program committee along with the speakers and attendees, whose interaction contributed to a stimulating environment. VHPC is planning to continue the successful co-location with Euro-Par in 2012.

Michael Alexander, Gianluigi Zanetti
Group-Based Memory Deduplication for Virtualized Clouds

In virtualized clouds, machine memory is known as a resource that primarily limits consolidation level due to the expensive cost of hardware extension and power consumption. To address this limitation, various memory deduplication techniques have been proposed to increase available machine memory by eliminating memory redundancy. Existing memory deduplication techniques, however, lack isolation support, which is a crucial factor of cloud quality of service and trustworthiness. This paper presents a group-based memory deduplication scheme that ensures isolation between customer groups colocated in a physical machine. In addition to isolation support, our scheme enables per-group customization of memory deduplication according to each group’s memory demand and workload characteristic.

Sangwook Kim, Hwanju Kim, Joonwon Lee
A Smart HPC Interconnect for Clusters of Virtual Machines

In this paper, we present the design of a VM-aware, high-performance cluster interconnect architecture over 10Gbps Ethernet. Our framework provides a direct data path to the NIC for applications that run on VMs, leaving non-critical paths (such as control) to be handled by intermediate virtualization layers. As a result, we are able to multiplex and prioritize network access per VM. We evaluate our design via a prototype implementation that integrates RDMA semantics into the privileged guest of the Xen virtualization platform. Our framework allows VMs to communicate with the network using a simple user-level RDMA protocol. Preliminary results show that our prototype achieves 681MiB/sec over generic 10GbE hardware and relieves the guest from CPU overheads, while limiting the guest’s CPU utilisation to 34%.

Anastassios Nanos, Nikos Nikoleris, Stratos Psomadakis, Elisavet Kozyri, Nectarios Koziris
Coexisting Scheduling Policies Boosting I/O Virtual Machines

Deploying multiple Virtual Machines (VMs) running various types of workloads on current many-core cloud computing infrastructures raises an important issue: The Virtual Machine Monitor (VMM) has to efficiently multiplex VM accesses to the hardware. We argue that altering the scheduling concept can optimize the system’s overall performance.

Currently, the Xen VMM achieves near native performance multiplexing VMs with homogeneous workloads. Yet having a mixture of VMs with different types of workloads running concurrently, it leads to poor I/O performance. Taking into account the complexity of the design and implementation of a universal scheduler, let alone the probability of being fruitless, we focus on a system with multiple scheduling policies that coexist and service VMs according to their workload characteristics. Thus, VMs can benefit from various schedulers, either existing or new, that are optimal for each specific case.

In this paper, we design a framework that provides three basic coexisting scheduling policies and implement it in the Xen paravirtualized environment. Evaluating our prototype we experience 2.3 times faster I/O service and link saturation, while the CPU-intensive VMs achieve more than 80% of current performance.

Dimitris Aragiorgis, Anastassios Nanos, Nectarios Koziris
PIGA-Virt: An Advanced Distributed MAC Protection of Virtual Systems

Efficient Mandatory Access Control of Virtual Machines remains an open problem for protecting efficiently Cloud Systems. For example, the MAC protection must allow some information flows between two virtual machines while preventing other information flows between those two machines. For solving these problems, the virtual environment must guarantee an in-depth protection in order to control the information flows that starts in a Virtual Machine (

vm

) and finishes in another one. In contrast with existing MAC approaches, PIGA-Virt is a MAC protection controlling the different levels of a virtual system. It eases the management of the required security objectives. The PIGA-Virt approach guarantees the required security objectives while controlling efficiently the information flows. PIGA-Virt supports a large range of predefined protection canvas whose efficiency has been demonstrated during the ANR Sec&Si security challenge. The paper shows how the PIGA-Virt approach guarantees advanced confidentiality and integrity properties by controlling complex combinations of transitive information flows passing through intermediate resources. As far as we know, PIGA-Virt is the first operational solution providing in-depth MAC protection, addressing advanced security requirements and controlling efficiently information flows inside and between virtual machines. Moreover, the solution is independent of the underlying hypervisor. Performances and protection scenarios are given for protecting KVM virtual machines.

J. Briffaut, E. Lefebvre, J. Rouzaud-Cornabas, C. Toinard
An Economic Approach for Application QoS Management in Clouds

Virtualization provides increased control and flexibility in how resources are allocated to applications. However, common resource provisioning mechanisms do not fully use these advantages; either they provide limited support for applications demanding quality of service, or the resource allocation complexity is high. To address this problem we propose a novel resource management architecture for virtualized infrastructures based on a virtual economy. By limiting the coupling between the applications and the resource management, this architecture can support diverse types of applications and performance goals while ensuring an efficient resource usage. We validate its use through simple policies that scale the resource allocations of the applications vertically and horizontally to meet application performance goals.

Stefania Costache, Nikos Parlavantzas, Christine Morin, Samuel Kortas
Evaluation of the HPC Challenge Benchmarks in Virtualized Environments

This paper evaluates the performance of the HPC Challenge benchmarks in several virtual environments, including VMware, KVM and VirtualBox. The HPC Challenge benchmarks consist of a suite of tests that examine the performance of HPC architectures using kernels with memory access patterns more challenging than those of the High Performance LINPACK (HPL) benchmark used in the TOP500 list. The tests include four local (matrix-matrix multiply, STREAM, RandomAccess and FFT) and four global (High Performance Linpack – HPL, parallel matrix transpose – PTRANS, RandomAccess and FFT) kernel benchmarks.

The purpose of our experiments is to evaluate the overheads of the different virtual environments and investigate how different aspects of the system are affected by virtualization. We ran the benchmarks on an 8-core system with Core i7 processors using Open MPI. We did runs on the bare hardware and in each of the virtual environments for a range of problem sizes. As expected, the HPL results had some overhead in all the virtual environments, with the overhead becoming less significant with larger problem sizes. The RandomAccess results show drastically different behavior and we attempt to explain it with pertinent experiments. We show the cause of variability of performance results as well as major causes of measurement error.

Piotr Luszczek, Eric Meek, Shirley Moore, Dan Terpstra, Vincent M. Weaver, Jack Dongarra
DISCOVERY, Beyond the Clouds
DIStributed and COoperative Framework to Manage Virtual EnviRonments autonomicallY: A Prospective Study

Although the use of virtual environments provided by cloud computing infrastructures is gaining consensus from the scientific community, running applications in these environments is still far from reaching the maturity of more usual computing facilities such as clusters or grids. Indeed, current solutions for managing virtual environments are mostly based on centralized approaches that barter large-scale concerns such as scalability, reliability and reactivity for simplicity. However, considering current trends about cloud infrastructures in terms of size (larger and larger) and in terms of usage (cross-federation), every large-scale concerns must be addressed as soon as possible to efficiently manage next generation of cloud computing platforms.

In this work, we propose to investigate an alternative approach leveraging DIStributed and COoperative mechanisms to manage Virtual EnviRonments autonomicallY (DISCOVERY). This initiative aims at overcoming the main limitations of the traditional server-centric solutions while integrating all mandatory mechanisms into a unified distributed framework. The system we propose to implement, relies on a peer-to-peer model where each agent can efficiently deploy, dynamically schedule and periodically checkpoint the virtual environments they manage. The article introduces the global design of the DISCOVERY proposal and gives a preliminary description of its internals.

Adrien Lèbre, Paolo Anedda, Massimo Gaggero, Flavien Quesnel
Cooperative Dynamic Scheduling of Virtual Machines in Distributed Systems

Cloud Computing aims at outsourcing data and applications hosting and at charging clients on a per-usage basis. These data and applications may be packaged in virtual machines (VM), which are themselves hosted by nodes, i.e. physical machines.

Consequently, several frameworks have been designed to manage VMs on pools of nodes. Unfortunately, most of them do not efficiently address a common objective of cloud providers: maximizing system utilization while ensuring the quality of service (QoS). The main reason is that these frameworks schedule VMs in a static way and/or have a centralized design.

In this article, we introduce a framework that enables to schedule VMs cooperatively and dynamically in distributed systems. We evaluated our prototype through simulations, to compare our approach with the centralized one. Preliminary results showed that our scheduler was more reactive. As future work, we plan to investigate further the scalability of our framework, and to improve reactivity and fault-tolerance aspects.

Flavien Quesnel, Adrien Lèbre
Large-Scale DNA Sequence Analysis in the Cloud: A Stream-Based Approach

Cloud computing technologies have made it possible to analyze big data sets in scalable and cost-effective ways. DNA sequence analysis, where very large data sets are now generated at reduced cost using the Next-Generation Sequencing (NGS) methods, is an area which can greatly benefit from cloud-based infrastructures. Although existing solutions show nearly linear scalability, they pose significant limitations in terms of data transfer latencies and cloud storage costs. In this paper, we propose to tackle the performance problems that arise from having to transfer large amounts of data between clients and the cloud based on a streaming data management architecture. Our approach provides an incremental data processing model which can hide data transfer latencies while maintaining linear scalability. We present an initial implementation and evaluation of this approach for SHRiMP, a well-known software package for NGS read alignment, based on the IBM InfoSphere Streams computing platform deployed on Amazon EC2.

Romeo Kienzler, Rémy Bruggmann, Anand Ranganathan, Nesime Tatbul
Backmatter
Metadata
Title
Euro-Par 2011: Parallel Processing Workshops
Editors
Michael Alexander
Pasqua D’Ambra
Adam Belloum
George Bosilca
Mario Cannataro
Marco Danelutto
Beniamino Di Martino
Michael Gerndt
Emmanuel Jeannot
Raymond Namyst
Jean Roman
Stephen L. Scott
Jesper Larsson Traff
Geoffroy Vallée
Josef Weidendorfer
Copyright Year
2012
Publisher
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-29740-3
Print ISBN
978-3-642-29739-7
DOI
https://doi.org/10.1007/978-3-642-29740-3