Skip to main content

2012 | Buch

Euro-Par 2011: Parallel Processing Workshops

CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS, MDGS, ProPer, Resilience, UCHPC, VHPC, Bordeaux, France, August 29 – September 2, 2011, Revised Selected Papers, Part I

herausgegeben von: Michael Alexander, Pasqua D’Ambra, Adam Belloum, George Bosilca, Mario Cannataro, Marco Danelutto, Beniamino Di Martino, Michael Gerndt, Emmanuel Jeannot, Raymond Namyst, Jean Roman, Stephen L. Scott, Jesper Larsson Traff, Geoffroy Vallée, Josef Weidendorfer

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes thoroughly refereed post-conference proceedings of the workshops of the 17th International Conference on Parallel Computing, Euro-Par 2011, held in Bordeaux, France, in August 2011. The papers of these 12 workshops CCPI, CGWS, HeteroPar, HiBB, HPCVirt, HPPC, HPSS HPCF, PROPER, CCPI, and VHPC focus on promotion and advancement of all aspects of parallel and distributed computing.

Inhaltsverzeichnis

Frontmatter

CCPI 2011: Workshop on Cloud Computing Projects and Initiatives

CCPI 2011: Workshop on Cloud Computing Projects and Initiatives

Cloud computing is a recent computing paradigm for enabling convenient, ondemand network access to a shared pool of configurable computing resources (e.g. networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. Clouds are currently used mainly in commercial settings and focus on on-demand provision of IT infrastructure. Cloud computing can play a significant role in a variety of areas including innovations, virtual worlds, ebusiness, social networks, or search engines. But currently, it is still in its early stages, with consistent experimentation to come.

Beniamino Di Martino, Dana Petcu
Towards Cross-Platform Cloud Computing

Cloud computing is becoming increasingly popular and prevalent in many domains. However, there is high variability in the programming models, access methods, and operational aspects of different clouds, diminishing the viability of cloud computing as a true utility. Our ADAPAS project attempts to analyze the commonalities and differences between cloud offerings with a view to determining the extent to which they may be unified. We propose the concept of dynamic adapters supported by runtime systems for environment preconditioning, that help facilitate cross platform deployment of cloud applications. This vision paper outlines the issues involved, and presents preliminary ideas for enhancing the executability of applications on different cloud platforms.

Magdalena Slawinska, Jaroslaw Slawinski, Vaidy Sunderam
QoS Monitoring in a Cloud Services Environment: The SRT-15 Approach

The evolution of Cloud Computing environments has resulted in a new impulse to the service oriented computing, with hardware resources, whole applications and entire business processes provided as services in the so called “as a service” paradigm. In such a paradigm the resulting interactions should involve actors (users and providers of services) belonging to different entities and possibly to different companies, hence the success of such a new vision of the IT world is strictly tied to the possibility of guaranteed high quality levels in the provisioning of resources and services. In this paper we present QoSMONaaS (Quality of Service MONitoring as a Service), a QoS monitoring facility built on top of the SRT-15, a Cloud-oriented and CEP-based platform being developed in the context of the homonymous EU funded project. In particular we present the main components of QoSMONaaS and illustrate QoSMONaaS operation and internals with respect to a substantial case study of an Internet of Thing (IoT) application.

Giuseppe Cicotti, Luigi Coppolino, Rosario Cristaldi, Salvatore D’Antonio, Luigi Romano
Enabling e-Science Applications on the Cloud with COMPSs

COMP Superscalar (COMPSs) is a programming framework that provides an easy-to-use programming model and a runtime to ease the development of applications for distributed environments. Thanks to its modular architecture COMPSs can use a wide range of computational infrastructures providing a uniform interface for job submission and file transfer operations through adapters for different middlewares. In the context of the VENUS-C project the COMPSs framework has been extended through the development of a programming model enactment service that allows researcher to transparently port and execute scientific applications in the Cloud.

This paper presents the implementation of a bioinformatics workflow (using BLAST as core program), the porting to the COMPSs framework and its deployment on the VENUS-C platform. The proposed approach has been evaluated on a Cloud testbed using virtual machines managed by EMOTIVE Cloud and compared to a similar approach on the Azure platform and to other implementations on HPC infrastructures.

Daniele Lezzi, Roger Rafanell, Abel Carrión, Ignacio Blanquer Espert, Vicente Hernández, Rosa M. Badia
OPTIMIS and VISION Cloud: How to Manage Data in Clouds

In the rapidly evolving Cloud market, the amount of data being generated is growing continuously and as a consequence storage as a service plays an increasingly important role. In this paper, we describe and compare two new approaches, deriving from the EU funded FP7 projects OPTIMIS and VISION Cloud respectively, to filling existing gaps in Cloud storage offerings. We portray the key value-add characteristics of their designs that improve the state of the art for Cloud computing towards providing more advanced features for Cloud-based storage services.

Spyridon V. Gogouvitis, George Kousiouris, George Vafiadis, Elliot K. Kolodner, Dimosthenis Kyriazis
Integrated Monitoring of Infrastructures and Applications in Cloud Environments

One approach to fully exploit the potential of Cloud technologies consists in leveraging on the Autonomic Computing paradigm. It could be exploited in order to put in place reconfiguration strategies spanning the whole protocol stack, starting from the infrastructure and then going up to platform/application level protocols. On the other hand, the very base for the design and development of Cloud oriented Autonomic Managers is represented by monitoring sub-systems, able to provide audit data related to any layer within the stack. In this article we present the approach that has been taken while designing and implementing the monitoring sub-system for the Cloud-TM FP7 project, which is aimed at realizing a self-adapting, Cloud based middleware platform providing transactional data access to generic customer applications.

Roberto Palmieri, Pierangelo di Sanzo, Francesco Quaglia, Paolo Romano, Sebastiano Peluso, Diego Didona
Towards Collaborative Data Management in the VPH-Share Project

The goal of the Virtual Physiological Human Initiative is to provide a systematic framework for understanding physiological processes in the human body in terms of anatomical structure and biophysical mechanisms across multiple length and time scales. In the long term it will transform the delivery of European healthcare into a more personalised, predictive, and integrative process, with significant impact on healthcare and on disease prevention. This paper outlines how the recently funded project VPH-Share contributes to this vision. The project is motivated by the needs of the whole VPH community to harness ICT technology to improve health services for the individual. VPH-Share will provide the organisational fabric (the infostructure), realised as a series of services, offered in an integrated framework, to expose and to manage data, information and tools, to enable the composition and operation of new VPH workflows and to facilitate collaborations between the members of the VPH community.

Siegfried Benkner, Jesus Bisbal, Gerhard Engelbrecht, Rod D. Hose, Yuriy Kaniovskyi, Martin Koehler, Carlos Pedrinaci, Steven Wood
SLM and SDM Challenges in Federated Infrastructures

Federation of computing resources imposes challenges in service management not seen in simple customer-supplier relationships. Federation is common in e-Infrastructure and growing in clouds through the growth of hybrid and multiclouds. Relationships in federated environments are complex at present, and must be simplified to allow structured service management to be improved. Input can be taken from commercial service management techniques such as ITIL and ISO/IEC20000 but special features of federated environments, such as complications in inducement and enforcement must be considered.

Matti Heikkurinen, Owen Appleton
Rapid Prototyping of Architectures on the Cloud Using Semantic Resource Description

We present in this paper a way for prototyping architectures based on the generation of service representations of resources. This generated “infrastructure” can be used to rapidly build on-demand settings for application/scenario requirements in a Cloud Computing context where such requirements can be as diverse as the applications running on the Cloud. The resources used to build the infrastructure are semantically described to capture their properties and capabilities. We have also developed a framework called the Managed Resource Framework (MRF) to automatically generate service descriptions with an added manageability interface from these semantic description. These services are then ready for deployment. Our work was materialize in the SEROM Software.

Houssam Haitof
Cloud Patterns for mOSAIC-Enabled Scientific Applications

Cloud computing has a huge potential to change the way data- and computing-intensive applications are performing computations. These specific categories of applications raise different concerns and issues that can be bypassed by identifying relevant reusable cloud computing patterns, on the top of specific cloud computing use cases. Development of new cloud patterns will help offering a better support for the development and deployment of scientific distributed application over a cloud infrastructure.

Teodor-Florin Fortiş, Gorka Esnal Lopez, Imanol Padillo Cruz, Gábor Ferschl, Tamás Máhr
Enhancing an Autonomic Cloud Architecture with Mobile Agents

In cloud environments application scheduling, i.e., the matching of applications with the resources they need to be executed, is a hot research topic. Autonomic computing provides viable solutions to implement robust architectures that are enough flexible to tackle scheduling problems. CHASE is a framework based on an autonomic engine, designed to optimize resource management in clouds, grids or hybrid cloud-grid environments. Its optimizations are based on real-time knowledge of the status of managed resources. This requires continuous monitoring, which is difficult to be carried out in distributed and rapidly-changing environments as clouds. This paper presents a monitoring system to support autonomicity based on the mobile agents computing paradigm.

A. Cuomo, M. Rak, S. Venticinque, U. Villano
Mapping Application Requirements to Cloud Resources

Cloud Computing has created a paradigm shift in software development. Many developers now use the Cloud as an affordable platform on which to deploy business solutions. One outstanding challenge is the integration of different Cloud services (or resources), offered by different Cloud providers, when building a Cloud-oriented business solution. Typically each provider has a different means of describing Cloud resources and uses a different application programming interface to acquire Cloud resources. Developers need to make complex decisions involving multiple Cloud products, different Cloud implementations, different deployment options, and different programming approaches. In this paper, we propose a model for discovering Cloud resources in a multi-provider environment. We study a financial use case scenario and suggest the use of a provider-agnostic approach which hides the complex implementation details for mapping the application requirements to Cloud resources.

Yih Leong Sun, Terence Harmer, Alan Stewart, Peter Wright

CoreGRID/ERCIM Workshop on Grids, Clouds and P2P Computing – CGWS2011

CoreGRID/ERCIM Workshop on Grids, Clouds and P2P Computing – CGWS2011

CoreGRID is a European research Network of Excellence (NoE) that was initiated in 2004 as part of the EU FP6 research framework and run up to 2008. CoreGRID partners, from 44 different countries, developed theoretical foundations and software infrastructures for large-scale, distributed Grid and P2P applications. An ERCIM sponsored CoreGRID Working Group was established to ensure the continuity of the CoreGrid programme after the official end of NoE. The working group extended its interests to include the emerging field of (service based) cloud computing which is of great importance to the European software industry. Its main goals consist in i) sustaining the operation of the CoreGRID Network, ii) establishing a forum to encourage collaboration between the Grid and P2P Computing research communities, and (iii) encourage research on the role of cloud computing as a new paradigm for distributed computing in e-Science.

Marco Danelutto, Frédéric Desprez, Vladimir Getov, Wolfgang Ziegler
A Perspective on the CoreGRID Grid Component Model

The Grid Component Model is a software component model designed partly in the context of the

CoreGRID

European Network of Excellence, as an extension of the Fractal model, to target the programming of large-scale distributed infrastructures such as computing grids [3]. These distributed memory infrastructures, characterized by high latency, heterogeneity and sharing of resources, suggest the efficient use of several CPUs at once to obtain high performances.

Françoise Baude
Towards Scheduling Evolving Applications

Most high-performance computing resource managers only allow applications to request a static allocation of resources. However, evolving applications have resource requirements which change (evolve) during their execution. Currently, such applications are forced to make an allocation based on their peak resource requirements, which leads to an inefficient resource usage. This paper studies whether it makes sense for resource managers to support evolving applications. It focuses on scheduling fully-predictably evolving applications on homogeneous resources, for which it proposes several algorithms and evaluates them based on simulations. Results show that resource usage and application response time can be significantly improved with short scheduling times.

Cristian Klein, Christian Pérez
Model Checking Support for Conflict Resolution in Multiple Non-functional Concern Management

When implementing autonomic management of multiple non-functional concerns a trade-off must be found between the ability to develop independently management of the individual concerns (following the separation of concerns principle) and the detection and resolution of conflicts that may arise when combining the independently developed management code. Here we discuss strategies to establish this trade-off and introduce a model checking based methodology aimed at simplifying the discovery and handling of conflicts arising from deployment–within the same parallel application–of independently developed management policies. Preliminary results are shown demonstrating the feasibility of the approach.

Marco Danelutto, P. Kilpatrick, C. Montangero, L. Semini
Consistent Rollback Protocols for Autonomic ASSISTANT Applications

Nowadays, a central issue for applications executed on heterogeneous distributed platforms is represented by assuring that certain performance and reliability parameters are respected throughout the system execution. A typical solution is based on supporting application components with adaptation strategies, able to select at run-time the better component version to execute. It is worth noting that the efficacy of a reconfiguration may depend on the time spent in applying it: in fact, albeit a reconfiguration may lead to a better steady-state behavior, its application could induce a transient violation of a QoS constraint. In this paper we will show how consistent reconfiguration protocols can be derived for stream-based ASSISTANT applications, and we will characterize their costs in terms of proper performance models.

Carlo Bertolli, Gabriele Mencagli, Marco Vanneschi
A Dynamic Resource Management System for Real-Time Online Applications on Clouds

We consider a challenging class of highly interactive virtual environments, also known as Real-Time Online Interactive Applications (ROIA). Popular examples of ROIA include multi-player online computer games, e-learning and training applications based on real-time simulations, etc. ROIA combine high demands on the scalability and real-time user interactivity with the problem of efficient and economic utilization of resources, which is difficult to achieve due to the changing number of users. We address these challenges by developing the dynamic resource management system RTF-RMS which implements load balancing for ROIA on Clouds. We illustrate how RTF-RMS chooses between three different load-balancing actions and implements Cloud resource allocation. We report experimental results on the load balancing of a multi-player online game in a Cloud environment using RTF-RMS.

Dominik Meiländer, Alexander Ploss, Frank Glinka, Sergei Gorlatch
Cloud Federations in Contrail

Cloud computing infrastructures support dynamical and flexible access to computational, network and storage resources. To date, several disjoint industrial and academic technologies provide infrastructure level access to Clouds. Especially for industrial platforms, the evolution of de-facto standards goes together with worries about user lock-in to a platform. The Contrail project [6] proposes a federated and integrated approach to Clouds. In this work we present and motivate the architecture of Contrail federations. Contrail’s goal is to minimize the burden on the user and increase the efficiency in using Cloud platforms by performing both a vertical and a horizontal integration. To this end, Contrail federations play a key role, allowing users to exploit resources belonging to different cloud providers, regardless of the kind of technology of the providers and with a homogeneous, secure interface. Vertical integration is achieved by developing both the Infrastructure- and the Platform-as-a-Service levels within the project. A third key point is the adoption of a fully open-source approach toward technology and standards. Beside supporting user authentication and applications deployment, Contrail federations aim at providing extended SLA management functionalities, by integrating the SLA management approach of SLA@SOI project in the federation architecture.

Emanuele Carlini, Massimo Coppola, Patrizio Dazzi, Laura Ricci, Giacomo Righetti
Semi-automatic Composition of Ontologies for ASKALON Grid Workflows

Automatic workflow composition with the help of ontologies has been addressed by numerous researchers in the past. While ontologies are very useful for automatic and semiautomatic workflow composition, ontology creation itself remains a very important and complex task.

In this paper we present a novel tool to synthesize ontologies for the Abstract Grid Workflow Language (AGWL) which has been used for years to successfully create Grid workflow applications at a high level of abstraction. In order to semi-automatically generate ontologies we use an AGWL Ontology (AGWO - an ontological description of the AGWL language), structural information of one or several input workflows of a given application domain, and semantic enrichment of the structural information with the help of the user. Experiments based on two separate application domains (movie rendering and meteorology) will be shown that demonstrate the effectiveness of our approach by semi-automatically generating ontologies which are then used to automatically create workflow applications.

Muhammad Junaid Malik, Thomas Fahringer, Radu Prodan
The Chemical Machine: An Interpreter for the Higher Order Chemical Language

The notion of chemical computing has evolved for more than two decades. From the seminal idea several models, calculi and languages have been developed and there are various proposals for applying chemical models in distributed problem solving where some sort of autonomy, self-evolving nature and adaptation is sought. While there are some experimental chemical implementations, most of these proposals remained at the paper-and-pencil stage. This paper presents a general purpose interpreter for the Higher Order Chemical Language. The design follows that of logic/functional languages and bridges the gap between the highly abstract chemical model and the physical machine by an abstract interpreter engine. As a novel approach the engine is based on a modified hierarchical production system and turns away from imperative languages.

Vilmos Rajcsányi, Zsolt Németh
Design and Performance of the OP2 Library for Unstructured Mesh Applications

OP2 is an “active” library framework for the solution of unstructured mesh applications. It aims to decouple the scientific specification of an application from its parallel implementation to achieve code longevity and near-optimal performance by re-targeting the back-end to different multi-core/many-core hardware. This paper presents the design of the OP2 code generation and compiler framework which, given an application written using the OP2 API, generates efficient code for state-of-the-art hardware (e.g. GPUs and multi-core CPUs). Through a representative unstructured mesh application we demonstrate the capabilities of the compiler framework to utilize the same OP2 hardware specific run-time support functionalities. Performance results show that the impact due to this sharing of basic functionalities is negligible.

Carlo Bertolli, Adam Betts, Gihan Mudalige, Mike Giles, Paul Kelly
Mining Association Rules on Grid Platforms

In this paper we propose a dynamic load balancing strategy to enhance the performance of parallel association rule mining algorithms in the context of a Grid computing environment. This strategy is built upon a distributed model which necessitates small overheads in the communication costs for load updates and for both data and work transfers. It also supports the heterogeneity of the system and it is fault tolerant.

Raja Tlili, Yahya Slimani

5th Workshop on System-Level Virtualization for High-Performance Computing (HPCVirt 2011)

5th Workshop on System-Level Virtualization for High Performance Computing (HPCVirt 2011)

The emergence of virtualization enabled hardware, such as the latest generation AMD and Intel processors, has raised significant interest in High Performance Computing (HPC) community. In particular, system-level virtualization provides an opportunity to advance the design and development of operating systems, programming environments, administration practices, and resource management tools. This leads to some potential research topics for HPC, such as failure tolerance, system management, and solution for application porting to new HPC platforms.

The workshop on System-level Virtualization for HPC (HPCVirt 2011) is intended to be a forum for the exchange of ideas and experiences on the use of virtualization technologies for HPC, the challenges and opportunities offered by the development of system-level virtualization solutions themselves, as well as case studies in the application of system-level virtualization in HPC.

Stephen L. Scott, Geoffroy Vallée, Thomas Naughton
Performance Evaluation of HPC Benchmarks on VMware’s ESXi Server

A major obstacle to virtualizing HPC workloads is a concern about the performance loss due to virtualization. We will demonstrate that new features significantly enhance the performance and scalability of virtualized HPC workloads on VMware’s virtualization platform. Specifically, we will discuss VMware’s ESXi Server performance for virtual machines with up to 64 virtual CPUs as well as support for exposing virtual NUMA topology to guest operating systems, enabling the operating system and applications to make intelligent NUMA aware decisions about memory allocation and process/thread placement. NUMA support is especially important for large VMs which necessarily span host NUMA nodes on all modern hardware. We will show how the virtual NUMA topology is chosen to closely match physical host topology, while preserving the now expected virtualization benefits of portability and load balancing. We show that the benefit of exposing the virtual NUMA topology can lead to performance gains of up to 167%. Overall, we will show close to native performance on applications from SPEC MPI V2.0 and SPEC OMP V3.2 benchmarks virtualized on our prototype VMware’s ESXi Server.

Qasim Ali, Vladimir Kiriansky, Josh Simons, Puneet Zaroo
Virtualizing Performance Counters

Virtual machines are becoming commonplace as a stable and flexible platform to run many workloads. As developers continue to move more workloads into virtual environments, they need ways to analyze the performance characteristics of those workloads. However, performance efforts can be hindered because the standard profiling tools like VTune and the Linux Performance Counter Subsystem do not work in most modern hypervisors. These tools rely on CPUs’ hardware performance counters, which are not currently exposed to the guests by most hypervisors. This work discusses the challenges of performance counters due to the trap and emulate method of virtualization and the time sharing of physical CPUs among multiple virtual CPUs. We propose an approach to address these issues to provide useful and intuitive information about guest performance and the relative costs of virtualization overheads.

Benjamin Serebrin, Daniel Hecht
A Case for Virtual Machine Based Fault Injection in a High-Performance Computing Environment

Large-scale computing platforms provide tremendous capabilities for scientific discovery. As applications and system software scale up to multi- petaflops and beyond to exascale platforms, the occurrence of failure will be much more common. This has given rise to a push in fault-tolerance and resilience research for high-performance computing (HPC) systems. This includes work on log analysis to identify types of failures, enhancements to the Message Passing Interface (MPI) to incorporate fault awareness, and a variety of fault tolerance mechanisms that span redundant computation, algorithm based fault tolerance, and advanced checkpoint/restart techniques.

While there is much work to be done on the FT/Resilience mechanisms for such large-scale systems, there is also a profound gap in the tools for experimentation. This gap is compounded by the fact that HPC environments have stringent performance requirements and are often highly customized. The tool chain for these systems are often tailored for the platform and the operating environments typically contain many site/machine specific enhancements. Therefore, it is desirable to maintain a consistent execution environment to minimize end-user (scientist) interruption.

The work on system-level virtualization for HPC system offers a unique opportunity to maintain a consistent execution environment via a virtual machine (VM). Recent work on virtualization for HPC has shown that low-overhead, high performance systems can be realized [7, 15]. Virtualization also provides a clean abstraction for building experimental tools for investigation into the effects of failures in HPC and the related research on FT/Resilience mechanisms and policies. In this paper we discuss the motivation for tools to perform fault injection in an HPC context. We also present the design of a new fault injection framework that can leverage virtualization.

Thomas Naughton, Geoffroy Vallée, Christian Engelmann, Stephen L. Scott

HPPC 2010: 5th Workshop on Highly Parallel Processing on a Chip

HPPC 2010: 5th Workshop on Highly Parallel Processing on a Chip

Despite the processor industry having more or less successfully invested already 10 years to develop better and increasingly parallel multicore architectures, both software community and educational institutions appear still to rely on the sequential computing paradigm as the primary mechanism for expressing the (very often originally inherently parallel) functionality, especially in the arena of general purpose computing. In that respect, parallel programming has remained a hobby of highly educated specialists and is still too often being considered as too difficult for the average programmer. Excuses are various: lack of education, lack of suitable easy-to-use tools, too architecture-dependent mechanisms, huge existing base of sequential legacy code, steep learning curves, and inefficient architectures. It is important for the scientific community to analyze the situation and understand whether the problem is with hardware architectures, software development tools and practices, or both. Although we would be tempted to answer this question (and actually try to do so elsewhere), there is strong need for wider academic discussion on these topics and presentation of research results in scientific workshops and conferences.

Martti Forsell, Jesper Larsson Träff
Thermal Management of a Many-Core Processor under Fine-Grained Parallelism

In this paper, we present the work in progress that studies the run-time impact of various DTM techniques on a proposed 1024-core XMT chip. XMT aims to improve single task performance using fine-grained parallelism. Via simulations, we show that relative to a general global scheme, speedups of up to 46% with a dedicated interconnection controller and 22% with distributed control of computing clusters are possible. Our findings lead to several high level insights that can impact the design of a broader family of shared memory many-core systems.

Fuat Keceli, Tali Moreshet, Uzi Vishkin
Mainstream Parallel Array Programming on Cell

We present the E

$\sharp$

compiler and runtime library for the ‘F’ subset of the Fortran 95 programming language. ‘F’ provides first-class support for arrays, allowing E

$\sharp$

to implicitly evaluate array expressions in parallel using the SPU co-processors of the Cell Broadband Engine. We present performance results from four benchmarks that all demonstrate absolute speedups over equivalent ‘C’ or Fortran versions running on the PPU host processor. A significant benefit of this straightforward approach is that a serial implementation of any code is always available, providing code longevity, and a familiar development paradigm.

Paul Keir, Paul W. Cockshott, Andrew Richards
Generating GPU Code from a High-Level Representation for Image Processing Kernels

We present a framework for representing image processing kernels based on decoupled access/execute metadata, which allow the programmer to specify both execution constraints and memory access pattern of a kernel. The framework performs source-to-source translation of kernels expressed in high-level framework-specific C++ classes into low-level CUDA or OpenCL code with effective device-dependent optimizations such as global memory padding for memory coalescing and optimal memory bandwidth utilization. We evaluate the framework on several image filters, comparing generated code against highly-optimized CPU and GPU versions in the popular OpenCV library.

Richard Membarth, Anton Lokhmotov, Jürgen Teich
A Greedy Heuristic Approximation Scheduling Algorithm for 3D Multicore Processors

In this paper, we propose a greedy heuristic approximation scheduling algorithm for future multicore processors. It is expected that hundreds of cores will be integrated on a single chip, known as a Chip Multiprocessor (CMP). To reduce on-chip communication delay, 3D integration with Through Silicon Vias (TSVs) is introduced to replace the 2D counterpart. Multiple functional layers can be stacked in a 3D CMP. However, operating system process scheduling, one of the most important design issues for CMP systems, has not been well addressed for such a system. We define a model for future 3D CMPs, based on which a scheduling algorithm is proposed to reduce cache access latencies and the delay of inter process communications (IPC). We explore different scheduling possibilities and discuss the advantages and disadvantages of our algorithm. We present benchmark results using a cycle accurate full system simulator based on realistic workloads. Experiments show that under two workloads, the execution times of our scheduling in two configurations (2 and 4 threads) are reduced by 15.58% and 8.13% respectively, compared with the other schedulings. Our study provides a guideline for designing scheduling algorithms for 3D multicore processors.

Thomas Canhao Xu, Pasi Liljeberg, Hannu Tenhunen

Algorithms and Programming Tools for Next-Generation High-Performance Scientific Software HPSS 2011

Algorithms and Programming Tools for Next-Generation High-Performance Scientific Software HPSS 2011

The workshop

Algorithms and Programming Tools for Next-Generation High- Performance Scientific Software

(HPSS) focuses on recent advances in algorithms and programming tools development for next-generation high-performance scientific software as enabling technologies for new insights into Computational Science.

Stefania Corsaro, Pasqua D’Ambra, Francesca Perla
European Exascale Software Initiative: Numerical Libraries, Solvers and Algorithms

Computers with sustained Petascale performance are now available and it is expected that hardware will be developed with a peak capability in the Exascale range by around 2018. However, the complexity, hierarchical nature, and probable heterogeneity of these machines pose great challenges for the development of software to exploit these architectures.

This was recognized some years ago by the IESP (International Exascale Software Project) initiative and the European response to this has been a collaborative project called EESI (European Exascale Software Initiative). This initiative began in 2010 and has submitted its final report to the European Commission with a final conference in Barcelona in October 2011. The main goals of EESI are to build a European vision and roadmap to address the international outstanding challenge of performing scientific computing on the new generation of computers.

The main activity of the EESI is in eight working groups, four on applications and four on supporting technologies. We first briefly review these eight chapters before discussing in more detail the work of Working Group 4.3 on Numerical Libraries, Solvers and Algorithms. Here we will look at the principal areas, the challenges of Exascale and possible ways to address these, and the resources that will be needed.

Iain S. Duff
On Reducing I/O Overheads in Large-Scale Invariant Subspace Projections

Obtaining highly accurate predictions on properties of light atomic nuclei using the Configuration Interaction (CI) method requires computing the lowest eigenvalues and associated eigenvectors of a large many-body nuclear Hamiltonian,

H

. One particular approach, the

J

-scheme, requires the projection of the

H

matrix into an invariant subspace. Since the matrices can be very large, enormous computing power is needed while significant stresses are put on the memory and I/O sub-systems. By exploiting the inherent localities in the problem and making use of the MPI one-sided communication routines backed by RDMA operations available in the new parallel architectures, we show that it is possible to reduce the I/O overheads drastically for large problems. This is demonstrated in the subspace projection phase of

J

-scheme calculations on

6

Li nucleus, where our new implementation based on one-sided MPI communications outperforms the previous I/O based implementation by almost a factor of 10.

Hasan Metin Aktulga, Chao Yang, Ümit V. Çatalyürek, Pieter Maris, James P. Vary, Esmond G. Ng
Enabling Next-Generation Parallel Circuit Simulation with Trilinos

The Xyce Parallel Circuit Simulator, which has demonstrated scalable circuit simulation on hundreds of processors, heavily leverages the high-performance scientific libraries provided by Trilinos. With the move towards multi-core CPUs and GPU technology, retaining this scalability on future parallel architectures will be a challenge. This paper will discuss how Trilinos is an enabling technology that will optimize the trade-off between effort and impact for application codes, like Xyce, in their transition to becoming next-generation simulation tools.

Chris Baker, Erik Boman, Mike Heroux, Eric Keiter, Siva Rajamanickam, Rich Schiek, Heidi Thornquist
DAG-Based Software Frameworks for PDEs

The task-based approach to software and parallelism is well-known and has been proposed as a potential candidate, named the silver model, for exascale software. This approach is not yet widely used in the large-scale multi-core parallel computing of complex systems of partial differential equations. After surveying task-based approaches we investigate how well the Uintah software and an extension named Wasatch fit in the task-based paradigm and how well they perform on large scale parallel computers. The conclusion is that these approaches show great promise for petascale but that considerable algorithmic challenges remain.

Martin Berzins, Qingyu Meng, John Schmidt, James C. Sutherland
On Partitioning Problems with Complex Objectives

Hypergraph and graph partitioning tools are used to partition work for efficient parallelization of many sparse matrix computations. Most of the time, the objective function that is reduced by these tools relates to reducing the communication requirements, and the balancing constraints satisfied by these tools relate to balancing the work or memory requirements. Sometimes, the objective sought for having balance is a complex function of a partition. We mention some important class of parallel sparse matrix computations that have such balance objectives. For these cases, the current state of the art partitioning tools fall short of being adequate. To the best of our knowledge, there is only a single algorithmic framework in the literature to address such balance objectives. We propose another algorithmic framework to tackle complex objectives and experimentally investigate the proposed framework.

Kamer Kaya, François-Henry Rouet, Bora Uçar
A Communication-Avoiding Thick-Restart Lanczos Method on a Distributed-Memory System

The Thick-Restart Lanczos (TRLan) method is an effective method for solving large-scale Hermitian eigenvalue problems. On a modern computer, communication can dominate the solution time of TRLan. To enhance the performance of TRLan, we develop CA-TRLan that integrates communication-avoiding techniques into TRLan. To study the numerical stability and solution time of CA-TRLan, we conduct numerical experiments using both synthetic diagonal matrices and matrices from the University of Florida sparse matrix collection. Our experimental results on up to 1,024 processors of a distributed-memory system demonstrate that CA-TRLan can achieve speedups of up to three over TRLan while maintaining numerical stability.

Ichitaro Yamazaki, Kesheng Wu
Spherical Harmonic Transform with GPUs

We describe an algorithm for computing an inverse spherical harmonic transform suitable for graphic processing units (GPU). We use CUDA and base our implementation on a

Fortran90

routine included in a publicly available parallel package,

s

2

hat

. We focus our attention on two major sequential steps involved in the transforms computation retaining the efficient parallel framework of the original code. We detail optimization techniques used to enhance the performance of the CUDA-based code and contrast them with those implemented in the

Fortran90

version. We present performance comparisons of a single CPU plus GPU unit with the

s

2

hat

code running on either a single or 4 processors. In particular, we find that the latest generation of GPUs, such as NVIDIA GF100 (Fermi), can accelerate the spherical harmonic transforms by as much as 18 times with respect to

s

2

hat

executed on one core, and by as much as 5.5 with respect to

s

2

hat

on 4 cores, with the overall performance being limited by the Fast Fourier transforms. The work presented here has been performed in the context of the Cosmic Microwave Background simulations and analysis. However, we expect that the developed software will be of more general interest and applicability.

Ioan Ovidiu Hupca, Joel Falcou, Laura Grigori, Radek Stompor
Design Patterns for Scientific Computations on Sparse Matrices

We discuss object-oriented software design patterns in the context of scientific computations on sparse matrices. Design patterns arise when multiple independent development efforts produce very similar designs, yielding an evolutionary convergence onto a good solution: a flexible, maintainable, high-performance design. We demonstrate how to engender these traits by implementing an interface for sparse matrix computations on NVIDIA GPUs starting from an existing sparse matrix library. We also present initial performance results.

Davide Barbieri, Valeria Cardellini, Salvatore Filippone, Damian Rouson
High-Performance Matrix-Vector Multiplication on the GPU

In this paper, we develop a high-performance GPU kernel for one of the most popular dense linear algebra operations, the matrix-vector multiplication. The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture), which is designed from the ground up for scientific computing. We show that it is essentially a matter of fully utilizing the fine-grained parallelism of the many-core GPU in order to achieve high-performance for dense matrix-vector multiplication. We show that auto-tuning can be successfully employed to the GPU kernel so that it performs well for all matrix shapes and sizes.

Hans Henrik Brandenborg Sørensen
Relaxed Synchronization with Ordered Read-Write Locks

This paper promotes the first stand-alone implementation of our adaptive tool for synchronization

ordered read-write locks

, ORWL. It provides new synchronization methods for resource oriented parallel or distributed algorithms for which it allows an implicit deadlock-free and equitable control of a protected resource and provides means to couple lock objects and data tightly. A typical application that uses this framework will run a number of loosely coupled tasks that are exclusively regulated by the data flow. We conducted experiments to prove the validity, efficiency and scalability of our implementation.

Jens Gustedt, Emmanuel Jeanvoine
The Parallel C++ Statistical Library ‘QUESO’: Quantification of Uncertainty for Estimation, Simulation and Optimization

QUESO is a collection of statistical algorithms and programming constructs supporting

research

into the uncertainty quantification (UQ) of models and their predictions. It has been designed with three objectives: it should (a) be

sufficiently abstract

in order to handle a large spectrum of models, (b) be

algorithmically extensible

, allowing an easy insertion of new and improved algorithms, and (c) take advantage of

parallel computing

, in order to handle realistic models. Such objectives demand a combination of an

object-oriented design

with robust software engineering practices. QUESO is written in C++, uses MPI, and leverages libraries already available to the scientific community. We describe some UQ concepts, present QUESO, and list planned enhancements.

Ernesto E. Prudencio, Karl W. Schulz
Use of HPC-Techniques for Large-Scale Data Migration

Any re-design of a distributed legacy system requires a migration which involves numerous complex data replication and transformation steps. Migration procedures can become quite difficult and time-consuming, especially when the setup (i.e., the employed databases, encodings, formats etc.) of the legacy and the target system fundamentally differ, which is often the case with finance data, grown over decades. We report on experiences from a real-world project: the recent migration of a customer loyalty system from a COBOL-operated mainframe to a modern service-oriented architecture. In this context, we present our easy-to-adopt solution for running most replication steps in a high-performance manner: the

QuickApply

HPC-software which helps minimizing the replication time, and, thereby, the overall

downtime

of the migration. Business processes can be kept up and running most of the time, while pre-extracted data already pass a variety of platforms and representations toward the target system. We combine the advantages of traditional migration approaches: transformations, which require the interruption of business processes are performed with static data only, they can be made undone in case of a failure and terminate quickly, due to the use of parallel processing.

Jan Dünnweber, Valentin Mihaylov, René Glettler, Volker Maiborn, Holger Wolff

Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (HeteroPar 2011)

Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (HeteroPar 2011)

Heterogeneity is emerging as one of the most profound and challenging characteristics of today’s parallel environments. From the macro level, where networks of distributed computers, composed by diverse node architectures, are interconnected with potentially heterogeneous networks, to the micro level, where deeper memory hierarchies and various accelerator architectures are increasingly common, the impact of heterogeneity on all computing tasks is increasing rapidly. Traditional parallel algorithms, programming environments and tools, designed for legacy homogeneous multiprocessors, can at best achieve on a small fraction of the efficiency and potential performance we should expect from parallel computing in tomorrow’s highly diversified and mixed environments. New ideas, innovative algorithms, and specialized programming environments and tools are needed to efficiently use these new and multifarious parallel architectures. The workshop is intended to be a forum for researchers working on algorithms, programming languages, tools, and theoretical models aimed at efficiently solving problems on heterogeneous networks.

This volume contains the papers presented at HeteroPar’11:Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms held on August 28, 2011 in Bordeaux.

George Bosilca
A Genetic Algorithm with Communication Costs to Schedule Workflows on a SOA-Grid

In this paper we study the problem of scheduling a collection of workflows, identical or not, on a SOA (Service Oriented Architecture) grid . A workflow (job) is represented by a directed acyclic graph (DAG) with typed tasks. All of the grid hosts are able to process a set of typed tasks with unrelated processing costs and are able to transmit files through communication links for which the communication times are not negligible. The goal of our study is to minimize the maximum completion time (makespan) of the workflows. To solve this problem we propose a genetic approach. The contributions of this paper are both the design of a Genetic Algorithm taking the communication costs into account and its performance analysis.

Jean-Marc Nicod, Laurent Philippe, Lamiel Toch
An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

A GPU is a promising device for further increasing computing performance in high performance computing field. Currently, many programming langauges are proposed for the GPU offloaded from the host, as well as CUDA. However, parallel programming with a multi-node GPU cluster, where each node has one or more GPUs, is a hard work. Users have to describe multi-level parallelism, both between nodes and within the GPU using MPI and a GPGPU language like CUDA. In this paper, we will propose a parallel programming language targeting multi-node GPU clusters. We extend XcalableMP, a parallel PGAS (Partitioned Global Address Space) programming language for PC clusters, to provide a productive parallel programming model for multi-node GPU clusters. Our performance evaluation with the N-body problem demonstrated that not only does our model achieve scalable performance, but it also increases productivity since it only requires small modifications to the serial code.

Jinpil Lee, Minh Tuan Tran, Tetsuya Odajima, Taisuke Boku, Mitsuhisa Sato
Performance Evaluation of List Based Scheduling on Heterogeneous Systems

This paper addresses the problem of evaluating the schedules produced by list based scheduling algorithms, with metaheuristic algorithms. Task scheduling in heterogeneous systems is a NP-problem, therefore several heuristic approaches were proposed to solve it. These heuristics are categorized into several classes, such as list based, clustering and task duplication scheduling. Here we consider the list scheduling approach. The objective of this study is to assess the solutions obtained by list based algorithms to verify the space of improvement that new heuristics can have considering the solutions obtained with metaheuritcs that are higher time complexity approaches. We concluded that for a low Communication to Computation Ratio (CCR) of 0.1, the schedules given by the list scheduling approach is in average close to metaheuristic solutions. And for CCRs up to 1 the solutions are below 11% worse than the metaheuristic solutions, showing that it may not be worth to use higher complexity approaches and that the space to improve is narrow.

Hamid Arabnejad, Jorge G. Barbosa
Column-Based Matrix Partitioning for Parallel Matrix Multiplication on Heterogeneous Processors Based on Functional Performance Models

In this paper we present a new data partitioning algorithm to improve the performance of parallel matrix multiplication of dense square matrices on heterogeneous clusters. Existing algorithms either use single speed performance models which are too simplistic or they do not attempt to minimise the total volume of communication. The functional performance model (FPM) is more realistic then single speed models because it integrates many important features of heterogeneous processors such as the processor heterogeneity, the heterogeneity of memory structure, and the effects of paging. To load balance the computations the new algorithm uses FPMs to compute the area of the rectangle that is assigned to each processor. The total volume of communication is then minimised by choosing a shape and ordering so that the sum of the half-perimeters is minimised. Experimental results demonstrate that this new algorithm can reduce the total execution time of parallel matrix multiplication in comparison to existing algorithms.

David Clarke, Alexey Lastovetsky, Vladimir Rychkov
A Framework for Distributing Agent-Based Simulations

Agent-based simulation models are an increasingly popular tool for research and management in many, different and diverse fields. In executing such simulations the “speed” is one of the most general and important issues. The traditional answer to this issue is to invest resources in deploying a dedicated installation of dedicated computers. In this paper we present a framework that is a parallel version of the

Mason

, a library for writing and running Agent-based simulations.

Gennaro Cordasco, Rosario De Chiara, Ada Mancuso, Dario Mazzeo, Vittorio Scarano, Carmine Spagnuolo
Parallel Sparse Linear Solver GMRES for GPU Clusters with Compression of Exchanged Data

GPU clusters have become attractive parallel platforms for high performance computing due to their ability to compute faster than the CPU clusters. We use this architecture to accelerate the mathematical operations of the GMRES method for solving large sparse linear systems. However the parallel sparse matrix-vector product of GMRES causes overheads in CPU/CPU and GPU/CPU communications when exchanging large shared vectors of unknowns between GPUs of the cluster. Since a sparse matrix-vector product does not often need all the unknowns of the vector, we propose to use data compression and decompression operations on the shared vectors, in order to exchange only the needed unknowns. In this paper we present a new parallel GMRES algorithm for GPU clusters, using compression vectors. Our experimental results show that the GMRES solver is more efficient when using the data compression technique on large shared vectors.

Jacques M. Bahi, Raphaël Couturier, Lilia Ziane Khodja
Two-Dimensional Discrete Wavelet Transform on Large Images for Hybrid Computing Architectures: GPU and CELL

The Discrete Wavelet Transform (DWT) has gained the momentum in signal processing and image compression over the last decade bringing the concept up to the level of new image coding standard JPEG2000. Thanks to many added values in DWT, in particular inherent multi-resolution nature, wavelet-coding schemes are suitable for various applications where scalability and tolerable degradation are relevant. Moreover, as we demonstrate in this paper, it can be used as a perfect benchmarking procedure for more sophisticated data compression and multimedia applications using General Purpose Graphical Processor Units (GPGPUs). Thus, in this paper we show and compare experiments performed on reference implementations of DWT on Cell Broadband Engine Architecture (Cell B.E) and nVidia Graphical Processing Units (GPUs). The achieved results show clearly that although both GPU and Cell B.E. are being considered as representatives of the same hybrid architecture devices class they differ greatly in programming style and optimization techniques that need to be taken into account during the development. In order to show the speedup, the parallel algorithm has been compared to sequential computation performed on the x86 architecture.

Marek Błażewicz, Miłosz Ciżnicki, Piotr Kopta, Krzysztof Kurowski, Paweł Lichocki
Scheduling Divisible Loads on Heterogeneous Desktop Systems with Limited Memory

This paper addresses the problem of scheduling discretely divisible applications in heterogeneous desktop systems with limited memory by relying on realistic performance models for computation and communication, through bidirectional asymmetric full-duplex buses. We propose an algorithm for multi-installment processing with multi-distributions that allows to efficiently overlap computation and communication at the device level in respect to the supported concurrency. The presented approach was experimentally evaluated for a real application; 2D FFT batch collaboratively executed on a Graphic Processing Unit and a multi-core CPU. The experimental results obtained show the ability of the proposed approach to outperform the optimal implementation for about 4 times, whereas it is not possible with the current state of the art approaches to determine a load balanced distribution.

Aleksandar Ilic, Leonel Sousa
Peer Group and Fuzzy Metric to Remove Noise in Images Using Heterogeneous Computing

In this paper, we report a study on the parallelization of an algorithm for removing impulsive noise in images. The algorithm is based on the concept of peer group and fuzzy metric. We have developed implementations using Open Multi-Processing (OpenMP) and Compute Unified Device Architecture (CUDA) for Graphics Processing Unit (GPU). Many sequential algorithms have been proposed to remove noise, but their computational cost is excessive for real-time processing of large images. We developed implementations for a multi-core CPU, for a multi-GPU (several GPUs) and for a combination of both. These implementations were compared also with different sizes of the image in order to find out the settings with the best performance. A study is made using the shared memory and texture memory to minimize access time to data in GPU global memory. The result shows that when the image is distributed in multi-core and multi-GPU a greater number of Mpixels/second are processed.

Ma. Guadalupe Sánchez, Vicente Vidal, Jordi Bataller
Estimation of MPI Application Performance on Volunteer Environments

Emerging MPI libraries, such as VolpexMPI and P2P MPI, allow message passing parallel programs to execute effectively in heterogeneous volunteer environments despite frequent failures. However, the performance of message passing codes varies widely in a volunteer environment, depending on the application characteristics and the computation and communication characteristics of the nodes and the interconnection network. This paper has the dual goal of developing and validating a tool chain to estimate performance of MPI codes in a volunteer environment and analyzing the suitability of the class of computations represented by NAS benchmarks for volunteer computing. The framework is deployed to estimate performance in a variety of possible volunteer configurations, including some based on the measured parameters of a campus volunteer pool. The results show slowdowns by factors between 2 and 10 for different NAS benchmark codes for execution on a realistic volunteer campus pool as compared to dedicated clusters.

Girish Nandagudi, Jaspal Subhlok, Edgar Gabriel, Judit Gimenez
Backmatter
Metadaten
Titel
Euro-Par 2011: Parallel Processing Workshops
herausgegeben von
Michael Alexander
Pasqua D’Ambra
Adam Belloum
George Bosilca
Mario Cannataro
Marco Danelutto
Beniamino Di Martino
Michael Gerndt
Emmanuel Jeannot
Raymond Namyst
Jean Roman
Stephen L. Scott
Jesper Larsson Traff
Geoffroy Vallée
Josef Weidendorfer
Copyright-Jahr
2012
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-29737-3
Print ISBN
978-3-642-29736-6
DOI
https://doi.org/10.1007/978-3-642-29737-3

Neuer Inhalt