Skip to main content
Top

2012 | Book

Parallel Processing and Applied Mathematics

9th International Conference, PPAM 2011, Torun, Poland, September 11-14, 2011. Revised Selected Papers, Part I

Editors: Roman Wyrzykowski, Jack Dongarra, Konrad Karczewski, Jerzy Waśniewski

Publisher: Springer Berlin Heidelberg

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

This two-volume-set (LNCS 7203 and 7204) constitutes the refereed proceedings of the 9th International Conference on Parallel Processing and Applied Mathematics, PPAM 2011, held in Torun, Poland, in September 2011. The 130 revised full papers presented in both volumes were carefully reviewed and selected from numerous submissions. The papers address issues such as parallel/distributed architectures and mobile computing; numerical algorithms and parallel numerics; parallel non-numerical algorithms; tools and environments for parallel/distributed/grid computing; applications of parallel/distributed computing; applied mathematics, neural networks and evolutionary computing; history of computing.

Table of Contents

Frontmatter

A Look Back: 57 Years of Scientific Computing

A Look Back: 57 Years of Scientific Computing

This document outlines my 57-year career in computational mathematics, a career that took me from Poland to Canada and finally to Denmark. It of course spans a period in which both hardware and software developed enormously. Along the way I was fortunate to be faced with fascinating technical challenges and privileged to be able to share them with inspiring colleagues. From the beginning, my work to a great extent was concerned, directly or indirectly, with computational linear algebra, an interest I maintain even today.

Jerzy Waśniewski

Parallel/Distributed Architectures and Mobile Computing

Modeling a Leadership-Scale Storage System

Exascale supercomputers will have the potential for billion-way parallelism. While physical implementations of these systems are currently not available, HPC system designers can develop models of exascale systems to evaluate system design points. Modeling these systems and associated subsystems is a significant challenge. In this paper, we present the Co-design of Exascale Storage System (CODES) framework for evaluating exascale storage system design points. As part of our early work with CODES, we discuss the use of the CODES framework to simulate leadership-scale storage systems in a tractable amount of time using parallel discrete-event simulation. We describe the current storage system models and protocols included with the CODES framework and demonstrate the use of CODES through simulations of an existing petascale storage system.

Ning Liu, Christopher Carothers, Jason Cope, Philip Carns, Robert Ross, Adam Crume, Carlos Maltzahn
Combining Optimistic and Pessimistic Replication

This paper presents a concept of combining pessimistic and optimistic approach to replication. Optimistic replication allows for tentative system states, which increases availability and efficiency, but makes behaviour of the system less predictable, even if some operations seem completed. To enable more stable results, pessimistic and optimistic modes of operations are distinguished. Operations issued in the optimistic mode accept or produce tentative states, while operations issued in the pessimistic mode appear as completed in a stable state, termed committed. Orthogonally, to refine expectations of the results, modifications are specified as either synchronous or asynchronous, and reads as either synchronised or immediate.

Marcin Bazydło, Szymon Francuzik, Cezary Sobaniec, Dariusz Wawrzyniak
K-Resilient Session Guarantees Synchronization Protocol for Mobile Ad-Hoc Networks

Session guarantees define data consistency in a distributed system as required by a single, mobile client. Consistency protocols of session guarantees are built from two components: one is aimed at providing safety (the

guarantees

) and the other at providing liveness (data synchronization). This paper presents a new k-resilient data synchronization protocol designed for mobile ad-hoc networks and other systems, where crash-stop failures occur. The application of the protocol is targeted at, but not limited to, various rescue and emergency operations.

Jerzy Brzeziński, Dariusz Dwornikowski, Łukasz Piątkowski, Grzegorz Sobański
On Time Constraints of Reliable Broadcast Protocols for Ad Hoc Networks with the Liveness Property

In this paper we consider a formal model of ad hoc systems and its liveness property, defined with the use of the concept of dynamic sets. In this context we analyse reliable broadcast protocols dedicated for use in this kind of networks. In solutions proposed till now it is assumed that the minimum time of direct connectivity between any neighbouring nodes is much longer than maximum message transmission time. This assumption covers, however, dependence of the required minimum time of direct communication on some system parameters. Therefore, in this paper we show precisely how the minimum time of direct connectivity depends on the total number of hosts in a network and on the total number of messages that can be disseminated by each node concurrently.

Jerzy Brzeziński, Michał Kalewski, Dariusz Wawrzyniak
Data Transfers on the Fly for Hierarchical Systems of Chip Multi-Processors

Local and global communication between computing cores is an essential problem of efficient parallel computations in many–core massively parallel systems based on many Chip Multi–Processor (CMP) modules interconnected by global networks. The paper presents new methods for data communication inside and between CMP modules. At the level of data communication between CMP modules a special network implements communication between CMP module external shared memories with simultaneous reads on the fly to L2 data caches and main memories of CMP modules. Similar mechanism improves local communication between shared memory modules and data caches inside CMPs.

Marek Tudruj, Łukasz Maśko

Numerical Algorithms

New Level-3 BLAS Kernels for Cholesky Factorization

Some Linear Algebra Libraries use Level-2 routines during the factorization part of any Level-3 block factorization algorithm. We discuss four Level-3 routines called DPOTF3, a new type of BLAS, for the factorization part of a block Cholesky factorization algorithm for use by LAPACK routine DPOTRF or for BPF (Blocked Packed Format) Cholesky factorization. The four routines DPOTF3 are Fortran routines. Our main result is that performance of routines DPOTF3 is still increasing when the performance of Level-2 routine DPOTF2 of LAPACK starts to decrease. This means that the performance of DGEMM, DSYRK, and DTRSM will increase due to their use of larger block sizes and also to making less passes over the matrix elements. We present corroborating performance results for DPOTF3 versus DPOTF2 on a variety of common platforms. The four DPOTF3 routines are based on simple register blocking; different platforms have different numbers of registers and so our four routines have different register blockings. Blocked Packed Format (BPF) is discussed. LAPACK routines for _POTRF and _PPTRF using BPF instead of full and packed format are shown to be trivial modifications of LAPACK _POTRF source codes. Upper BPF is shown to be identical to square block packed format. Performance results for DBPTRF and DPOTRF for large

n

show that routines DPOTF3 does increase performance for large

n

.

Fred G. Gustavson, Jerzy Waśniewski, José R. Herrero
Parallel Preconditioner for Nonconforming Adini Discretization of a Plate Problem on Nonconforming Meshes

In this paper we present a domain decomposition parallel preconditioner for a discretization of a plate problem on nonconforming meshes in 2D. The local discretizations are Adini nonconforming plate finite elements. On the interfaces between adjacent subdomains two mortar conditions are imposed. The condition number of the preconditioned problem is almost optimal i.e. it is bounded poly-logarithmically with respect to the mesh parameters.

Leszek Marcinkowski
Incomplete Cyclic Reduction of Banded and Strictly Diagonally Dominant Linear Systems

The ScaLAPACK library contains a pair of routines for solving banded linear systems which are strictly diagonally dominant by rows. Mathematically, the algorithm is complete block cyclic reduction corresponding to a particular block partitioning of the system. In this paper we extend Heller’s analysis of incomplete cyclic reduction for block tridiagonal systems to the ScaLAPACK case. We obtain a tight estimate on the significance of the off diagonal blocks of the tridiagonal linear systems generated by the cyclic reduction algorithm. Numerical experiments illustrate the advantage of omitting all but the first reduction step for a class of matrices related to high order approximations of the Laplace operator.

Carl Christian Kjelgaard Mikkelsen, Bo Kågström
Fast and Small Nonlinear Pseudorandom Number Generators for Computer Simulation

In this paper we present Tyche, a nonlinear pseudorandom number generator designed for computer simulation. Tyche has a small 128-bit state and an expected period length of 2

127

. Unlike most nonlinear generators, Tyche is consistently fast across architectures, due to its very simple iteration function derived from ChaCha, one of today’s fastest stream ciphers.

Tyche is especially amenable for the highly parallel environments we find today, in particular for Graphics Processing Units (GPUs), where it enables a very large number of uncorrelated parallel streams running independently. For example, 2

16

parallel independent streams are expected to generate about 2

96

pseudorandom numbers each, without overlaps.

Additionally, we determine bounds for the period length and parallelism of our generators, and evaluate their statistical quality and performance. We compare Tyche and the variant Tyche-i to the XORWOW and TEA

8

generators in CPUs and GPUs. Our comparisons show that Tyche and Tyche-i simultaneously achieve high performance and excellent statistical properties, particularly when compared to other nonlinear generators.

Samuel Neves, Filipe Araujo
Parallel Quantum Algorithm for Finding the Consistency of Saaty’s Matrices

Features of the quantum systems enable simple calculation of the eigenvalues in uniquely - parallel way. We propose the use of quantum algorithms for the implementation of an iterative method to search for consistency in the Saaty’s matrix of relative judgments, by step by step closing up to the consistent matrix structure. Typically, the matrix of relative judgments is prepared on the basis of the opinion of an expert or group of experts. In practice if is necessary to obtain consistent form of opinions set, but when we want to get the desired level of their consistency we must even in the minimal scope correct them. Criteria of correction are: the minimum number of seats of correcting (the fastest convergence to the consistency) or minimum summary value of alterations. In our proposition we can choose several variants of the iterative corrections as a way of consolidation. The method most often chosen by experts is based on the minimal correction in every iteration. Sometimes we want to make minimal iteration steps. Details of this classical approach are presented in [9].

In this paper we want to support the classical algorithm by the quantum convention and parameters. The measurement realization will be connected with the state transition and reading of the eigenvalue. The superposition procedure will be activated after the transition, what causes the change of the probability of the choice of location of next correction(s). In the aspect of quantum calculations we use quantum vectors, qubits, inner and tensor vectors, linear operators, projectors, gates etc. The resulting effect (for simulation of quantum calculations) concerning the complexity of the calculations is comparable to the classical algorithm.

Henryk Piech, Olga Siedlecka-Lamch
A Numerical Approach to the Determination of 3D Stokes Flow in Polygonal Domains Using PIES

This paper discusses a generalization of intensively developed parametric integral equation system (PIES) to solve 3D steady Stokes flow. Given the preliminary nature of this research, the effectiveness of the generalized PIES has been verified for the solutions of the Stokes problems defined on polygonal domains. The boundary of such domains has been modeled directly in PIES by joining rectangular Coons parametric patches of the first degree. With them it is possible to model relatively large linear segments of the boundary by small number of corner points of the considered polygonal domain. Two numerical examples were used to validate the solutions of PIES with analytical and numerical results available in the literature.

Eugeniusz Zieniuk, Krzysztof Szerszen, Marta Kapturczak

Parallel Numerics

Cache Blocking for Linear Algebra Algorithms

We briefly describe Cache Blocking for Dense Linear Algebra Algorithms on computer architectures since about 1985. Before that one had uniform memory architectures. The Cray I machine was the last holdout. We cover the where, when, what, how and why of Cache Blocking. Almost all computer manufacturers have recently (about seven years ago) dramatically changed their computer architectures to produce Multicore (MC) processors. It will be seen that the arrangement in memory of the submatrices

A

ij

of

A

is a critical factor for obtaining high performance. From a practical point of view, this work is very important as it will allow existing codes using LAPACK and ScaLAPACK to remain usable by new versions of LAPACK and ScaLAPACK.

Fred G. Gustavson
Reducing the Amount of Pivoting in Symmetric Indefinite Systems

This paper illustrates how the communication due to pivoting in the solution of symmetric indefinite linear systems can be reduced by considering innovative approaches that are different from pivoting strategies implemented in current linear algebra libraries. First a tiled algorithm where pivoting is performed within a tile is described and then an alternative to pivoting is proposed. The latter considers a symmetric randomization of the original matrix using the so-called recursive butterfly matrices. In numerical experiments, the accuracy of tile-wise pivoting and of the randomization approach is compared with the accuracy of the Bunch-Kaufman algorithm.

Dulceneia Becker, Marc Baboulin, Jack Dongarra
A High Performance Dual Revised Simplex Solver

When solving families of related linear programming (LP) problems and many classes of single LP problems, the simplex method is the preferred computational technique. Hitherto there has been no efficient parallel implementation of the simplex method that gives good speed-up on general, large sparse LP problems. This paper presents a variant of the dual simplex method and a prototype parallelisation scheme. The resulting implementation, ParISS, is efficient when run in serial and offers modest speed-up for a range of LP test problems.

Julian Hall, Qi Huangfu
TFETI Coarse Space Projectors Parallelization Strategies

This paper deals with an analysis of various parallelization strategies for the TFETI algorithm. The data distributions and the most relevant actions are discussed, especially those concerning coarse problem. Being a part of the coarse space projector, it couples all the subdomains computations and accelerates convergence. Its direct solution is more robust but complicates the massively parallel implementation. Presented numerical results confirm high parallel scalability of the coarse problem solution if the dual constraint matrix is distributed and then orthonormalized in parallel. Benchmarks were implemented using PETSc parallelization library and run on HECToR service at EPCC in Edinburgh.

Vaclav Hapla, David Horak
FFTs and Multiple Collective Communication on Multiprocessor-Node Architectures

We consider FFTs for networks with multiprocessor nodes using 2D data decomposition. In this application, processors perform collective all-to-all communication in different groups independently at the same time. Thus the individual processors of the nodes might be involved in independent collective communication. The underlying communication algorithm should account for that fact. For short messages, we propose a sparse version of Bruck’s algorithm which handles such multiple collectives. The distribution of the FFT data to the nodes is discussed for the local and global application of Bruck’s original algorithm, as well as the suggested sparse version. The performance of the different approaches is compared.

Andreas Jocksch
Performance Analysis of Parallel Alternating Directions Algorithm for Time Dependent Problems

We consider the time dependent Stokes equation on a finite time interval and on a uniform rectangular mesh, written in terms of velocity and pressure. In a parallel algorithm, based on a new direction splitting approach, the pressure equation is derived from a perturbed form of the continuity equation, in which the incompressibility constraint is penalized in a negative norm induced by the direction splitting. The scheme used in the algorithm is composed of: pressure prediction, velocity update, penalty step, and pressure correction. In order to achieve good parallel performance, the solution of the Poison problem for the pressure correction is replaced by solving a sequence of one-dimensional second order elliptic boundary value problems in each spatial direction. The parallel code was developed using MPI and tested on modern computer systems. The performed numerical tests illustrate the parallel efficiency, and the scalability, of the direction-splitting based algorithm.

Ivan Lirkov, Marcin Paprzycki, Maria Ganzha
A Novel Parallel Algorithm for Gaussian Elimination of Sparse Unsymmetric Matrices

We describe a new algorithm for Gaussian Elimination suitable for general (unsymmetric and possibly singular) sparse matrices of any entry type, which has a natural parallel and distributed-memory formulation but degrades gracefully to sequential execution.

We present a sample MPI implementation of a program computing the rank of a sparse integer matrix using the proposed algorithm. Some preliminary performance measurements are presented and discussed, and the performance of the algorithm is compared to corresponding state-of-the-art algorithms for floating-point and integer matrices.

Riccardo Murri
Parallel FEM Adaptation on Hierarchical Architectures

The parallel FEM package NuscaS allows us to solve adaptive FEM problems with 3D unstructured meshes on distributed-memory parallel computers such as PC-clusters. In our previous works, a new method for parallelizing the FEM adaptation was presented, based on using the 8-tetrahedra longest-edge partition. This method relies on a decentralized approach, and is more scalable in comparison to previous implementations requiring a centralized synchronizing node.

At present nodes of clusters contain more and more processing cores. Their efficient utilization is crucial for providing high performance of numerical codes. In this paper, different schemes of mapping the mesh adaptation algorithm on such hierchical architectures are presented and compared. These schemes use either the pure message-passing model, or the hybrid approach which combines shared-memory and message-passing models. Also, we investigate an approach for adapting the pure MPI model to hierarchical topology of clusters with multi-core nodes.

Tomasz Olas, Roman Wyrzykowski, Pawel Gepner
Solving Systems of Interval Linear Equations in Parallel Using Multithreaded Model and “Interval Extended Zero” Method

In this paper, an approach to the solution of systems of interval linear equations with the use of the parallel machines is presented, based on parallel multithreaded model and “interval extended zero” method. This approach not only allows us to decrease the undesirable excess width effect, but makes it possible to avoid the inverted interval solutions too. The efficiency of this method has been already proved for non-parallel systems. Here it is shown that it can be also used to perform efficient calculations on parallel machines using the multithreaded model.

Mariusz Pilarek, Roman Wyrzykowski
GPU-Based Parallel Algorithms for Transformations of Quantum States Expressed as Vectors and Density Matrices

In this paper the parallel algorithms to simulate measurements and unitary operations in the quantum computing model are presented. The proposed solutions are implemented using the CUDA technology. Moreover, a more effective routine for processing density matrices is presented. The main advantages of this approach are the reduced memory requirement and a good use of the computing power of the GPU hardware. Additionally, the proposed solution can be used to simulate circuits build from qubits or qudits.

Marek Sawerwain
Generalizing Matrix Multiplication for Efficient Computations on Modern Computers

Recent advances in computing allow taking new look at matrix multiplication, where the key ideas are: decreasing interest in recursion, development of processors with thousands (potentially millions) of processing units, and influences from the Algebraic Path Problems. In this context, we propose a generalized matrix-matrix multiply-add (MMA) operation and illustrate its usability. Furthermore, we elaborate the interrelation between this generalization and the BLAS standard.

Stanislav G. Sedukhin, Marcin Paprzycki
Distributed QR Factorization Based on Randomized Algorithms

Most parallel algorithms for matrix computations assume a static network with reliable communication and thus use fixed communication schedules. However, in situations where computer systems may change dynamically, in particular, when they have unreliable components, algorithms with randomized communication schedule may be an interesting alternative.

We investigate randomized algorithms based on gossiping for the distributed computation of the QR factorization. The analyses of numerical accuracy showed that known numerical properties of classical sequential and parallel QR decomposition algorithms are preserved. Moreover, we illustrate that the randomized approaches are well suited for distributed systems with arbitrary topology and potentially unreliable communication, where approaches with fixed communication schedules have major drawbacks. The communication overhead compared to the optimal parallel QR decomposition algorithm (CAQR) is analyzed. The randomized algorithms have a much higher potential for trading off numerical accuracy against performance because their accuracy is proportional to the amount of communication invested.

Hana Straková, Wilfried N. Gansterer, Thomas Zemen
Static Load Balancing for Multi-level Monte Carlo Finite Volume Solvers

The Multi-Level Monte Carlo finite volumes (MLMC-FVM) algorithm was shown to be a robust and fast solver for uncertainty quantification in the solutions of multi-dimensional systems of stochastic conservation laws. A novel load balancing procedure is used to ensure scalability of the MLMC algorithm on massively parallel hardware. We describe this procedure together with other arising challenges in great detail. Finally, numerical experiments in multi-dimensions showing strong and weak scaling of our implementation are presented.

Jonas Šukys, Siddhartha Mishra, Christoph Schwab

Parallel Non-numerical Algorithms

A Parallel Algorithm for Minimizing the Number of Routes in the Vehicle Routing Problem with Time Windows

A parallel algorithm for minimizing the number of routes in the vehicle routing problem with time windows (VRPTW) is presented. The algorithm components cooperate periodically by exchanging their best solutions with the lowest number of routes found to date. The objective of the work is to analyze speedup, achieved accuracy of solutions and scalability of the MPI implementation. For comparisons the selected VRPTW tests are used. The derived results justify the proposed parallelization concept. By making use of the parallel algorithm the twelve new best-known solutions for Gehring and Homberger’s benchmarking tests were found.

Mirosław Błocho, Zbigniew J. Czech
Towards Parallel Direct SAT-Based Cryptanalysis

In this paper we show a new approach of parallelised and optimised direct SAT-based cryptanalysis for symmetric block ciphers. It is shown how one can code directly in SAT each bit of the plaintext together with its ’route’ through the enciphering algorithm steps, code the round key schedule and S-boxes, and eliminate all simple Boolean equivalences and redundancies. We show Boolean coding directly from the analysed cipher’s source code, with no intermediate step of generating any auxiliary system of multivariate low-degree equations, as it was the case in SAT-enhanced algebraic cryptanalysis of [4]. This contributes to the results in much shorter formulae. Another speed-up effect we get by parallelising the cryptanalytic effort to some 2

n

available processing cores. We report some experimental results on two basic well known symmetric ciphers.

Paweł Dudek, Mirosław Kurkowski, Marian Srebrny
Parallel Version of Image Segmentation Algorithm Using Polygonal Markov Fields

In this paper we present an application of parallel simulated annealing method to a segmentation algorithm using polygonal Markov fields. After a brief presentation of the algorithm and a general scheme of parallelization methods using simulated annealing technique, there is presented parallel approach to the segmentation algorithm with different synchronization scenarios.

Authors also present results of the parallelization of the segmentation algorithm. There is discussed comparison between simulations with different synchronization scenarios applied to the multiple-trial approach of simulated annealing technique. Some simulations based on the number of threads are presented as well.

Rafał Kluszczyński, Piotr Bała
Parallel Community Detection for Massive Graphs

Tackling the current volume of graph-structured data requires parallel tools. We extend our work on analyzing such massive graph data with the first massively parallel algorithm for community detection that scales to current data sizes, scaling to graphs of over 122 million vertices and nearly 2 billion edges in under 7300 seconds on a massively multithreaded Cray XMT. Our algorithm achieves moderate parallel scalability without sacrificing sequential operational complexity. Community detection partitions a graph into subgraphs more densely connected within the subgraph than to the rest of the graph. We take an agglomerative approach similar to Clauset, Newman, and Moore’s sequential algorithm, merging pairs of connected intermediate subgraphs to optimize different graph properties. Working in parallel opens new approaches to high performance. On smaller data sets, we find the output’s modularity compares well with the standard sequential algorithms.

E. Jason Riedy, Henning Meyerhenke, David Ediger, David A. Bader
Is Your Permutation Algorithm Unbiased for n ≠ 2 m ?

Many papers on parallel random permutation algorithms assume the input size

n

to be a power of two and imply that these algorithms can be easily generalized to arbitrary

n

. We show that this simplifying assumption is not necessarily correct since it may result in a bias. Many of these algorithms are, however, consistent, i.e., iterating them ultimately converges against an unbiased permutation. We prove this convergence along with proving exponential convergence speed. Furthermore, we present an analysis of iterating applied to a butterfly permutation network, which works in-place and is well-suited for implementation on many-core systems such as GPUs. We also show a method that improves the convergence speed even further and yields a practical implementation of the permutation network on current GPUs.

Michael Waechter, Kay Hamacher, Franziska Hoffgaard, Sven Widmer, Michael Goesele

Tools and Environments for Parallel/Distributed/Grid Computing

Extracting Coarse–Grained Parallelism for Affine Perfectly Nested Quasi–uniform Loops

This paper presents a new approach for the extraction of coarse–grained parallelism available in program loops. The approach permits for extracting parallelism for both uniform and quasi–uniform perfectly nested parameterized loops, where the loop bounds and data accesses are affine functions of loop indices and symbolic parameters. It extracts a set of synchronization–free code fragments. The procedure has a polynomial time complexity except for one step of calculations. The effectiveness and time complexity of the approach are evaluated by means of loops of the NAS Parallel Benchmark suite.

Włodzimierz Bielecki, Krzysztof Kraska
Polish Computational Research Space for International Scientific Collaborations

The Polish Grid Initiative commenced in 2009 as part of the PL-Grid project funded under the framework of the Innovative Economy Operational Programme [1]. The main purpose of this Project is to provide the Polish scientific community with an IT basic platform making use of Grid computer clusters, enabling e-science research in various fields. The Project is establishing a country-wide computing platform, which supports scientific research through integration of experimental data and results of advanced computer simulations carried out by geographically-dispersed teams. The solutions applied in setting up this e-infrastructure will ensure integration with other, similar platforms around the world. In the paper some basic facts concerning the Project history are given, PL-Grid goals are described and several examples of innovative grid services and software as well as support procedures developed to-date are presented.

Jacek Kitowski, Michał Turała, Kazimierz Wiatr, Łukasz Dutka, Marian Bubak, Tomasz Szepieniec, Marcin Radecki, Mariusz Sterzel, Zofia Mosurska, Robert Pająk, Renata Słota, Krzysztof Kurowski, Bartek Palak, Bartłomiej Balcerek, Piotr Bała, Maciej Filocha, Rafał Tylman
Request Distribution Toolkit for Virtual Resources Allocation

The Service Oriented Architecture (SOA) concept facilitates building flexible services that can be deployed in distributed environment, and executed on different hardware and software platforms. On the other hand SOA paradigm rises many challenges in the area of Quality of Service and resources utilization. In the paper Resources Distribution Manager (RDM) that manages resources and service delivery in order to satisfy user requirements and service provider’s needs is presented. Requests for services are distributed to selected virtualized instances of services or optionally new instances are created for handling the requests. The allocation decisions are based on the knowledge about the communication resources and the current load of computational resources which are dynamically monitored by special RDM module. The decision on the resources allocation is then compared with its utilization during service execution and causes changes in allocation strategy.

Jan Kwiatkowski, Mariusz Fras
Vitrall: Web-Based Distributed Visualization System for Creation of Collaborative Working Environments

Advanced parallel computing solutions using GPU for general purpose processing are becoming more and more popular. Applications like CFD or weather modeling take an extensive speed up using GPU-based clusters. Still, original purpose of graphic processing units – visualization, is not exploited as much as it could be in such powerful processing centres. Main reason of that situation is their fundamental difference from classic desktop configurations: being a structure remote and hidden from the actual viewer. Already existing GPU-based architectures consisting of many processing units may be used for visualization of complex issues from many points of view and in resolutions that are not accessible for single GPUs in real-time. Visualization is a very efficient way of collaboration, especially when collaborators can interact with presented content in a natural way, for example using multi-touch devices. Vitrall embraces these methods by introducing possibility of linkage between modern interfaces and complex visualizations. This paper will begin with summary of several researches that where conducted to establish actual concept of the Vitrall system. Next, proposed architecture of Vitrall will be introduced, following main and secondary usage scenarios. Finally, some of Vitrall’s specific configurations will be shown, like real-time stereoscopic visualization.

Piotr Śniegowski, Marek Błażewicz, Grzegorz Grzelachowski, Tomasz Kuczyński, Krzysztof Kurowski, Bogdan Ludwiczak

Applications of Parallel/Distributed Computing

CUDA Accelerated Blobby Molecular Surface Generation

A proper and efficient representation of molecular surfaces is an important issue in biophysics from several view points. Molecular surfaces indeed are used for different aims, in particular for visualization, as support tools for biologists, computation, in electrostatics problems involving implicit solvents (e.g. while solving the Poisson-Boltzmann equation) or for molecular dynamics simulations. This problem has been recognized in the literature, resulting in a multitude of algorithms that differ on the basis of the adopted representation and the approach/ technology used. Among several molecular surface definitions, the Blobby surface is particularly appealing from the computational and the graphics point of view. In the paper we describe an efficient software component able to produce high-resolution Blobby surfaces for very large molecules using the CUDA architecture. Experimental results show a speedup of 35.4 considering a molecule of 90,898 atoms and a resulting mesh of 168 million triangles.

Daniele D’Agostino, Sergio Decherchi, Antonella Galizia, José Colmenares, Alfonso Quarati, Walter Rocchia, Andrea Clematis
GPU Accelerated Image Processing for Lip Segmentation

This paper presents the problem of lip segmentation in parallel environment using computational capabilities of GPUs and CUDA. The presented implementation of lip segmentation is based on image processing methods using the most popular transformations such as morphological operations and convolution filters. The obtained experimental results for the parallel implementation on GPU indicate significant speedup in comparison to its sequential counterpart. Consequently, the use of popular graphics cards provides a very promising possibility of quick lips segmentation.

Lukasz Adrjanowicz, Mariusz Kubanek, Adam Tomas
Material Parameter Identification with Parallel Processing and Geo-applications

The paper describes numerical solution of material parameter identification problems, which arise in geo-applications and many other fields. We describe approach based on nonlinear least squares minimization using different optimization techniques (Nelder-Mead, gradient methods, genetic algorithms) as well as experience with OpenMP+MPI parallelization of the solution methods.

Radim Blaheta, Rostislav Hrtus, Roman Kohut, Owe Axelsson, Ondřej Jakl
Hierarchical Parallel Approach in Vascular Network Modeling – Hybrid MPI+OpenMP Implementation

This paper presents a two-level parallel algorithm of vascular network development. At the outer level, tasks (newly appeared parts of tissue) are spread over processing nodes. Each node attempts to connect/disconnect its assigned parts of tissue in all vascular trees. Communication between nodes is accomplished by a message passing paradigm. At the inner level, subtasks concerning different vascular trees (e.g. arterial and venous) within each task are parallelized by a shared address space paradigm. The solution was implemented on a computing cluster of multi-core nodes with mixed MPI+OpenMP support. The experimental results show that the algorithm provides a significant improvement in computational performance compared with a pure MPI implementation.

Krzysztof Jurczuk, Marek Kretowski, Johanne Bezy-Wendling
Runtime Optimisation Approaches for a Real-Time Evacuation Assistant

This paper presents runtime optimisation approaches for a real-time evacuation assistant. The pedestrian model used for the forecast is a modification of the centrifugal force model which operates in continuous space. It is combined with an event driven route choice algorithm which encompasses the local shortest path, the global shortest path and a combination with the quickest path. A naive implementation of this model has the complexity of

O

(

N

2

),

N

being the number of pedestrians. In the first step of the optimisation the complexity is reduced to

O

(

N

) using special neighbourhood lists like Verlet-List or Linked-Cell commonly used in molecular dynamics. The next step in this optimisation process is parallelisation on a multicore system. The Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) application programming interfaces are used to this extend. The simulation is performed on the Juropa cluster installed at the Jülich Supercomputing Centre. The speedup factors obtained are ~10 for the linked-cells, ~4 for 8 threads and ~3 for the parallelisation on 5 nodes using a static domain decomposition.

Armel Ulrich Kemloh Wagoum, Bernhard Steffen, Armin Seyfried
A Parallel Genetic Algorithm Based on Global Program State Monitoring

A new approach to the design of parallel genetic algorithms (

GA

) for execution in distributed systems is presented. It is based on the use of global parallel program control functions and asynchronous process/thread internal execution control based on global application states monitoring. A control design graphical infrastructure is provided for a programmer based on generalized synchronization processes called synchronizers. They collect local states of program elements, compute global control predicates and send control signals to program computational elements. It enables an easy construction and management of global program states for the purpose of the program execution control at both thread and process level. At each level we create a hierarchical control/synchronization infrastructure which is used to optimize the control of computations in programs. As an example we present the design of a parallel genetic algorithm used to partition a macro data flow graph for FDTD (Finite Difference Time Domain method) computations.

Adam Smyk, Marek Tudruj

Applied Mathematics, Neural Networks and Evolutionary Computing

Parallel Approach to the Functional Decomposition of Logical Functions Using Developmental Genetic Programming

Functional decomposition is the main step in the FPGA-oriented logic synthesis, where a function is decomposed into a set of functions, each of which must be simple enough to be implementable in one logic cell. This paper presents a method of searching for the best decomposition strategy for logical functions specified by cubes. The strategy is represented by a decision tree, where each node corresponds to a single decomposition step. In that way the multistage decomposition of complex logical functions may be specified. The tree evolves using the parallel developmental genetic programming. The goal of the evolution is to find a decomposition strategy for which the cost of FPGA implementation of a given function is minimal. Experimental results show that our approach gives significantly better results than other existing methods.

Stanislaw Deniziak, Karol Wieczorek
The Nine Neighbor Extrapolated Diffusion Method for Weighted Torus Graphs

The convergence analysis of the Extrapolated Diffusion (EDF) was developed in [7] and [8] for the weighted torus and mesh graphs, respectively using the set

$\mathcal{N}_1(i)$

of nearest neighbors of a node i in the graph. In the present work we propose a Diffusion scheme which employs the set

$\mathcal{N} _1(i)\cup \mathcal{N}_2(i)$

, where

$\mathcal{N}_2(i)$

denotes the four neighbors of node i with path length two (see Figure 1) in order to increase the convergence rate. We study the convergence analysis of the new Diffusion scheme with nine neighbors (NEDF) for weighted torus graphs. In particular, we find closed form formulae for the optimum values of the edge weights and the extrapolation parameter. A 60% increase in the convergence rate of NEDF compared to the conventional EDF method is shown analytically and numerically.

Katerina A. Dimitrakopoulou, Michail N. Misyrlis
On the Weak Convergence of the Recursive Orthogonal Series-Type Kernel Probabilistic Neural Networks in a Time-Varying Environment

In a paper a recursive version of general regression neural networks, based on the orthogonal series-type kernels, is presented. Sufficient conditions for convergence in probability are given assuming time-varying noise. Experimental results are provided and discussed.

Piotr Duda, Yoichi Hayashi
On the Cesaro Orthogonal Series-Type Kernel Probabilistic Neural Networks Handling Non-stationary Noise

The Cesaro means of orthogonal series are applied to construct general regression neural networks. Sufficient conditions for convergence in probability are given assuming nonstationary noise. An experiment with syntetic data is described.

Piotr Duda, Jacek M. Zurada
On the Weak Convergence of the Orthogonal Series-Type Kernel Regresion Neural Networks in a Non-stationary Environment

In the paper general regression neural networks, based on the orthogonal series-type kernel, is studied. Convergence in probability is proved assuming non-stationary noise. The performance is investigated using syntetic data.

Meng Joo Er, Piotr Duda
A Graph-Based Generation of Virtual Grids

This paper presents a unified formal model to be used in the grid representation based on hierarchical graph structures. It aims at both helping in a better understanding of grid generation and in grid simulation problems. A new graph structure called layered graphs is proposed. This approach enables one to use attributed graph grammars as a tool to generate both a grid structure and its parameters at the same time. To illustrate the method an example of a grid generated by means of graph grammar rules is presented.

Ewa Grabska, Wojciech Palacz, Barbara Strug, Grażyna Ślusarczyk
On General Regression Neural Network in a Nonstationary Environment

In this paper we present the method for estimation of unknown function in a time-varying environment. We study the probabilistic neural network based on the Parzen kernels combined with the recursive least square method. We present the conditions for convergence in probability and we discuss the experimental results.

Yoichi Hayashi, Lena Pietruczuk
Determination of the Heat Transfer Coefficient by Using the Ant Colony Optimization Algorithm

The paper presents a numerical method of solving the inverse heat conduction problem based on the respectively new tool for combinational optimization named Ant Colony Optimization (ACO). ACO belongs to the group of swarm intelligence algorithms and is inspired by the technique of searching for the shortest way connecting the ant-hill with the source of food. In the proposed approach we use this algorithm for minimizing the proper functional appearing in the procedure of determining the value of heat transfer coefficient in the successive cooling zones.

Edyta Hetmaniok, Damian Słota, Adam Zielonka
Learning in a Non-stationary Environment Using the Recursive Least Squares Method and Orthogonal-Series Type Regression Neural Network

In the paper the recursive least squares method, in combining with general regression neural network, is applied for learning in a non-stationary environment. The orthogonal series-type kernel is applied to design the general regression neural networks. Sufficient conditions for convergence in probability are given and simulation results are presented.

Maciej Jaworski, Meng Joo Er
On the Application of the Parzen-Type Kernel Probabilistic Neural Network and Recursive Least Squares Method for Learning in a Time-Varying Environment

This paper presents the Parzen kernel-type regression neural network in combination with recursive least squares method to solve problem of learning in a time-varying environment. Sufficient conditions for convergence in probability are given. Simulation experiments are presented and discussed.

Maciej Jaworski, Yoichi Hayashi
Learning in Rough-Neuro-Fuzzy System for Data with Missing Values

Rough-neuro-fuzzy systems offer suitable way for classifying data with missing values. The paper presents a new implementation of gradient learning in the case of missing input data which has been adapted for rough-neuro-fuzzy classifiers. We consider the system with singleton fuzzification, Mamdani-type reasoning and center average defuzzification. Several experiments based on common benchmarks illustrating the performance of trained systems are shown. The learning and testing of the systems has been performed with various number of missing values.

Bartosz A. Nowak, Robert K. Nowicki
Diameter of the Spike-Flow Graphs of Geometrical Neural Networks

Average path length is recognised as one of the vital characteristics of random graphs and complex networks. Despite a rather sparse structure, some cases were reported to have a relatively short lengths between every pair of nodes, making the whole network available in just several hops. This small-worldliness was reported in metabolic, social or linguistic networks and recently in the Internet. In this paper we present results concerning path length distribution and the diameter of the spike-flow graph obtained from dynamics of geometrically embedded neural networks. Numerical results confirm both short diameter and average path length of resulting activity graph. In addition to numerical results, we also discuss means of running simulations in a concurrent environment.

Jaroslaw Piersa
Weak Convergence of the Recursive Parzen-Type Probabilistic Neural Network in a Non-stationary Environment

A recursive version of general regression neural networks is presented. Weak convergence is proved in a case of nonstationary noise. Experimental results are given.

Lena Pietruczuk, Jacek M. Zurada
Strong Convergence of the Parzen-Type Probabilistic Neural Network in a Time-Varying Environment

In this paper general regression neural networks are applied to handle nonstationary noise. Strong convergence is established. Experiments conducted on synthetic data show good performance in the case of finite length of data samples.

Lena Pietruczuk, Meng Joo Er
Learning in a Time-Varying Environment by Making Use of the Stochastic Approximation and Orthogonal Series-Type Kernel Probabilistic Neural Network

In the paper stochastic approximation, in combining with general regression neural network, is applied for learning in a time-varying environment. The orthogonal-type kernel is applied to design the general regression neural networks. Sufficient conditions for weak convergence are given and simulation results are presented.

Jacek M. Zurada, Maciej Jaworski

Minisymposium on GPU Computing

Accelerating BST Methods for Model Reduction with Graphics Processors

Model order reduction of dynamical linear time-invariant system appears in many scientific and engineering applications. Numerically reliable SVD-based methods for this task require

$\mathcal{O}(n^3)$

floating-point arithmetic operations, with

n

being in the range 10

3

 − 10

5

for many practical applications. In this paper we investigate the use of graphics processors (GPUs) to accelerate model reduction of large-scale linear systems via Balanced Stochastic Truncation, by off-loading the computationally intensive tasks to this device. Experiments on a hybrid platform consisting of state-of-the-art general-purpose multi-core processors and a GPU illustrate the potential of this approach.

Peter Benner, Pablo Ezzatti, Enrique S. Quintana-Ortí, Alfredo Remón
Reducing Thread Divergence in GPU-Based B&B Applied to the Flow-Shop Problem

In this paper,we propose a pioneering work on designing and programming B&B algorithms on GPU. To the best of our knowledge, no contribution has been proposed to raise such challenge. We focus on the parallel evaluation of the bounds for the Flow-shop scheduling problem. To deal with thread divergence caused by the bounding operation, we investigate two software based approaches called thread data reordering and branch refactoring. Experiments reported that parallel evaluation of bounds speeds up execution up to 54.5 times compared to a CPU version.

Imen Chakroun, Ahcène Bendjoudi, Nouredine Melab
A GPU-Based Approximate SVD Algorithm

Approximation of matrices using the Singular Value Decomposition (SVD) plays a central role in many science and engineering applications. However, the computation cost of an exact SVD is prohibitively high for very large matrices. In this paper, we describe a GPU-based approximate SVD algorithm for large matrices. Our method is based on the QUIC-SVD introduced by [6], which exploits a tree-based structure to efficiently discover a subset of rows that spans the matrix space. We describe how to map QUIC-SVD onto the GPU, and improve its speed and stability using a blocked Gram-Schmidt orthogonalization method. Using a simple matrix partitioning scheme, we have extended our algorithm to out-of-core computation, suitable for very large matrices that exceed the main memory size. Results show that our GPU algorithm achieves 6~7 times speedup over an optimized CPU version of QUIC-SVD, which itself is orders of magnitude faster than exact SVD methods.

Blake Foster, Sridhar Mahadevan, Rui Wang
Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU Architectures

Recently, general purpose GPU (GPGPU) programming has spread rapidly after CUDA was first introduced to write parallel programs in high-level languages for NVIDIA GPUs. While a GPU exploits data parallelism very effectively, task-level parallelism is exploited as a multi-threaded program on a multicore CPU. For such a heterogeneous platform that consists of a multicore CPU and GPU, we propose an automatic code synthesis framework that takes a process network model specification as input and generates a multithreaded CUDA code. With the model based specification, one can explicitly specify both function-level and loop-level parallelism in an application and explore the wide design space in mapping of function blocks and selecting the communication methods between CPU and GPU. The proposed technique is complementary to other high-level methods of CUDA programming.

Hanwoong Jung, Youngmin Yi, Soonhoi Ha
Accelerating the Red/Black SOR Method Using GPUs with CUDA

This work presents our strategy, applied optimizations and results in our effort to exploit the computational capabilities of GPUs under the CUDA environment in solving the Laplacian PDE. The parallelizable red/black SOR method was used. Additionally, a program for the CPU, featuring OpenMP, was developed as a performance reference. Significant performance improvements were achieved by using optimization methods which proved to have substantial speedup in performance. Eventually, a direct comparison of performance of both versions was realised. A 51x speedup was measured for the CUDA version over the CPU version, exceeding 134GB/sec bandwidth. Memory access patterns prove to be a critical factor in efficient program execution on GPUs and it is, therefore, appropriate to follow data reorganization in order to achieve the highest feasible memory throughput.

Elias Konstantinidis, Yiannis Cotronis
Dense Affinity Propagation on Clusters of GPUs

This article focuses on implementation of Affinity Propagation, a state of the art method for finding exemplars in sets of patterns, on clusters of Graphical Processing Units. When finding exemplars in dense, non-metric data Affinity Propagation has

O

(

n

2

) memory complexity. This limits the size of problems that can fit in the Graphical Processing Unit memory. We show, however, that dense Affinity Propagation can be distributed on multiple Graphical Processing Units with low communication-to-computation ratio. By exploiting this favorable communication pattern we propose an implementation which can find exemplars in large, dense data sets efficiently, even when run over slow interconnect.

Marcin Kurdziel, Krzysztof Boryczko
High-Performance Pseudo-Random Number Generation on Graphics Processing Units

This work considers the deployment of pseudo-random number generators (PRNGs) on graphics processing units (GPUs), developing an approach based on the xorgens generator to rapidly produce pseudo-random numbers of high statistical quality. The chosen algorithm has configurable state size and period, making it ideal for tuning to the GPU architecture. We present a comparison of both speed and statistical quality with other common GPU-based PRNGs, demonstrating favourable performance of the xorgens-based approach.

Nimalan Nandapalan, Richard P. Brent, Lawrence M. Murray, Alistair P. Rendell
Auto-tuning Dense Vector and Matrix-Vector Operations for Fermi GPUs

In this paper, we consider the automatic performance tuning of dense vector and matrix-vector operations on GPUs. Such operations form the backbone of level 1 and level 2 routines in the Basic Linear Algebra Subroutines (BLAS) library and are therefore of great importance in many scientific applications. As examples, we develop single-precision CUDA kernels for the Euclidian norm (SNRM2) and the matrix-vector multiplication (SGEMV). The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture). We show that auto-tuning can be successfully applied to achieve high performance for dense vector and matrix-vector operations by appropriately utilizing the fine-grained parallelism of the GPU. Our tuned kernels display between 25-100% better performance than the current CUBLAS 3.2 library.

Hans Henrik Brandenborg Sørensen
GPGPU Implementation of Cellular Automata Model of Water Flow

In this paper we present how Cellular Automata model can be implemented for processing on Graphics Processing Unit (GPU). Recently, graphics processors have gained a lot of interest as an efficient architecture for general-purpose computation. Cellular Automata algorithms that are inherently parallel give the opportunity to achieve very high efficiency when they are implemented on GPUs. We demonstrate how existing model of water flow can be ported to GPU environment with OpenCL programming framework. Sample simulation results and performance evaluations are included.

Paweł Topa, Paweł Młocek

Workshop on Memory and Data Parallelism on Multi- and Manycore Platforms

A Multi-GPU Implementation of a D2Q37 Lattice Boltzmann Code

We describe a parallel implementation of a compressible Lattice Boltzmann code on a multi-GPU cluster based on Nvidia

Fermi

processors. We analyze how to optimize the algorithm for GP-GPU architectures, describe the implementation choices that we have adopted and compare our performance results with an implementation optimized for latest generation multi-core CPUs. Our program runs at ≈ 30% of the double-precision peak performance of one GPU and shows almost linear scaling when run on the multi-GPU cluster.

Luca Biferale, Filippo Mantovani, Marcello Pivanti, Fabio Pozzati, Mauro Sbragaglia, Andrea Scagliarini, Sebastiano Fabio Schifano, Federico Toschi, Raffaele Tripiccione
Combining Smoother and Residual Calculation in v-cycle AMG for Symmetric Problems

We examine a modified implementation of the v-cycle multigrid scheme which takes into account the memory traffic, i.e. the movement of data between the memory and the processing units. It is known that the relatively slow data transfer is responsible for the poor parallel performance of multigrid algorithms on certain shared memory architectures e.g. those with a front-side bus memory controller. The modification is simple but it speeds up computations by up to 15%.

Maximilian Emans
Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures Using Tree Reduction

The objective of this paper is to enhance the parallelism of the tile bidiagonal transformation using tree reduction on multicore architectures. First introduced by Ltaief et. al [LAPACK Working Note #247, 2011], the bidiagonal transformation using tile algorithms with a two-stage approach has shown very promising results on square matrices. However, for tall and skinny matrices, the inherent problem of processing the panel in a domino-like fashion generates unnecessary sequential tasks. By using tree reduction, the panel is horizontally split, which creates another dimension of parallelism and engenders many concurrent tasks to be dynamically scheduled on the available cores. The results reported in this paper are very encouraging. The new tile bidiagonal transformation, targeting tall and skinny matrices, outperforms the state-of-the-art numerical linear algebra libraries LAPACK V3.2 and Intel MKL ver. 10.3 by up to 29-fold speedup and the standard two-stage PLASMA BRD by up to 20-fold speedup, on an eight socket hexa-core AMD Opteron multicore shared-memory system.

Hatem Ltaief, Piotr Luszczek, Jack Dongarra
Autotuning of Adaptive Mesh Refinement PDE Solvers on Shared Memory Architectures

Many multithreaded, grid-based, dynamically adaptive solvers for partial differential equations permanently have to traverse subgrids (patches) of different and changing sizes. The parallel efficiency of this traversal depends on the interplay of the patch size, the architecture used, the operations triggered throughout the traversal, and the grain size, i.e. the size of the subtasks the patch is broken into. We propose an oracle mechanism delivering grain sizes on-the-fly. It takes historical runtime measurements for different patch and grain sizes as well as the traverse’s operations into account, and it yields reasonable speedups. Neither magic configuration settings nor an expensive pre-tuning phase are necessary. It is an autotuning approach.

Svetlana Nogina, Kristof Unterweger, Tobias Weinzierl
GPU Acceleration of the Matrix-Free Interior Point Method

The

matrix-free

technique is an iterative approach to interior point methods (IPM), so named because both the solution procedure and the computation of an appropriate preconditioner require only the results of the operations

Ax

and

A

T

y

, where

A

is the matrix of constraint coefficients. This paper demonstrates its overwhelmingly superior performance on two classes of linear programming (LP) problems relative to both the simplex method and to IPM with equations solved directly. It is shown that the reliance of this technique on sparse matrix-vector operations enables further, significant performance gains from the use of a GPU, and from multi-core processors.

Edmund Smith, Jacek Gondzio, Julian Hall

Workshop on Models, Algorithms and Methodologies for Hierarchical Parallelism in New HPC Systems

Deconvolution of 3D Fluorescence Microscopy Images Using Graphics Processing Units

We consider the deconvolution of 3D Fluorescence Microscopy RGB images, describing the benefits arising from facing medical imaging problems on modern graphics processing units (GPUs), that are non expensive parallel processing devices available on many up-to-date personal computers. We found that execution time of CUDA version is about 2 orders of magnitude less than the one of sequential algorithm. Anyway, the experiments lead some reflections upon the best setting for the CUDA-based algorithm. That is, we notice the need to model the GPUs architectures and their characteristics to better describe the performance of GPU-algorithms and what we can expect of them.

Luisa D’Amore, Livia Marcellino, Valeria Mele, Diego Romano
HADAB: Enabling Fault Tolerance in Parallel Applications Running in Distributed Environments

The development of scientific software, reliable and efficient, in distributed computing environments, requires the identification and the analysis of issues related to the design and the deployment of algorithms for high-performance computing architectures and their integration in distributed contexts. In these environments, indeed, resources efficiency and availability can change unexpectedly because of overloading or failure i.e. of both computing nodes and interconnection network. The scenario described above, requires the design of mechanisms enabling the software to “survive” to such unexpected events by ensuring, at the same time, an effective use of the computing resources. Although many researchers are working on these problems for years, fault tolerance, for some classes of applications is an open matter still today. Here we focus on the design and the deployment of a checkpointing/migration system to enable fault tolerance in parallel applications running in distributed environments. In particular we describe details about HADAB, a new hybrid checkpointing strategy, and its deployment in a meaningful case study: the PETSc Conjugate Gradient algortithm implementation. The related testing phase has been performed on the University of Naples distributed infrastructure (S.Co.P.E. infrastructure).

Vania Boccia, Luisa Carracciuolo, Giuliano Laccetti, Marco Lapegna, Valeria Mele
Increasing the Efficiency of the DaCS Programming Model for Heterogeneous Systems

Efficient programming of hybrid systems is usually done with the use of new programming models. It creates a unique opportunity to increase the performance of scientific applications and is also especially interesting in the context of future exascale applications development where extreme number of MPI processes tend to be a limitation. Future scientific codes will make use of hierarchical parallel programming models with message passing techniques used between nodes and optimized computational kernels used within multicore, multithreaded or accelerator nodes. In this article we consider the x86 and PowerXCell8i heterogeneous environment introduced in the High Performance Computing (HPC) sites like Roadrunner [6] or Nautilus [5]. Programming techniques for this environment are usually based on the IBM Data Communication and Synchronization library (DaCS). We describe our effort to increase the hybrid efficiency of the DaCS library and show how it affects the performance of scientific computations based on FFT kernels. The results are very promising especially for computational models that involve large three dimensional Fourier transformations.

Maciej Cytowski, Marek Niezgódka
A Software Architecture for Parallel List Processing on Grids

The Data List Management Library (DLML) processes data lists in parallel, balancing the workload transparently to programmers. Programmers only need to organise data into a list, use DLML functions to insert and get data items, and specify the sequential function(s) to process each data item according to the application logic. The first design of DLML was targeted for use at a single cluster.

This paper presents

DLML-Grid

, a software architecture for DLML to run in Grid environments composed of multiple distributed clusters. The architecture is hierarchical and tends to localise communication within clusters, thus reducing communication overhead. Using OpenVPN, we implemented a prototype version of

DLML-Grid

to gather empirical results on its performance using two clusters and two applications whose workload is static and dynamically generated.

DLML-Grid

performs much better than DLML overall.

Apolo H. Hernández, Graciela Román-Alonso, Miguel A. Castro-García, Manuel Aguilar-Cornejo, Santiago Domínguez-Domínguez, Jorge Buenabad-Chávez
Reducing the Time to Tune Parallel Dense Linear Algebra Routines with Partial Execution and Performance Modeling

We present a modeling framework to accurately predict time to run dense linear algebra calculation. We report the framework’s accuracy in a number of varied computational environments such as shared memory multicore systems, clusters, and large supercomputing installations with tens of thousands of cores. We also test the accuracy for various algorithms, each of which having a different scaling properties and tolerance to low-bandwidth/high-latency interconnects. The predictive accuracy is very good and on the order of measurement accuracy which makes the method suitable for both dedicated and non-dedicated environments. We also present a practical application of our model to reduce the time required to tune and optimize large parallel runs whose time is dominated by linear algebra computations. We show practical examples of how to apply the methodology to avoid common pitfalls and reduce the influence of measurement errors and the inherent performance variability.

Piotr Luszczek, Jack Dongarra
A General-Purpose Virtualization Service for HPC on Cloud Computing: An Application to GPUs

This paper describes the generic virtualization service GVirtuS (Generic Virtualization Service), a framework for development of split-drivers for cloud virtualization solutions. The main goal of GVirtuS is to provide tools for developing elastic computing abstractions for high-performance private and public computing clouds. In this paper we focus our attention on GPU virtualization. However, GVirtuS is not limited to accelerator-based architectures: a virtual high performance parallel file system and a MPI channel are ongoing projects based on our split driver virtualization technology.

Raffaele Montella, Giuseppe Coviello, Giulio Giunta, Giuliano Laccetti, Florin Isaila, Javier Garcia Blas
A Simulated Annealing Algorithm for GPU Clusters

Simulated Annealing (SA) is a powerful global optimization technique that is frequently used for solving many practical problems from various scientific and technical fields. In this article we present a novel approach to parallelization of SA and propose an algorithm optimized for execution in GPU clusters. Our technique exploits the basic characteristics of such environments by using hierarchical problem decomposition. The proposed algorithm performs especially well for complex problems with large number of variables. We compare our approach with traditional parallel Simulated Annealing, both in terms of speed and result accuracy.

Maciej Zbierski
Backmatter
Metadata
Title
Parallel Processing and Applied Mathematics
Editors
Roman Wyrzykowski
Jack Dongarra
Konrad Karczewski
Jerzy Waśniewski
Copyright Year
2012
Publisher
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-31464-3
Print ISBN
978-3-642-31463-6
DOI
https://doi.org/10.1007/978-3-642-31464-3

Premium Partner