Simulations of Atomic Scale Systems

Free Energy Monte Carlo Simulations on a Distributed Network

While the use of enhanced sampling techniques and parallel computing to determine potentials of mean force is in widespread use in modern Molecular Dynamics and Monte Carlo simulation studies, there have been few methods that efficiently combine heterogeneous computer resources of varying quality and speeds in realizing a single simulation result on a distributed network. Here, we apply an algorithm based on the Monte Carlo method of Wang and Landau within a client-server framework, in which individual computing nodes report a histogram of regions of phase space visited and corresponding updates to a centralized server at regular intervals entirely asynchronously. The server combines the data and reports the sum to all nodes so that the overall free energy determination scales linearly with the total amount of resources allocated. We discuss our development of this technique and present results for molecular simulations of DNA.

Luke Czapla, Alexey Siretskiy, John Grime, Malek O. Khan

Numerical Investigation of the Cumulant Expansion for Fourier Path Integrals

Recent developments associated with the cumulant expansion of the Fourier path integral Monte Carlo method are illustrated numerically using a simple one-dimensional model of a quantum fluid. By calculating the Helmholtz free energy of the model we demonstrate that 1) recently derived approximate asymptotic expressions for the cumulants requiring only one-dimensional quadrature are both accurate and viable, 2) expressions through third-cumulant order are significantly more rapidly convergent than either the primitive Fourier method or the partial average method, and 3) the derived cumulant convergence orders can be verified numerically.

Nuria Plattner, Sharif Kunikeev, David L. Freeman, Jimmie D. Doll

Optimization of Functionals of Orthonormal Functions in the Absence of Unitary Invariance

We discuss the optimization of a functional with respect to sets of orthonormal functions where unitary invariance does not apply. This problem arises, for example, when density functionals with explicit self-interaction correction are used for systems of electrons. There, unitary invariance cannot be used to reformulate the minimization of the energy with respect to each of the functions as an eigenvalue problem as can be done for the commonly used GGA-DFT and Hartree-Fock theory. By including optimization with respect to unitary transformations as an explicit step in the iterative minimization procedure, fast convergence can, nevertheless, be obtained. Furthermore, by working with two sets of orthonormal functions, the optimal functions and a set of eigenfunctions, the implementation of the extended functional form in existing software becomes easier. The additional computations arising from the lack of unitary invariance can largely be carried out in parallel.

Peter Klüpfel, Simon Klüpfel, Kiril Tsemekhman, Hannes Jónsson

Simulated Annealing with Coarse Graining and Distributed Computing

EON is a software package that uses distributed computing, systematic coarse graining and bookkeeping of minima and first order saddle points too speed up adaptive kinetic Monte Carlo simulations. It can be used to optimize continuously differentiable functions of a large number of variables. The approach is based on finding minima of the cost function by traversing low-lying, first-order saddle points from one minimum to another. A sequence of minima is thus generated in a path through regions of low values of the cost function with the possibility of ‘temperature’ controlled acceptance of higher lying saddle points. Searches of first order saddle points are carried out using distributed computing and the minimum-mode following method. Coarse graining which involves merging local minima into composite states and the recognition of previous search paths and saddle points are used to accelerate the exploration of the cost function. In addition to obtaining an estimate of the global minimum, a simulation using this approach gives information about the shape of the cost function in the regions explored. Example applications to the simulated annealing of a cluster of water molecules on a platinum metal surface and grain boundary in copper are presented.

Andreas Pedersen, Jean-Claude Berthet, Hannes Jónsson

Path Optimization with Application to Tunneling

A method is presented for optimizing paths on high dimensional surfaces, i.e. scalar functions of many variables. The method involves optimizing simultaneously the end points and several intermediate points along the path and thus lends itself well to parallel computing. This is an extension of the nudged elastic band method (NEB) which is frequently used to find minimum energy paths on energy surfaces of atomic scale systems, often with several thousand variables. The method is illustrated using 2-dimensional systems and various choices of the object function, in particular (1) path length, (2) iso-contour and (3) quantum mechanical tunneling rate. The use of the tunneling paths to estimate tunneling rates within the instanton approximation is also sketched and illustrated with an application to associative desorption of hydrogen molecule from a copper surface, a system involving several hundred degrees of freedom.

Dóróthea M. Einarsdóttir, Andri Arnaldsson, Finnbogi Óskarsson, Hannes Jónsson

Tools and Environments for Accelerator Based Computational Biomedicine

Shallow Water Simulations on Multiple GPUs

We present a state-of-the-art shallow water simulator running on multiple GPUs. Our implementation is based on an explicit high-resolution finite volume scheme suitable for modeling dam breaks and flooding. We use row domain decomposition to enable multi-GPU computations, and perform traditional CUDA block decomposition within each GPU for further parallelism. Our implementation shows near perfect weak and strong scaling, and enables simulation of domains consisting of up-to 235 million cells at a rate of over 1.2 gigacells per second using four Fermi-generation GPUs. The code is thoroughly benchmarked using three different systems, both high-performance and commodity-level systems.

Martin Lilleeng Sætra, André Rigland Brodtkorb

High Performance Computing Techniques for Scaling Image Analysis Workflows

Biomedical images are intrinsically complex with each domain and modality often requiring specialized knowledge to accurately render diagnosis and plan treatment. A general software framework that provides access to high-performance resources can make possible high-throughput investigations of micro-scale features as well as algorithm design, development and evaluation. In this paper we describe the requirements and challenges of supporting microscopy analyses of large datasets of high-resolution biomedical images. We present high-performance computing approaches for storage and retrieval of image data, image processing, and management of analysis results for additional explorations. Lastly, we describe issues surrounding the use of high performance computing for scaling image analysis workflows.

Patrick M. Widener, Tahsin Kurc, Wenjin Chen, Fusheng Wang, Lin Yang, Jun Hu, Vijay Kumar, Vicky Chu, Lee Cooper, Jun Kong, Ashish Sharma, Tony Pan, Joel H. Saltz, David J. Foran

GPU Computing

Parallel Computation of Bivariate Polynomial Resultants on Graphics Processing Units

Polynomial resultants are of fundamental importance in symbolic computations, especially in the field of quantifier elimination. In this paper we show how to compute the resultant

$\ensuremath\operatorname{res}_y(f,g)$

of two bivariate polynomials

$f,g\in\ensuremath\mathbb{Z}[x,y]$

on a CUDA-capable graphics processing unit (GPU). We achieve parallelization by mapping the bivariate integer resultant onto a sufficiently large number of univariate resultants over finite fields, which are then lifted back to the original domain. We point out, that the commonly proposed special treatment for so called unlucky homomorphisms is unnecessary and how this simplifies the parallel resultant algorithm. All steps of the algorithm are executed entirely on the GPU. Data transfer is only used for the input polynomials and the resultant. Experimental results show the considerable speedup of our implementation compared to host-based algorithms.

Christian Stussak, Peter Schenzel

Accelerating Model Reduction of Large Linear Systems with Graphics Processors

Model order reduction of a dynamical linear time-invariant system appears in many applications from science and engineering. Numerically reliable SVD-based methods for this task require in general

$\mathcal{O}(n^3)$

floating-point arithmetic operations, with

n

being in the range 10

3

− 10

5

for many practical applications. In this paper we investigate the use of graphics processors (GPUs) to accelerate model reduction of large-scale linear systems by off-loading the computationally intensive tasks to this device. Experiments on a hybrid platform consisting of state-of-the-art general-purpose multi-core processors and a GPU illustrate the potential of this approach.

Peter Benner, Pablo Ezzatti, Daniel Kressner, Enrique S. Quintana-Ortí, Alfredo Remón

Fast GPU-Based Fluid Simulations Using SPH

Graphical Processing Units (GPUs) are massive floating-point stream processors, and through the recent development of tools such as CUDA and OpenCL it has become possible to fully utilize them for scientific computing. We have developed an open-source CUDA-based acceleration framework for 3D Computational Fluid Dynamics (CFD) using Smoothed Particle Hydrodynamics (SPH). This paper describes the methods used in our framework and compares the performance of the implementation to previous SPH implementations. We implement two different SPH models, a simplified model for Newtonian fluids, and a complex model for Non-Newtonian fluids, which we use for simulation of snow avalanches. Having implemented two different models, we investigate the performance characteristics of SPH simulations on the GPU and find that despite the larger bandwidth-requirements of the complex model the GPU scales well. Our simulations are rendered interactively and in “real-time”. Using an NVIDIA GeForce GTX 470 Fermi-based card we achieve 215.4, 122.2 and 64.9 FPS for the simple model and 69.6, 37.4 and 19.1 FPS for 64K, 128K and 256K particles respectively.

Øystein E. Krog, Anne C. Elster

Toward Techniques for Auto-tuning GPU Algorithms

We introduce a variety of techniques toward autotuning data-parallel algorithms on the GPU. Our techniques tune these algorithms independent of hardware architecture, and attempt to select near-optimum parameters. We work towards a general framework for creating auto-tuned data-parallel algorithms, using these techniques for common algorithms with varying characteristics. Our contributions include tuning a set of algorithms with a variety of computational patterns, with the goal in mind of building a general framework from these results. Our tuning strategy focuses first on identifying the computational patterns an algorithm shows, and then reducing our tuning model based on these observed patterns.

Andrew Davidson, John Owens

High Performance Computing Interval Methods

An Interval Version of the Crank-Nicolson Method – The First Approach

To study the heat or diffusion equation, the Crank-Nicolson method is often used. This method is unconditionally stable and has the order of convergence

O

(

k

2

+

h

2

), where

k

and

h

are mesh constants. Using this method in conventional floating-point arithmetic, we get solutions including not only the method error, but also representation and rounding errors. Therefore, we propose an interval version of the Crank-Nicolson method from which we would like to obtain solutions including the discretization error. Applying such a method in interval floating-point arithmetic allows one to obtain solutions including all possible numerical errors. Unfortunately, for the proposed interval version of Crank-Nicolson method, we are not able to prove that the exact solution belongs to the interval solutions obtained. Thus, the presented method should be modified in the nearest future to fulfil this necessary condition. A numerical example is presented. Although in this example the exact solution belongs to the interval solutions obtained, but the so-called wrapping effect significantly increases the widths of these intervals.

Andrzej Marciniak

Parallel Detection of Interval Overlapping

In this paper we define the interval overlapping relation and develop a parallel hardware unit for its realization. As one application we consider the interval comparisons. It is shown that a detailed classification of the interval overlapping relation leads to a reduction of floating-point comparisons in common applications.

Marco Nehmeier, Stefan Siegel, Jürgen Wolff von Gudenberg

Using the Second-Order Information in Pareto-set Computations of a Multi-criteria Problem

The paper presents an extension of a previously developed interval method for solving multi-criteria problems [13]. The idea is to use second order information (i.e., Hesse matrices of criteria and constraints) in a way analogous to global optimization (see e.g. [6], [9]). Preliminary numerical results are presented and parallelization of the algorithm is considered.

Bartłomiej Jacek Kubica, Adam Woźniak

Comments on Fast and Exact Accumulation of Products

A new IEEE arithmetic standard 1788 is currently being worked out. It will specify interval arithmetic and an exact dot product (EDP). In an EDP, an arbitrary finite number of products is accumulated without rounding errors.

These are essential tools for computations with reliable and accurate results. In high performance computing, it is necessary that implementations of interval arithmetic and the EDP must be as efficient as the ordinary floating-point arithmetic. In this paper, fast and accurate solutions for the EDP are presented.

Gerd Bohlender, Ulrich Kulisch

An Interval Finite Difference Method of Crank-Nicolson Type for Solving the One-Dimensional Heat Conduction Equation with Mixed Boundary Conditions

In the paper an interval method for solving the one-dimensio-nal heat conduction equation with mixed boundary conditions is considered. The idea of the interval method is based on the finite difference scheme of the conventional Crank-Nicolson method adapted to the mixed boundary conditions. The interval method given in the form presented in the paper includes the error term of the conventional method.

Malgorzata A. Jankowska

Using C-XSC for High Performance Verified Computing

C-XSC is a C++ class library for scientific computing, with its main focus on reliable interval computations. Recently, several changes and new features have been implemented, making C-XSC much more suitable for tasks in high performance computing. However, these changes require that users take several factors into consideration when writing and compiling programs using C-XSC to get the best possible performance while still maintaining a sufficient level of numerical accuracy. This paper gives an overview of the most important points concerning these factors and tries to give background information and recommendations to the end user for the implementation of efficient C-XSC programs.

Remark: An accompanying extended version of this paper is available, see [10].

Walter Krämer, Michael Zimmer, Werner Hofschuster

Efficient Implementation of Interval Matrix Multiplication

The straightforward implementation of interval matrix product suffers from poor efficiency, far from the performances of highly optimized floating-point matrix products. In this paper, we show how to reduce the interval matrix multiplication to 9 floating-point matrix products - for performance issues - without sacrificing the quality of the result. Indeed, we show that, compared to the straightforward implementation, the overestimation factor is at most 1.18.

Nguyen Hong Diep

Real-Time Access and Processing of Large Data Sets

The Computing Framework for Physics Analysis at LHCb

The analysis of high energy physics experiment data is a good example for the need of real-time access and processing of large data sets. In this contribution we introduce the basic concepts and implementations of high energy data analysis on the example of the LHCb experiment at the LHC collider. Already now, but even more once running under nominal conditions it will produce unprecedented amounts of data for years. The contribution will also discuss the potential of parts of the existing implementations to be used in the analysis middleware for future dedicated projects on fast distributed analysis frameworks for LHC data.

Markward Britsch

Taming the Raven – Testing the Random Access, Visualization and Exploration Network RAVEN

The

Random Access, Visualization and Exploration Network

(RAVEN) aims to allow for the storage, analysis and visualisation of peta-bytes of scientific data in (near) real-time. In essence, RAVEN is a huge distributed and parallel system.

While testing of distributed systems, such as huge telecommunication systems, is well understood and performed systematically, testing of parallel systems, in particular high-performance computing, is currently lagging behind and is mainly based on ad-hoc approaches.

This paper surveys the state of the art of software testing and investigates challenges of testing a distributed and parallel high-performance RAVEN system. While using the standardised

Testing and Test Control Notation

(TTCN-3) looks promising for testing networking and communication aspects of RAVEN, testing the visualisation and analysis aspects of RAVEN may open new frontiers.

Helmut Neukirchen

RAVEN – Boosting Data Analysis for the LHC Experiments

The analysis and visualization of the LHC data is a good example of human interaction with petabytes of inhomogeneous data. After outlining the computational requirements for an efficient analysis of such data sets, a proposal, RAVEN – a Random Access, Visualization and Exploration Network for petabyte sized data sets, for a scalable architecture meeting these demands is presented. The proposed hardware basis is a network of ”CSR”-units based on off-the-shelf components, which combine Computing, data Storage and Routing functionalities. At the software level efficient protocols for broadcasting information, data distribution and information collection are required, together with a middleware layer for data processing.

Michael Schmelling, Markward Britsch, Nikolai Gagunashvili, Hans Kristjan Gudmundsson, Helmut Neukirchen, Nicola Whitehead

Bridging HPC and Grid File I/O with IOFSL

Traditionally, little interaction has taken place between the Grid and high-performance computing (HPC) storage research communities. Grid research often focused on optimizing data accesses for high-latency, wide-area networks, while HPC research focused on optimizing data accesses for local, high-performance storage systems. Recent software and hardware trends are blurring the distinction between Grids and HPC. In this paper, we investigate the use of I/O forwarding — a well established technique in leadership-class HPC machines— in a Grid context. We show that the problems that triggered the introduction of I/O forwarding for HPC systems also apply to contemporary Grid computing environments. We present the design of our I/O forwarding infrastructure for Grid computing environments. Moreover, we discuss the advantages our infrastructure provides for Grids, such as simplified application data management in heterogeneous computing environments and support for multiple application I/O interfaces.

Jason Cope, Kamil Iskra, Dries Kimpe, Robert Ross

Linear Algebra Algorithms and Software for Multicore and Hybrid Architectures in Honor of Fred Gustavson on His 75th Birthday

Fine Granularity Sparse QR Factorization for Multicore Based Systems

The advent of multicore processors represents a disruptive event in the history of computer science as conventional parallel programming paradigms are proving incapable of fully exploiting their potential for concurrent computations. The need for different or new programming models clearly arises from recent studies which identify fine-granularity and dynamic execution as the keys to achieve high efficiency on multicore systems. This work presents an implementation of the sparse, multifrontal

QR

factorization capable of achieving high efficiency on multicore systems through using a fine-grained, dataflow parallel programming model.

Alfredo Buttari

Mixed Precision Iterative Refinement Methods for Linear Systems: Convergence Analysis Based on Krylov Subspace Methods

The convergence analysis of Krylov subspace solvers usually provides an estimation for the computational cost. Exact knowledge about the convergence theory of error correction methods using different floating point precision formats would enable to determine a priori whether the implementation of a mixed precision iterative refinement solver using a certain Krylov subspace method as error correction solver outperforms the plain solver in high precision. This paper reveals characteristics of mixed precision iterative refinement methods using Krylov subspace methods as inner solver.

Hartwig Anzt, Vincent Heuveline, Björn Rocker

An Implementation of the Tile QR Factorization for a GPU and Multiple CPUs

The tile QR factorization provides an efficient and scalable way for factoring a dense matrix in parallel on multicore processors. This article presents a way of efficiently implementing the algorithm on a system with a powerful GPU and many multicore CPUs.

Jakub Kurzak, Rajib Nath, Peng Du, Jack Dongarra

Efficient Reduction from Block Hessenberg Form to Hessenberg Form Using Shared Memory

A new cache-efficient algorithm for reduction from block Hessenberg form to Hessenberg form is presented and evaluated. The algorithm targets parallel computers with shared memory. One level of look-ahead in combination with a dynamic load-balancing scheme significantly reduces the idle time and allows the use of coarse-grained tasks. The coarse tasks lead to high-performance computations on each processor/core. Speedups close to 13 over the sequential unblocked algorithm have been observed on a dual quad-core machine using one thread per core.

Lars Karlsson, Bo Kågström

Cache-Oblivious Algorithms and Matrix Formats for Computations on Interval Matrices

The paper considers the use of cache-oblivious algorithms and matrix formats for computations on interval matrices. We show how the efficient use of cache is of less importance in interval computations than in traditional floating-point ones. For interval matrices there are more important factors, like the number of rounding modes switches or the number of times we have to check if an interval contains zero or not. Yet the use of cache still plays some role.

Rafał Dabrowski, Bartłomiej Jacek Kubica

Parallel Solution of Narrow Banded Diagonally Dominant Linear Systems

ScaLAPACK contains a pair of routines for solving systems which are narrow banded and diagonally dominant by rows. Mathematically, the algorithm is block cyclic reduction. The ScaLAPACK implementation can be improved using incomplete, rather than complete block cyclic reduction. If the matrix is strictly dominant by rows, then the truncation error can be bounded directly in terms of the dominance factor and the size of the partitions. Our analysis includes new results applicable in our ongoing work of developing an efficient parallel solver.

Carl Christian Kjelgaard Mikkelsen, Bo Kågström

Memory and Multicore Issues in Scientific Computing - Theory and Practice

An Approach for Semiautomatic Locality Optimizations Using OpenMP

The processing power of multicore CPUs increases at a high rate, whereas memory bandwidth is falling behind. Almost all modern processors use multiple cache levels to overcome the penalty of slow main memory; however cache efficiency is directly bound to data locality. This paper studies a possible way to incorporate data locality exposure into the syntax of the parallel programming system OpenMP. We study data locality optimizations on two applications: matrix multiplication and Gauß-Seidel stencil. We show that only small changes to OpenMP are required to expose data locality so a compiler can transform the code. Our notion of tiled loops allows developers to easily describe data locality even at scenarios with non-trivial data dependencies. Furthermore, we describe two optimization techniques. One explicitly uses a form of local memory to prevent conflict cache misses, whereas the second one modifies the wavefront parallel programming pattern with dynamically sized blocks to increase the number of parallel tasks. As an additional contribution we explore the benefit of using multiple levels of tiling.

Jens Breitbart

Memory-Efficient Sierpinski-Order Traversals on Dynamically Adaptive, Recursively Structured Triangular Grids

Adaptive mesh refinement and iterative traversals of unknowns on such adaptive grids are fundamental building blocks for PDE solvers. We discuss a respective integrated approach for grid refinement and processing of unknowns that is based on recursively structured triangular grids and space-filling element orders. In earlier work, the approach was demonstrated to be highly memory- and cache-efficient. In this paper, we analyse the cache efficiency of the traversal algorithms using the I/O model. Further, we discuss how the nested recursive traversal algorithms can be efficiently implemented. For that purpose, we compare the memory throughput of respective implementations with simple stream benchmarks, and study the dependence of memory throughput and floating point performance from the computational load per element.

Michael Bader, Kaveh Rahnema, Csaba Vigh

Fast Wavelet Transform Utilizing a Multicore-Aware Framework

The move to multicore processors creates new demands on software development in order to profit from the improved capabilities. Most important, algorithm and code must be parallelized wherever possible, but also the growing memory wall must be considered. Additionally, high computational performance can only be reached if architecture-specific features are made use of. To address this complexity, we developed a C++ framework that simplifies the development of performance-optimized, parallel, memory-efficient, stencil-based codes on standard multicore processors and the heterogeneous Cell processor developed jointly by Sony, Toshiba, and IBM. We illustrate the implementation and optimization of the Fast Wavelet Transform and its inverse for Haar wavelets within our hybrid framework, using OpenMP, and using the Open Compute Language, and analyze performance results for different platforms.

Markus Stürmer, Harald Köstler, Ulrich Rüde

Multicore Algorithms and Implementations for Application Problems

Direct Sparse Factorization of Blocked Saddle Point Matrices

We present a parallel algorithm for the direct factorization of sparse saddle-point matrices of moderate size coming from real-time multibody dynamics simulations. We used the specific structure of these problems both for

a priori

construction of supernodes and to avoid all dynamic permutations during factorization. For the latter, we present a technique we call “leaf swapping” which performs permutations of the supernodes in the elimination tree without any reference to numerical values. The results compare favorably with currently available high performance codes on our problem sets because of the high overhead necessary to process very large problems on increasingly complex supercomputers.

Claude Lacoursière, Mattias Linde, Olof Sabelström

Multi-Target Vectorization with MTPS C++ Generic Library

This article introduces a C++ template library dedicated at vectorizing algorithms for different target architectures: Multi-Target Parallel Skeleton (MTPS). Skeletons describing the data structures and algorithms are provided and allow MTPS to generate a code with optimized memory access patterns for the choosen architecture. MTPS currently supports x86-64 multicore CPUs and CUDA enabled GPUs. On these architectures, performances close to hardware limits are observed.

Wilfried Kirschenmann, Laurent Plagne, Stéphane Vialle

Analysis of Gravitational Wave Signals on Heterogeneous Architectures

Heterogeneous architectures and programming techniques will be very important in the development of exascale HPC applications. Adapting heterogeneous programming techniques to scientific programming is not always straightforward. Here we present an in-depth analysis of an astrophysical application used for performing an all-sky coherent search for periodic signals of gravitational waves in narrowband detector data. The application was first ported to the PowerXCell8i architecture and then on the basis of achieved performance it was again redesigned and programmed in a heterogeneous model. Moreover presented heterogeneous techniques could be easily adopted for other scientific computational problems involving FFT computations.

Maciej Cytowski

Towards Efficient Execution of Erasure Codes on Multicore Architectures

Erasure codes can improve the availability of distributed storage in comparison with replication systems. In this paper, we focus on investigating how to map systematically the Reed-Solomon and Cauchy Reed-Solomon erasure codes onto the Cell/B.E. and GPU multicore architecture. A method for the systematic mapping of computation kernels of encoding/decoding algorithms onto the Cell/B.E. architecture is proposed. This method takes into account properties of the architecture on all three levels of its parallel processing hierarchy. The performance results are shown to be very promising. The possibility of using GPUs is studied as well, based on the Cauchy version of Reed-Solomon codes.

Roman Wyrzykowski, Lukasz Kuczynski, Marcin Wozniak

Communication-Efficient Algorithms for Numerical Quantum Dynamics

The time-dependent Schrödinger equation (TDSE) describes the quantum dynamical nature of molecular processes. However, numerical simulations of this linear, high-dimensional partial differential equation (PDE) rapidly become computationally very demanding and massive-scale parallel computing is needed to tackle many interesting problems. We present recent improvements to our MPI and OpenMP parallelized code framework HAParaNDA for solving high-dimensional PDE problems like the TDSE. By using communication-efficient high-order finite difference methods and Lanczos time propagators, we are able to accurately and efficiently solve TDSE problems in up to five dimensions on medium-sized clusters. We report numerical experiments which show that the solver scales well up to at least 4096 computing cores, also on computer systems with commodity communication networks.

Magnus Gustafsson, Katharina Kormann, Sverker Holmgren

Efficiently Implementing Monte Carlo Electrostatics Simulations on Multicore Accelerators

The field of high-performance computing is highly dependent on increasingly complex computer architectures. Parallel computing has been the norm for decades, but hardware architectures like the Cell Broadband Engine (Cell/BE) and General Purpose GPUs (GPGPUs) introduce additional complexities and are difficult to program efficiently even for well-suited problems. Efficiency is taken to include both maximizing the performance of the software and minimizing the programming effort required. With the goal of exposing the challenges facing a domain scientist using these types of hardware, in this paper we discuss the implementation of a Monte Carlo simulation of a system of charged particles on the Cell/BE and for GPUs. We focus on Coulomb interactions because their long-range nature prohibits using cut-offs to reduce the number of calculations, making simulations very expensive. The goal was to encapsulate the computationally expensive component of the program in a way so as to be useful to domain researchers with legacy codes. Generality and flexibility were therefore just as important as performance. Using the GPU and Cell/BE library requires only small changes in the simulation codes we’ve seen and yields programs that run at or near the theoretical peak performance of the hardware.

Marcus Holm, Sverker Holmgren

Fast PDE Solvers and a Posteriori Error Estimates

Algebraic Multigrid Solver on Clusters of CPUs and GPUs

Solvers for elliptic partial differential equations are needed in a wide area of scientific applications. We will present a highly parallel CPU and GPU implementation of a conjugate gradient solver with an algebraic multigrid preconditioner in a package called

Parallel Toolbox

. The solvers operates on fully unstructured discretizations of the PDE. The algorithmic specialities are investigated with respect to many-core architectures and the code is applied to one current application. Benchmark results of computations on clusters of CPUs and GPUs will be presented. They will show that a linear equation system with 25 million unknowns can be solved in about 1 second.

Aurel Neic, Manfred Liebmann, Gundolf Haase, Gernot Plank

Solution of Identification Problems in Computational Mechanics – Parallel Processing Aspects

Problems of identification of material parameters (mostly parameters appearing in constitutive relations) have applications in many fields of engineering including investigation of processes in a rock mass. This paper outlines the structure of parameter identification problems, methods for their solution and describes an identification (calibration) problem from geotechnics, which will serve as a realistic benchmark problem for illustration of the behaviour of selected parameter identification methods.

Radim Blaheta, Roman Kohut, Ondřej Jakl

Scalable Tools for High Performance Computing

ScalaTrace: Tracing, Analysis and Modeling of HPC Codes at Scale

Characterizing the communication behavior of large-scale applications is a difficult and costly task due to code/system complexity and their long execution times. An alternative to running actual codes is to gather their communication traces and then replay them, which facilitates application tuning and future procurements. While past approaches lacked lossless scalable trace collection, we contribute an approach that provides orders of magnitude smaller, if not near constant-size, communication traces regardless of the number of nodes while preserving structural information. We introduce intra- and inter-node compression techniques of MPI events, we develop a scheme to preserve time and causality of communication events, and we present results of our implementation for BlueGene/L. Given this novel capability, we discuss its impact on communication tuning and on trace extrapolation. To the best of our knowledge, such a concise representation of MPI traces in a scalable manner combined with time-preserving deterministic MPI call replay are without any precedence.

Frank Mueller, Xing Wu, Martin Schulz, Bronis R. de Supinski, Todd Gamblin

A Lightweight Library for Building Scalable Tools

MRNet is a software-based multicast reduction network for building scalable tools. Tools face communication and computation issues when used on large systems; MRNet alleviates these issues by providing multicast communication and data aggregation functionalities. Until now, the MRNet API has been entirely in C++. We present a new, lightweight library that provides a C interface for MRNet back-ends, making MRNet accessible to a wide range of new tools. Further, this library is single threaded to accommodate even more platforms and tools where this is a limitation.This new library provides the same abstractions as the C++ library, using an API that can be derived by applying a standard translation template to the C++ API.

Emily R. Jacobson, Michael J. Brim, Barton P. Miller

MATE: Toward Scalable Automated and Dynamic Performance Tuning Environment

The use of parallel/distributed programming increases as it enables high performance computing. There are many tools that help a user in the performance analysis of the application, and that allow to improve the application execution. As there is a high demand of computational power, new systems, such as large scale computer clusters, have become more common and accessible to everyone to solve complex problems. However, these systems generate a new set of problems related to the scalability of current analysis and tuning tools. Our automatic and dynamic tuning environment MATE does not scale well because it has a set of common bottlenecks in its architecture, and hence we have decided to improve the tool for providing dynamic tuning on large scale systems too. For this purpose, we are designing a new tool that introduces a tree-based overlay network infrastructure for scalable metrics collection, and to substitutes the current centralized performance analysis by a distributed one, in order to provide better scalability.

Anna Morajko, Andrea Martínez, Eduardo César, Tomàs Margalef, Joan Sorribes

Improving the Scalability of Performance Evaluation Tools

Performance evaluation tools play an important role in helping understand application performance, diagnose performance problems and guide tuning decisions on modern HPC systems. Tools to observe parallel performance must evolve to keep pace with the ever-increasing complexity of these systems. In this paper, we describe our experience in building novel tools and techniques in the TAU Performance System® to observe application performance effectively and efficiently at scale. It describes the extensions to TAU to contend with large data volumes associated with increasing core counts. These changes include new instrumentation choices, efficient handling of disk I/O operations in the measurement layer, and strategies for visualization of performance data at scale in TAU’s analysis layer, among others. We also describe some techniques that allow us to fully characterize the performance of applications running on hundreds of thousands of cores.

Sameer Suresh Shende, Allen D. Malony, Alan Morris

Automatic Performance Analysis of OpenMP Codes on a Scalable Shared Memory System Using Periscope

OpenMP is a successful interface for programming parallel applications on shared memory systems. It is widely applied on small scale shared memory systems such as multicore processors, but also in hybrid programming on large supercomputers. This paper presents performance properties for OpenMP and their automatic detection by Periscope. We evaluate Periscope’s OpenMP analysis strategy in the context of the Altix 4700 supercomputer at Leibniz Computing Center (LRZ) in Garching. On this unique machine OpenMP scales up to 500 cores, one partition of in total 19 partitions. We present results for the NAS parallel benchmarks and for a large hybrid scientific application.

Shajulin Benedict, Michael Gerndt

Further Improving the Scalability of the Scalasca Toolset

Scalasca is an open-source toolset that can be used to analyze the performance behavior of parallel applications and to identify opportunities for optimization. Target applications include simulation codes from science and engineering based on the parallel programming interfaces MPI and/or OpenMP. Scalasca, which has been specifically designed for use on large-scale machines such as IBM Blue Gene and Cray XT, integrates runtime summaries suitable to obtain a performance overview with in-depth studies of concurrent behavior via event tracing. Although Scalasca was already successfully used with codes running with 294,912 cores on a 72-rack Blue Gene/P system, the current software design shows scalability limitations that adversely affect user experience and that will present a serious obstacle on the way to mastering larger scales in the future. In this paper, we outline how to address the two most important ones, namely the unification of local identifiers at measurement finalization as well as collating and displaying analysis reports.

Markus Geimer, Pavel Saviankou, Alexandre Strube, Zoltán Szebenyi, Felix Wolf, Brian J. N. Wylie

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Part II – Minisymposium Papers