Skip to main content
Top

2016 | Book

Software for Exascale Computing - SPPEXA 2013-2015

insite
SEARCH

About this book

The research and its outcomes presented in this collection focus on various aspects of high-performance computing (HPC) software and its development which is confronted with various challenges as today's supercomputer technology heads towards exascale computing. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The collection thereby highlights pioneering research findings as well as innovative concepts in exascale software development that have been conducted under the umbrella of the priority programme "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) and that have been presented at the SPPEXA Symposium, Jan 25-27 2016, in Munich. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest.

Table of Contents

Frontmatter

EXA-DUNE: Flexible PDE Solvers, Numerical Methods, and Applications

Frontmatter
Hardware-Based Efficiency Advances in the EXA-DUNE Project
Abstract
We present advances concerning efficient finite element assembly and linear solvers on current and upcoming HPC architectures obtained in the frame of the Exa-Dune project, part of the DFG priority program 1648 Software for Exascale Computing (SPPEXA). In this project, we aim at the development of both flexible and efficient hardware-aware software components for the solution of PDEs based on the DUNE platform and the FEAST library. In this contribution, we focus on node-level performance and accelerator integration, which will complement the proven MPI-level scalability of the framework. The higher-level aspects of the Exa-Dune project, in particular multiscale methods and uncertainty quantification, are detailed in the companion paper (Bastian et al., Advances concerning multiscale methods and uncertainty quantification in Exa-Dune. In: Proceedings of the SPPEXA Symposium, 2016).
Peter Bastian, Christian Engwer, Jorrit Fahlke, Markus Geveler, Dominik Göddeke, Oleg Iliev, Olaf Ippisch, René Milk, Jan Mohring, Steffen Müthing, Mario Ohlberger, Dirk Ribbrock, Stefan Turek
Advances Concerning Multiscale Methods and Uncertainty Quantification in EXA-DUNE
Abstract
In this contribution we present advances concerning efficient parallel multiscale methods and uncertainty quantification that have been obtained in the frame of the DFG priority program 1648 Software for Exascale Computing (SPPEXA) within the funded project Exa-Dune. This project aims at the development of flexible but nevertheless hardware-specific software components and scalable high-level algorithms for the solution of partial differential equations based on the DUNE platform. While the development of hardware-based concepts and software components is detailed in the companion paper (Bastian et al., Hardware-based efficiency advances in the Exa-Dune project. In: Proceedings of the SPPEXA Symposium 2016, Munich, 25–27 Jan 2016), we focus here on the development of scalable multiscale methods in the context of uncertainty quantification. Such problems add additional layers of coarse grained parallelism, as the underlying problems require the solution of many local or global partial differential equations in parallel that are only weakly coupled.
Peter Bastian, Christian Engwer, Jorrit Fahlke, Markus Geveler, Dominik Göddeke, Oleg Iliev, Olaf Ippisch, René Milk, Jan Mohring, Steffen Müthing, Mario Ohlberger, Dirk Ribbrock, Stefan Turek

ExaStencils: Advanced Stencil-Code Engineering

Frontmatter
Systems of Partial Differential Equations in ExaSlang
Abstract
As HPC systems are becoming increasingly heterogeneous and diverse, writing software that attains maximum performance and scalability while remaining portable as well as easily composable is getting more and more challenging. Additionally, code that has been aggressively optimized for certain execution platforms is usually not easily portable to others without either losing a great share of performance or investing many hours by re-applying optimizations. One possible remedy is to exploit the potential given by technologies such as domain-specific languages (DSLs) that provide appropriate abstractions and allow the application of technologies like automatic code generation and auto-tuning. In the domain of geometric multigrid solvers, project ExaStencils follows this road by aiming at providing highly optimized and scalable numerical solvers, specifically tuned for a given application and target platform. Here, we introduce its DSL ExaSlang with data types for local vectors to support computations that use point-local vectors and matrices. These data types allow an intuitive modeling of many physical problems represented by systems of partial differential equations (PDEs), e.g., the simulation of flows that include vector-valued velocities.
Christian Schmitt, Sebastian Kuckuk, Frank Hannig, Jürgen Teich, Harald Köstler, Ulrich Rüde, Christian Lengauer
Performance Prediction of Multigrid-Solver Configurations
Abstract
Geometric multigrid solvers are among the most efficient methods for solving partial differential equations. To optimize performance, developers have to select an appropriate combination of algorithms for the hardware and problem at hand. Since a manual configuration of a multigrid solver is tedious and does not scale for a large number of different hardware platforms, we have been developing a code generator that automatically generates a multigrid-solver configuration tailored to a given problem. However, identifying a performance-optimal solver configuration is typically a non-trivial task, because there is a large number of configuration options from which developers can choose. As a solution, we present a machine-learning approach that allows developers to make predictions of the performance of solver configurations, based on quantifying the influence of individual configuration options and interactions between them. As our preliminary results on three configurable multigrid solvers were encouraging, we focus on a larger, non-tivial case-study in this work. Furthermore, we discuss and demonstrate how to integrate domain knowledge in our machine-learning approach to improve accuracy and scalability and to explore how the performance models we learn can help developers and domain experts in understanding their system.
Alexander Grebhahn, Norbert Siegmund, Harald Köstler, Sven Apel

EXASTEEL: Bridging Scales for Multiphase Steels

Frontmatter
One-Way and Fully-Coupled FE2 Methods for Heterogeneous Elasticity and Plasticity Problems: Parallel Scalability and an Application to Thermo-Elastoplasticity of Dual-Phase Steels
Abstract
In this paper, aspects of the two-scale simulation of dual-phase steels are considered. First, we present two-scale simulations applying a top-down one-way coupling to a full thermo-elastoplastic model in order to study the emerging temperature field. We find that, for our purposes, the consideration of thermo-mechanics at the microscale is not necessary. Second, we present highly parallel fully-coupled two-scale FE2 simulations, now neglecting temperature, using up to 458, 752 cores of the JUQUEEN supercomputer at Forschungszentrum Jülich. The strong and weak parallel scalability results obtained for heterogeneous nonlinear hyperelasticity exemplify the massively parallel potential of the FE2 multiscale method.
Daniel Balzani, Ashutosh Gandhi, Axel Klawonn, Martin Lanser, Oliver Rheinbach, Jörg Schröder
Scalability of Classical Algebraic Multigrid for Elasticity to Half a Million Parallel Tasks
Abstract
The parallel performance of several classical Algebraic Multigrid (AMG) methods applied to linear elasticity problems is investigated. These methods include standard AMG approaches for systems of partial differential equations such as the unknown and hybrid approaches, as well as the more recent global matrix (GM) and local neighborhood (LN) approaches, which incorporate rigid body modes (RBMs) into the AMG interpolation operator. Numerical experiments are presented for both two- and three-dimensional elasticity problems on up to 131,072 cores (and 262,144 MPI processes) on the Vulcan supercomputer (LLNL, USA) and up to 262,144 cores (and 524,288 MPI processes) on the JUQUEEN supercomputer (JSC, Jülich, Germany). It is demonstrated that incorporating all RBMs into the interpolation leads generally to faster convergence and improved scalability.
Allison H. Baker, Axel Klawonn, Tzanio Kolev, Martin Lanser, Oliver Rheinbach, Ulrike Meier Yang

EXAHD: An Exa-Scalable Two-Level Sparse Grid Approach for Higher-Dimensional Problems in Plasma Physics and Beyond

Frontmatter
Recent Developments in the Theory and Application of the Sparse Grid Combination Technique
Abstract
Substantial modifications of both the choice of the grids, the combination coefficients, the parallel data structures and the algorithms used for the combination technique lead to numerical methods which are scalable. This is demonstrated by the provision of error and complexity bounds and in performance studies based on a state of the art code for the solution of the gyrokinetic equations of plasma physics. The key ideas for a new fault-tolerant combination technique are mentioned. New algorithms for both initial- and eigenvalue problems have been developed and are shown to have good performance.
Markus Hegland, Brendan Harding, Christoph Kowitz, Dirk Pflüger, Peter Strazdins
Scalable Algorithms for the Solution of Higher-Dimensional PDEs
Abstract
The solution of higher-dimensional problems, such as the simulation of plasma turbulence in a fusion device as described by the five-dimensional gyrokinetic equations, is a grand challenge for current and future high-performance computing. The sparse grid combination technique is a promising approach to the solution of these problems on large-scale distributed memory systems. The combination technique numerically decomposes a single large problem into multiple moderately-sized partial problems that can be computed in parallel, independently and asynchronously of each other. The ability to efficiently combine the individual partial solutions to a common sparse grid solution is a key to the overall performance of such large-scale computations. In this work, we present new algorithms for the recombination of distributed component grids and demonstrate their scalability to 180, 225 cores on the supercomputer Hazel Hen.
Mario Heene, Dirk Pflüger
Handling Silent Data Corruption with the Sparse Grid Combination Technique
Abstract
We describe two algorithms to detect and filter silent data corruption (SDC) when solving time-dependent PDEs with the Sparse Grid Combination Technique (SGCT). The SGCT solves a PDE on many regular full grids of different resolutions, which are then combined to obtain a high quality solution. The algorithm can be parallelized and run on large HPC systems. We investigate silent data corruption and show that the SGCT can be used with minor modifications to filter corrupted data and obtain good results. We apply sanity checks before combining the solution fields to make sure that the data is not corrupted. These sanity checks are derived from well-known error bounds of the classical theory of the SGCT and do not rely on checksums or data replication. We apply our algorithms on a 2D advection equation and discuss the main advantages and drawbacks.
Alfredo Parra Hinojosa, Brendan Harding, Markus Hegland, Hans-Joachim Bungartz

TERRA-NEO: Integrated Co-Design of an Exascale Earth Mantle Modeling Framework

Frontmatter
Hybrid Parallel Multigrid Methods for Geodynamical Simulations
Abstract
Even on modern supercomputer architectures, Earth mantle simulations are so compute intensive that they are considered grand challenge applications. The dominating roadblocks in this branch of Geophysics are model complexity and uncertainty in parameters and data, e.g., rheology and seismically imaged mantle heterogeneity, as well as the enormous space and time scales that must be resolved in the computational models. This article reports on a massively parallel all-at-once multigrid solver for the Stokes system as it arises in mantle convection models. The solver employs the hierarchical hybrid grids framework and demonstrates that a system with coupled velocity components and with more than a trillion (1. 7 ⋅ 1012) degrees of freedom can be solved in about 1,000 s using 40,960 compute cores of JUQUEEN. The simulation framework is used to investigate the influence of asthenosphere thickness and viscosity on upper mantle velocities in a static scenario. Additionally, results for a time-dependent simulation with a time-variable temperature-dependent viscosity model are presented.
Simon Bauer, Hans-Peter Bunge, Daniel Drzisga, Björn Gmeiner, Markus Huber, Lorenz John, Marcus Mohr, Ulrich Rüde, Holger Stengel, Christian Waluga, Jens Weismüller, Gerhard Wellein, Markus Wittmann, Barbara Wohlmuth

ExaFSA: Exascale Simulation of Fluid--Structure--Acoustics Interactions

Frontmatter
Partitioned Fluid–Structure–Acoustics Interaction on Distributed Data: Coupling via preCICE
Abstract
One of the great prospects of exascale computing is to simulate challenging highly complex multi-physics scenarios with different length and time scales. A modular approach re-using existing software for the single-physics model parts has great advantages regarding flexibility and software development costs. At the same time, it poses challenges in terms of numerical stability and parallel scalability. The coupling library preCICE provides communication, data mapping, and coupling numerics for surface-coupled multi-physics applications in a highly modular way. We recapitulate the numerical methods but focus particularly on their parallel implementation. The numerical results for an artificial coupling interface show a very small runtime of the coupling compared to typical solver runtimes and a good parallel scalability on a number of cores corresponding to a massively parallel simulation for an actual, coupled simulation. Further results for actual application scenarios from the field of fluid–structure–acoustic interactions are presented in the next chapter.
Hans-Joachim Bungartz, Florian Lindner, Miriam Mehl, Klaudius Scheufele, Alexander Shukaev, Benjamin Uekermann
Partitioned Fluid–Structure–Acoustics Interaction on Distributed Data: Numerical Results and Visualization
Abstract
We present a coupled simulation approach for fluid–structure–acoustic interactions (FSAI) as an example for strongly surface coupled multi-physics problems. In addition to the multi-physics character, FSAI feature multi-scale properties as a further challenge. In our partitioned approach, the problem is split into spatially separated subdomains interacting via coupling surfaces. Within each subdomain, scalable, single-physics solvers are used to solve the respective equation systems. The surface coupling between them is realized with the scalable open-source coupling tool preCICE described in the “Partitioned Fluid–Structure–Acoustics Interaction on Distributed Data: Coupling via preCICE”. We show how this approach enables the use of existing solvers and present the overall scaling behavior for a three-dimensional test case with a bending tower generating acoustic waves. We run this simulation with different solvers demonstrating the performance of various solvers and the flexibility of the partitioned approach with the coupling tool preCICE. An efficient and scalable in-situ visualization reducing the amount of data in place at the simulation processors before sending them over the network or to a file system completes the simulation environment.
David Blom, Thomas Ertl, Oliver Fernandes, Steffen Frey, Harald Klimach, Verena Krupp, Miriam Mehl, Sabine Roller, Dörte C. Sternel, Benjamin Uekermann, Tilo Winter, Alexander van Zuijlen

ESSEX: Equipping Sparse Solvers for Exascale

Frontmatter
Towards an Exascale Enabled Sparse Solver Repository
Abstract
As we approach the exascale computing era, disruptive changes in the software landscape are required to tackle the challenges posed by manycore CPUs and accelerators. We discuss the development of a new ‘exascale enabled’ sparse solver repository (the ESSR) that addresses these challenges—from fundamental design considerations and development processes to actual implementations of some prototypical iterative schemes for computing eigenvalues of sparse matrices. Key features of the ESSR include holistic performance engineering, tight integration between software layers and mechanisms to mitigate hardware failures.
Jonas Thies, Martin Galgon, Faisal Shahzad, Andreas Alvermann, Moritz Kreutzer, Andreas Pieper, Melven Röhrig-Zöllner, Achim Basermann, Holger Fehske, Georg Hager, Bruno Lang, Gerhard Wellein
Performance Engineering and Energy Efficiency of Building Blocks for Large, Sparse Eigenvalue Computations on Heterogeneous Supercomputers
Abstract
Numerous challenges have to be mastered as applications in scientific computing are being developed for post-petascale parallel systems. While ample parallelism is usually available in the numerical problems at hand, the efficient use of supercomputer resources requires not only good scalability but also a verifiably effective use of resources on the core, the processor, and the accelerator level. Furthermore, power dissipation and energy consumption are becoming further optimization targets besides time-to-solution. Performance Engineering (PE) is the pivotal strategy for developing effective parallel code on all levels of modern architectures. In this paper we report on the development and use of low-level parallel building blocks in the GHOST library (“General, Hybrid, and Optimized Sparse Toolkit”). We demonstrate the use of PE in optimizing a density of states computation using the Kernel Polynomial Method, and show that reduction of runtime and reduction of energy are literally the same goal in this case. We also give a brief overview of the capabilities of GHOST and the applications in which it is being used successfully.
Moritz Kreutzer, Jonas Thies, Andreas Pieper, Andreas Alvermann, Martin Galgon, Melven Röhrig-Zöllner, Faisal Shahzad, Achim Basermann, Alan R. Bishop, Holger Fehske, Georg Hager, Bruno Lang, Gerhard Wellein

DASH: Hierarchical Arrays for Efficient and Productive Data-Intensive Exascale Computing

Frontmatter
Expressing and Exploiting Multi-Dimensional Locality in DASH
Abstract
DASH is a realization of the PGAS (partitioned global address space) programming model in the form of a C++ template library. It provides a multi-dimensional array abstraction which is typically used as an underlying container for stencil- and dense matrix operations. Efficiency of operations on a distributed multi-dimensional array highly depends on the distribution of its elements to processes and the communication strategy used to propagate values between them. Locality can only be improved by employing an optimal distribution that is specific to the implementation of the algorithm, run-time parameters such as node topology, and numerous additional aspects. Application developers do not know these implications which also might change in future releases of DASH. In the following, we identify fundamental properties of distribution patterns that are prevalent in existing HPC applications. We describe a classification scheme of multi-dimensional distributions based on these properties and demonstrate how distribution patterns can be optimized for locality and communication avoidance automatically and, to a great extent, at compile-time.
Tobias Fuchs, Karl Fürlinger
Tool Support for Developing DASH Applications
Abstract
DASH is a new parallel programming model for HPC which is implemented as a C++ template library on top of a runtime library implementing various PGAS (Partitioned Global Address Space) substrates. DASH’s goal is to be an easy to use and efficient way to parallel programming with C++. Supporting software tools is an important part of the DASH project, especially debugging and performance monitoring. Debugging is particularly necessary when adopting a new parallelization model, while performance assessment is crucial in High Performance Computing applications by nature. Tools are fundamental for a programming ecosystem and we are convinced that providing tools early brings multiple advantages, benefiting application developers using DASH as well as developers of the DASH library itself. This work, first briefly introduces DASH and the underlying runtime system, existing debugger and performance analysis tools. We then demonstrate the specific debugging and performance monitoring extensions for DASH in exemplary use cases and discuss an early assessment of the results.
Denis Hünich, Andreas Knüpfer, Sebastian Oeste, Karl Fürlinger, Tobias Fuchs

EXAMAG: Exascale Simulations of the Evolution of the Universe Including Magnetic Fields

Frontmatter
Simulating Turbulence Using the Astrophysical Discontinuous Galerkin Code TENET
Abstract
In astrophysics, the two main methods traditionally in use for solving the Euler equations of ideal fluid dynamics are smoothed particle hydrodynamics and finite volume discretization on a stationary mesh. However, the goal to efficiently make use of future exascale machines with their ever higher degree of parallel concurrency motivates the search for more efficient and more accurate techniques for computing hydrodynamics. Discontinuous Galerkin (DG) methods represent a promising class of methods in this regard, as they can be straightforwardly extended to arbitrarily high order while requiring only small stencils. Especially for applications involving comparatively smooth problems, higher-order approaches promise significant gains in computational speed for reaching a desired target accuracy. Here, we introduce our new astrophysical DG code TENET designed for applications in cosmology, and discuss our first results for 3D simulations of subsonic turbulence. We show that our new DG implementation provides accurate results for subsonic turbulence, at considerably reduced computational cost compared with traditional finite volume methods. In particular, we find that DG needs about 1.8 times fewer degrees of freedom to achieve the same accuracy and at the same time is more than 1.5 times faster, confirming its substantial promise for astrophysical applications.
Andreas Bauer, Kevin Schaal, Volker Springel, Praveen Chandrashekar, Rüdiger Pakmor, Christian Klingenberg

FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale Computing

Frontmatter
FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale Computing
Abstract
In this paper we describe the hardware and application-inherent challenges that future exascale systems pose to high-performance computing (HPC) and propose a system architecture that addresses them. This architecture is based on proven building blocks and few principles: (1) a fast light-weight kernel that is supported by a virtualized Linux for tasks that are not performance critical, (2) decentralized load and health management using fault-tolerant gossip-based information dissemination, (3) a maximally-parallel checkpoint store for cheap checkpoint/restart in the presence of frequent component failures, and (4) a runtime that enables applications to interact with the underlying system platform through new interfaces. The paper discusses the vision behind FFMK and the current state of a prototype implementation of the system, which is based on a microkernel and an adapted MPI runtime.
Carsten Weinhold, Adam Lackorzynski, Jan Bierbaum, Martin Küttler, Maksym Planeta, Hermann Härtig, Amnon Shiloh, Ely Levy, Tal Ben-Nun, Amnon Barak, Thomas Steinke, Thorsten Schütt, Jan Fajerski, Alexander Reinefeld, Matthias Lieber, Wolfgang E. Nagel
Fast In-Memory Checkpointing with POSIX API for Legacy Exascale-Applications
Abstract
Exascale systems will be much more vulnerable to failures than today’s high-performance computers. We present a scheme that writes erasure-encoded checkpoints to other nodes’ memory. The rationale is twofold: first, writing to memory over the interconnect is several orders of magnitude faster than traditional disk-based checkpointing and second, erasure encoded data is able to survive component failures. We use a distributed file system with a tmpfs back end and intercept file accesses with LD_PRELOAD. Using a POSIX file system API, legacy applications which are prepared for application-level checkpoint/restart, can quickly materialize their checkpoints via the supercomputer’s interconnect without the need to change the source code. Experimental results show that the LD_PRELOAD client yields 69 % better sequential bandwidth (with striping) than FUSE while still being transparent to the application. With erasure encoding the performance is 17 % to 49 % worse than striping because of the additional data handling and encoding effort. Even so, our results indicate that erasure-encoded memory checkpoint/restart is an effective means to improve resilience for exascale computing.
Jan Fajerski, Matthias Noack, Alexander Reinefeld, Florian Schintke, Torsten Schütt, Thomas Steinke

CATWALK: A Quick Development Path for Performance Models

Frontmatter
Automatic Performance Modeling of HPC Applications
Abstract
Many existing applications suffer from inherent scalability limitations that will prevent them from running at exascale. Current tuning practices, which rely on diagnostic experiments, have drawbacks because (i) they detect scalability problems relatively late in the development process when major effort has already been invested into an inadequate solution and (ii) they incur the extra cost of potentially numerous full-scale experiments. Analytical performance models, in contrast, allow application developers to address performance issues already during the design or prototyping phase. Unfortunately, the difficulties of creating such models combined with the lack of appropriate tool support still render performance modeling an esoteric discipline mastered only by a relatively small community of experts. This article summarizes the results of the Catwalk project, which aimed to create tools that automate key activities of the performance modeling process, making this powerful methodology accessible to a wider audience of HPC application developers.
Felix Wolf, Christian Bischof, Alexandru Calotoiu, Torsten Hoefler, Christian Iwainsky, Grzegorz Kwasniewski, Bernd Mohr, Sergei Shudler, Alexandre Strube, Andreas Vogel, Gabriel Wittum
Automated Performance Modeling of the UG4 Simulation Framework
Abstract
Many scientific research questions such as the drug diffusion through the upper part of the human skin are formulated in terms of partial differential equations and their solution is numerically addressed using grid based finite element methods. For detailed and more realistic physical models this computational task becomes challenging and thus complex numerical codes with good scaling properties up to millions of computing cores are required. Employing empirical tests we presented very good scaling properties for the geometric multigrid solver in Reiter et al. (Comput Vis Sci 16(4):151–164, 2013) using the UG4 framework that is used to address such problems. In order to further validate the scalability of the code we applied automated performance modeling to UG4 simulations and presented how performance bottlenecks can be detected and resolved in Vogel et al. (10,000 performance models per minute—scalability of the UG4 simulation framework. In: Träff JL, Hunold S, Versaci F (eds) Euro-Par 2015: Parallel processing, theoretical computer science and general issues, vol 9233. Springer, Springer, Heidelberg, pp 519–531, 2015). In this paper we provide an overview on the obtained results, present a more detailed analysis via performance models for the components of the geometric multigrid solver and comment on how the performance models coincide with our expectations.
Andreas Vogel, Alexandru Calotoiu, Arne Nägel, Sebastian Reiter, Alexandre Strube, Gabriel Wittum, Felix Wolf

GROMEX: Unified Long-Range Electrostatics and Dynamic Protonation for Realistic Biomolecular Simulations on the Exascale

Frontmatter
Accelerating an FMM-Based Coulomb Solver with GPUs
Abstract
The simulation of long-range electrostatic interactions in huge particle ensembles is a vital issue in current scientific research. The Fast Multipole Method (FMM) is able to compute those Coulomb interactions with extraordinary speed and controlled precision. A key part of this method are its shifting operators, which usually exhibit O( p 4) complexity. Some special rotation-based operators with O( p 3) complexity can be used instead. However, they are still computationally expensive. Here we report on the parallelization of those operators that have been implemented for a GPU cluster to speed up the FMM calculations.
Alberto Garcia Garcia, Andreas Beckmann, Ivo Kabadshow

ExaSolvers: Extreme Scale Solvers for Coupled Problems

Frontmatter
Space and Time Parallel Multigrid for Optimization and Uncertainty Quantification in PDE Simulations
Abstract
In this article we present a complete parallelization approach for simulations of PDEs with applications in optimization and uncertainty quantification. The method of choice for linear or nonlinear elliptic or parabolic problems is the geometric multigrid method since it can achieve optimal (linear) complexity in terms of degrees of freedom, and it can be combined with adaptive refinement strategies in order to find the minimal number of degrees of freedom. This optimal solver is parallelized such that weak and strong scaling is possible for extreme scale HPC architectures. For the space parallelization of the multigrid method we use a tree based approach that allows for an adaptive grid refinement and online load balancing. Parallelization in time is achieved by SDC/ISDC or a space-time formulation. As an example we consider the permeation through human skin which serves as a diffusion model problem where aspects of shape optimization, uncertainty quantification as well as sensitivity to geometry and material parameters are studied. All methods are developed and tested in the UG4 library.
Lars Grasedyck, Christian Löbbert, Gabriel Wittum, Arne Nägel, Volker Schulz, Martin Siebenborn, Rolf Krause, Pietro Benedusi, Uwe Küster, Björn Dick

Further Contributions

Frontmatter
Domain Overlap for Iterative Sparse Triangular Solves on GPUs
Abstract
Iterative methods for solving sparse triangular systems are an attractive alternative to exact forward and backward substitution if an approximation of the solution is acceptable. On modern hardware, performance benefits are available as iterative methods allow for better parallelization. In this paper, we investigate how block-iterative triangular solves can benefit from using overlap. Because the matrices are triangular, we use “directed” overlap, depending on whether the matrix is upper or lower triangular. We enhance a GPU implementation of the block-asynchronous Jacobi method with directed overlap. For GPUs and other cases where the problem must be overdecomposed, i.e., more subdomains and threads than cores, there is a preference in processing or scheduling the subdomains in a specific order, following the dependencies specified by the sparse triangular matrix. For sparse triangular factors from incomplete factorizations, we demonstrate that moderate directed overlap with subdomain scheduling can improve convergence and time-to-solution.
Hartwig Anzt, Edmond Chow, Daniel B. Szyld, Jack Dongarra
Asynchronous OpenCL/MPI Numerical Simulations of Conservation Laws
Abstract
Hyperbolic conservation laws are important mathematical models for describing many phenomena in physics or engineering. The Finite Volume (FV) method and the Discontinuous Galerkin (DG) method are two popular methods for solving conservation laws on computers. In this paper, we present several FV and DG numerical simulations that we have realized with the OpenCL and MPI paradigms. First, we compare two optimized implementations of the FV method on a regular grid: an OpenCL implementation and a more traditional OpenMP implementation. We compare the efficiency of the approach on several CPU and GPU architectures of different brands. Then we present how we have implemented the DG method in the OpenCL/MPI framework in order to achieve high efficiency. The implementation relies on a splitting of the DG mesh into subdomains and subzones. Different kernels are compiled according to the zone properties. In addition, we rely on the OpenCL asynchronous task graph in order to overlap OpenCL computations, memory transfers and MPI communications.
Philippe Helluy, Thomas Strub, Michel Massaro, Malcolm Roberts
Backmatter
Metadata
Title
Software for Exascale Computing - SPPEXA 2013-2015
Editors
Hans-Joachim Bungartz
Philipp Neumann
Wolfgang E. Nagel
Copyright Year
2016
Electronic ISBN
978-3-319-40528-5
Print ISBN
978-3-319-40526-1
DOI
https://doi.org/10.1007/978-3-319-40528-5

Premium Partner