Skip to main content
Top

2015 | Book

High Performance Computing for Computational Science -- VECPAR 2014

11th International Conference, Eugene, OR, USA, June 30 -- July 3, 2014, Revised Selected Papers

insite
SEARCH

About this book

This book constitutes the thoroughly refereed post-conference proceedings of the 11th International Conference on High Performance Computing for Computational Science, VECPAR 2014, held in Eugene, OR, USA, in June/July 2014.

The 25 papers presented were carefully reviewed and selected of numerous submissions. The papers are organized in topical sections on algorithms for GPU and manycores, large-scale applications, numerical algorithms, direct/hybrid methods for solving sparse matrices, performance tuning. The volume also contains the papers presented at the 9th International Workshop on Automatic Performance Tuning.

Table of Contents

Frontmatter

Algorithms for GPU and Manycores

Frontmatter
A Communication Optimization Scheme for Basis Computation of Krylov Subspace Methods on Multi-GPUs
Abstract
Krylov Subspace Methods (KSMs) are widely used for solving large-scale linear systems and eigenproblems. However, the computation of Krylov subspace bases suffers from the overhead of performing global reduction operations when computing the inner vector products in the orthogonalization steps. In this paper, a hypergraph based communication optimization scheme is applied to Arnoldi and incomplete Arnoldi methods of forming Krylov subspace basis from sparse matrix, and features of these methods are compared in a analytical way. Finally, experiments on a CPU-GPU heterogeneous cluster show that our optimization improves the Arnoldi methods implementations for a generic matrix, and a benefit of up to 10x speedup for some special diagonal structured matrix. The performance advantage also varies for different subspace sizes and matrix formats, which requires a further integration of auto-tuning strategy.
Langshi Chen, Serge G. Petiton, Leroy A. Drummond, Maxime Hugues
Mixed-Precision Orthogonalization Scheme and Adaptive Step Size for Improving the Stability and Performance of CA-GMRES on GPUs
Abstract
The Generalized Minimum Residual (GMRES) method is a popular Krylov subspace projection method for solving a nonsymmetric linear system of equations. On modern computers, communication is becoming increasingly expensive compared to arithmetic operations, and a communication-avoiding variant (CA-GMRES) may improve the performance of GMRES. To further enhance the performance of CA-GMRES, in this paper, we propose two techniques, focusing on the two main computational kernels of CA-GMRES, tall-skinny QR (TSQR) and matrix powers kernel (MPK). First, to improve the numerical stability of TSQR based on the Cholesky QR (CholQR) factorization, we use higher-precision arithmetic at carefully-selected steps of the factorization. In particular, our mixed-precision CholQR takes the input matrix in the standard \(64\)-bit double precision, but accumulates some of its intermediate results in a software-emulated double-double precision. Compared with the standard CholQR, this mixed-precision CholQR requires about \(8.5\times \) more computation but a much smaller increase in communication. Since the computation is becoming less expensive compared to the communication on a newer computer, the relative overhead of the mixed-precision CholQR is decreasing. Our case studies on a GPU demonstrate that using higher-precision arithmetic for this small but critical segment of the algorithm can improve not only the overall numerical stability of CA-GMRES but also, in some cases, its performance. We then study an adaptive scheme to dynamically adjust the step size of MPK based on the static inputs and the performance measurements gathered during the first restart loop of CA-GMRES. Since the optimal step size of MPK is often much smaller than that of the orthogonalization kernel, the overall performance of CA-GMRES can be improved using different step sizes for these two kernels. Our performance results on multiple GPUs show that our adaptive scheme can choose a near optimal step size for MPK, reducing the total solution time of CA-GMRES.
Ichitaro Yamazaki, Stanimire Tomov, Tingxing Dong, Jack Dongarra
Heterogenous Acceleration for Linear Algebra in Multi-coprocessor Environments
Abstract
We present an efficient and scalable programming model for the development of linear algebra in heterogeneous multi-coprocessor environments. The model incorporates some of the current best design and implementation practices for the heterogeneous acceleration of dense linear algebra (DLA). Examples are given as the basis for solving linear systems’ algorithms – the LU, QR, and Cholesky factorizations. To generate the extreme level of parallelism needed for the efficient use of coprocessors, algorithms of interest are redesigned and then split into well-chosen computational tasks. The tasks execution is scheduled over the computational components of a hybrid system of multi-core CPUs and coprocessors using a light-weight runtime system. The use of light-weight runtime systems keeps scheduling overhead low, while enabling the expression of parallelism through otherwise sequential code. This simplifies the development efforts and allows the exploration of the unique strengths of the various hardware components.
Azzam Haidar, Piotr Luszczek, Stanimire Tomov, Jack Dongarra
A Study of SpMV Implementation Using MPI and OpenMP on Intel Many-Core Architecture
Abstract
The Sparse Matrix-Vector Multiplication (SpMV) is fundamental to a broad spectrum of scientific and engineering applications, such as many iterative numerical methods. The widely used Compressed Sparse Row (CSR) sparse matrix storage format was chosen to carry on this study for sustainability and reusability reasons.
We parallelized for Intel Many Integrated Core (MIC) architecture a vectorized SpMV kernel using MPI and OpenMP, both pure and hybrid versions of them. In comparison to pure models and vendor-supplied BLAS libraries across different mainstream architectures (CPU, GPU), the hybrid model exhibits a substantial improvement.
To further assess the behavior of hybrid model, we attribute the inadequacy of performances to vectorization rate, irregularity of non-zeros, and load balancing issue. A mathematical relationship between the first two factors and the performance is then proposed based on the experimental data.
Fan Ye, Christophe Calvin, Serge G. Petiton
SIMD Implementation of a Multiplicative Schwarz Smoother for a Multigrid Poisson Solver on an Intel Xeon Phi Coprocessor
Abstract
In this paper, we discuss an efficient implementation of the three-dimensional multigrid Poisson solver on a many-core coprocessor, Intel Xeon Phi. We have used the modified block red-black (mBRB) Gauss-Seidel (GS) smoother to achieve sufficient degree of parallelism and high cache hit ratio. We have vectorized (SIMDized) the GS steps in the smoother by introducing a partially SIMDizing technique based on loop splitting. Our numerical tests demonstrate that our implementation performs 35.5 % better than the conventional mBRB-GS smoother implementation on Xeon Phi.
Masatoshi Kawai, Takeshi Iwashita, Hiroshi Nakashima
Performance Optimization of the 3D FDM Simulation of Seismic Wave Propagation on the Intel Xeon Phi Coprocessor Using the ppOpen-APPL/FDM Library
Abstract
We evaluate the performance of a parallel 3D finite-difference method (FDM) simulation of seismic wave propagation using the Intel Xeon Phi coprocessor. Since a continued decrease in the byte/flop ratio of future machines is forecast, program optimization with a decrease byte/flop ratio was applied by fusing the original major kernel and omitting the storing and loading of intermediate variables. We confirm that 1) MPI/OpenMP hybrid parallel computing with hyper-threading is more efficient than pure MPI parallel computing and 2) the performance of the FDM simulation with a splitting of triple DO loops is 1.3 times faster than the modified code with triple DO loops, while no performance acceleration is achieved with a fused double DO-loop calculation. We consider that loop distribution optimization is effective for prefetching and the thread parallelization of each loop by its use and reuse on cache data.
Futoshi Mori, Masaharu Matsumoto, Takashi Furumura

Large-Scale Applications

Frontmatter
Machine-Learning-Based Load Balancing for Community Ice Code Component in CESM
Abstract
Load balancing scientific codes on massively parallel architectures is becoming an increasingly challenging task. In this paper, we focus on the Community Earth System Model, a widely used climate modeling code. It comprises six components each of which exhibits different scalability patterns. Previously, an analytical performance model has been used to find optimal load-balancing parameter configurations for each component. Nevertheless, for the Community Ice Code component, the analytical performance model is too restrictive to capture its scalability patterns. We therefore developed machine-learning-based load-balancing algorithm. It involves fitting a surrogate model to a small number of load-balancing configurations and their corresponding runtimes. This model is then used to find high-quality parameter configurations. Compared with the current practice of expert-knowledge-based enumeration over feasible configurations, the machine-learning-based load-balancing algorithm requires six times fewer evaluations to find the optimal configuration.
Prasanna Balaprakash, Yuri Alexeev, Sheri A. Mickelson, Sven Leyffer, Robert Jacob, Anthony Craig
Domain Decomposition for Heterojunction Problems in Semiconductors
Abstract
We present a domain decomposition approach for the simulation of charge transport in heterojunction semiconductors. The problem is characterized by a large variation of primary variables across an interface region of a size much smaller than the device scale, and requires a multiscale approach in which that region is modeled as an internal boundary. The model combines drift diffusion equations on subdomains coupled by thermionic emission heterojunction model on the interface which involves a nonhomogeneous jump computed at fine scale with Density Functional Theory. Our full domain decomposition approach extends our previous work for the potential equation only, and we present perspectives on its HPC implementation. The model can be used, e.g., for the design of higher efficiency solar cells for which experimental results are not available. More generally, our algorithm is naturally parallelizable and is a new domain decomposition paradigm for problems with multiscale phenomena associated with internal interfaces and/or boundary layers.
Timothy Costa, David Foster, Malgorzata Peszynska
A Hybrid Approach for Parallel Transistor-Level Full-Chip Circuit Simulation
Abstract
The computer-aided design (CAD) applications that are fundamental to the electronic design automation industry need to harness the available hardware resources to be able to perform full-chip simulation for modern technology nodes (45 nm and below). We will present a hybrid (MPI+threads) approach for parallel transistor-level transient circuit simulation that achieves scalable performance for some challenging large-scale integrated circuits. This approach focuses on the computationally expensive part of the simulator: the linear system solve. Hybrid versions of two iterative linear solver strategies are presented, one takes advantage of block triangular form structure while the other uses a Schur complement technique. Results indicate up to a 27x improvement in total simulation time on 256 cores.
Heidi K. Thornquist, Sivasankaran Rajamanickam

Numerical Algorithms

Frontmatter
Self-adaptive Multiprecision Preconditioners on Multicore and Manycore Architectures
Abstract
Based on the premise that preconditioners needed for scientific computing are not only required to be robust in the numerical sense, but also scalable for up to thousands of light-weight cores, we argue that this two-fold goal is achieved for the recently developed self-adaptive multi-elimination preconditioner. For this purpose, we revise the underlying idea and analyze the performance of implementations realized in the PARALUTION and MAGMA open-source software libraries on GPU architectures (using either CUDA or OpenCL), Intel’s Many Integrated Core Architecture, and Intel’s Sandy Bridge processor. The comparison with other well-established preconditioners like multi-coloured Gauss-Seidel, ILU(0) and multi-colored ILU(0), shows that the twofold goal of a numerically stable cross-platform performant algorithm is achieved.
Hartwig Anzt, Dimitar Lukarski, Stanimire Tomov, Jack Dongarra
Fault Tolerance in an Inner-Outer Solver: A GVR-Enabled Case Study
Abstract
Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates. We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Experimental results validate correct execution with low performance overhead under varied error conditions.
Ziming Zheng, Andrew A. Chien, Keita Teranishi

Direct/Hybrid Methods for Solving Sparse Matrices

Frontmatter
Using Random Butterfly Transformations to Avoid Pivoting in Sparse Direct Methods
Abstract
We consider the solution of sparse linear systems using direct methods via LU factorization. Unless the matrix is positive definite, numerical pivoting is usually needed to ensure stability, which is costly to implement especially in the sparse case. The Random Butterfly Transformations (RBT) technique provides an alternative to pivoting and is easily parallelizable. The RBT transforms the original matrix into another one that can be factorized without pivoting with probability one. This approach has been successful for dense matrices; in this work, we investigate the sparse case. In particular, we address the issue of fill-in in the transformed system.
Marc Baboulin, Xiaoye S. Li, François-Henry Rouet
Hybrid Sparse Linear Solutions with Substituted Factorization
Abstract
We develop a computationally less expensive alternative to the direct solution of a large sparse symmetric positive definite system arising from the numerical solution of elliptic partial differential equation models. Our method, substituted factorization , replaces the computationally expensive factorization of certain dense submatrices that arise in the course of direct solution with sparse Cholesky factorization with one or more solutions of triangular systems using substitution. These substitutions fit into the tree-structure commonly used by parallel sparse Cholesky, and reduce the initial factorization cost at the expense of a slight increase cost in solving for a right-hand side vector. Our analysis shows that substituted factorization reduces the number of floating-point operations for the model \(k \times k\) 5-point finite-difference problem by \(10\,\%\) and empirical tests show execution time reduction on average of \(24.4\,\%\). On a test suite of three-dimensional problems we observe execution time reduction as high as \(51.7\,\%\) and \(43.1\,\%\) on average.
Joshua Dennis Booth, Padma Raghavan
Modeling 1D Distributed-Memory Dense Kernels for an Asynchronous Multifrontal Sparse Solver
Abstract
To solve sparse systems of linear equations, multifrontal methods rely on dense partial \(LU\) decompositions of so-called frontal matrices; we consider a parallel asynchronous setting in which several frontal matrices can be factored simultaneously. In this context, to address performance and scalability issues of acyclic pipelined asynchronous factorization kernels, we study models to revisit properties of left and right-looking variants of partial \(LU\) decompositions, study the use of several levels of blocking, before focusing on communication issues. The general purpose sparse solver MUMPS has been modified to implement the proposed algorithms and confirm the properties demonstrated by the models.
Patrick R. Amestoy, Jean-Yves L’Excellent, François-Henry Rouet, Wissam M. Sid-Lakhdar

Performance Tuning

Frontmatter
Performance Characteristics of HYDRA – A Multi-physics Simulation Code from LLNL
Abstract
HYDRA simulates a variety of experiments carried out at the National Ignition Facility and other high energy density physics facilities. It has packages to simulate radiation transfer, atomic physics, hydrodynamics, laser propagation, and a number of other physics effects. HYDRA has over one million lines of code, includes MPI and thread-level (OpenMP and pthreads) parallelism, has run on a variety of platforms for two decades, and is undergoing active development.
In this paper, we demonstrate that HYDRA’s thread-based load balancing approach is very effective. Hardware counters from IBM Blue Gene/Q runs show that none of HYDRA’s packages are memory bandwidth limited, a few come close to the maximum integer instruction issue rate, and all are well below the maximum floating point issue rate.
Steven H. Langer, Ian Karlin, Michael M. Marinak
Accelerating Computation of Eigenvectors in the Dense Nonsymmetric Eigenvalue Problem
Abstract
In the dense nonsymmetric eigenvalue problem, work has focused on the Hessenberg reduction and QR iteration, using efficient algorithms and fast, Level 3 BLAS. Comparatively, computation of eigenvectors performs poorly, limited to slow, Level 2 BLAS performance with little speedup on multi-core systems. It has thus become a dominant cost in the solution of the eigenvalue problem. To address this, we present improvements for the eigenvector computation to use Level 3 BLAS and parallelize the triangular solves, achieving good parallel scaling and accelerating the overall eigenvalue problem more than three-fold.
Mark Gates, Azzam Haidar, Jack Dongarra
Low Byte/Flop Implementation of Iterative Solver for Sparse Matrices Derived from Stencil Computations
Abstract
Practical simulators require high-performance iterative methods and efficient boundary conditions, especially in the field of computational fluid dynamics. In this paper, we propose a novel bit-representation technique to enhance the performance of such simulators. The technique is applied to an iterative kernel implementation that treats various boundary conditions in a stencil computation on a structured grid system. This approach reduces traffic from the main memory to CPU, and effectively utilizes Single Instruction–Multiple Data (SIMD) stream units with cache because of the bit-representation and compression of matrix elements. The proposed implementation also replaces if-branch statements with mask operations using the bit expression. This promotes the optimization of code during compilation and runtime. To evaluate the performance of the proposed implementation, we employ the Red–Black SOR and BiCGstab algorithms. Experimental results show that the proposed approach is up to 3.5 times faster than a naïve implementation on both the Intel and Fujitsu Sparc architectures.
Kenji Ono, Shuichi Chiba, Shunsuke Inoue, Kazuo Minami

The Ninth International Workshop on Automatic Performance Tuning

Frontmatter
Environment-Sensitive Performance Tuning for Distributed Service Orchestration
Abstract
Modern distributed systems are designed to tolerate unreliable environments, i.e., they aim to provide services even when some failures happen in the underlying hardware or network. However, the impact of unreliable environments can be significant on the performance of the distributed systems, which should be considered when deploying the services. In this paper, we present an approach to optimize performance of the distributed systems under unreliable deployed environments, through searching for optimal configuration parameters. To simulate an unreliable environment, we inject several failures in the environment of a service application, such as a node crash in the cluster, network failures between nodes, resource contention in nodes, etc. Then, we use a search algorithm to find the optimal parameters automatically in the user-selected parameter space, under the unreliable environment we created. We have implemented our approach in a testing-based framework and applied it to several well-known distributed service systems.
Yu Lin, Franjo Ivančić, Pallavi Joshi, Gogul Balakrishnan, Malay Ganai, Aarti Gupta
Historic Learning Approach for Auto-tuning OpenACC Accelerated Scientific Applications
Abstract
The performance optimization of scientific applications usually requires an in-depth knowledge of the hardware and software. A performance tuning mechanism is suggested to automatically tune OpenACC parameters to adapt to the execution environment on a given system. A historic learning based methodology is suggested to prune the parameter search space for a more efficient auto-tuning process. This approach is applied to tune the OpenACC gang and vector clauses for a better mapping of the compute kernels onto the underlying architecture. Our experiments show a significant performance improvement against the default compiler parameters and drastic reduction in tuning time compared to a brute force search-based approach.
Shahzeb Siddiqui, Fatemah AlZayer, Saber Feki
Capturing the Expert: Generating Fast Matrix-Multiply Kernels with Spiral
Abstract
Matrix-Matrix Multiplication (MMM) is a fundamental operation in scientific computing. Achieving the floating point peak with this operation requires expert knowledge of linear algebra and computer architecture to craft a tuned implementation, for a given microarchitecture. To do this an expert follows a mechanical process for implementing MMM, by deriving an algorithm from models found in the literature. Then, the expert applies optimizations which are well suited for the target architecture. Lastly, the expert expresses that implementation in assembly code. In this paper, we argue that this process is mechanical and can be captured in a rule based program generation system such as Spiral. We then show that given this machinery, Spiral can produce code for large size MMM implementations that are competitive with hand tuned code.
Richard Veras, Franz Franchetti
A Study on the Influence of Caching: Sequences of Dense Linear Algebra Kernels
Abstract
It is universally known that caching is critical to attain high-performance implementations: In many situations, data locality (in space and time) plays a bigger role than optimizing the (number of) arithmetic floating point operations. In this paper, we show evidence that at least for linear algebra algorithms, caching is also a crucial factor for accurate performance modeling and performance prediction.
Elmar Peise, Paolo Bientinesi
Toward Restarting Strategies Tuning for a Krylov Eigenvalue Solver
Abstract
rylov eigensolvers are used in many scientific fields, such as nuclear physics, page ranking, oil and gas exploration, etc. In this paper, we focus on the ERAM Krylov eigensolver whose convergence is strongly correlated to the Krylov subspace size and the restarting vector \(v_0\), a unit norm vector. We focus on computing the restarting vector \(v_0\) to accelerate the ERAM convergence. First, we study different restarting strategies and compare their efficiency. Then, we mix these restarting strategies and show the considerable ERAM convergence improvement. Mixing the restarting strategies optimizes the “numerical efficiency” versus “execution time” ratio as we do not introduce neither additionnal computation nor communications.
France Boillod-Cerneux, Serge G. Petiton, Christophe Calvin, Leroy A. Drummond
Performance Analysis of the Householder-Type Parallel Tall-Skinny QR Factorizations Toward Automatic Algorithm Selection
Abstract
We consider computing tall-skinny QR factorizations on a large-scale parallel machine. We present a realistic performance model and analyze the difference of the parallel execution time between Householder QR and TSQR. Our analysis indicates the possibility that TSQR becomes slower than Householder QR as the number of columns of the target matrix increases. We aim for estimating the difference and selecting the faster algorithm by using models, which falls into auto-tuning. Numerical experiments on the K computer support our analysis and show our success in determining the faster algorithm.
Takeshi Fukaya, Toshiyuki Imamura, Yusaku Yamamoto
Automatic Parameter Tuning of Three-Dimensional Tiled FDTD Kernel
Abstract
This paper introduces an automatic tuning method for the tiling parameters required in an implementation of the three-dimensional FDTD method based on time-space tiling. In this tuning process, an appropriate range for the tile size is first determined by trial experiments using cubic tiles. The tile shape is then optimized by using the Monte Carlo method. The tiled FDTD kernel was multi-threaded and its performance with the tuned parameters was evaluated on multi-core processors. When compared with a naively implemented kernel, the performance of the tuned FDTD kernel was improved by more than a factor of two.
Takeshi Minami, Motoharu Hibino, Tasuku Hiraishi, Takeshi Iwashita, Hiroshi Nakashima
Automatic Parameter Tuning of Hierarchical Incremental Checkpointing
Abstract
As future HPC systems become larger, the failure rates and the cost of checkpointing to the global file system are expected to increase. Hierarchical incremental CPR is a promising approach to solve this problem. It utilizes a hierarchical storage system of local and global storages and performs incremental checkpointing by writing only updated memory pages between two consecutive checkpoints. In this paper, we response to an open question; how to optimize the checkpoint interval when the checkpoint overheads are changing with time as in hierarchical incremental CPR. We propose a runtime checkpoint interval autotuning technique to optimize the efficiency of hierarchical incremental CPR. Evaluation results show that the efficiency can be significantly increased if the storage hierarchy can be exploited with appropriate checkpoint intervals.
Alfian Amrizal, Shoichi Hirasawa, Hiroyuki Takizawa, Hiroaki Kobayashi
Backmatter
Metadata
Title
High Performance Computing for Computational Science -- VECPAR 2014
Editors
Michel Daydé
Osni Marques
Kengo Nakajima
Copyright Year
2015
Electronic ISBN
978-3-319-17353-5
Print ISBN
978-3-319-17352-8
DOI
https://doi.org/10.1007/978-3-319-17353-5

Premium Partner