Skip to main content

2013 | Buch

GPU Solutions to Multi-scale Problems in Science and Engineering

herausgegeben von: David A. Yuen, Long Wang, Xuebin Chi, Lennart Johnsson, Wei Ge, Yaolin Shi

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Earth System Sciences

insite
SUCHEN

Über dieses Buch

This book covers the new topic of GPU computing with many applications involved, taken from diverse fields such as networking, seismology, fluid mechanics, nano-materials, data-mining , earthquakes ,mantle convection, visualization. It will show the public why GPU computing is important and easy to use. It will offer a reason why GPU computing is useful and how to implement codes in an everyday situation.

Inhaltsverzeichnis

Frontmatter

Introductory Material

Frontmatter
Chapter 1. Why Do Scientists and Engineers Need GPU’s Today?

Recently, a paradigm shift in computer architecture has offered computational science the prospect of a vast increase in capability at relatively little cost. The tremendous computational power of graphics processors (GPU) provides a great opportunity for those willing to rethink algorithms and rewrite existing simulation codes. In this introduction, we give a brief survey of GPU computing, and its potential capabilities, intended for the general scientific and engineering audience. We will also review some challenges facing the users in adapting the large toolbox of scientific computing to these changes in computer architecture, and what the community can expect in the near future.

Matthew G. Knepley, David A. Yuen
Chapter 2. Happenings at the GPU Conference

The pages given in this chapter should convey the lively convivial atmosphere at the GPU conference in nice surroundings of Harbin in China’s Dongbei region.

Xian-yu Lang, Long Wang, David A. Yuen

Hardware and Installations

Frontmatter
Chapter 3. Efficiency, Energy Efficiency and Programming of Accelerated HPC Servers: Highlights of PRACE Studies

During the last few years the convergence in architecture for High-Performance Computing systems that took place for over a decade has been replaced by a divergence. The divergence is driven by the quest for performance, cost-performance and in the last few years also energy consumption that during the life-time of a system have come to exceed the HPC system cost in many cases. Mass market, specialized processors, such as the Cell Broadband Engine (CBE) and Graphics Processors, have received particular attention, the latter especially after hardware support for double-precision floating-point arithmetic was introduced about three years ago. The recent support of Error Correcting Code (ECC) for memory and significantly enhanced performance for double-precision arithmetic in the current generation of Graphic Processing Units (GPUs) have further solidified the interest in GPUs for HPC. In order to assess the issues involved in potentially deploying clusters with nodes consisting of commodity microprocessors with some type of specialized processor for enhanced performance or enhanced energy efficiency or both for science and engineering workloads, PRACE, the Partnership for Advanced Computing in Europe, undertook a study that included three types of accelerators, the CBE, GPUs and ClearSpeed, and tools for their programming. The study focused on assessing performance, efficiency, power efficiency for double-precision arithmetic and programmer productivity. Four kernels, matrix multiplication, sparse matrix-vector multiplication, FFT, random number generation were used for the assessment together with High-Performance Linpack (HPL) and a few application codes. We report here on the results from the kernels and HPL for GPU and ClearSpeed accelerated systems. The GPU performed surprisingly significantly better than the CPU on the sparse matrix-vector multiplication on which the ClearSpeed performed surprisingly poorly. For matrix-multiplication, HPL and FFT the ClearSpeed accelerator was by far the most energy efficient device.

Lennart Johnsson
Chapter 4. GRAPE and GRAPE-DR

We describe the architecture and performance of GRAPE-DR (Greatly Reduced Array of Processor Elements with Data Reduction). It operates as an accelerator attached to general-purpose PCs or x86-based servers. The processor chip of a GRAPE-DR board have 512 cores operating at the clock frequency of 400 MHz. The peak speed of a processor chip is 410 Gflops (single precision) or 205 Gflops (double precision). A GRAPE-DR board consists of four GRAPE-DR chips, each with its own local memory of 256 MB. Thus, a GRAPE-DR board has the theoretical peak speed of 1.64 SP and 0.82 DP Tflops. Its power consumption is around 200 W. The application area of GRAPE-DR covers particle-based simulations such as astrophysical many-body simulations and molecular-dynamics simulations, quantum chemistry calculations, various applications which requires dense matrix operations, and many other compute-intensive applications. The architecture of GRAPE-DR is in many ways similar to those of modern GPUs, since the evolutionary tracks are rather similar. GPUs have evolved from specialized hardwired logic for specific operations to a more general-purpose computing engine, in order to meet the perform complex shading and other operations. The predecessor of GRAPE-DR is GRAPE (GRAvity PipE), which is a specialized pipeline processor for gravitational

$$N$$

N

-body simulations. We have changed the architecture to extend the range of applications. There are two main differences between GRAPE-DR and GPGPU. One is the transistor and power efficiency. With 90 nm technology and 400M transistors, we have integrated 512 processor cores and achieved the speed of 400 Gflops at 400 MHz clock and 50 W. A Fermi processor from NVIDIA integrates 448 processors with 3B transistors and achieved the speed of 1.03 Tflops at 1.15 GHz and 247 W. Thus, Fermi achieved 2.5 times higher speed compared to GRAPE-DR, with 2.9 times higher clock, 8 times more transistors, and 5 times more power consumption. The other is the external memory bandwidth. GPUs typically have the memory bandwidth of around 100 GB/s, while our GRAPE-DR card, with 4 chips, have only 16 GB/s. Thus, the range of application is somewhat limited, but for suitable applications, the performance and performance per watt of GRAPE-DR is quite good. The single-card performance of HPL benchmark is 480 Gflops for matrix size of t 48 k, and for 81 cards 37 Tflops.

Junichiro Makino

Software libraries

Chapter 5. Parray: A Unifying Array Representation for Heterogeneous Parallelism

This paper introduces a programming interface called P

array

(or Parallelizing ARRAYs) that supports system-level succinct programming for heterogeneous parallel systems like GPU clusters. The current practice of software development requires combining several low-level libraries like Pthread, OpenMP, CUDA and MPI. Achieving productivity and portability is hard with different numbers and models of GPUs. P

array

extends mainstream C programming with novel array types of the following features: (1) the dimensions of an array type are nested in a tree structure, conceptually reflecting the memory hierarchy; (2) the definition of an array type may contain references to other array types, allowing sophisticated array types to be created for parallelization; (3) threads also form arrays that allow programming in a Single-Program-Multiple-Codeblock (SPMC) style to unify various sophisticated communication patterns. This leads to shorter, more portable and maintainable parallel codes, while the programmer still has control over performance-related features necessary for deep manual optimization. Although the source-to-source code generator only faithfully generates low-level library calls according to the type information, higher-level programming and automatic performance optimization are still possible through building libraries of sub-programs on top of P

array

. The case study on cluster FFT illustrates a simple 30-line code that

$$2\times $$

2

×

-outperforms Intel Cluster MKL on the Tianhe-1A system with 7168 Fermi GPUs and 14336 CPUs.

Yifeng Chen, Xiang Cui, Hong Mei
Chapter 6. Practical Random Linear Network Coding on GPUs

Recently, random linear network coding has been widely applied in peer-to-peer network applications. Instead of sharing the raw data with each other, peers in the network produce and send encoded data to each other. As a result, the communication protocols have been greatly simplified, and the applications experience higher end-to-end throughput and better robustness to network churns. Since it is difficult to verify the integrity of the encoded data, such systems can suffer from the famous pollution attack, in which a malicious node can send bad encoded blocks that consist of bogus data. Consequently, the bogus data will be propagated into the whole network at an exponential rate. Homomorphic hash functions (HHFs) have been designed to defend systems from such pollution attacks, but with a new challenge: HHFs require that network coding must be performed in GF(

$$q)$$

q

)

, where

$$q$$

q

is a very large prime number. This greatly increases the computational cost of network coding, in addition to the already computational expensive HHFs. This chapter exploits the potential of the huge computing power of Graphic Processing Units (GPUs) to reduce the computational cost of network coding and homomorphic hashing. With our network coding and HHF implementation on GPU, we observed significant computational speedup in comparison with the best CPU implementation. This implementation can lead to a practical solution for defending the pollution attacks in distributed systems.

Xiaowen Chu, Kaiyong Zhao
Chapter 7. Preliminary Implementation of PETSc Using GPUs

PETSc is a scalable solver library for the solution of algebraic equations arising from the discretization of partial differential equations and related problems. PETSc is organized as a class library with classes for vectors, matrices, Krylov methods, preconditioners, nonlinear solvers, and differential equation integrators. A new subclass of the vector class has been introduced that performs its operations on NVIDIA GPU processors. In addition, a new sparse matrix subclass that performs matrix-vector products on the GPU was introduced. The Krylov methods, nonlinear solvers, and integrators in PETSc run unchanged in parallel using these new subclasses. These can be used transparently from existing PETSc application codes in C, C++, Fortran, or Python. The implementation is done with the Thrust and Cusp C++ packages from NVIDIA.

Victor Minden, Barry Smith, Matthew G. Knepley

Industrial Applications

Frontmatter
Chapter 8. Multi-scale Continuum-Particle Simulation on CPU–GPU Hybrid Supercomputer

This chapter serves as an introduction to the supercomputing works carried out at CAS-IPE following the strategy of structural consistency among the physics in the simulated systems, mathematical model, computational software expressing the numerical methods and algorithms, and finally architecture of the computer hardware (Li et al., From multiscale modeling to Meso-science—a chemical engineering perspective, 2012; Li et al., Meso-scale phenomena from compromise—a common challenge, not only for chemical engineering, 2009; Ge et al., Chem Eng Sci 66:4426–4458, 2011). Multi-scale simulation of gas-solid flow in continuum-discrete approaches and molecular dynamics simulation of crystalline silicon are taken as examples, both making full use of CPU-GPU hybrid supercomputers. This strategy is demonstrated to be effective and critical for achieving good scalability and efficiency in such simulations. The software and hardware systems thus designed have found wide applications in process engineering.

Wei Ge, Ji Xu, Qingang Xiong, Xiaowei Wang, Feiguo Chen, Limin Wang, Chaofeng Hou, Ming Xu, Jinghai Li
Chapter 9. GPU Best Practices for HPC Applications at Industry Scale

Current trends in high performance computing (HPC) are moving towards the availability of several cores on the same chip of contemporary processors in order to achieve speed-up through exploiting the potential of fine-grain parallelism in applications. The trend is led by graphics processing units (GPUs) which have recently been developed exclusively for computational tasks as massively-parallel co-processors to conventional x86 CPUs. Since the introduction in 2006 of the NVIDIA Tesla GPU and CUDA programming environment, the HPC community has achieved noted performance gains across a broad range of application software. In particular, various scientific research disciplines within computational physics and chemistry have reported performance levels as high as two orders of magnitude over current quad-core CPUs. During 2010 an extensive set of new HPC architectural features were offered in the third generation Tesla and CUDA (codenamed Fermi), giving engineering disciplines a similar opportunity to expand use of GPUs for applications relevant to industry modeling and simulation. Similar to the scientific research community, practical applications in industry observe constant growth in model fidelity, but parallel efficiency of commercial software and job completion times also become important factors behind decisions on model size and scale, and level of physics features to include. This work examines algorithmic development best practices, and performance results of application software for the Tesla Fermi architecture in modelling and simulation examples relevant to industry-scale HPC practice. Included are GPU implementations of computational structural mechanics (CSM) and computational fluid dynamics (CFD) software that support mechanical product design in manufacturing industries. Specifically, the critical requirements of memory optimization and storage formats are discussed for grid-based direct solvers that appear in CSM and for highly irregular sparse matrices that require iterative solver schemes in CFD.

Peng Wang, Stan Posey
Chapter 10. Simulation of 1D Condensing Flows with CESE Method on GPU Cluster

We realized the space-time Conservation Element and Solution Element method (CESE) on GPU and applied it to condensation problem in a 1D infinite length shock tube. In the present work, the CESE Method has been implemented on a graphics card 9800GT successfully with the overlapping scheme. Then the condensation problem in 1D infinite shock tube was investigated using the scheme. The speedup of the condensation problem with the overlapping schemes is

$$71 \times $$

71

×

(9800GT to E7300). The influence of different meshes on the asymptotic solution in an infinite shock tube with condensation was studied by using the single GPU and GPU cluster. It is found that the asymptotic solution is trustable and is mesh-insensitive when the grid size is fine enough to resolve the condensation process. It is worth to mention that the peak value of computing reaches 0.88 TFLOPS when the GPU cluster with 8 GPUs is employed.

Wei Ran, Wan Cheng, Fenghua Qin, Xisheng Luo
Chapter 11. Two-Way Coupled Sprays and Liquid Surface: A GPU-Based Multi-Scale Fluid Animation Method

GPU-based fluid animation is a hot topic in many applications such as films, cartoons and games. As the flow phenomena contain highly complex behaviors and rich visual details, it is necessary to explore the intrinsic multi-scale property in fluid animation. In this paper, we present a multi-scale fluid animation method on GPU. Our method is designed to animate fluid details of grid and sub-grid scale with high efficiency. In our method, the motion of liquid surface is obtained by solving Navier-Stokes equations and Level Set equation while the dynamics of fluid sprays are dominated by SPH solution. The interaction between liquid surface and sprays is modeled by a two-way coupling algorithm which can be executed efficiently on GPU. From the results of the experiments, we can reach the conclusion that the proposed GPU based acceleration method can improve the processing speed of the multi-scale fluid animation significantly while getting interesting details.

Guijuan Zhang, Gaojin Wen, Shengzhong Feng
Chapter 12. High Performance Implementation of Binomial Option Pricing Using CUDA

Binomial tree model is often used for option pricing in the financial market. According to this method, it is rather expensive to obtain high accurate option price. Although existing methods running on CPU clusters have improved the efficiency significantly, there is still a great gap between the real performance and the desired. In this paper, we parallelize this model on CUDA to further improve the efficiency. We optimize our method according to principles of memory hierarchy and extend it to support multiple GPUs. Experiments on single Tesla C1060 GPU chip show an average of 285

$$\times $$

×

speedup compared to the result on single CPU node. Furthermore, for the data size of 64 K, GPU performance has reached 315 Gflops, which outperforms the earlier version on the Sun station by a factor of about 100

$$\times $$

×

. The maximum performance reached with 108 GPU nodes is 30 Tflops.

Yechen Gui, Shenzhong Feng, Gaojin Wen, Guijuan Zhang, Yanyi Wan, Tao Liu
Chapter 13. Research of Acceleration MS-Alignment Identifying Post-Translational Modifications on GPU

MS-Alignment is an unrestrictive post-translational modification (PTM) search algorithm with an advantage of searching for all types of PTMs at once in a blind mode. However, it is time-consuming, and thus it could not well meet the challenge of large-scale protein database and spectra. We use Graphic Processor Unit (GPU) to accelerate MS-Alignment for reducing identification time to meet time requirement. The work mainly includes two parts. (1) The step of Database search and Candidate generation (DC) consumes most of the time in MS-Alignment. We propose an algorithm of DC on GPU based on CUDA (DCGPU). The data parallelism way is partitioning protein sequences. We adopt several methods to optimize DCGPU implementation. (2) For further acceleration, we propose an algorithm of MS-Alignment on GPU cluster based on MPI and CUDA (MC_MS-A). The comparison experiments show that the average speedup ratio could be above 26 in the model of at most one modification and above 41 in the model of at most two modifications. The experimental results show that MC_MS-A on GPU Cluster could reduce the time of identifying 31173 spectra from about 2.853 months predicted to 0.606 h. Accelerating MS-Alignment on GPU is applicable for large-scale data requiring for high-speed processing.

Zhai Yantang, Tu Qiang, Lang Xianyu, Lu Zhonghua, Chi Xuebin

Chemical Physical Applications

Frontmatter
Chapter 14. GPU Tuning for First-Principle Electronic Structure Simulations

With increasing demands on hardware in quantum chemistry calculations, modern Graphical Processing Units (GPUs) have great potential meeting the resources of high performance computing. In this paper we investigate the possibility to accelerate the planewave pseudopotential code PEtot on CUDA architecture. In particular, we execute two most time consuming steps, i.e., the nonlocal projections and FFT transformations on GPU with careful implementations to reduce the data exchanges between the CPU and the GPU. Our experience for the molecule with as many as 512 atoms is also shown.

Yue Wu, Weile Jia, Lin-Wang Wang, Weiguo Gao, Long Wang, Xuebin Chi
Chapter 15. Nucleation and Reaction of Dislocations in Some Metals and Intermetallic Compound TiAl

The shear deformation in selected metals and intermetallic compound TiAl under different conditions was investigated using molecular dynamics (MD) simulation with many-body interatomic potentials. The atomic-scale details of the dislocation nucleation were simulated with the GPU implementation of our Para MD program. For the homogeneous nucleation in perfect lattice, as the lattice strain increases, strain localization occurs during which the strain condenses gradually on a few lattice planes where the nucleation is finally achieved. The dislocations on different slip planes in the same slip system can react with each other if the slip planes are only a few interplanar-spacings apart, forming a variety of defects including vacancies, interstitial atoms and their clusters, small dislocation loops, etc., depending on the distance between the slip planes and the characteristics of the reacting dislocations. It was shown that the C1060 GPU has a substantial acceleration on MD simulation compared with conventional CPU, indicating therefore a method to reduce the total cost of MD simulations; however, for ultra-large atomic systems, the relatively small memory capacity on C1060 hinders further increase of the simulation size beyond a few million atoms, therefore expansion of the GPU memory, which meanwhile reduces the communication burden given the same simulation size, rather than the number of cores is more important for such simulations.

D. S. Xu, H. Wang, R. Yang

Geophysical and Fluid Dynamical Application

Frontmatter
Chapter 16. Large-Scale Numerical Weather Prediction on GPU Supercomputer
145.0 TFlops with 3990 GPUs on TSUBAME 2.0

In order to drastically shorten the runtime of a weather prediction code ASUCA developed by the JMA (Japan Meteorological Agency) for the purpose of the next-generation weather forecasting service, the entire parts of the huge code are re-written for GPU computing from scratch. By introducing many optimization techniques and several new algorithms, very high performance of 145 TFlops has been achieved with 3990 GPUs on the TSUBAME 2.0 supercomputer. It is quite meaningful to show that the GPU supercomputing is really available for one of the major applications in the HPC field.

Takayuki Aoki, Takashi Shimokawabe
Chapter 17. Targeting Atmospheric Simulation Algorithms for Large, Distributed-Memory, GPU-Accelerated Computers

Computing platforms are increasingly moving to accelerated architectures, and here we deal particularly with GPUs. In Norman et al. (

2011

), a method was developed for atmospheric simulation to improve efficiency on large, distributed-memory machines by reducing communication demand and increasing the time step. Here, we improve upon this method to further target GPU-accelerated platforms by reducing GPU memory accesses, removing a synchronization point, and clustering computations. The modified code ran more than two times faster than the original in some cases even though more computations were required, demonstrating the importance of improving memory handling on the GPU. Furthermore, we discovered that the modification also has a near 100 % hit rate in fast, on-chip L1 cache and discuss the reasons for this. Finally, we remark on further potential improvements to GPU efficiency.

Matthew R. Norman
Chapter 18. Investigation of Solving 3D Navier–Stokes Equations with Hybrid Spectral Scheme Using GPU

The approach of accelerating application with GPUs already delivers impressive computational performance compared to the traditional CPU. The hardware architecture of GPU is a significant departure from CPUs, hence the redesign and validation of the numerical algorithm are necessary. The spectral-finite-difference schemes usually used in the direction numerical simulation (DNS) for turbulent channel flows are studied here. In order to validate the numerical accuracy, the scalar diffusion equation is first solved with this scheme, and the results from GPU and CPU are validated with the analytical solution. The major computational kernels of the scheme are the fast Fourier transfer (FFT) and the linear equation solver, which are both implemented on GPU. The performance study of the scalar diffusion equation shows at least 20

$$\times $$

×

speedup. For 3D Navier-Stokes equation, the performance on a single Nvidia S2050 card shows 25 times speedup.

Ying Xu, Lei Xu, D. D. Zhang, J. F. Yao
Chapter 19. Correlation of Reservoir and Earthquake by Multi Temporal-Spatial Scale Flow Driven Pore-Network Crack Model in Parallel CPU and GPU Platform

Coulomb failure assumptions. Jaeger and Cook (Fundamentals of rock mechanics. Methuen, New York,

1969

) is used to evaluate the earthquake trigger, and pore pressure (Biot, J Appl Phys 12:155,

1941

; 26:182,

1955

; 78:91,

1956

; J Geophys Res 78:4924,

1973

) parts reflect the effect of reservoir which closed to the earthquake slip. Fluid flow driven pore-network crack model (Zhu and Shi, Theor Appl Fract Mech 53:9,

2010

) is use to study the reservoir and earthquake. Based on the parallel CPU computation and GPU visualization technology, the relationship between the water-drainage sluice process of the Zipingpu reservoir, stress triggers and shadows of 2008 Wenchuan

$$\text{ M}_{\mathrm{s}}$$

M

s

8.0 earthquake and porosity variability of Longmenshan slip zone have been analyzed and the flow-solid coupled facture mechanism of Longmenshan coseismic fault slip is obtained.

B. J. Zhu, C. Liu, Y. L. Shi, D. A. Yuen
Chapter 20. A Full GPU Simulation of Evolving Fracture Networks in a Heterogeneous Poro-Elasto-Plastic Medium with Effective-Stress-Dependent Permeability

The wide range of timescales and underlying physics associated with simulating poro-elasto-plastic media present significant computational challenges. GPU technology is particularly advantageous to overcome these problems because even though the physics are the same, computational times are orders of magnitude faster. Poro-elasticity could be implemented in GPU, however GPU implementation of plastic stresses pose problems because branching is introduced into the program and thus introduces efficiency penalties. In general any element by element evaluation to deal with branching in GPU is very inefficient. In this paper, we describe fracture evolution in a poro-elasto-plastic medium and use a switch-on/switch-off function to avoid branching, allowing efficient computation of plasticity in GPU. We benchmark for the elasto-plastic part by investigating the angles of developed shear bands, and benchmark the non-linear diffusion part of the code using the method of manufactured solutions. Model results are presented for fluid pressure propagation through an elasto-plastic matrix subjected to compression, and another for extension. The results demonstrate how fluid flow is restricted in the compression case because of the load-induced low permeability, while fluid flow is encouraged in the extensional case because of the extension-induced high permeability. Code performance is excellent in GPU, and we are able to runs months of simulation using time steps of a few seconds within a few hours. With this new algorithm, many problems of couple fluid flow and the mechanical response can be efficiently simulated at very high resolution.

Boris Galvan, Stephen Miller
Chapter 21. GPU Implementation of Multigrid Solver for Stokes Equation with Strongly Variable Viscosity

Solving Stokes flow problem is commonplace for numerical modeling of geodynamic processes, because the lithosphere and mantle can be always regarded as incompressible flow for long geological time scales. For Stokes flow, the Reynold Number is effectively zero so that one can ignore the advective transport of momentum equation thus resulting in the slowly creeping flow. Because of the ill-conditioned matrix due to the saddle points problem that coupling mass and momentum partial differential equations together, it is still extremely to efficiently solve this elliptic PDE system, especially with the strongly variable coefficients due to rheological structure of the earth. However, since NVIDIA issued the CUDA programming framework in 2007, scientists can use commodity CPU-GPU system to do such geodynamic simulation efficiently with the advantage of CPU and GPU respectively. In this paper, we try to implement a GPU solver for Stokes Equations with variable viscosity based on CUDA using geometric multigrid methods on the staggered grids. For 2D version, we used a mixture of Jacobi and Gauss-Seidel iteration with conservative finite difference as the smoother. For 3D version, we called the GPU smoother which is rewritten with the Red-Black Gauss-Seidel updating method to avoid the problem of disordered threads with Matlab 2010b.

Liang Zheng, Taras Gerya, Matthew Knepley, David A. Yuen, Huai Zhang, Yaolin Shi
Chapter 22. High Rayleigh Number Mantle Convection on GPU

We implemented two- and three-dimensional Rayleigh–Benard convection on Nvidia GPUs by utilizing a 2nd-order finite difference method. By exploiting the massive parallelism of GPU using both CUDA for C and optimized CUBLAS routines, we have on a single Fermi GPU run simulations of Rayleigh number up to

$$6\times 10^{10}$$

6

×

10

10

(on a mesh of

$$2000 \times 4000$$

2000

×

4000

uniform grid points) in two dimensions and up to 10

$$^{7}$$

7

(on a mesh of

$$450\times 450\times 225$$

450

×

450

×

225

uniform grid points) for three dimensions. On Nvidia Tesla C2070 GPUs, these implementations enjoy single-precision performance of 535 GFLOP/s and 100 GFLOP/s respectively, and double-precision performance of 230 GFLOP/s and 70 GFLOP/s respectively.

David A. Sanchez, Christopher Gonzalez, David A. Yuen, Grady B. Wright, Gregory A. Barnett
Chapter 23. High-Order Discontinuous Galerkin Methods by GPU Metaprogramming

Discontinuous Galerkin (DG) methods for the numerical solution of partial differential equations have enjoyed considerable success because they are both flexible and robust: They allow arbitrary unstructured geometries and easy control of accuracy without compromising simulation stability. In a recent publication, we have shown that DG methods also adapt readily to execution on modern, massively parallel graphics processors (GPUs). A number of qualities of the method contribute to this suitability, reaching from locality of reference, through regularity of access patterns, to high arithmetic intensity. In this article, we illuminate a few of the more practical aspects of bringing DG onto a GPU, including the use of a Python-based metaprogramming infrastructure that was created specifically to support DG, but has found many uses across all disciplines of computational science.

Andreas Klöckner, Timothy Warburton, Jan S. Hesthaven
Chapter 24. Accelerating Large-Scale Simulation of Seismic Wave Propagation by Multi-GPUs and Three-Dimensional Domain Decomposition

We adopted the GPU (graphics processing unit) to accelerate the large-scale finite-difference simulation of seismic wave propagation. We describe the main part of our implementation: the memory optimization, the three-dimensional domain decomposition, and overlapping communication and computation. With our GPU program, we achieved a very high single-precision performance of about 61 TFlops by using 1,200 GPUs and 1.5 TB of total memory, and a scalability nearly proportional to the number of GPUs on TSUBAME–2.0, the recently installed GPU supercomputer in Tokyo Institute of Technology, Japan. In a realistic application by using 400 GPUs, only a wall clock time of 2,068 s (including the times for the overhead of snapshot output) was required for a complex structure model with more than 13 billion unit cells and 20,000 time steps. We therefore conclude that GPU computing for large-scale simulation of seismic wave propagation is a promising approach.

Taro Okamoto, Hiroshi Takenaka, Takeshi Nakamura, Takayuki Aoki
Chapter 25. Support Operator Rupture Dynamics on GPU

The method of Support Operator (SOM) is a numerical method to simulate seismic wave propagation by solving the three dimension viscoelastic equations. Its implementation, the Support Operator Rupture Dynamics (SORD) has been proved to be highly scalable in large-scale multi-processors calculations. This paper discusses accelerating SORD using on GPU using NVIDIA CUDA C. Compared to its original version on CPU, we have acrhieved a maximum 12.8X speed-up.

Shenyi Song, Yichen Zhou, Tingxing Dong, David A. Yuen

Algorithms and Solvers

Frontmatter
Chapter 26. A Geometric Multigrid Solver on GPU Clusters

Recently, more and more GPU HPC clusters are installed and thus there is a need to adapt existing software design concepts to multi-GPU environments. We have developed a modular and easily extensible software framework called WaLBerla that covers a wide range of applications ranging from particulate flows over free surface flows to nano fluids coupled with temperature simulations. In this article we report on our experiences to extend WaLBerla in order to support geometric multigrid algorithms for the numerical solution of partial differential equations (PDEs) on multi-GPU clusters. We discuss the object-oriented software and performance engineering concepts necessary to integrate efficient compute kernels into our WaLBerla framework and show that a large fraction of the high computational performance offered by current heterogeneous HPC clusters can be sustained for geometric multigrid algorithms.

Harald Koestler, Daniel Ritter, Christian Feichtinger
Chapter 27. Accelerating 2-Dimensional CFD on Multi-GPU Supercomputer

In this paper, we describe the domain decomposing strategy of finite-difference to implement and optimize GPU codes in solving 2-D N-S equations. To satisfy GPU architecture, our algorithms emphasize on the decomposition strategy and the maximum of exploiting the GPU memory hierarchy so that high rate of speedup can be expected. Tests on two CFD cases, respectively being cavity flow and aerofoil RAE 2822, are used. For cavity flow, we ran our simulation both on CUDA and OpenCL platform and witnessed 30–60x speedup. In aerofoil, we used 6–60 GPU devices and get speedup of 5–29 times depending on the grid size and number of devices used.

Sen Li, Xinliang Li, Long Wang, Zhonghua Lu, Xuebin Chi
Chapter 28. Efficient Rendering of Order Independent Transparency on the GPUs

Order independent transparency refers to the problem of rendering scenes using alpha blending equations, which requires the primitives in the scenes to be rendered according to their distances to the viewer. It is one of the key rendering effects in many graphics applications, thus has been extensively studied. Various techniques and systems have been proposed to render order independent transparency. These techniques can be classified into three categories based on their underlying methodologies: the primitive level methods, the fragment level methods, and the screen-door methods.This article provides a comprehensive review of these existing methods, with an emphasis on the advanced techniques that have been recently developed. The background of order independent transparency is introduced at the beginning of this review. Key contributions, advantages as well as limitations of each method are summarized in three following parts, respectively. The first part focuses on the primitive level methods, which tries to solve the problem by pre-sorting primitives, then rendering them from back to front using alpha blending equations. The second part reviews the fragment level methods, which performs fragment sorting and blending on the fly, or captures all the fragments per pixel and sort fragments in post-processing before blending. The performance and memory consumption analysis is presented as a comparison between these methods. The third part introduces another catalog of methods which approximates the rendering results using screen-door techniques, which is quite practical while rendering scenes with high depth complexities, such as grass and hair. Finally, a simple conclusion is given at the end of the review, indicating the direction of the future development of order independent transparency.

Fang Liu
Chapter 29. Performance Evaluation of Fast Fourier Transform Application on Heterogeneous Platforms

Heterogeneous platforms, integrating SMPs, clusters, GPUs, FPGAs, etc. are becoming the most popular architectures of supercomputers. Achieving high performance on CPUs or GPUs requires careful consideration of their different architectures, which challenges the capability and skills of programmers. In order to overcome the portability problem, OpenCL, a free cross-platform programming standard, is proposed by Khronos Compute Working Group. However, the performance of OpenCL-based programs has not been thoroughly studied yet. Therefore, in this paper, we first design

OpenFFT-Bench

, an FFT application with OpenCL-based FFT and OpenGL-based real-time spectrum visualization as the benchmark. We evaluate its performance on four OpenCL programming platforms including NVIDIA CUDA, ATI Stream (GPU), ATI Stream (CPU), and Intel OpenCL. Characteristics of OpenFFT-Bench are investigated with varied FFT size. Experimental results show that OpenCL and OpenGL-based applications can not only run on multiple heterogeneous platforms, but also achieve relatively high performance on GPU-based platforms.

Xiaojun Li, Yang Gao, Xinyu Ma, Ying Liu
Chapter 30. Accurate Evaluation of Local Averages on GPGPUs

We discuss fast and accurate evaluation of local averages on GPGPU. This work was motivated by the need to calculate reference fluid densities in the classical density functional theory (DFT) of electrolytes proposed in Gillespie et al. (

2002

). In Knepley et al. (

2010

) we developed efficient algorithms for the minimization of three-dimensional DFT models of biological ion channel permeation and selectivity. One of the essential bottlenecks of 3D DFT models is the evaluation of local

screening averages

of the chemical species’ densities. But the problem has wider applicability and fast evaluation of averages over the local spherical screening neighborhood of every grid point are typically inaccurate due to the use of collocation approximations of densities on Cartesian grids. Accurate evaluations based spectral quadrature were proposed and used in Knepley et al. (

2010

), but they are significantly more computationally expensive because of their nonlocal nature in the Fourier space. Here we show that the passage to the Fourier space can, in fact, make the averaging calculation much more amenable to efficient implementation on GPGPU architectures. This allows us to take advantage of both improved accuracy and hardware acceleration to arrive at a fast and accurate screening calculations.

Dmitry A. Karpeev, Matthew G. Knepley, Peter R. Brune
Chapter 31. Accelerating Swarm Intelligence Algorithms with GPU-Computing

Swarm intelligence describes the ability of groups of social animals and insects to exhibit highly organized and complex problem-solving behaviors that allow the group as a whole to accomplish tasks which are beyond the capabilities of any individual. This phenomenon found in nature is the inspiration for swarm intelligence algorithms—systems that utilize the emergent patterns found in natural swarms to solve computational problems. In this paper, we will show that due to their implicitly parallel structure, swarm intelligence algorithms of all sorts can benefit from GPU-based implementations. To this end, we present the ClusterFlockGPU algorithm, a swarm intelligence data mining algorithm for partitional cluster analysis based on the flocking behaviors of birds and implemented with CUDA. Our results indicate that ClusterFlockGPU is competitive with other swarm intelligence and traditional clustering methods. Furthermore, the algorithm exhibits a nearly linear time complexity with respect to the number of data points being analyzed and running time is not affected by the dimensionality of the data being clustered, thus making it well-suited for high-dimensional data sets. With the GPU-based implementation adopted here, we find that ClusterFlockGPU is up to 55x times faster than a sequential implementation and its time complexity is significantly reduced to nearly

O(n).

Robin M. Weiss
Chapter 32. Asynchronous Parallel Logic Simulation on Modern Graphics Processors

Logic simulation has become the bottleneck of today’s integrated circuit (IC) design projects. For instance, over 80 % of the IC design turn-around time of NVIDIA is spent on logic simulation even with NVIDIA’s proprietary supercomputing facility. It is thus essential to develop parallel simulation solutions to maintain the momentum of increasing IC integration capacity. Inspired by the supreme parallel computing power of modern GPUs, in this chapter we reported our recent work on using GPU to accelerate the time-consuming IC verification process by developing a massively parallel gate-level logical simulator. To the best of authors’ knowledge, this work is the first one to leverage the power of the modern GPUs to successfully unleash the massive parallelism of a conservative discrete event driven algorithm, CMB algorithm. Based on a novel data-parallel algorithmic mapping strategy, both the data structure and processing flow of the CMB protocol are re-designed to better exploit the potential of modern GPUs. A dynamic memory management mechanism is developed to efficiently utilize the relatively limited GPU memory resource. Experimental results prove that our GPU based simulator outperforms a CPU baseline event-driven simulator by a factor of 47.4X on average. This work demonstrates that the CMB algorithm can be efficiently and effectively deployed on GPUs without the performance overhead that had hindered its successful applications on previous parallel architectures.

Yangdong Deng, Yuhao Zhu, Wang Bo
Chapter 33. Implementations of Main Algorithms for Generalized Symmetric Eigenproblem on GPU Accelerator

To solve a generalized eigensystem problem, we firstly need to transform the generalized eigenproblem to a standard eigenproblem, and then reduce a matrix to tridiagonal form. These are based on both blocked Cholesky decomposition and blocked Householder tridiagonalization method. We present parallel implementations of standard transformation which combines the Cholesky into the transformation from generalized to standard form, and reduction of a dense matrix to tridiagonal form on GPU accelerator using CUBLAS. Experimental results clearly demonstrate the potential of data-parallel coprocessors for scientific computations. When comparing against the CPU implementation, the GPU implementations achieve above 16-fold and 20-fold speedups in double precision respectively.

Yonghua Zhao, Fang Liu, Yangang Wang, Xuebin Chi
Chapter 34. Using Mixed Precision Algorithm for LINPACK Benchmark on AMD GPU

LINPACK is a de facto benchmark for supercomputers. Nowadays, the CPU and GPU heterogenous cluster becomes an important trendy of supercomputers. Because of high performance of mixed precision algorithm, we had developed a mixed precision high performance LINPACK software package GHPL on NVIDIA GPU cluster. In this paper, we will introduce the recent work about porting and optimizing GHPL on AMD GPU. On AMD GPU platform, we implemented a hybrid of CPU and GPU GEMM function by ACML-GPU and GotoBLAS library. According to our results, the speedup of GHPL over HPL was 3.21. In addition, we would point out the limitations of ACML-GPU library.

Xianyi Zhang, Yunquan Zhang, Lei Wang
Chapter 35. Parallel Lattice Boltzmann Method on CUDA Architecture

In this article, an implementation of 2D Lattice Boltzmann Method by CUDA is presented. The simulation has been finished by GPU very well, which has the latest core Tesla C1060. From the results, the speedup of simulation on GPU is 30 times more than on a CPU, and the summit speedup is 41 times.

Weibing Feng, Wu Zhang, Bing He, Kai Wang

Visualization

Frontmatter
Chapter 36. Iterative Deblurring of Large 3D Datasets from Cryomicrotome Imaging Using an Array of GPUs

The aim was to enhance vessel like features of large 3D datasets (

$$4000 \times 4000 \times 4000$$

4000

×

4000

×

4000

pixels) resulting from cryomicrotome images using a system specific point spread function (PSF). An iterative (Gauss-Seidel) spatial convolution strategy for GPU arrays was developed to enhance the vessels. The PSF is small and spatially invariant and resides in fast constant memory of the GPU while the unfiltered data reside in slower global memory but are prefetched by blocks of threads in shared GPU memory. Filtering is achieved by a series of unrolled loops in shared memory. Between iterations the filtered data is stored to disk using asynchronous MPI-IO effectively hiding the IO overhead with the kernel execution time. Our implementation reduces computational time up to 350 times on four GPU’s in parallel compared to a single core CPU implementation and outperforms FFT based filtering strategies on GPU’s. Although developed for filtering the complete arterial system of the heart, the method is general applicable.

Thomas Geenen, Pepijn van Horssen, Jos A.E. Spaan, Maria Siebes, Jeroen P.H.M. van den Wijngaard
Chapter 37. WebViz: A Web-Based Collaborative Interactive Visualization System for Large-Scale Data Sets

We have created a web-based system for multi-user collaborative interactive visualization of large data sets (on the order of terabytes) that allows users in different locations to simultaneously and collectively perform visualizations over the Internet. By leveraging an asynchronous java and XML (AJAX) web development pattern via the Google Web Toolkit (

http://code.google.com/webtoolkit/

), we are able to provide remote users a web portal to the University of Minnesota’s Laboratory for Computational Sciences and Engineering’s large-scale interactive visualization system that provides high resolution visualizations to the order of 15 million pixels. Our web application, known as WebViz (for Web-based Visualization System), provides visualization services “in the cloud” and is accessible via a range of devices including netbooks, smartphones, and other web-and JavaScript-enabled mobile devices. This paper will detail the history of the project as well as the current version of WebViz including a discussion of its implementation and system architecture. We will also discuss features, future goals, and our plans for increasing scalability of the system, which includes a discussion of the benefits potentially, afforded us by a migration of server-side components to the Google Application Engine (

http://code.google.com/appengine/

).

Yichen Zhou, Robin M. Weiss, Elizabeth McArthur, David Sanchez, Xiang Yao, Dave Yuen, Mike R. Knox, W. Walter Czech
Chapter 38. Interactive Visualization Tool for Planning Cancer Treatment

We discuss the components and main requirements of the interactive visualization and simulation system intended for better understanding the dynamics of solid tumor proliferation. The heterogeneous Complex Automata, discrete-continuum model is used as the simulation engine. It combines Cellular Automata paradigm, particle dynamics and continuum approaches to model mechanical interactions of tumor with the rest of tissue. We show that to provide interactivity, the system has to be efficiently implemented on workstations with multiple cores CPUs controlled by OpenMP interface and/or empowered by GPGPU accelerators. Currently, the computational power of modern CPU and GPU processors enable to simulate the tumors of a few millimeters in diameter in its both avascular and angiogenic phases. To validate the results of simulation with real tumors, we plan to integrate the tumor modeling simulator with the

Graph Investigator

tool. Then one can validate the simulation results on the base of topological similarity between the tumor vascular networks obtained from its direct observation and simulation. The interactive visualization system can have both educational and research aspects. It can be used as a tool for clinicians and oncologists for educational purposes and, in the nearest future, in medical in silico labs doing research in anticancer drug design and/or in planning cancer treatment.

R. Wcisło, W. Dzwinel, P. Gosztyla, D. A. Yuen, W. Czech
Chapter 39. High Throughput Heterogeneous Computing and Interactive Visualization on a Desktop Supercomputer

At a cost below $2500, a desktop supercomputer was built from scratch by assembling the basic parts including a Tesla C1060 card and a GeForce GTX 295 card. This commodity desktop runs a Linux operating system together with CUDA, MPI and other needed software. MPI is used not only for distributing and/or transferring the computing loads among the GPU devices, but also for controlling the process of visualization. Several applications of heterogeneous computing have been successfully run on this desktop. Calculation of long-ranged forces in the

n-body problem

with fast multi-pole method can consume more than 85 % of the cycles and generate 480 GFLOPS of throughput. Mixed programming of CUDA-based C and Matlab has facilitated interactive visualization during simulations. One such MIMD application is the simulation of an idealized Belousov-Zhabotinsky Reaction (BZR), which is distributed evenly on three GPU devices (two on GTX 295 and one on Tesla) through message passing interface (MPI) and visualized at a given frequency displaying the evolution of the simulated reaction. One additional MPI process is over-subscribed onto one GPU device for monitoring the thermal status and memory usage of all the GPU devices as the BZR simulation progresses, further enhancing the throughput. (Submitted as a part of the paper is a movie capturing the self-organization process of cellular spirals resembling the Belousov-Zhabotinsky Reaction.) Our test runs have shown that running multiple applications on one GPU device or running one application across multiple GPU devices can be done as conveniently as on traditional CPUs.

S. Zhang, R. Weiss, S. Wang, G. A. Barnett Jr., D. A. Yuen
Chapter 40. Applications of Microtomography to Multiscale System Dynamics: Visualisation, Characterisation and High Performance Computation

We characterise microstructure over multiple spatial scales for different samples utilising a workflow that combines microtomography with computational analysis. High-resolution microtomographic data are acquired by desktop and synchrotron X-ray tomography. In some recent 4-dimensional experiments, microstructures that are evolving with time are produced and documented in situ.The microstructures in our materials are characterised by a numerical routine based on percolation theory. In a pre-processing step, the material of interest is segmented from the tomographic data. The analytical approach can be applied to any feature that can be segmented. We characterise a microstructure by its volume fraction, the specific surface area, the connectivity (percolation) and the anisotropy of the microstructure. Furthermore, properties such as permeability and elastic parameters can be calculated. By using the moving window method, scale-dependent properties are obtained and the size of representative volume element (RVE) is determined. The fractal dimension of particular microstructural configurations is estimated by relating the number of particular features to their normalized size. The critical exponent of correlation length can be derived from the probability of percolation of the microstructure. With these two independent parameters, all other critical exponents are determined leading to scaling laws for the specific microstructure. These are used to upscale the microstructural model and properties. Visualisation is one of the essential tools when performing characterisation. The high performance computations behind these characterisations include: (1) the Hoshen-Kolpeman algorithm for labelling materials in large datasets; (2) the OpenMP parallelisation of the moving window method and the performance of stochastic analysis (up to

$$640^3$$

640

3

voxels); (3) the MPI parallelisation of the moving window method and the performance of stochastic analysis, which enables the computation to be run on distributed memory machines and employ massive parallelism; (4) the parallelised MPI version of the Hoshen-Kolpeman algorithm and the moving window method, which allows datasets of theoretically unlimited size to be analysed.

Jie Liu, Klaus Regenauer-Lieb, Chris Hines, Shuxia Zhang, Paul Bourke, Florian Fusseis, David A. Yuen
Chapter 41. Three-Dimensional Reconstruction of Electron Tomography Using Graphic Processing Units (GPUs)

Three-dimensional (3D) reconstruction of electron tomography (ET) has emerged as a leading technique to elucidate the molecular structures of complex biological specimens. Iterative methods using blob basis functions are advantageous reconstruction methods due to their good performance especially under noisy and limited-angle conditions. However, iterative reconstruction algorithms for ET pose tremendous computational challenges. Graphic processing units (GPUs) offer an affordable platform to meet these demands. Nevertheless, due to the limited available memory of GPUs, the weighted matrix involved by iterative methods cannot be located into GPUs especially for the large images. To meet high computational demands, we propose a multilevel parallel scheme to perform iterative algorithm reconstruction using blob on GPUs. In order to address the large memory requirements of the weighted matrix, we also present a matrix storage technique, called blobELL-R, suitable for GPUs. In the storage technique, several geometric related symmetry relationships have been exploited to significantly reduce the storage space. Experimental results indicate that the multilevel parallel reconstruction scheme on GPUs can achieve high and stable speedups. The blobELL-R data structure only needs nearly 1/16 of the storage space in comparison with ELLPACK-R (ELL-R) storage structure and yields significant acceleration compared to the standard and matrix with CRS implementations on CPU.

Xiaohua Wan, Fa Zhang, Qi Chu, Zhiyong Liu
Backmatter
Metadaten
Titel
GPU Solutions to Multi-scale Problems in Science and Engineering
herausgegeben von
David A. Yuen
Long Wang
Xuebin Chi
Lennart Johnsson
Wei Ge
Yaolin Shi
Copyright-Jahr
2013
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-16405-7
Print ISBN
978-3-642-16404-0
DOI
https://doi.org/10.1007/978-3-642-16405-7

Premium Partner