Implementing molecular dynamics on hybrid high performance computers – Particle–particle particle-mesh

https://doi.org/10.1016/j.cpc.2011.10.012Get rights and content

Abstract

The use of accelerators such as graphics processing units (GPUs) has become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high-performance computers, machines with nodes containing more than one type of floating-point processor (e.g. CPU and GPU), are now becoming more prevalent due to these advantages. In this paper, we present a continuation of previous work implementing algorithms for using accelerators into the LAMMPS molecular dynamics software for distributed memory parallel hybrid machines. In our previous work, we focused on acceleration for short-range models with an approach intended to harness the processing power of both the accelerator and (multi-core) CPUs. To augment the existing implementations, we present an efficient implementation of long-range electrostatic force calculation for molecular dynamics. Specifically, we present an implementation of the particle–particle particle-mesh method based on the work by Harvey and De Fabritiis. We present benchmark results on the Keeneland InfiniBand GPU cluster. We provide a performance comparison of the same kernels compiled with both CUDA and OpenCL. We discuss limitations to parallel efficiency and future directions for improving performance on hybrid or heterogeneous computers.

Introduction

Graphics processing units (GPUs) have become popular as accelerators for scientific computing applications due to their low cost, impressive floating-point capabilities, and high memory bandwidth. Use of accelerators such as GPUs are an important consideration for high performance computing (HPC) platforms due to potential benefits including lower cost, electrical power, space, cooling demands, and reduced operating system images [1]. To date already a number of the highest performing supercomputers listed in the Top500 list [2] utilize GPUs. In our previous work [3], we described an approach for accelerating molecular dynamics on hybrid high performance computers containing accelerators in addition to CPUs. The work was performed using the LAMMPS software package and the implementation was focused on acceleration for neighbor list builds and non-bonded short-range force calculation. The implementation allows for CPU/GPU concurrency and can be compiled using either CUDA or OpenCL. The approach is sufficient for many LAMMPS applications where electronic screening limits the range of interatomic forces. For simulations requiring consideration of long-range electrostatics, however, there is significant potential for performance improvement from the acceleration of long-range calculations. Although short-range force calculation typically dominates the computational workload on many parallel machines, GPU-acceleration of the short-range routines can result in the overall simulation time being dominated by long-range calculations.

The most common methods for calculation of long-range electrostatic forces in periodically closed systems are the standard Ewald summation [4] and related particle mesh methods that utilize fast Fourier transformations (FFTs) to speed up long-range calculations. Particle mesh Ewald (PME) [5], smooth particle mesh Ewald (SPME) [6], and particle–particle particle-mesh (P3M) [7] all use a grid representation of the charge density to allow FFT computation with a more favorable time complexity. Although there are many reports of GPU acceleration for short range forces in the literature, there are relatively few publications reporting speedups from acceleration of long-range calculations. Harvey and De Fabritiis [8] published an implementation of SPME for CUDA and an alternative approach, also for SPME, was recently presented [9]. Alternatives to the traditional Ewald and FFT-based particle mesh methods have also been published including GPU acceleration of the orientation-averaged Ewald sum [10] and GPU acceleration of multilevel summation [11].

Continued investigation into alternative algorithms for long-range force calculation on accelerators will likely be necessary in order to best utilize the floating-point capabilities of accelerators in a distributed computing environment and this is an area of active research for our group. In this paper, however, we have chosen to focus on acceleration for the P3M method because 1) an CPU implementation of P3M has been available in LAMMPS for some time and this allows for a comparison between CPU and GPU calculation times, 2) the particle mesh methods are already accepted by most physicists as accurate and efficient methods for long-range force calculation, 3) an implementation of P3M for accelerators gives a baseline for comparison of alternative accelerated algorithms, and 4) we believe that the porting of existing algorithms for use on accelerators is of general interest to the scientific computing community.

The algorithms we have used for P3M acceleration follow those proposed by Harvey and De Fabritiis [8] for SPME. In this paper, however, we focus on parallel long-range force calculation on distributed systems with accelerators. We present several improvements that address some limitations to achieving good parallel efficiency. We compare results with the pre-existing CPU implementation in LAMMPS in order to assess the benefit from acceleration on GPUs. We describe an implementation that compiles with both the CUDA and OpenCL APIs to allow for acceleration on a variety of platforms. We discuss approaches that minimize the amount of code that needs to be written for the accelerator with concurrent calculations performed on the CPU and the accelerator. We present benchmarks on an InfiniBand GPU cluster and discuss several important issues that limit strong scaling. Finally, we discuss future directions for improving performance on hybrid or heterogeneous clusters.

Section snippets

Ewald summation and P3M

Let L=diag{lx,ly,lz} be the diagonal matrix specifying the size of a periodic box with cell side lengths equal to lx, ly, and lz. The total electrostatic energy resulting from pairwise interactions of N Coulomb point charges (q.) within the box is given by,E=12i,j=1NnZ3qiqj|rij+nL|, where rij is the distance between the point charges, n indexes all surrounding periodic cells for the simulation box and the ′ is used to indicate that in the case i=j, the summation should not include the term

Results

For our initial analysis of acceleration for simulations considering long-range, we have used the rhodopsin benchmark with a 1:1 CPU core to GPU ratio. Although most hardware platforms currently have a higher ratio of CPU cores to GPUs, the 1:1 ratio is convenient in that it allows for 1) a comparison where the times for non-accelerated routines are similar and 2) accurate timings for individual device kernels and host-device data transfer times (not currently possible with GPU sharing in

Discussion

Algorithms for efficient acceleration of P3M are not straightforward on current hardware due to the high latencies for atomic floating point operations and non-contiguous memory access, the relatively small mesh sizes for P3M, and the difficulty in achieving fine-grained parallelism for spline computations without redundant computations and work-item divergence. The relative performance of the charge assignment and force-interpolation kernels on NVIDIA GPUs is much less impressive when compared

Acknowledgements

This research was conducted in part under the auspices of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy under Contract No. DE-AC05-00OR22725 with UT-Battelle, LLC. This research used resources of the Leadership Computing Facility at Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725 with UT-Battelle, LLC. Accordingly, the U.S. Government retains a

References (20)

  • W.M. Brown et al.

    Computer Physics Communications

    (2011)
  • D.J. Hardy et al.

    Parallel Computing

    (2009)
  • S. Plimpton

    Journal of Computational Physics

    (1995)
  • V.V. Kindratenko, J.J. Enos, G.C. Shi, M.T. Showerman, G.W. Arnold, J.E. Stone, J.C. Phillips, W.M. Hwu, in: IEEE...
  • Top500, Top500 Supercomputer Sites, http://www.top500.org,...
  • P. Ewald

    Ann. Phys. (Leipzig)

    (1921)
  • T. Darden et al.

    Journal of Chemical Physics

    (1993)
  • U. Essmann et al.

    Journal of Chemical Physics

    (1995)
  • R.W. Hockney et al.

    Computer Simulation Using Particles

    (1988)
  • M.J. Harvey et al.

    Journal of Chemical Theory and Computation

    (2009)
There are more references available in the full text version of this article.

Cited by (0)

View full text