A massively parallel, multi-disciplinary Barnes–Hut tree code for extreme-scale N-body simulations

https://doi.org/10.1016/j.cpc.2011.12.013Get rights and content

Abstract

The efficient parallelization of fast multipole-based algorithms for the N-body problem is one of the most challenging topics in high performance scientific computing. The emergence of non-local, irregular communication patterns generated by these algorithms can easily create an insurmountable bottleneck on supercomputers with hundreds of thousands of cores. To overcome this obstacle we have developed an innovative parallelization strategy for Barnes–Hut tree codes on present and upcoming HPC multicore architectures. This scheme, based on a combined MPI–Pthreads approach, permits an efficient overlap of computation and data exchange. We highlight the capabilities of this method on the full IBM Blue Gene/P system JUGENE at Jülich Supercomputing Centre and demonstrate scaling across 294,912 cores with up to 2,048,000,000 particles. Applying our implementation pepc to laser–plasma interaction and vortex particle methods close to the continuum limit, we demonstrate its potential for ground-breaking advances in large-scale particle simulations.

Introduction

Despite rapid development in supercomputer performance over the last three decades, the N-body problem still is and will not be solved in a direct way. What makes this problem so challenging – and important as a driver for supercomputing – is the uncompromising nature of a 1/r force-law coupled with the need to compute O(N2) pairwise interactions for a system of N particles. Since hardware acceleration and/or parallelism can postpone, but not remove, the bottleneck implied by such an algorithmic complexity, much attention has been devoted to rapid summation techniques which reduce the computational effort to O(N) or O(NlogN). In molecular dynamics simulations of soft matter and plasmas, the particle–particle/particle–mesh (P3M) method and variations thereof have traditionally been deployed to model periodic systems dominated by Coulomb interactions: most recently the ddcMD code was successfully scaled across the entire 294,912-core BlueGene/P at Jülich Supercomputing Centre (JSC) [1].

While a mesh-based approach is suited to broadly homogeneous systems with periodic boundaries, it is less appropriate for geometrically complex N-body problems exhibiting a high dynamic range in their density distribution, as found, for example in astro- or plasma physics. For these applications, mesh-free multipole techniques based on either the Barnes–Hut (BH) tree algorithm [2] or the Fast Multipole Method (FMM) [3] are preferable. Both of these methods rely on a hierarchical oct-tree data structure to store local multipole expansions of the particle charge or mass distribution. Since this additional information typically consumes more memory than the particles themselves, sharing it among many processors makes parallel scaling of the complete force summation a notoriously difficult task.

In this paper we present a new, highly scalable implementation of the Barnes–Hut tree algorithm: we note in passing that parallel versions of the FMM have also been reported recently, scaling to 64k [4] and 288k [5] cores respectively. An important caveat of these works is that the benchmarks were performed for static systems in which the domain decomposition could be adjusted a priori to ensure load balancing. Despite theoretical work which suggests load balancing can be generally achieved [6], this has yet to be demonstrated in a highly dynamic, inhomogeneous simulation using FMM.

By contrast, several groups including JSC staff have demonstrated load-balanced BH tree codes for electrostatics [7], astrophysics [8], and fluid dynamics applications [9]. Our own persistence with the BH algorithm owes more to the fact that the code can easily be adapted to new problems by exchanging the interaction kernel, a feature which has led to the rapid establishment of a whole family of codes at JSC.

Since the seminal work by Warren and Salmon [10], there have been many efforts made to parallelize the BH tree algorithm with varying degrees of success. A flurry of early attempts [11], [12], [13], [14] reported scaling to at most a few hundred processors, limited mainly by the machine size available during the late ʼ90s. With the explosion in core numbers seen in Petascale machines over the last five years, the demands on tree algorithms in terms of memory footprint and load balance have been stretched to the point where a complete redesign is necessary to scale beyond a few thousand cores.

One way around this bottleneck is to use GPGPU accelerators, as demonstrated to great effect by Hamada et al. [15]. While this approach offers an attractive short-term solution, it still begs the question of whether tree codes are in principle suited to future exascale architectures with millions of cores. The best-scaling BH tree code prior to the present work stands at around 8k cores [7], [8], [16]. Here we present an new load-balanced, hybrid MPI–Pthread algorithm which makes effective use of all 288k cores of JUGENE, and which is also well placed to take advantage of multicore architectures coming online within the next few years.

The paper is organized as follows: in Section 2, the hybrid tree algorithm is described with particular attention to the tree traversal routine, where bottlenecks are typically found in other implementations. Section 3 contains detailed performance and scalability analyses of the communication-intensive parts of the code. Finally, examples of two applications currently being pursued in the Jülich and Lugano groups are discussed in which major qualitative progress can be expected by exploiting the huge improvements in statistical fidelity afforded by our new algorithm.

Section snippets

The hybrid parallel Barnes–Hut tree code pepc

Over the past few years at JSC, we have developed a parallel tree code named pepc – ‘Pretty Efficient Parallel Coulomb-solver’ [17]. This highly portable code was originally conceived for mesh-free modeling of complex plasma systems [18], [19], [20], but has since been greatly extended to cover gravitational problems [21], smoothed particle hydrodynamics [22], fusion plasmas [23] and vortex particle methods for fluid dynamics, the latter utilizing a broad class of kernels differing from the

Performance analysis

The most important high performance supercomputer currently installed at Jülich Supercomputing Centre is the IBM Blue Gene/P system JUGENE, a 72 rack installation with P=73,728 four-way-SMT-nodes (Power PC 450 processors), each having 2 GB of main memory. With a total number of C=294,912=288k compute cores it is one of the fastest supercomputers worldwide and still has the largest core count at the time of writing [29]. In this section, we will demonstrate the scaling behavior of pepc with its

Fields of applications

The present code is already being used for a wide range of applications [23], including stellar accretion discs, creation and transport of laser-accelerated ion beams, plasma–wall interactions in tokamaks, strongly coupled plasmas and vortex fluids. We show two specific examples here which demonstrate the transformative potential of this technique: transport and optical properties of Coulomb clusters and interacting fluid vortices.

Summary and outlook

In this paper we have demonstrated the outstanding capabilities of the parallel Barnes–Hut tree code pepc at extreme scales. For the first time, a pure particle-based Barnes–Hut ansatz is able to utilize 288k cores, simulating up to 2 billion particles. The key innovation presented here is a novel, combined MPI–Pthreads traversal algorithm. A rigorous algorithmic distinction between communication- and worker-threads allows for a highly efficient overlap of the irregular communication and the

Acknowledgements

The authors gratefully acknowledge the helpful support by Jülich Supercomputing Centre and the JSC staff, especially M. Stephan and J. Docter. This work was supported in part by the Alliance Program of the Helmholtz Association (HA216/EMMI), the BMBF project ScaFaCoS and the EU TEXT project, as well as additional computing time via the VSR project JZAM04. R.S. and R.K. would like to thank the Swiss Platform for High-Performance and High-Productivity Computing (HP2C) for funding and support.

References (36)

  • L. Greengard et al.

    Journal of Computational Physics

    (1987)
  • Y.M. Marzouk et al.

    Journal of Computational Physics

    (2005)
  • J. Dubinski

    New Astronomy

    (1996)
  • U. Becciani

    Journal of Computational Physics

    (1996)
  • M.S. Warren et al.

    Computer Physics Communications

    (1995)
  • G. Winckelmans et al.

    Journal of Computational Physics

    (1993)
  • D.F. Richards
  • J. Barnes et al.

    Nature

    (1986)
  • I. Lashuk
  • I. Kabadshow, H. Dachsel, in: SIAM Conference on Computational Science and Engineering (CSE 11), Reno, February...
  • S.-H. Teng

    SIAM Journal on Scientific and Statistical Computing

    (1998)
  • R. Speck

    Parallel Computing: From Multicores and GPUʼs to Petascale

    (2010)
  • P. Jetley, et al., in: IEEE International Symposium on Parallel and Distributed Processing, 2008, pp....
  • M.S. Warren, J.K. Salmon, in: Proceedings of the 1993 ACM/IEEE Conference on Supercomputing, Portland, Oregon, 1993,...
  • G. Winckelmans, et al., in: European Series in Applied and Industrial Mathematics: Proceedings, vol. 1,...
  • A.Y. Grama
  • T. Hamada et al.
  • V. Springel

    Monthly Notices of the Royal Astronomical Society

    (2005)
  • Cited by (0)

    View full text