A massively parallel, multi-disciplinary Barnes–Hut tree code for extreme-scale N-body simulations
Introduction
Despite rapid development in supercomputer performance over the last three decades, the N-body problem still is and will not be solved in a direct way. What makes this problem so challenging – and important as a driver for supercomputing – is the uncompromising nature of a force-law coupled with the need to compute pairwise interactions for a system of N particles. Since hardware acceleration and/or parallelism can postpone, but not remove, the bottleneck implied by such an algorithmic complexity, much attention has been devoted to rapid summation techniques which reduce the computational effort to or . In molecular dynamics simulations of soft matter and plasmas, the particle–particle/particle–mesh (P3M) method and variations thereof have traditionally been deployed to model periodic systems dominated by Coulomb interactions: most recently the ddcMD code was successfully scaled across the entire 294,912-core BlueGene/P at Jülich Supercomputing Centre (JSC) [1].
While a mesh-based approach is suited to broadly homogeneous systems with periodic boundaries, it is less appropriate for geometrically complex N-body problems exhibiting a high dynamic range in their density distribution, as found, for example in astro- or plasma physics. For these applications, mesh-free multipole techniques based on either the Barnes–Hut (BH) tree algorithm [2] or the Fast Multipole Method (FMM) [3] are preferable. Both of these methods rely on a hierarchical oct-tree data structure to store local multipole expansions of the particle charge or mass distribution. Since this additional information typically consumes more memory than the particles themselves, sharing it among many processors makes parallel scaling of the complete force summation a notoriously difficult task.
In this paper we present a new, highly scalable implementation of the Barnes–Hut tree algorithm: we note in passing that parallel versions of the FMM have also been reported recently, scaling to 64k [4] and 288k [5] cores respectively. An important caveat of these works is that the benchmarks were performed for static systems in which the domain decomposition could be adjusted a priori to ensure load balancing. Despite theoretical work which suggests load balancing can be generally achieved [6], this has yet to be demonstrated in a highly dynamic, inhomogeneous simulation using FMM.
By contrast, several groups including JSC staff have demonstrated load-balanced BH tree codes for electrostatics [7], astrophysics [8], and fluid dynamics applications [9]. Our own persistence with the BH algorithm owes more to the fact that the code can easily be adapted to new problems by exchanging the interaction kernel, a feature which has led to the rapid establishment of a whole family of codes at JSC.
Since the seminal work by Warren and Salmon [10], there have been many efforts made to parallelize the BH tree algorithm with varying degrees of success. A flurry of early attempts [11], [12], [13], [14] reported scaling to at most a few hundred processors, limited mainly by the machine size available during the late ʼ90s. With the explosion in core numbers seen in Petascale machines over the last five years, the demands on tree algorithms in terms of memory footprint and load balance have been stretched to the point where a complete redesign is necessary to scale beyond a few thousand cores.
One way around this bottleneck is to use GPGPU accelerators, as demonstrated to great effect by Hamada et al. [15]. While this approach offers an attractive short-term solution, it still begs the question of whether tree codes are in principle suited to future exascale architectures with millions of cores. The best-scaling BH tree code prior to the present work stands at around 8k cores [7], [8], [16]. Here we present an new load-balanced, hybrid MPI–Pthread algorithm which makes effective use of all 288k cores of JUGENE, and which is also well placed to take advantage of multicore architectures coming online within the next few years.
The paper is organized as follows: in Section 2, the hybrid tree algorithm is described with particular attention to the tree traversal routine, where bottlenecks are typically found in other implementations. Section 3 contains detailed performance and scalability analyses of the communication-intensive parts of the code. Finally, examples of two applications currently being pursued in the Jülich and Lugano groups are discussed in which major qualitative progress can be expected by exploiting the huge improvements in statistical fidelity afforded by our new algorithm.
Section snippets
The hybrid parallel Barnes–Hut tree code pepc
Over the past few years at JSC, we have developed a parallel tree code named pepc – ‘Pretty Efficient Parallel Coulomb-solver’ [17]. This highly portable code was originally conceived for mesh-free modeling of complex plasma systems [18], [19], [20], but has since been greatly extended to cover gravitational problems [21], smoothed particle hydrodynamics [22], fusion plasmas [23] and vortex particle methods for fluid dynamics, the latter utilizing a broad class of kernels differing from the
Performance analysis
The most important high performance supercomputer currently installed at Jülich Supercomputing Centre is the IBM Blue Gene/P system JUGENE, a 72 rack installation with four-way-SMT-nodes (Power PC 450 processors), each having 2 GB of main memory. With a total number of compute cores it is one of the fastest supercomputers worldwide and still has the largest core count at the time of writing [29]. In this section, we will demonstrate the scaling behavior of pepc with its
Fields of applications
The present code is already being used for a wide range of applications [23], including stellar accretion discs, creation and transport of laser-accelerated ion beams, plasma–wall interactions in tokamaks, strongly coupled plasmas and vortex fluids. We show two specific examples here which demonstrate the transformative potential of this technique: transport and optical properties of Coulomb clusters and interacting fluid vortices.
Summary and outlook
In this paper we have demonstrated the outstanding capabilities of the parallel Barnes–Hut tree code pepc at extreme scales. For the first time, a pure particle-based Barnes–Hut ansatz is able to utilize 288k cores, simulating up to 2 billion particles. The key innovation presented here is a novel, combined MPI–Pthreads traversal algorithm. A rigorous algorithmic distinction between communication- and worker-threads allows for a highly efficient overlap of the irregular communication and the
Acknowledgements
The authors gratefully acknowledge the helpful support by Jülich Supercomputing Centre and the JSC staff, especially M. Stephan and J. Docter. This work was supported in part by the Alliance Program of the Helmholtz Association (HA216/EMMI), the BMBF project ScaFaCoS and the EU TEXT project, as well as additional computing time via the VSR project JZAM04. R.S. and R.K. would like to thank the Swiss Platform for High-Performance and High-Productivity Computing (HP2C) for funding and support.
References (36)
- et al.
Journal of Computational Physics
(1987) - et al.
Journal of Computational Physics
(2005) New Astronomy
(1996)Journal of Computational Physics
(1996)- et al.
Computer Physics Communications
(1995) - et al.
Journal of Computational Physics
(1993) - et al.
Nature
(1986) - I. Kabadshow, H. Dachsel, in: SIAM Conference on Computational Science and Engineering (CSE 11), Reno, February...