1 Background
Nbabel.org
website. This site contains examples of a simple N-body simulation code implemented in a wide range of programming languages. However, in practice there are many variations of the algorithms in use, with up to eighth order integrations (Nitadori and Makino 2008), algorithmic extensions such as block time-stepping (McMillan 1986), neighbor-schemes (Ahmad and Cohen 1973), see Bédorf and Portegies Zwart (2012) and references therein for more examples. These variations transform the simple \(O(N^{2})\) shared time-step implementation in a complex method, where the amount of parallelism can differ per time-step. Especially the dynamic block time-stepping method adds complexity to the algorithm, since the number of particles that participate in the computations changes with each integration step. This variable number of particles involved in computing forces requires different parallelization strategies. In the worst case, there is only one particle integrated, which eliminates most of the standard parallelization methods for \(N^{2}\) algorithms. There is extensive literature on high performance direct N-body methods with the first being described in 1963 (Aarseth 1963). The method has been efficiently implemented on parallel machines (McMillan 1986), vector machines (Hertz and McMillan 1988) and dedicated hardware such as the GRAPE’s (Makino and Taiji 1998). For an overview we refer the interested reader to the following reviews Bédorf and Portegies Zwart (2012), Heggie and Hut (2003) and Dehnen and Read (2011). Furthermore, there has been extensive work on accelerating N-body methods using GPUs. There have been several N-body libraries to ease the development of N-body integrators that use the GPU. The first library that offered support for the GRAPE API was Kirin
(Belleman et al. 2008), however this library only supports single precision and is therefore less accurate than the GRAPE. With the introduction of the Yebisu
library (Nitadori 2009) there was support for double-single precision,1 which achieved accuracy comparable to the GRAPE. The library also featured support for fourth and sixth order Hermite integrators in combination with minimized data send by performing the prediction on the GPU. This library, however, is not compatible with the GRAPE API and only supports a single GPU. In our previous work Sapporo1
(Gaburov and Harfst 2009), we added support for multiple GPUs in combination with the GRAPE API and double-single precision. Apart from libraries there are also N-body integrators that come with built-in support for GPU hardware. For example in Berczik et al. (2011), the authors combine Yebisu
and phiGRAPE
(Harfst et al. 2007) in the new phiGPU
code. This code is able to run on multiple GPUs and supports up to eighth order accuracy. Capuzzo-Dolcetta et al. (2013) and Capuzzo-Dolcetta and Spera (2013) introduce the HiGPUs
N-body code. This standalone code contains a sixth order integrator, and supports CUDA, OpenCL and IEEE-754 double precision accuracy. Finally, there is NBODY6
which uses GPU acceleration together with an Ahmad-Cohen neighbor scheme (Ahmad and Cohen 1973; Nitadori and Aarseth 2012).Sapporo2
, since we focus on the library we will not make a full comparison with the standalone software packages mentioned above. The library contains built-in support for the second order leap-frog (GRAPE-5), fourth order Hermite (GRAPE-6) and sixth order Hermite integrators. The numerical precision can be specified at run time and depends on requirements for performance and accuracy. Furthermore, the library can keep track of the nearest neighbors by returning a list containing all particles within a certain radius. Depending on the available hardware the library operates with CUDA and OpenCL, and has the option to run on multiple-GPUs, if installed in the same compute node. The library computes the gravitational force on particles that are integrated with block time-step algorithms. However, the library can trivially be applied to any other \(O(N^{2})\) particle method by replacing the force equations. For example, methods that compute the Coulomb interactions (Gorp et al. 2011) or molecular dynamics (van Meel et al. 2008) use similar methods as presented in this work.2 Methods
2.1 Parallelization method
Sapporo1
(Gaburov and Harfst 2009) which takes a slightly different approach. In Sapporo1
we parallelize over the source particles and keep the number of sink particles that is concurrently integrated fixed to a certain number. The source particles are split into subsets, each of which forms the input against which a set of sink particles is integrated. The smaller the number of sink particles the more subsets of source particles we can make. It is possible to saturate the GPU with enough subsets, so if the product of the number of sink and source particles is large enough2 you can reach high performance even if the number of sinks or sources is small.Sapporo1
method is more suitable for block time-stepping algorithms commonly used in high precision gravitational N-body simulations. Even though this method requires an extra step to combine the partial results from the different subsets. The Sapporo1
method is also applied in this work. With Sapporo1
being around for 5 years we completely rewrote it and renamed it to Sapporo2
, which is compatible with current hardware and is easy to tune for future generation accelerator devices and algorithms using the supplied test scripts. The next set of paragraphs describe the implementation and the choices we made.2.2 Implementation
2.2.1 CUDA and OpenCL
C
programming language and as such came with language changes. These extensions are part of the device and, more importantly, part of the host code.3 The use of these extensions requires that the host code is compiled using the compiler supplied by NVIDIA. With the introduction of the ‘driver API’4 this was no longer required. The ‘driver API’ does not require modifications to the C
language for the host code. However, writing CUDA programs with the ‘driver API’ is more involved than with the ‘run time API’, since actions that were previously done by the NVIDIA compiler now have to be performed by the programmer.C
language to be used in the device code. There are no changes to the language used for writing the host code, instead OpenCL comes with a specification of functions to interact with the device. This specification is very similar to the specification used in the CUDA driver API and follows the same program flow.Sapporo2
we exploited the similarity between the CUDA driver API and the OpenCL API. We developed a set of C++
classes on top of these APIs which offer an unified interface for the host code. The classes encapsulate a subset of the OpenCL and CUDA functions for creating device contexts, memory buffers (including functions to copy data) and kernel operations (loading, compiling, launching). Then, depending on which class is included at compile time the code is executed using OpenCL or CUDA. The classes have no support for the more advanced CUDA features such as OpenGL and Direct3D interoperability.C++
templates, printing and debugging functionality in CUDA makes it much more convenient to develop in pure CUDA. Once we have a working CUDA version we convert this to OpenCL. The use of templates in particular reduces the amount of code. In the CUDA version all possible kernel combinations are implemented using a single file with templates. For OpenCL a separate file has to be written for each combination of integrator and numerical precision.Sapporo1
with only minor changes to allow double precision data loads/stores and more efficient loop execution.2.2.2 Numerical accuracy
Sapporo1
(before the GT200 chips) GPUs lacked support for IEEE-754 double precision computations and therefore all the compute work was done in either single or double-single precision. The resulting force computation had similar precision as the, at that time, commonly used GRAPE hardware (Makino and Taiji 1998; Gaburov and Harfst 2009). This level of accuracy is sufficient for the fourth order Hermite integration scheme (Makino and Aarseth 1992; Portegies Zwart and Boekholt 2014). Currently, however there are integrators that accurately solve the equations of motions of stars around black-holes, planets around stars and similar systems that encounter high mass ratios. For these kind of simulations one often prefers IEEE-754 double precision to solve the equations of motion. The current generation of GPUs support IEEE-754, which enables computations that require this high level of accuracy. The data in Sapporo2
is, therefore, always stored in double precision. The advantage is that we can easily add additional higher order integrators that require double precision accuracy computations, without having to rewrite major parts of the host code. Examples of such integrators are the sixth and eighth order Hermite integrators (Nitadori and Makino 2008). The performance impact of double precision storage on algorithms that do not require double precision computations is limited. Before the actual computations are executed the particle properties are converted to either float
or double-single
and the precision therefore does not influence the computational performance. The penalty for loading and storing double the amount of data is relatively small as can be seen in the result section where Sapporo1
is compared to Sapporo2
.2.2.3 Multiple GPUs
Sapporo1
this was implemented using the boost
threading library, this is now handled using OpenMP
. The multi-GPU parallelization is achieved by parallelizing over the source particles. In Sapporo1
each GPU contained a copy of all source particles (as in Harfst et al. (2007)), but in Sapporo2
the source particles are distributed over the devices using the round-robin method. Each GPU now only holds a subset of the source particles (similar to PhiGPU
, HiGPU
and NBODY6
) which reduces memory requirements, transfer time and the time to execute the prediction step on the source particles. However, the order of the particle distribution and therefore, the order in which the addition is executed is changed when comparing Sapporo1
and Sapporo2
. This in turn can lead to differences in the least significant digit when comparing the computed force of Sapporo1
to Sapporo2
.2.2.4 Other differences
Sapporo1
and Sapporo2
is the way the partial results of the parallelization blocks are combined. Sapporo1
contains two computational kernels to solve the gravitational forces. The first computes the partial forces for the individual blocks of source particles, and the second sums the partial results. With the use of atomic operators these two kernels can be combined, which reduces the complexity of maintaining two compute kernels when adding new functionality, at a minimal performance impact. The expectation is that future devices require more active threads to saturate the GPU, but at the same time offer improved atomic performance. The single kernel method that we introduced here will automatically scale to future devices and offers less overhead than launching a separate reduction kernel. This reduced overhead results in slightly better performance (few %) on current architectures compared to the original two kernel method. In total we now require three GPU kernels to compute gravity, one copy kernel to move particles from CPU buffers to GPU buffers, one kernel to predict the particles to the new time-step and finally, the gravity kernel to compute the forces.3 Results
NBODY4
(Aarseth 1999) and Kira (Portegies Zwart et al. 2001) require a list of particles within a given radius from each particle to determine the perturber list. All this is what Sapporo1
computes and how the GRAPE hardware operates (Makino and Taiji 1998). The used numerical precision in this method is the double-single variant. In order to compare the new implementation with the results of Sapporo1
, all results in this section, unless indicated otherwise, refer to the double-single fourth order Hermite integrator. Furthermore, we have enabled the computation of the nearest neighbor and the list of nearby particles, as has Sapporo1
. However if the user does not require this information it can be disabled by changing a template parameter in the code.CUDA 5.5
toolkit and drivers installed. For the machine with the AMD card we used version 2.8.1.0 of the APP-SDK toolkit and driver version 13.4.GTX480
. Since, theoretical performance is not always reachable we also show the relative practical performance as computed with a simple single precision N-body kernel that is designed for shared-time steps, similar to the N-body example in the CUDA SDK (Nyland et al. 2007).
Memory
|
Cores
|
SP
TFlop/s
|
DP
TFlop/s
|
TPP
|
PPP
| ||||
---|---|---|---|---|---|---|---|---|---|
Mhz
|
bus
|
bw
|
#
|
Mhz
| |||||
GTX480 | 3,696 | 384 | 133.9 | 480 | 1,401 | 1.35 | 0.17 | 1 | 1 |
GTX680 | 6,008 | 256 | 192.2 | 1,536 | 1,006 | 3.09 | 0.13 | 2.3 | 1.7 |
K20m | 5,200 | 320 | 208 | 2,496 | 706 | 3.5 | 1.17 | 2.6 | 1.8 |
GTX Titan | 6,144 | 384 | 288.4 | 2,688 | 837 | 4.5 | 1.5 | 3.35 | 2.2 |
HD7970 | 5,500 | 384 | 264 | 2,048 | 925 | 3.8 | 0.94 | 2.8 | 2.3 |
3.1 Thread-block configuration
Sapporo2
is designed around the concept of processing a fixed number of sink particles for a block time-step algorithm (see Section 2.1). Therefore, the first thing to determine is the smallest number of sink particles that gives full GPU performance. To achieve full performance the computation units on the GPUs have to be saturated with work. The GPU consists of a number of multiprocessors and the computation units are spread over these multiprocessors. When the host code sends work to the GPU this is done in sets of thread-blocks. Each thread-block is executed on a multiprocessor. The blocks contain a (configurable) number of threads that can work together, while the blocks themselves are treated as independent units of work. In this section we determine the optimal number of blocks and the number of threads per block to saturate the GPU when performing the gravity computations. We test a range of configurations where we vary the number of blocks per multi-processor and the number of threads per block. The results for four different GPU architectures are presented in Figure 1. In this figure each line represents a certain number of blocks per multi-processor, \(N_{\mathrm{blocks}}\). The x-axis indicates the number of threads in a thread-block, \(N_{\mathrm{threads}}\). The range of this axis depends on the hardware. For the HD7970
architecture we cannot launch more than \(N_{\mathrm{threads}}=256\), and for the GTX480
the limit is \(N_{\mathrm{threads}}=576\). For the two Kepler
devices 680GTX
and K20m
we can launch up to \(N_{\mathrm{threads}}=1{,}024\) giving these last two devices the largest set of configuration options. The y-axis shows the required wall-clock time to compute the forces using the indicated configuration, the bottom line indicates the most optimal configuration.
680GTX
and the K20m
the \(N_{\mathrm{blocks}}\) configurations reach similar performance when \(N_{\mathrm{threads}} > 512\). This indicates that at that point there are so many active threads per multi-processor, that there are not enough resources (registers and/or shared-memory) to accommodate multiple thread-blocks per multi-processor at the same time. To make the code suitable for block time-steps the configuration with the least number of threads, that gives the highest performance would be the most ideal. For the HD7970
this is \(N_{\mathrm{threads}}=256\) while for the Kepler
architectures \(N_{\mathrm{threads}}=512\) gives a slightly lower execution time than \(N_{\mathrm{threads}}=256\) and \(N_{\mathrm{threads}}=1{,}024\). However, we chose to use \(N_{\mathrm{threads}}=256\) for all configurations and use 2D thread-blocks on the Kepler
devices to launch 512 or 1,024 threads. When we talk about 2D thread-blocks it means that we launch multiple threads per i-particle whereby each thread computes a part of the j-particles. This way we increase the number of total threads which the hardware can schedule in order to hide the memory latencies. Especially when the number of active i particles is ≤128 this helps to improve the performance and is discussed in more detail in the next section. For each architecture the default configuration is indicated with the circles in Figure 1.3.2 Block-size/active-particles
Sapporo2
in combination with a block time-step algorithm. We measured the time to compute the gravitational forces using either the NVIDIA GPU Profiler or the built-in event timings of OpenCL. The number of active sink particles, \(N_{\mathrm{active}}\), is varied between 1 and the optimal \(N_{\mathrm{threads}}\) as specified in the previous paragraph. The results are averaged over 100 runs and presented in Figure 2. We used 131,072 source particles which is enough to saturate the GPU and is currently the average number of particles used in direct N-body simulations that employ a block time-step integration method.
CUDA
and OpenCL
is minimal, which indicates that the compute part of both implementations inhabits similar behavior. For most values of \(N_{\mathrm{active}}\) the timings of Sapporo1
and Sapporo2
are comparable. Only for \(N_{\mathrm{active}} < 64\) we see a slight advantage for Sapporo1
where the larger data loads of Sapporo2
result in a slightly longer execution time. However, the improvements made in Sapporo2
result in higher performance and a more responsive execution time compared to Sapporo1
when \(128 \geq N_{\mathrm{active}} < 160\). For the HD7970
, there is barely any improvement when \(N_{\text{active}}\) decreases from 256 to 128. There is a slight drop in the execution time at \(N_{\mathrm{active}}=192\), which coincides with one less active wavefront compared to \(N_{\mathrm{active}}=256\). When \(N_{\mathrm{active}} \leq128\) we can launch 2D blocks and the performance improves again and approaches that of the NVIDIA hardware, but the larger wavefront size compared to the warp size causes the execution times to be less responsive to changes of \(N_{\mathrm{active}}\).3.3 Range of N
CUDA
, OpenCL
, Sapporo1
and Sapporo2
. The execution time includes the time required to send the input data and retrieve the results from the device.
Sapporo1
and Sapporo2
(both the CUDA
and OpenCL
versions) on the K20m GPU are negligible. Sapporo1
is slightly faster for \(N < 10^{4}\), because of the increased data-transfer sizes in Sapporo2
, which influence the performance more when the number of computations is relatively small. Sapporo2
is slightly faster than Sapporo1
when \(N \geq10^{4}\), because of the various optimizations added to the new version. The difference between the GTX680
, K20m
and the HD7970
configurations is relatively small. While the GTX Titan
is almost 1.5× faster and the GTX480
almost 2× slower than these three cards. These numbers are not unexpected when inspecting their theoretical performance (see Table 1). For \(N < 10^{5}\) we further see that the performance of the HD7970
is lower than for the NVIDIA cards. This difference is caused by slower data transfer rates between the host and device for the HD7970
. Something similar can be seen when we compare the OpenCL
version of the K20m
with the CUDA
version. Close inspection of the timings indicate that this difference is caused by longer CPU-GPU transfer times in the OpenCL
version when transferring small amounts of data (\({<}100~\mbox{KB}\)) which, for small N, forms a larger part of the total execution time.3.4 Double precision vs double-single precision
GTX680
, K20m
and the HD7970
. The theoretical peak performance when using double precision computations is lower than the peak performance when using single precision computations. The double precision performance of the K20m
is one third that of the single precision performance. For the GTX680
this is \(\frac{1}{24}\)th and for the HD7970
this is one fourth. As in the previous section we use the wall-clock time required to perform \(N^{2}\) force computations (including the data send and receive time) to compare the devices. The results are presented in the right panel of Figure 3, here the double precision timings are indicated with the open symbols and the double-single timings with the filled symbols.GTX680
is slower than the other two devices. The performance of the K20m
and the HD7970
are comparable for \(N > 10^{4}\). For smaller N the performance is more influenced by the transfer rates between the host and the device than by its actual compute speed.GTX680
in full double precision is about \({\sim}10\times\) lower than when using double-single precision. For the other two cards the double precision performance is roughly \({\sim}2.8\times\) lower. For all the devices this is roughly a factor of 2 difference from what can be expected based on the specifications. This difference can be explained by the knowledge that the number of operations is not exactly the same for the two versions5 and even in the double single method we use the special operation units to compute the rsqrt
.6 Another reason for the discrepancy between the practical and theoretical numbers is that we keep track of the nearest neighbors which requires the same operations for the double-single and the double precision implementation. Combining this with the knowledge that we already execute a number of double precision operations to perform atomic additions and data reads, results in the observed difference between the theoretical and empirically found performance numbers.3.5 Sixth order performance
3.6 Multi-GPU
Sapporo2
supports multiple GPUs in parallel. The parallelized parts are the force computation, data transfer and prediction of the source particles. The transfer of particle properties to the device and the transfer of the force computation results from the device are serial operations. These operations have a small but constant overhead, independent of the number of GPUs. For the measurements in this section we use the total wall-clock time required to compute the forces on N particles (as in Section 3.3). The speed-up compared to 1 GPU is presented in Figure 5. The timings are from the K20m
GPUs which have enough memory to store up to \(8\times10^{6}\) particles. We use shared-time steps for these timings. For \(N > 10^{4}\) it is efficient to use all available GPUs in the system and for \(N \leq10^{4}\) all multi-GPU configurations show similar performance. The only exception here is when \(N = 10^{3}\) at which point the overhead of using 4 GPUs is larger than the gain in compute power. For large enough N the scaling is near perfect (\(T_{\mathrm{single}\mbox{-}\mathrm{GPU}}/ T_{\mathrm{multi}\mbox{-}\mathrm{GPU}}\)), since the execution time is dominated by the computation of the gravitational interactions. Note that for these experiments we have to transfer the full data-sets to the GPU, this is why the scaling for small N is less than perfect as it takes time to transfer the data over a PCI-Express bus. For block time-step simulations the number of particles being transferred, per time-step, will be smaller. However, the compute time is also smaller as less particles will have to integrated. Therefore, the scaling for small N will stay less than perfect in all situations.
3.7 Block time-step simulations
Sapporo2
we use a sixth order Hermite integrator with block time-steps (Fujii et al. 2012; Nitadori and Makino 2008). We perform simulations of Plummer (Plummer 1915) spheres using 1 and 4 GPUs with double-single (DS) and full double precision (DP) accuracy. The number of particles used ranges from 16k up to 512k particles. For each simulation we record the execution time, the energy error, the average number of active particles per block-step and the speed-up of using 4 GPUs over 1 GPU.4 Discussion and CPU support
4.1 CPU
Sapporo2
. This CPU version uses SSE2 vector instructions and OpenMP parallelization and can be run in single or in double precision. The only kernel implemented is the fourth order integrator, including support for neighbor lists and nearest neighbors (particle-ID and distance). Because the performance of the GPU depends on the combination of sink and source particles we test a grid of combinations for the number of sink and source particles when measuring the time to compute the gravitational forces. The results for the CPU (a Xeon E5620 @2.4Ghz), using a single core, are presented in Figure 7(a). In this figure (and all the following figures) the x-axis indicates the number of sinks and the y-axis the number of sources. The execution time is indicated by the color from blue (fastest) to red (slowest). The smooth transition from blue to red from the bottom left corner to the top right indicates that the performance does not preferentially depend on either the source or sink particles, but rather on the combined number of interactions. This matches our expectations, because the parallelization granularity on the CPU is as small as the vector width, which is 4. On the GPU this granularity is much higher, as presented in Figure 7(b), here we see bands of different color every 256 particles. Which corresponds to the number of threads used in a thread-block (\(N_{\mathrm{threads}}\)). With 256 sink particles we have the most optimal performance of a block, however, if we would have 257 sink particles we process the first 256 sinks using optimal settings while the 257th sink particle is processed relatively inefficiently. This granularity becomes less obvious when we increase the number of interactions as presented in Figure 7(c). Here we see the same effect appearing as with the CPU (Figure 7(a)), where the granularity becomes less visible once we saturate the device and use completely filled thread-blocks for most of the particles. The final panel, Figure 7(d), indicates per combination of source and sink particles which CPU or GPU configuration is the fastest. For the CPU we measured the execution time when using 1, 2, 4 or 8 cores. In this panel the colors indicate the method which gives the shortest execution times. Furthermore does it indicate if and by how much the GPU is faster than the 8 cores of the CPU.
Sapporo2
library for the GPU is an efficient method for realistic data-set sizes. Although our implementation uses SSE2 instructions it is not as advanced as the implementation of Tanikawa et al. (2012). For example, we use intrinsic functions while they use the assembly operations directly. This is also visible when we compare their performance with our implementation. The implementation we tested here reaches about 60% of their performance, however they do not compute the nearest neighbor particle and do not keep track of the neighbor list, both of which have a significant impact on the performance as they cause divergence in the execution stream.4.2 XeonPhi
Sapporo2
library can be built with OpenCL
it should, theoretically, be possible to run on any device that supports OpenCL
. To put this to the test, we compiled the library with the Intel OpenCL
implementation. However, although the code compiled without problems it did not produce correct results. We tested the library both on an Intel CPU and the Intel XeonPhi
accelerator. Neither the CPU, nor the XeonPhi
produced correct results. Furthermore, the performance of the XeonPhi
was about 100× smaller than what can be expected from its theoretical peak performance. We made some changes to the configuration parameters such as \(N_{\mathrm{threads}}\) and \(N_{\mathrm{blocks}}\), however this did not result in any presentable performance. We suspect that the Intel OpenCL
implementation, especially for XeonPhi, contains a number of limitations that causes it to generate bad performing and/or incorrect code. Therefore, the Sapporo2
library is not portable to Intel architectures with their current OpenCL
implementation.8 This does not imply that the XeonPhi
has bad performance in general, since it is possible to achieve good performance on N-body codes that is comparable to GPUs. However, this requires code that is specifically tuned to the XeonPhi
architecture (K. Nitadori, private communication9).5 Conclusion
Sapporo2
library makes it easy to enable GPU acceleration for direct N-body codes. We have seen that the difference between the CUDA
and OpenCL
implementation is minimal, when there are enough particles to make the simulation compute limited. However, if many small data transfers are required, for example when the integrator takes very small time-steps with few active particles, the CUDA
implementation will be faster. Apart from the here presented fourth and sixth order integrators the library also contains a second order implementation. And because of the storage of data in double precision it can be trivially expanded with an eighth order integrator. The performance gain when using multiple GPUs implies that it is efficient to configure GPU machines that contain more than 1 GPU. This will improve the time to solution for simulations with more than 104 particles.OpenCL
support and built-in tuning methods allow easy extension to other OpenCL
supported devices. However, this would require a mature OpenCL
library and matching hardware that supports atomic operations and double precision data types. For the CUDA
devices this is not a problem since the current CUDA
libraries already have mature support for the used operations and we expect that the library automatically scales to future architectures. The only property that has to be set is the number of thread-blocks per multiprocessor and this can be easily identified using the figures as presented in Section 3.1.