1 Introduction
-
We introduce EPSILOD’s architecture and describe the programming tools used to implement it. We also show how to use EPSILOD to characterize stencils and automatically generate efficient kernels.
-
We present a new skeleton version that supports clusters with devices of mixed types and vendors, such as NVIDIA and AMD GPUs. It includes the possibility of using load balancing techniques to transparently distribute the computation across devices with different computation power. The skeleton includes a generic GPU kernel that can be used for any stencil. EPSILOD also includes a tool to generate more optimized kernels for each stencil pattern, or even the possibility of providing a user-optimized kernel.
-
We include extensive experimentation to show the benefits of our approach, comparing it with an state-of-the-art solution, and taking into account different architectures. This evaluation includes:
-
An experimental study in a BSC (Barcelona Supercomputing Center) cluster of up to 48 NVIDIA V100 GPUs, distributed among 12 nodes, which shows that our proposal can achieve both very good strong and weak scalability for several types of 2D stencils.
-
An experimental study in a cluster of up to 32 NVIDIA V100 GPUs, distributed among eight nodes, which shows that our proposal outperforms implementations of the same stencils using Celerity [12], a state-of-the-art programming tool for distributed GPUs built on top of MPI and SYCL.
-
An experimental study in a heterogeneous node comprising different kinds of GPUs (NVIDIA and AMD) that shows the impact of using the load-balancing technique integrated in the skeleton to achieve good strong and weak scalability in this kind of heterogeneous platforms.
-
2 Related work
Framework | Multi-GPU | Multinode | Heterogeneous portability | Mixed GPUs |
---|---|---|---|---|
SkelCL \(*\) | X | |||
Muesli \(*\) | X | X | ||
DeVito \(\ddagger \) | X | X | ||
YASK \(\ddagger \) | X | X | ||
Celerity | X | X | X | |
EPSILOD | X | X | X | X |
3 EPSILOD building blocks: Hitmap and Controllers
3.1 Hitmap
-
Tiling functions Hitmap provides the HitTile structure, a kind of fat-pointer. It stores, alongside the pointer to the memory space, metadata related to the size, dimensions and partitions of an array. The index domains of the tiles are declared and queried using the interface associated to the HitShape structure. Accessing these abstract arrays is performed using a generic function hit, with a variable number of index parameters.
-
Mapping functions The HitTopology and HitLayout structures provide abstractions for modular functionalities for the partition and mapping of data structures across virtual processes. The library includes from common n-dimensional partition policies to heterogeneous partitions based on different weights for each process that can be used for load balancing in heterogeneous environments.
-
Communication functions They allow the creation of reusable communication patterns, HitPattern, based on a given HitLayout distribution. They provide an abstraction of a message passing programming model to communicate tiles between multiple virtual processes. Each new pattern is built in terms of the runtime information found in a layout object. Thus, they transparently adapt the communication structure to any change in the distribution policy or number of processes.
3.2 Controllers programming model
4 Parallel stencil skeleton
4.1 Interface and usage
4.2 Implementation details
-
Halos: Elements of the local arrays not owned by the process, but needed to compute the local part. These elements are not assigned to the process by the partition function. They are included in the local part during the expansion of the shape determined by the stencil radiuses and should be received on every iteration from the neighbor processes. They are the outermost elements of the local arrays, and they are not updated on the local process.
-
Borders: These are the elements owned by the local process that are duplicated as halos in other processes. They are updated locally and they should be sent to their corresponding neighbors on every iteration.
-
Inner: The innermost elements of the local arrays, skipping halos and borders. These elements are updated on every iteration, and they are not communicated to any neighbor.
5 Experimental studies
5.1 Experimental study 1: strong and weak scaling in CTE-POWER
-
2D 4-point compact stencil (Jacobi).
-
2D 9-point compact stencil.
-
2D 9-point non-compact, second order stencil.
-
2D 5-point non-compact, asymmetric stencil.
2D 9-point stencil GFLOPS | 1 GPU 1 node | 2 GPUs 1 node | 4 GPUs 1 node | 8 GPUs 2 nodes | 16 GPUs 4 nodes | 32 GPUs 8 nodes | 48 GPUs 12 nodes |
---|---|---|---|---|---|---|---|
EPSILOD weak scalability, generic kernel | 821.59 | 1640.67 | 3142.45 | 6219.52 | 12333.38 | 24800.17 | 36968.81 |
EPSILOD weak scalability, KGT-generated kernel | 1282.60 | 2557.95 | 4684.85 | 9302.92 | 18478.60 | 37114.85 | 55087.86 |
-
In the weak scaling experiments, we observe smaller execution times when using one or two GPUs than when using four GPUs or more. This is explained by the architecture of the cluster nodes. There are four GPUs on each node grouped in two different zones, with different CPUs (NUMA nodes), and different power and ventilation areas. With four GPUs executing in the node, there are two devices executing in the same zone, and both devices experiment a little performance degradation.
-
We also observe a very good weak scalability, specially for the optimized kernel. With up to four GPUs, inter-process communication is fast, as it is done internally in the node. With eight or more, there are also inter-process communications using the network. We observe no relevant execution time increment when using more than four GPUs, indicating that all the data transfers and communications are fully overlapped with the computation.
-
The strong scaling results also show how good the overlapping is, with a parallel efficiency of more than 90% with up to 48 GPUs.
-
Regarding performance, the execution time is dominated by the computation of the inner part. The kernels of these stencil computations are memory-bound. The performance differences between different stencils are mainly originated by the lower or higher memory bandwidth needed to access the neighbor elements. In general, non-compact stencils perform worse than compact ones. For stencils accessing a similar amount of memory lines, slight performance differences can be appreciated depending on the number of arithmetic operations per cell. These differences are more noticeable when using the less optimized generic kernel. For example, the 2d9 non-compact stencil presents the higher execution times. It accesses more memory lines than the others due to the higher stencil radius (\(r=2\) on each of the four directions). The 2d4 stencil is faster because it has a lower number of floating point operations per cell. In the case of the asymmetric 2d5 stencil, the effects of the higher radius on two directions, zero-radius in another two, and the number of floating point operations are compensated, obtaining a similar performance to the 2d9 compact stencil. Nevertheless, the communication pattern of the asymmetric 2d5 stencil derives in a pipeline structure, simpler than the full neighbor synchronization structure of the other stencils. Thus, its performance suffers less from occasional small delays on remote synchronizations. This is noticeable in the strong scaling tests when using the automatically generated kernel for 32 or 48 GPUs. The load of the inner part becomes so small that the inter-process synchronization effects become relevant. Also, in the same situations, the higher volume of communication of the 2d9 non-compact stencil leads to a little lower performance for a high number of GPUs.
5.2 Experimental study 2: comparison with celerity implementations
-
2D 4-point compact stencil (Jacobi).
-
2D 9-point compact stencil.
-
2D 9-point non-compact, second order stencil.
-
2D 5-point non-compact, asymmetric stencil.
2D 9-point stencil GFLOPS | 1 GPU 1 node | 2 GPUs 1 node | 4 GPUs 1 node | 8 GPUs 2 nodes | 16 GPUs 4 nodes | 32 GPUs 8 nodes |
---|---|---|---|---|---|---|
Celerity implementation, weak scalability | 1238.37 | 2341.60 | 4238.52 | 8417.58 | 15777.26 | 30940.34 |
EPSILOD, KGT-generated kernel, weak scalability | 1426.74 | 2830.06 | 4953.20 | 9785.37 | 17965.59 | 35145.91 |
2D 4-point stencil GFLOPS | 1 GPU 1 node | 2 GPUs 1 node | 4 GPUs 1 node | 8 GPUs 2 nodes | 16 GPUs 4 nodes | 32 GPUs 8 nodes |
---|---|---|---|---|---|---|
Celerity implementation, weak scalability | 817.36 | 1520.13 | 2712.66 | 5361.13 | 10020.10 | 18011.26 |
EPSILOD, KGT-generated kernel, weak scalability | 774.52 | 1536.50 | 2681.57 | 5287.03 | 9773.10 | 19390.49 |
5.3 Experimental study 3: using different GPU types
2D 9-point stencil GFLOPS | 1 Tesla V100 | 2 Tesla V100 | 2 Tesla V100, 1 Radeon PRO WX 9100 | 2 Tesla V100, 2 Radeon PRO WX 9100 |
---|---|---|---|---|
EPSILOD, unbalanced distribution | 1239.86 | 2478.34 | 1762.18 | 2216.63 |
EPSILOD, balanced distribution | 1239.97 | 2478.48 | 3044.86 | 3657.75 |
5.4 Summary of results
-
Regarding strong and weak scaling in CTE-POWER, we observe a very good weak scalability, specially for the optimized kernel. We also observe that the execution time does not increase with more than four GPUs, indicating that all the data transfers and communications are fully overlapped with the computation. Strong scaling also demonstrate how good the overlapping is, with a parallel efficiency of more than 90% with up to 48 GPUs.
-
Regarding the comparison with Celerity implementations, our proposal outperforms the Celerity implementations for every type of stencil tested, except in the simplest stencil when using fewer than 32 GPUs, where performance is similar. This is due to our better communications management.
-
Regarding the use of different GPU types with different computation power simultaneously, the results show that simple, static load-balancing techniques allow to achieve good weak scalability with the proposed skeleton.