1 Introduction
2 Mathematical description of the Eikonal equation
3 Local solver
3.1 Fast iterative method
3.2 Precomputing the inner products
×
|
×
|
3.2.1 Gray code like indexing
3.2.2 Accessing the precomputed scalar products via Gray indexing
3.2.3 Rotating elements into the reference configuration
\(\phi _k\)
|
\(\phi _4\)
|
\(\phi _3\)
|
\(\phi _2\)
|
\(\phi _1\)
|
---|---|---|---|---|
Edges | 5,1,2 | 2,4,0 | 0,1,3 | 3,4,5 |
Signs | +,+,+ | +,–,+ | +,–,– | –,+,+ |
3.3 Further memory footprint reduction
×
|
4 Global solution algorithms
5 Task based parallel Eikonal solver
5.1 Multithreading
5.2 CUDA offloading kernels
5.2.1 Using SCAN for compaction
5.2.2 Using SCAN for data management
6 Domain decomposition parallel Eikonal solver
6.1 Parallel sweep
6.2 Local solves
6.3 CUDA implementation
6.3.1 First DD approach
-
Work load management strategy.During each iteration we keep track of the active domains. The number of active domains changes with thread synchronization. In this way the data preparation by other kernels and solution computations are done only for the active domains reducing in this way unnecessary computations that might come by preparing data or solving for inactive domains.The work load of one domain will be distributed into several blocks where each block computes for a subset of the tetrahedron elements of that domain calculated in advance from the local active list as shown in Fig. 6. One thread in a block computes for one tetrahedron element and the number of threads in the block is chosen such that the occupancy is not decreased. The number of blocks solving for one domain is defined based on the total number of tetrahedron elements divided by the number of threads used per one block.The algorithm is such that in the beginning of the execution and at the end of the execution the number of active domains is very small. This means that many blocks or processing units remain idle. This approach increases the utilizations because it uses more blocks for one domain reducing in this way the number of inactive processing units (SMs). Another way to overcome this is to have several smaller sub-domains which may increase the chance of a processor to have a non-empty active set.
-
Synchronization.At the end of the loop local values are exchanged based on local to local precomputed mapping which reduces the access to global memory. The domain \(\Omega \) is statically decomposed into a number of non-overlapping sub-domains \(\Omega _i\), which enables the exchange of the local solution values only on common nodes in the boundary between two domains. In our case the solution of the Eikonal equation on \(\Omega _i\) is efficient as long as its boundary data on \(\partial \Omega _i\) is correct. This data may belong to other sub-domains processed by other blocks which requires inter-process communication. Communication is realized on one GPU between blocks by host synchronization. For this reason the synchronization kernel takes place at the end of the loop. In order to achieve a quick mapping, a precomputed local to local interface of each domain with all the other neighboring domains is used. It contains the local indices of the boundary nodes and the indices of two domains that form that boundary where these nodes are contained. In this way we can easily start a kernel which has all the needed information in order to exchange solution values for all the node in the interface. The exchange happens only if the value of a certain boundary node in the first domain is smaller than the value of the same node in the other domain. If that happens then the larger value is updated with a better solution, the node is added to the local active list of the sub-domain of the changed value, and the sub-domain of the changed value becomes active.
-
The termination condition.It checks if the total number of nodes for all local active lists is zero which is the fulfillment of the termination condition. This is a separate kernel since it needs to wait for the synchronization kernel to end, where nodes are added to the local active lists, for getting a correct result.
6.3.2 Second DD approach
-
Work load management strategy.We keep track of active domains in order to reduce the unnecessary computations for the inactive domains and allow for better load balancing of the work into the blocks. Since one block is computing for a single sub-domain one thread has do to more work. This might might become a very efficient approach if done in the correct way by preserving the coalesced memory access and by increased granularity as in Fig. 7. Several smaller sub-domains increase the chance of a block to have a non-empty active set.
-
Host synchronization reduction.The domain decomposition in CUDA allows for kernel optimizations and host interaction reduction because the independent computations within a block for each sub-domain allows for block synchronization instead of host synchronization. This reduces the number of kernels and improves the code. The memory allocation is done dynamically which means a total reduction of memory transfer from the device to host. In summary this improves the performance thanks to the domain decomposition approach.
-
Synchronization.Fusing the number of kernels into one kernel is not straight forward because of incorporating the synchronization kernel and the termination condition kernel. The safest way to guarantee correct results consists in a synchronization with the host in order to ensure that all blocks ended execution which is mandatory for the termination condition kernel. We finally managed to reduce all computations to one large synchronization kernel and a second termination condition kernel.The change in the local to local interface model as explained in Sect. 6.3.1 allowed us to incorporate the synchronization kernel. The first domain decomposition approach used a local to local mapping for all boundaries between two sub-domains. This can be no longer applied in the new approach where one blocks solves for one sub-domain because the information in the interface is not grouped for each sub-domain. Now the mapping contains the total information on the boundary for all domains. One block does not know which part of the information belongs to the sub-domain it is solving for. Therefore, we changed the way of pre-computing the interface such that every sub-domain has its own separated interface information with its neighboring sub-domains. As a consequence one block knows exactly which interface maps to which process and updates accordingly.This change allowed us to incorporate all computations into one main kernel but it does not solve the synchronization problem. Placing the synchronization kernel within the main kernel it is no longer thread safe. When the synchronization kernel tries to update a boundary solution value of another domain, that block computing on that specific other domain might update it as well. Of course this update is protected with atomic operations but this alone is not enough. Fortunately, the order of this update is not relevant to the overall solution and so we achieve finally a correct solution.
7 Numerical tests and performance analysis
7.1 Gray code
Implementations | # Tets | CUDA | OpenMP 8 threads |
---|---|---|---|
Without Gray-code | 3,073,529 | 1.49 | 5.66 |
With Gray-code | 3,073,529 | 0.73 | 3.65 |
Without Gray-code | 24,400,999 | 11.48 | 56.63 |
With Gray-code | 24,400,999 | 5.16 | 36.43 |
7.2 Domain decomposition
# Sub-domains | First DD approach | Second DD approach |
---|---|---|
74 | 0.48 | 0.69 |
160 | 0.52 | 0.60 |
320 | 0.58 | 0.51 |
# Sub-domains | First DD approach | Second DD approach |
---|---|---|
2700 | 5.96 | 6.89 |
3000 | 6.40 | 7.55 |
4000 | 7.55 | 9.46 |
8000 | 14.00 | 14.74 |
8 Conclusions and future work
×
|