1 Introduction
cudaMallocManaged(...)
, invocation of a kernel and synchronization upon kernel termination. DP, on the other hand, is a new mechanism introduced in CUDA 5 (for devices with compute capability \(3.5{+}\)) that allows launching kernels from within kernels [21]. Recursive calls may continue up to 24 levels. This solution is well suited for divide and conquer applications [4] as no explicit synchronization through the host is needed before next kernel calls. Specifically, this allows recursive deepening in certain algorithms to increase resolution in computations in geometric SPMD applications [20], numerical applications such as adaptive integration [6] and others. It should also be noted that OpenCL, as another popular API for programming GPUs (and also \(\hbox {CPUs}+\hbox {GPUs}\)), offers device side enqueuing (a parent kernel enqueues a child kernel) and shared virtual memory where a virtual address space is accessible both from the host and a device with three types: coarse-grained buffer SVM, fine-grained buffer SVM and fine-grained system SVM depending on granularity of synchronization [13].2 Related work
2.1 Unified memory
2.2 Dynamic parallelism
3 Evaluation of unified memory and dynamic parallelism
3.1 Methodology
Application 1 | Application 2 | Application 3 | |
---|---|---|---|
Purpose | Heat distribution SPMD type application | Numerical integration | Goldbach conjecture |
DP | Moving the time simulation loop to the GPU | Used in adaptive integration for subranges that need more accuracy | Finding pairs of prime numbers |
UM | Transferring results for visualization on the CPU side | Transferring partial results for integrated subranges | Transferring tested numbers to the device |
Tested optimizations | Various ways of updating heaters’ temperatures, considering only areas with changes over threshold | Static, iterative and recursive version of integration algorithm for different functions | Different sizes of thread pools; helper boolean vector for primality test |
3.1.1 Test platforms
3.2 Parallel heat distribution application-experiments
3.2.1 Dynamic parallelism
for
loop. Another approach can be a single kernel that could use DP to call child kernels each for every heat source.-
std-cpConst
—a kernel works on the whole simulation grid. For every cell that is a part of a heater renew its temperature. -
std-for
—an application uses the standard CUDA API and invokes a kernel in afor
loop from a CPU. Each kernel invocation renews temperature only for one heater. -
dp-cpConst
—a kernel uses a similar approach as forstd-cpConst
. The difference is that we shifted the kernel invocation to the device side. -
dp-for
—on the device side in afor
loop we call the kernel for each heater. It is similar tostd-for
but the loop is on device side. -
dp-cpHeaters
—we call a kernel that calls nested kernels for each heater on the device side.
Number | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 20 | 30 |
---|---|---|---|---|---|---|---|---|---|---|---|
Area (%) | 0.48 | 0.72 | 0.95 | 1.19 | 1.43 | 1.67 | 1.91 | 2.15 | 2.38 | 4.77 | 7.15 |
dp-for
version (up to 10) or dp-cpHeaters
was the best (30 heaters). However, if the number of heaters exceeded 20, better versions were those working on the whole simulation area. Both solutions (std-cpConst
and dp-cpConst
) exhibited similar results. The most universal approach was the std-for
version. It worked well both with small and bigger number of heat sources. In the following experiments, we used the dp-for
approach.
dp-for
to the dp-cpConst
in a separate chart to better show performance gains. Detailed results are presented in Fig. 3. It can be seen that for two heaters the dp-for
approach is almost 40% faster.dp
). The second version checks each tile one per visualization, e.g., once per 90 iterations (marked as dp-heur
).dp-heur
version has improved performance two times because only a part of the grid needs to be updated. With time, when the simulation overtakes a greater area of the grid, performance becomes similar to the standard version. On the other hand, dp approach caused a significant decrease in performance. It is caused by additional operations performed in every loop iteration.
3.2.2 Unified memory
3.2.3 Summary
3.3 Parallel adaptive numerical integration-experiments
static
, splits the initial range into subranges each of width \(\delta \). Then, in every subrange a trapezoid area is calculated. There are no nested procedure calls nor testing accuracy of the calculations for a subrange. It is intuitively clear that in some cases (e.g., for a linear function) this method may result in significant redundancy of calculations compared to what could be done with a knowledge of the function.static
version, the most important is the \(\delta \) parameter. It should be as small as the smallest subrange in an adaptive version.double
type is used which requires 8 bytes for every subrange. Therefore, for large ranges additional transfers between the device and the host may occur.3.3.1 Test functions
-
static
—a static version of the application, -
adaptive32
—an adaptive version, without DP with splitting into 32 parts, -
adaptive2
—an adaptive version, without DP with splitting into 2 parts, -
dp32
—an adaptive version, using DP with splitting into 32 parts, -
dp2
—an adaptive version, using DP with splitting into 2 parts.
3.3.2 Dynamic parallelism
adaptive32
was significantly worse compared to other versions. The static
version was slower than DP versions but at the same time it had a slightly better accuracy as mentioned above. However, adaptive2
was the fastest version.
adaptive2
version.3.3.3 Unified memory
32
versions are somewhat more interesting. On GTX 970, the approach utilizing DP was better. In turn on Tesla K20m, the iterative version was faster. Additionally, we measured time of a kernel call. It turned out that it is almost four times longer on the Tesla device (Table 3). As a result, the time which is saved on avoiding memory transfer is covered by kernels call times.Time (ms) | |
---|---|
Tesla K20 m | 30.95 |
GTX 970 | 8.39 |
3.3.4 Summary
3.4 Parallel Goldbach conjecture application-experiments
3.4.1 Dynamic parallelism
3.4.2 Unified memory
-
std-constant
—version with standard memory management, the vector is stored in constant memory, -
std-global
—version with standard memory management, the vector is stored in global memory, -
um-constant
—version with UM, the vector is stored in constant memory, -
um-global
—version with UM, the vector is stored in global memory (marked as \(_{--}\)managed
\(_{--}\)).