1 Introduction
2 Related Work
3 The Muenster Skeleton Library Muesli
a
and b
(in a slightly simplified syntax). Firstly, a distributed array is created in line 6 of Listing ??. A skeleton typically gets a user function as argument, which can be either a C++ function or a C++ functor. We enable currying, i.e. the arguments of a user function can be supplied one by one rather than all together. For instance, the MapIndex
skeleton in line 7 of Listing 1 automatically adds the considered array element and its index as additional parameters to the sum
functor.4 Data Distribution and Data structures in Heterogeneous Computing Environments
4.1 Distributed Cubes
a
and b
, filled with the default values 0 and 1, respectively. The mapIndexInPlace
skeleton in line 11 adds to each element its row-index, column-index, and its index of the third dimension. In line 12, each value of b
is added to the corresponding value of a
.
4.2 Segmentation of Data Structures
4.3 Work-Load Partitioning
DeviceProperties
which, amongst others, state the number of multiprocessors available. To calculate the number of cores the function _ConvertSMVer2Cores(props.major, props.minor) * props.multiProcessorCount;
has to be used as the number of multiprocessors is dependent on the version of the GPU. However, a good approximation of the maximum possible parallelism can be calculated with this reference number. In the future, this number might also be dependent on the version of the GPU to prefer newer GPUs. Besides splitting the workload between multiple GPUs the fraction which is calculated by the CPU has to be automatically chosen by the library. The experimental results section evaluates which partition is reasonable for different skeletons, determining good default values for different calculation patterns.5 Experimental Results
Type | Number Nodes | Per Node | |||
---|---|---|---|---|---|
GPU-type | GPUs | CPU-type | CPUs | ||
Local | 1 | Quadro K620 | 1 | Intel(R) Core(TM) | 1 |
GeForce GTX 750 Ti | 1 | i7-4790 CPU 8 cores | |||
Cluster | 2 | GeForce RTX 2080 Ti | 4 | Zen3(EPYC 7513) | 1 |
24 cores |
5.1 CPU Usage on the PC
size (\(\hat{} 3\)) | Run-time | CPU % of Opt. Mix | Speedup | |||||
---|---|---|---|---|---|---|---|---|
Seq. | OpenMP | GPU | Opt. Mix | Seq. | OpenMP | GPU | ||
50 | 13.09 | 7.47 | 1.09 | 0.94 | 0.12 | 13.98 | 7.98 | 1.16 |
60 | 22.60 | 12.87 | 1.64 | 0.75 | 0.04 | 30.12 | 17.15 | 2.19 |
70 | 35.88 | 20.38 | 2.47 | 0.83 | 0.04 | 43.42 | 24.66 | 2.99 |
80 | 53.65 | 30.71 | 3.44 | 0.76 | 0.02 | 70.14 | 40.15 | 4.50 |
5.2 CPU Usage on HPC Machine
5.3 Multi-Node and Multi-GPU on the PC
size (\(\hat{} 3\)) | 1 Node | 2 Nodes | |||||
---|---|---|---|---|---|---|---|
Seq. | OpenMP | 1 GPU | 2 GPUs | 2 GPUs | Optimal | Speedup | |
45 | 9.24 | 5.50 | 0.11 | 0.12 | 0.10 | 0.10 | 93.94 |
55 | 16.88 | 10.02 | 0.63 | 0.36 | 0.33 | 0.33 | 51.89 |
65 | 27.96 | 16.55 | 0.67 | 0.67 | 0.34 | 0.34 | 82.75 |