1 Introduction
- We introduce workload partitioning implementations for all data parallel skeletons in SkePU 2, capable of dividing the work between an arbitrary number of CPU cores and accelerators.
- We present a way to automatically tune the workload partitioning of the skeletons, by using performance benchmarking.
- We show that our hybrid execution implementation can improve the execution time compared to the already existing backends of SkePU, without any changes to the application code.
- We show that the performance of our solution is faster and more robust than the experimental StarPU-based hybrid execution implementation from SkePU 1, by porting its implementation to SkePU 2.
2 SkePU 2
- Sequential CPU backend, mainly used as a reference implementation and baseline.
- OpenMP backend, for multi-core CPUs.
- CUDA backend, for NVIDIA GPUs, both single and multiple.
- OpenCL backend, for single and multiple GPUs of any OpenCL supported model, including other accelerators such as Intel Xeon Phis.
- Map, a generic element-wise data parallel skeleton working on vectors and matrices. It has support for element-wise arguments of arbitrary number, as well as random access and uniform user function arguments.
- Reduce, generic reduction operation with a single binary, associative operator. Available for vectors and row-wise or column-wise matrix reduction.
- MapReduce, an efficient combination of a Map skeleton followed by Reduce.
- MapOverlap, stencil operation in one or two dimensions with support for different edge handling schemes.
- Scan, a generalize prefix sum with a binary, associative operator.
- Call, a generic, multi-variant component.
3 StarPU
4 Workload partitioning and implementation
4.1 StarPU backend implementation
5 Auto-tuning
6 Performance evaluation
6.1 Single skeleton evaluation
6.2 Generic application evaluation
Application | Algorithm | Skeletons |
---|---|---|
CMA | Cumulative moving average | Map\({<}{1}{>}\), Scan |
Coulombic | Coulombic potential | Map\({<}{1}{>}\) |
Dotproduct | Dot product | MapReduce\({<}{2}{>}\) |
Mandelbrot | Mandelbrot fractal | Map\({<}{0}{>}\) |
PPMCC | Pearson product-moment correlation coeff. | Reduce, MapReduce\({<}{1}{>}\), MapReduce\({<}{2}{>}\) |
PSNR | Peak signal to noise ratio | Map\({<}{2}{>}\), MapReduce\({<} {2}{>}\) |
Taylor | Taylor series expansion of \(\log (1 + x)\) | MapReduce\({<} {0}{>}\) |