1 Introduction
2 Background
2.1 Skeletons
-
The Compose (\(\circ \)) skeleton represents sequential function composition applied to a sequence of inputs, where \(f1 \circ f2\) denotes a sequential composition of two functions, f1 and f2.
-
The Order (;) skeleton represents the execution of two functions on a sequence of inputs, where the execution of the first function needs to be completed for all input values before the execution of the second one can start. f; g, therefore, requires synchronisation between f and g. This skeleton can be used, for example, in the map-reduce like computations, such as the one described in Sect. 5.2, to synchronise between the map and reduce phase.
-
A Farm (\(\varDelta \)) skeleton, \(\varDelta ( nwCPU , nwGPU ,f,x)\), represents the application of a single function, f, to the sequence of independent inputs,\(x_1,x_2,x_3,\ldots ,x_n\), in parallel. In the farm implementation that we consider, a specific number of worker threads is created, and the inputs are assigned to these worker threads in a round-robin fashion. Here nwCPU/nwGPU are, respectively, the number of worker threads executed on CPUs/GPUs.
-
The Pipeline (\(\parallel \)) skeleton applies the composition of the functions\(f_1, f_2, \ldots , f_n\), in parallel to a sequence of independent inputs \(x_1,x_2,\ldots ,x_m\), where the output of \(f_i\) is the input to \(f_{i+1}\). Parallelism arises from the fact that \(f_i(x_j)\) can be computed in parallel with \(f_{i+1}(f_i(x_{j-1}))\). In the implementation that we consider, a separate thread is assigned to each pipeline stage (function \(f_i\)). We denote the pipeline skeleton by \((f_1 \parallel f_2 \parallel \dots \parallel f_n)(x) \). Note that the pipeline skeleton does not accept the number of workers as an input, because a pipeline stage is always executed in one thread. If we want multiple threads to execute a single pipeline stage, to parallelise processing of items from its inputs, we compose it with the farm skeleton.
2.2 Refactoring
3 Programming Heterogeneous Parallel Machines
read
_image
and rprocess
_image
, on a stream of input files, imageFiles
. Components are the functions read
_image
and process
_image
, and the tasks are applications of these functions to the elements of the array, imageFiles
. We might only have a CPU implementation of the read
_image
function, and both CPU and GPU implementations (kernels) of the process
_image
function. Using the notation from Sect. 2, we can denote this by \(r \circ p\), where r is read
_image
function, p is process
_image
function, and \(\circ \) is the sequential composition.read
_image
on one image takes 0.2 ms, running CPU version of process
_image
on one image takes 6.6 ms and running GPU version of process
_image
on one image takes 0.08 ms.ff
_farm
and ff
_pipeline
) including the number of CPU and GPU workers for the farm skeletons, readFarm
and processFarm
. These worker parameters are taken directly from the output of Stage 4.4 Deriving Mappings Using Monte Carlo Tree Search
4.1 Adaptation of the MCTS Technique to the Static Mapping Problem
CPU cores | GPUs | App components | Sol. size | Time for eval. (s) |
---|---|---|---|---|
16 | 1 | 4 | 1240 | > 86,400 |
24 | 2 | 4 | 6624 | > 604,800 |
64 | 2 | 4 | 129,204 | > 3,628,8000 |
4.2 MCTS parameters
5 Case Studies
5.1 Image Convolution
5.1.1 Configurations and Cost-Model Filtering
Configuration | Est. runtime |
---|---|
\( r \circ p \) | 5.60 |
\(r \parallel p\) | 3.88 |
\( {\varvec{\varDelta }}\mathbf {(r)} {\varvec{\parallel }} \mathbf {p}\) | 1.60 |
\( r \parallel \varDelta (p)\) | 4.00 |
\({\varvec{\varDelta }}\mathbf {(r)} {\varvec{\parallel }} {\varvec{\varDelta }}\mathbf {(p)}\) | 0.40 |
\({\varvec{\varDelta }}\mathbf {(r} {\varvec{\parallel }} \mathbf {p)}\) | 0.56 |
\(\varDelta (r \circ p)\) | 5.60 |
\(\varDelta (r) \circ \varDelta (p)\) | 2.00 |
\(\varDelta (r) \circ p\) | 2.00 |
\(r \circ \varDelta (p)\) | 5.60 |
5.1.2 Optimal Static Mappings Determined by MCTS
\( \Delta (r)||\Delta (p)\) | \( \Delta (r)||p \) | \( \Delta (r||p) \) | |
---|---|---|---|
Mapping (C,G) | \( (6,0)||(0,3) \) | \( (4,0)||(0,1) \) | (5, 5) |
5.1.3 Evaluation of Skeleton Configurations
5.2 Ant Colony Optimisation
5.2.1 Configurations and Cost-Model Filtering
5.2.2 Optimal Static Mapping Determined by MCTS
\(\varDelta (s) ; p ; u\) | |
---|---|
Mapping (C,G) | (9, 5) |
5.2.3 Evaluation of Skeleton Configurations
5.3 Molecular Dynamics
5.3.1 Configurations and Cost-Model Filtering
5.3.2 Optimal Static Mapping Using MCTS
\( \varDelta (r \circ h)\) | |
---|---|
Mapping (CPU, GPU) | (22, 1) |