1 Introduction
2 Related work
3 Stencil properties
4 Problem formulation
5 Performance and energy models
Platform | CPU | GPU |
---|---|---|
Xeon E5-2670@2.60GHz (pJ) | Kepler K20m (pJ) | |
\(e^{op}\)
| 327 | 54 |
\(e^{byte}\)
| 1700 | 324 |
6 Optimal model
6.1 Multiplicity
6.2 Incidence
6.3 Maximum degree and maximum multiplicity
6.4 Chromatic index
6.5 ILP solution
-
(Edge to slot \(x_{e,p,k}\)) The binary variable for all \(e \in E\) and \((p,k) \in \mathcal {P} \times \mathcal {P}\) equals 1 if and only if edge e is mapped to slot (p, k), and 0 otherwise,
-
(Edge to colour \(y_{e,c}\)) For all \(e \in K_{n}\) and \(c \in C\), where \(C = \{0,\ldots ,\Delta (G) + \mu (G)-1\}\), the binary variable equals 1 if edge e receives colour c in \(M_{p}\), and 0 otherwise. The most simple choice for the number of potential colours to colour multigraph is |E|. However, we can choose a smaller set based on [30, 31] that for any multigraph \(G=(V,E)\) the chromatic index is \(\chi '(G) \le \Delta (G) + \mu (G)\),
-
(Colour is used \(z_{c}\)) For all \(c \in C\) the binary variable equals to 1 if a colour c is used in the edge colouring of \(M_{p}\) and 0 otherwise,
-
(Number of grid cells \(c_{p}\)) This integer variable depicts for each processor p the number of allocated grid cells,
-
(Processor idle time \(t_{p}^{idle}\)) This variable for each processor p with the state l represents the idle time.
-
(Total execution time \(t^{t}\)) This variable indicates how much time it takes to finish the whole workload,
-
(Energy used for communication \(e^{e}\)) This variable represents the total energy used for the inter-processor communication.
-
(Map edge to single slot) Each edge \(e \in E\) must be mapped to exactly one slot,$$\begin{aligned} \underset{(p,k) \in \mathcal {P} \times \mathcal {P}}{\sum }x_{e,p,k} = 1 \end{aligned}$$(9)
-
(Restrict slots) Mapping edge uv to slot (p, k) restricts the slots to which edges in \(\delta (uv)\) can be mapped. Edges in \(\delta ^{+}(u)\) must start in p and edges in \(\delta ^{-}(u)\) must end there. Likewise, edges in \(\delta ^{+}(v)\) must start in k and edges in \(\delta ^{-}(v)\) must end there:$$\begin{aligned} \underset{k \in \mathcal {P}}{\sum }x_{uv,p,k} - \underset{k \in \mathcal {P}}{\sum }x_{f,p,k}= 0 \end{aligned}$$(10)$$\begin{aligned} \underset{k \in \mathcal {P}}{\sum }x_{uv,p,k} - \underset{k \in \mathcal {P}}{\sum }x_{f,k,p}= 0 \end{aligned}$$(11)$$\begin{aligned} \underset{k \in \mathcal {P}}{\sum }x_{uv,k,p} - \underset{k \in \mathcal {P}}{\sum }x_{f,p,k}= 0 \end{aligned}$$(12)These constraints are for all \(p \in \mathcal {P}\) and \(uv \in E\), where \(f \in \delta ^{+}(u)\) is for (10), \(f \in \delta ^{-}(u)\) is for (11), \(f \in \delta ^{+}(v)\) is for (12), \(f \in \delta ^{-}(v)\) is for (13),$$\begin{aligned} \underset{k \in \mathcal {P}}{\sum }x_{uv,k,p} - \underset{k \in \mathcal {P}}{\sum }x_{f,k,p}= 0 \end{aligned}$$(13)
-
(Control the number of grid cells) This constraint controls the number of the grid cells allocated for each \(p \in P\). The sum of grid cells mapped to processor p is given byfor all \(p \in P\),$$\begin{aligned} \underset{uv \in E}{\sum }\underset{k \in \mathcal {P}}{\sum } (w_{u} / deg_{u} * x_{uv,p,k} + w_{v} / deg_{v} * x_{uv,k,p}) \le c_{p} \end{aligned}$$(14)
-
(Number of colours not less than multiplicity) This constraint requires that each edge in \(M_{p}\) receives at least as many colours as its multiplicity demands. Each edge models time required to exchange single grid cell between processors p and k:$$\begin{aligned} \underset{uv \in E}{\sum } \left( x_{uv,p,k} * t_{p,k}^{e} * d_{uv} + x_{uv,k,p} * t_{k,p}^{e} * d_{uv}\right) \le \underset{c \in C}{y_{p,k,c}} \end{aligned}$$(15)
-
(Incident edges receive different colours) This requires that incident edges do not receive the same colour in the edge colouring of \(M_{p}\) and that an edge can only receive a colour that is used:$$\begin{aligned} \underset{k \ne p, k \in \mathcal {P}}{\sum } y_{p,k,c} \le z_{c} \end{aligned}$$(16)
-
(Restrict memory capacity for each processor) This constraint restricts the number of grid cells allocated for each processor p:$$\begin{aligned} c_{p} \le m_{p} \end{aligned}$$(17)
-
(Control energy used for communication) The sum of energy used for the inter-processor communication is depicted as$$\begin{aligned} \underset{uv \in E}{\sum } \underset{p \ne k, (p,k) \in \mathcal {P} \times \mathcal {P}}{\sum } x_{uv,p,k} * d_{uv} * e_{p,k}^{e} \le e^{e} \end{aligned}$$(18)
-
(Control execution time) These two constraints calculate the total execution time \(t^{t}\) using the maximum value from the computation and the communication time. As described in Sect. 4 the computation and the communication are done in parallel:$$\begin{aligned} t_{u,p,l}^{c} * c_{p} \le t^{t} \end{aligned}$$(19)$$\begin{aligned} \underset{c \in C}{\sum } z_{c} \le t^{t} \end{aligned}$$(20)
-
(Control processor’s idle time) This constraint controls the idle time for all \(p \in \mathcal {P}\):$$\begin{aligned} t^{t} - t_{u,p,l}^{c} * c_{p} \le t_{p}^{idle} \end{aligned}$$(21)
-
(Deadline) This inequality restricts the execution time:$$\begin{aligned} t^{t} \le t^{d} \end{aligned}$$(22)
ILP | |
---|---|
x variables |
\(|E \times \mathcal {P} \times \mathcal {P}|\)
|
y variables |
\(|C \times \mathcal {P} \times \mathcal {P}|\)
|
z variables | |C| |
x constraints | |E| |
y constraints |
\(|\mathcal {P} \times \mathcal {P}|\)
|
z constraints |
\(|\mathcal {P}|\)
|
7 Heuristics
-
minimize the energy usage,
-
load balance of the tasks to meet the deadline.
7.1 Simple
7.1.1 Balancing load
7.2 Advanced
7.2.1 Minimize degree
7.2.2 Minimize multicut
7.2.3 Accumulate neighbours
8 Experimental studies
8.1 Simulation setup
Name | #Blocks | #Edges | Block size | Grid size |
---|---|---|---|---|
Cuboid | 128 | 608 | 262144 | 512x256x256 |
Sphere | 128 | 704 | 262144 | 512x256x256 |
Parameter | CPU | GPU |
---|---|---|
\(t_{p,k}^{e}\)
|
\(1 \times 10^{-9} s\)
|
\(1 \times 10^{-9} s\)
|
\(e_{p,k}^{e}\)
|
\(1.36 \times 10^{-7} J\)
|
\(1.36 \times 10^{-7} J\)
|
\(t_{u,p,l}^{c}\)
|
\(8.33 \times 10^{-10} s\)
|
\(1.06 \times 10^{-10} s\)
|
\(e_{u,p,l}^{c}\)
|
\(2.9 \times 10^{-8} J\)
|
\(5.5 \times 10^{-9} J\)
|
\(P_{p}^{idle}\)
| 10 W | 30 W |
\(P0_{u,p,l}\)
| 90 W | 74 W |
Arch. |
\(t^{d}\)[ms] | #Col. | E [J] |
\(e^{c}\) (%) |
\(e^{e}\) (%) |
\(t^{t}\) [ms] |
---|---|---|---|---|---|---|
CPU–CPU | 27.90 | 0 | 3.76 | 100 | 0 | 27.90 |
26.97 | 20 | 4.86 | 77 | 23 | 26.81 | |
14.00 | 32 | 5.27 | 66 | 34 | 13.95 | |
CPU-GPU | 3.58 | 0 | 0.48 | 100 | 0 | 3.58 |
3.46 | 20 | 1.70 | 35 | 65 | 3.44 | |
3.34 | 28 | 2.23 | 30 | 70 | 3.33 | |
3.22 | 34 | 2.65 | 29 | 71 | 3.22 | |
2xCPU–2xGPU | 3.58 | 0 | 0.62 | 100 | 0 | 3.58 |
3.46 | 20 | 1.73 | 36 | 64 | 3.44 | |
3.34 | 28 | 2.16 | 28 | 72 | 3.33 | |
3.22 | 32 | 2.26 | 21 | 79 | 1.79 |
Arch. |
\(t^{d}\)[ms] | #Col. | E [J] |
\(e^{c}\) (%) |
\(e^{e}\) (%) |
\(t^{t}\) [ms] |
---|---|---|---|---|---|---|
CPU–CPU | 26.97 | 30 | 5.41 | 69 | 31 | 26.81 |
25.92 | 40 | 5.94 | 63 | 37 | 25.28 | |
25.02 | 48 | 6.37 | 58 | 42 | 24.41 | |
24.13 | 56 | 6.78 | 54 | 46 | 22.67 | |
14,00 | 64 | 7.05 | 49 | 51 | 13.95 | |
CPU–GPU | 3.60 | 0 | 0.48 | 100 | 0 | 3.58 |
3.46 | 30 | 2.26 | 26 | 74 | 3.44 | |
3.35 | 38 | 2.79 | 24 | 76 | 3.33 | |
3.23 | 46 | 3.32 | 23 | 77 | 3.22 | |
2xCPU-2xGPU | 3.58 | 0 | 0.62 | 100 | 0 | 3.58 |
3.46 | 30 | 2.28 | 27 | 73 | 3.44 | |
3.34 | 38 | 2.72 | 22 | 78 | 3.33 | |
3.22 | 46 | 3.16 | 19 | 81 | 3.22 | |
3.11 | 54 | 3.60 | 16 | 84 | 3.11 | |
2.99 | 56 | 3.69 | 16 | 84 | 2.91 | |
2.66 | 64 | 4.05 | 12 | 88 | 1.79 |
8.2 Simulation results
Algorithm | #Edg. | #Col. | E [J]/gap (%) |
\(t^{t}\) [ms]/gap (%) |
---|---|---|---|---|
Alg_1-RD | 1224 | 306 | 20.53/289.49 | 13.95/0.00 |
Alg_1-IJK | 256 | 64 |
\(\mathbf{7.05/33.81}\)
| 13.95 / 0.00 |
Alg_1-JIK | 256 | 64 |
\(\mathbf{7.05/33.81}\)
| 13.95 / 0.00 |
Alg_1-KIJ | 256 | 64 |
\(\mathbf{7.05/33.81}\)
| 13.95 / 0.00 |
Alg_2 | 256 | 64 |
\(\mathbf{7.05/33.81}\)
| 13.95 / 0.00 |
Alg_3 | 256 | 64 |
\(\mathbf{7.05/33.81}\)
| 13.95 / 0.00 |
Alg_4-0.1 | 256 | 64 |
\(\mathbf{7.05/33.81}\)
| 13.95 / 0.00 |
Alg_4-0.2 | 256 | 64 |
\(\mathbf{7.05/33.81}\)
| 13.95 / 0.00 |
Alg_4-0.3 | 256 | 64 |
\(\mathbf{7.05/33.81}\)
| 13.95 / 0.00 |
Alg_4-0.4 | 256 | 64 |
\(\mathbf{7.05/33.81}\)
| 13.95 / 0.00 |
Alg_4-0.5 | 256 | 64 |
\(\mathbf{7.05/33.81}\)
| 13.95 / 0.00 |
Alg_4-0.6 | 256 | 64 |
\(\mathbf{7.05/33.81}\)
| 13.95 / 0.00 |
Alg_4-0.7 | 256 | 64 |
\(\mathbf{7.05/33.81}\)
| 13.95 / 0.00 |
Alg_4-0.8 | 256 | 64 |
\(\mathbf{7.05/33.81}\)
| 13.95 / 0.00 |
Alg_4-0.9 | 256 | 64 |
\(\mathbf{7.05/33.81}\)
| 13.95 / 0.00 |
Alg_4-1.0 | 256 | 64 |
\(\mathbf{7.05/33.81}\)
| 13.95 / 0.00 |
Algorithm | #Edg. | #Col. | E [J]/gap (%) |
\(t^{t}\) [ms]/gap (%) |
---|---|---|---|---|
Alg_1-RD | 504 | 126 | 7.85/195.85 | 3.48/8.32 |
Alg_1-IJK | 192 | 48 | 3.51/32.29 | 3.48/8.32 |
Alg_1-JIK | 160 | 40 |
3.06/15.52
| 3.48/8.32 |
Alg_1-KIJ | 160 | 40 |
3.06/15.52
| 3.48/8.32 |
Alg_2 | 192 | 48 | 3.51/32.29 | 3.48/8.32 |
Alg_3 | 192 | 48 | 3.51/32.29 | 3.48/8.32 |
Alg_4-0.1 | 224 | 56 | 3.96/49.07 | 3.48/8.32 |
Alg_4-0.2 | 200 | 50 | 3.62/36.49 | 3.48/8.32 |
Alg_4-0.3 | 200 | 50 | 3.62/36.49 | 3.48/8.32 |
Alg_4-0.4 | 200 | 50 | 3.62/36.49 | 3.48/8.32 |
Alg_4-0.5 | 200 | 50 | 3.62/36.49 | 3.48/8.32 |
Alg_4-0.6 | 200 | 50 | 3.62/36.49 | 3.48/8.32 |
Alg_4-0.7 | 192 | 48 | 3.51/32.29 | 3.48/8.32 |
Alg_4-0.8 | 192 | 48 | 3.51/32.29 | 3.48/8.32 |
Alg_4-0.9 | 176 | 44 | 3.29/23.90 | 3.48/8.32 |
Alg_4-1.0 | 160 | 40 |
3.06/15.52
| 3.48/8.32 |
Algorithm | #Edg. | #Col. | E [J]/gap (%) |
\(t^{t}\) [ms]/gap (%) |
---|---|---|---|---|
Alg_1-RD | 520 | 318 | 22.00/870.23 | 1.74/\(-2.68\)
|
Alg_1-IJK | 576 | 128 | 8.86/290.69 | 1.74/\(-2.68\)
|
Alg_1-JIK | 480 | 112 | 7.52/231.75 | 1.74/\(-2.68\)
|
Alg_1-KIJ | 480 | 112 | 7.52/231.75 | 1.74/\(-2.68\)
|
Alg_2 | 576 | 128 | 8.86/290.69 | 1.74/\(-2.68\)
|
Alg_3 | 576 | 128 | 8.86/290.69 | 1.74/\(-2.68\)
|
Alg_4-0.1 | 568 | 116 | 8.75/285.77 | 1.74/\(-2.68\)
|
Alg_4-0.2 | 504 | 92 | 7.85/246.48 | 1.74/\(-2.68\)
|
Alg_4-0.3 | 584 | 124 | 8.97/295.60 | 1.74/\(-2.68\)
|
Alg_4-0.4 | 480 | 100 | 7.52/231.75 | 1,74/\(-2.68\)
|
Alg_4-0.5 | 480 | 100 | 7.52/231.75 | 1.74/\(-2.68\)
|
Alg_4-0.6 | 480 | 100 | 7.52/231.75 | 1.74/\(-2.68\)
|
Alg_4-0.7 | 472 | 98 | 7.41/226.84 | 1.74/\(-2.68\)
|
Alg_4-0.8 | 456 | 92 | 7.19/217.01 | 1.74/\(-2.68\)
|
Alg_4-0.9 | 432 | 84 |
6.85/202.28
| 1.74/\(-2.68\)
|
Alg_4-1.0 | 432 | 84 |
6.85/202.28
| 1.74/\(-2.68\)
|
Algorithm | #Edg. | #Col. | E [J]/gap(%) |
\(t^{t}\) [ms]/gap (%) |
---|---|---|---|---|
Alg_1-RD | 1408 | 352 | 23.09/227.39 | 13.95/0.00 |
Alg_1-IJK | 256 | 64 |
7.05/0.00
| 13.95/0.00 |
Alg_1-JIK | 256 | 64 |
7.05/0.00
| 13.95/0.00 |
Alg_1-KIJ | 512 | 128 | 10.62/50.53 | 13.95/0.00 |
Alg_2 | 256 | 64 |
7.05/0.00
| 13.95/0.00 |
Alg_3 | 256 | 64 |
7.05/0.00
| 13.95/0.00 |
Alg_4-0.1 | 256 | 64 |
7.05/0.00
| 13.95/0.00 |
Alg_4-0.2 | 256 | 64 |
7.05/0.00
| 13.95/0.00 |
Alg_4-0.3 | 256 | 64 |
7.05/0.00
| 13.95/0.00 |
Alg_4-0.4 | 256 | 64 |
7.05/0.00
| 13.95/0.00 |
Alg_4-0.5 | 256 | 64 |
7.05/0.00
| 13.95/0.00 |
Alg_4-0.6 | 256 | 64 |
7.05/0.00
| 13.95/0.00 |
Alg_4-0.7 | 256 | 64 |
7.05/0.00
| 13.95/0.00 |
Alg_4-0.8 | 256 | 64 |
7.05/0.00
| 13.95/0.00 |
Alg_4-0.9 | 256 | 64 |
7.05/0.00
| 13.95/0.00 |
Alg_4-1.0 | 256 | 64 |
7.05/0.00
| 13.95/0.00 |
Algorithm | #Edg. | #Col. | E [J]/gap (%) |
\(t^{t}\) [ms]/gap (%) |
---|---|---|---|---|
Alg_1-RD | 576 | 144 | 9.53/186.63 | 3.48/8.32 |
Alg_1-IJK | 256 | 64 | 4.40/32.50 | 3.48/8.32 |
Alg_1-JIK | 192 | 48 |
3.51/5.70
| 3.48/8.32 |
Alg_1-KIJ | 320 | 80 | 5.29/59.31 | 3.48/8.32 |
Alg_2 | 256 | 64 | 4.40/32.50 | 3.48/8.32 |
Alg_3 | 256 | 64 | 4.40/32.50 | 3.48/8.32 |
Alg_4-0.1 | 296 | 74 | 4.96/49.25 | 3.48/8.32 |
Alg_4-0.2 | 304 | 76 | 5.07/52.61 | 3.48/8.32 |
Alg_4-0.3 | 304 | 76 | 5.07/52.61 | 3.48/8.32 |
Alg_4-0.4 | 312 | 78 | 5.18/55.96 | 3.48/8.32 |
Alg_4-0.5 | 320 | 80 | 5.29/59.31 | 3.48/8.32 |
Alg_4-0.6 | 328 | 82 | 5.40/62.66 | 3.48/8.32 |
Alg_4-0.7 | 336 | 84 | 5.51/66.01 | 3.48/8.32 |
Alg_4-0.8 | 336 | 84 | 5.51/66.01 | 3.48/8.32 |
Alg_4-0.9 | 320 | 80 | 5.29/59.31 | 3.48/8.32 |
Alg_4-1.0 | 320 | 80 | 5.29/59.31 | 3.48/8.32 |
Algorithm | #Edg. | #Col. | E [J]/gap (%) |
\(t^{t}\) [ms]/gap (%) |
---|---|---|---|---|
Alg_1-RD | 1664 | 350 | 24.01/492.79 | 1.74/\(-2.68\)
|
Alg_1-IJK | 704 | 160 | 10.64/162.77 | 1.74/\(-2.68\)
|
Alg_1-JIK | 704 | 160 | 10.64/162.77 | 1.74/\(-2.68\)
|
Alg_1-KIJ | 800 | 760 | 11.98/195.77 | 1.74/\(-2.68\)
|
Alg_2 | 704 | 160 | 10.64/162.77 | 1.74/\(-2.68\)
|
Alg_3 | 704 | 160 | 10.64/162.77 | 1.74/\(-2.68\)
|
Alg_4-0.1 | 744 | 164 | 11.20/176.52 | 1.74/\(-2.68\)
|
Alg_4-0.2 | 728 | 158 | 10.97/171.02 | 1.74/\(-2.68\)
|
Alg_4-0.3 | 712 | 152 | 10.75/165.52 | 1.74/\(-2.68\)
|
Alg_4-0.4 | 712 | 156 | 10.75/165.52 | 1.74/\(-2.68\)
|
Alg_4-0.5 | 720 | 158 | 10.86/168.27 | 1.74/\(-2.68\)
|
Alg_4-0.6 | 720 | 158 | 10.86/168.27 | 1.74/\(-2.68\)
|
Alg_4-0.7 | 704 | 158 | 10.64/162.77 | 1.74/\(-2.68\)
|
Alg_4-0.8 | 704 | 158 | 10.64/162.77 | 1.74/\(-2.68\)
|
Alg_4-0.9 | 672 | 144 |
10.19/151.77
| 1.74/\(-2.68\)
|
Alg_4-1.0 | 672 | 144 |
10.19/151.77
| 1.74/\(-2.68\)
|
ILP | Alg1 | Alg2 | Alg3 | Alg4 | |
---|---|---|---|---|---|
Cuboid | 3118\({\times }0^{6}\)
| 32 | 78 | 52 | 55 |
Sphere | 281836\({\times }10^{6}\)
| 33 | 98 | 57 | 65 |
8.3 Verification of energy model
ILP | Alg1 | Alg2 | Alg3 | Alg4 | |
---|---|---|---|---|---|
\(E_{real}\)
| 2.29 | 7.23 | 8.56 | 6.56 | 8.56 |
\(E_{model}\)
| 2.27 | 7.53 | 8.86 | 6.86 | 8.86 |
ILP | Alg1 | Alg2 | Alg3 | Alg4 | |
---|---|---|---|---|---|
\(E_{model}/E_{real}\)
| 99 | 104 | 103 | 105 | 103 |