1 Introduction
-
design and implementation of CHAOS parallelization scheme for training CNNs on the Intel Xeon Phi,
-
performance modeling of our parallel solution for training CNNs on the Intel Xeon Phi,
-
measurement-based empirical evaluation of CHAOS parallelization scheme,
-
model-based performance evaluation for future architectures of the Intel Xeon Phi.
2 Related work
2.1 Machine learning targeting Intel Xeon Phi
2.2 Related work targeting CNNs
3 Background
3.1 Neural networks
3.1.1 Deep neural networks
3.1.2 Forward propagation
3.1.3 Back-propagation
3.1.4 Convolutional neural networks
3.2 Parallel systems accelerated with Intel®Xeon Phi™
4 Our parallelization scheme for training convolutional neural networks on Intel Xeon Phi
4.1 Design aspects
-
Thread parallelism The overview of our parallelization scheme is depicted in Fig. 3. Initially for as many threads as there are, available network instances are created, which share weight parameters, whereas to support concurrent processing of images some variables are private to each thread. After the initialization of CNNs and images is done, the process of training starts. The major steps of an epoch include: Training, Validation and Testing. The first step, Training, proceeds with each worker picking an image, forward propagates it through the network, calculates the error, and back-propagates the partial derivatives, adjusting the weight parameters. Since each worker picks a new image from the set, other workers do not have to wait for significantly slow workers. After Training, each worker participates in Validation and Testing evaluating the prediction accuracy of the network by predicting images in the validation and test set accordingly. Adoption of data parallelism was inspired by Krizhevsky [24], promoting data parallelism for convolutional layers as they are computationally intensive.
-
Controlled HogWild During the back-propagation, the shared weights are updated after each layer’s computations (a technique inspired by [27]), whereas the local weight parameters are updated instantly (a technique inspired by [41]), which means that the gradients are calculated locally first then shared with other workers. However, the update to the global gradients can be performed at any time, which means that there is no need to wait for other workers to finish their updates. This technique, which we refer to as non-instant updates of weight parameters without significant delay, allows us to avoid unnecessary cache line invalidation and memory writes.
-
Arbitrary order of synchronization There is no need for explicit synchronization, because all workers share weight parameter. However, an implicit synchronization is performed in an arbitrary order because writes are controlled by a first-come-first schedule and reads are performed on demand.
4.2 Implementation aspects
Layer type | Forward propagation (s) | Back-propagation (s) | % of total |
---|---|---|---|
Fully connected | 40.9 | 30.9 | 1.4 |
Convolutional | 3241 | 1,438 | 93.7 |
Max pooling | 188.3 | 8.2 | 3.9 |
5 Evaluation
5.1 Experimental setup
Layer type | Maps | Map size | Neurons | Kernel size | Weights | |
---|---|---|---|---|---|---|
Large | Input | – |
\(29\times 29\)
| 841 | – | – |
Convolutional | 20 |
\(26\times 26\)
| 13520 |
\(4\times 4\)
| 340 | |
Max-pooling | 20 |
\(26\times 26\)
| 13520 |
\(1\times 1\)
| – | |
Convolutional | 60 |
\(22\times 22\)
| 29040 |
\(5\times 5\)
| 30060 | |
Max-pooling | 60 |
\(11\times 11\)
| 7260 |
\(2\times 2\)
| – | |
Convolutional | 100 |
\(6\times 6\)
| 3600 |
\(6\times 6\)
| 216100 | |
Max-pooling | 100 |
\(2\times 2\)
| 900 |
\(3\times 3\)
| – | |
Fully connected | – | 150 | 150 | – | 135150 | |
Output | – | 10 | 10 | – | 1510 | |
Medium | Input | – |
\(29\times 29\)
| 841 | – | – |
Convolutional | 20 |
\(26\times 26\)
| 13520 |
\(4\times 4\)
| 340 | |
Max-pooling | 20 |
\(13\times 13\)
| 3380 |
\(2\times 2\)
| – | |
Convolutional | 40 |
\(9\times 9\)
| 3240 |
\(5\times 5\)
| 20040 | |
Max-pooling | 40 |
\(3\times 3\)
| 360 |
\(3\times 3\)
| – | |
Fully connected | - | 150 | 150 | – | 54150 | |
Output | – | 10 | 10 | – | 1510 | |
Small | Input | – |
\(29\times 29\)
| 841 | – | – |
Convolutional | 5 |
\(26 \times 26\)
| 3380 |
\(4\times 4\)
| 85 | |
Max-pooling | 5 |
\(13 \times 13\)
| 845 |
\(2\times 2\)
| – | |
Convolutional | 10 |
\(9\times 9\)
| 810 |
\(5\times 5\)
| 1260 | |
Max-pooling | 10 |
\(3\times 3\)
| 90 |
\(3 \times 3\)
| – | |
Fully connected | – | 50 | 50 | – | 4550 | |
Output | – | 10 | 10 | – | 510 |
5.2 Performance model
Variable | Values | Explanation |
---|---|---|
Parameters | ||
p
| 1-3,840 | Number of processing units/threads |
i
| 60,000 | Number of training/validation images |
it
| 10,000 | Number of test images |
ep
| 70 (small, medium), 15 (large) | Number of epochs |
Constants—hardware dependent | ||
CPI | 1–2 threads:1 | Best theoretical CPI/thread |
3 threads:1.5 | ||
4 threads: | ||
s
| 1.238 GHz | Speed of processing unit |
OperationFactor | 15 | Operation factor |
Measured—hardware dependent | ||
MemoryContention | see Table 4 | Memory contention |
\(T_{\mathrm{Fprop}}^+\)
| Small: 1.45 | Forward propagation / image (ms) |
Medium: 12.55 | ||
Large: 148.88 | ||
\(T_{\mathrm{Bprop}}^+\)
| Small: 5.3 | Back-propagation / image (ms) |
Medium: 69.73 | ||
Large: 859.19 | ||
\(T_{\mathrm{Prep}}^+\)
| Small: 12.56 | Time for preparations (s) |
Medium: 12.7 | ||
Large: 13.5 | ||
Calculated—hardware independent | ||
FProp\(^*\) | Small: 58,000 | # FProp Operations / image |
Medium: 559,000* | ||
Large: 5,349,000 | ||
BProp\(^*\) | Small: 524,000 | # BProp Operations / image |
Medium: 6,119,000 | ||
Large: 73,178,000 | ||
Prep
\(^*\)
| Small: \(10^9\) | # Operations carried out for preparations |
Medium: \(10^{10}\) | ||
Large: \(10^{11}\) |
# Threads | Small | Medium | Large |
---|---|---|---|
1 |
\(7.10\,\times \,10^{-6}\)
|
\(1.56\,\times \,10^{-4}\)
|
\(8.83\,\times \,10^{-4}\)
|
15 |
\(6.40\,\times \,10^{-4}\)
|
\(2.00\,\times \,10^{-3}\)
|
\(8.75\,\times \,10^{-3}\)
|
30 |
\(1.36\,\times \,10^{-3}\)
|
\(3.97\,\times \,10^{-3}\)
|
\(1.67\,\times \,10^{-2}\)
|
60 |
\(3.07\,\times \,10^{-3}\)
|
\(8.03\,\times \,10^{-3}\)
|
\(3.22\,\times \,10^{-2}\)
|
120 |
\(6.76\,\times \,10^{-3}\)
|
\(1.65\,\times \,10^{-2}\)
|
\(6.74\,\times \,10^{-2}\)
|
180 |
\(9.95\,\times \,10^{-3}\)
|
\(2.50\,\times \,10^{-2}\)
|
\(1.00\,\times \,10^{-1}\)
|
240 |
\(1.40\,\times \,10^{-2}\)
|
\(3.83\,\times \,10^{-2}\)
|
\(1.38\,\times \,10^{-1}\)
|
480\(^*\) |
\(2.78\,\times \,10^{-2}\)
|
\(7.31\,\times \,10^{-2}\)
|
\(2.73\,\times \,10^{-1}\)
|
960\(^*\) |
\(5.60\,\times \,10^{-2}\)
|
\(1.47\,\times \,10^{-1}\)
|
\(5.46\,\times \,10^{-1}\)
|
1920\(^*\) |
\(1.12\,\times \,10^{-1}\)
|
\(2.95\,\times \,10^{-1}\)
| 1.09 |
3840\(^*\) |
\(2.25\,\times \,10^{-1}\)
|
\(5.91\,\times \,10^{-1}\)
| 2.19 |
5.3 Results
BPF\(^{\mathrm{a}}\) | BPC\(^{\mathrm{b}}\) | FPC\(^{\mathrm{c}}\) | FPF\(^{\mathrm{d}}\) | |||||
---|---|---|---|---|---|---|---|---|
Sec | % | Sec | % | Sec | % | Sec | % | |
Phi Par. 244 T
| 7.8 | 1.36 | 506.2 | 88.48 | 54.7 | 9.56 | 0.23 | 0.04 |
Phi Par. 240 T
| 8.1 | 1.34 | 532.2 | 88.45 | 87.8 | 9.61 | 0.24 | 0.04 |
Phi Par. 180 T
| 9.0 | 1.41 | 557.9 | 87.78 | 64.8 | 10.20 | 0.26 | 0.04 |
Phi Par. 120 T
| 11.3 | 1.63 | 598.4 | 86.82 | 75.4 | 10.94 | 0.28 | 0.04 |
Phi Par. 60 T
| 19.5 | 1.91 | 877.7 | 86.19 | 114.4 | 11.23 | 0.47 | 0.05 |
Phi Par. 30 T
| 34.7 | 1.71 | 1,749 | 86.36 | 228.3 | 11.27 | 0.94 | 0.05 |
Phi Par. 15 T
| 60.8 | 1.50 | 3,495 | 86.52 | 456.9 | 11.31 | 1.90 | 0.05 |
Phi Par. 1 T
| 836.7 | 1.38 | 52,387 | 86.60 | 6,859 | 11.34 | 29.75 | 0.05 |
Xeon E5 Seq.
| 30.2 | 0.19 | 7,097 | 44.51 | 8714 | 54.66 | 17.04 | 0.11 |
BPC-S\(^{\mathrm{a}}\) | BPC-M\(^{\mathrm{a}}\) | BPC-L\(^{\mathrm{a}}\) | FPC-S\(^{\mathrm{b}}\) | FPC-M\(^{\mathrm{b}}\) | FPC-L\(^{\mathrm{b}}\) | |
---|---|---|---|---|---|---|
Phi Par. 244 T
| 102.0 | 99.3 | 103.5 | 122.3 | 124.2 | 125.4 |
Phi Par. 240 T
| 96.5 | 94.1 | 98.4 | 114.3 | 117.3 | 118.7 |
Phi Par. 180 T
| 91.8 | 89.5 | 93.9 | 106.3 | 107.0 | 105.8 |
Phi Par. 240 T
| 82.7 | 82.4 | 87.5 | 91.0 | 91.0 | 91.0 |
Phi Par. 60 T
| 56.9 | 58.9 | 59.7 | 58.6 | 60.1 | 60.0 |
Phi Par. 30 T
| 29.2 | 29.6 | 29.9 | 29.8 | 30.2 | 30.1 |
Phi Par. 15 T
| 14.7 | 14.8 | 15.0 | 14.9 | 15.1 | 15.0 |
# of Phi threads | Validation | Test | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Small | Medium | Large | Small | Medium | Large | |||||||
Tot | Diff | Tot | Diff | Tot | Diff | Tot | Diff | Tot | Diff | Tot | Diff | |
244 | 616 | 4 | 85 | 1 | 12 | 2 | 155 | 2 | 98 | 3 |
95
|
1
|
240 | 610 | −2 | 86 | 2 | 11 | 1 | 154 | 1 |
95
|
0
| 91 | −3 |
180 |
595
|
\(\underline{-17}\)
|
87
|
3
|
12
|
2
| 158 | 5 | 98 | 3 | 95 | 1 |
120 | 607 | −5 | 83 | −1 | 11 | 1 |
159
|
6
| 95 | 0 | 94 | 0 |
60 | 615 | 3 |
81
|
\(\underline{-3}\)
| 11 | 1 | 156 | 3 | 98 | 3 | 91 | −3 |
30 | 612 | 0 | 83 | −1 |
10
|
0
| 156 | 3 | 98 | 3 | 90 | −5 |
15 |
617
|
5
| 84 | 0 |
10
|
0
|
153
|
0
|
100
|
5
|
84
|
\(\underline{-10}\)
|
# Threads | 480 | 960 | 1920 | 3,840 |
---|---|---|---|---|
Small CNN | 6.6 | 5.4 | 4.9 | 4.6 |
Medium CNN | 36.8 | 23.9 | 17.4 | 14.2 |
Large CNN | 92.9 | 60.8 | 44.8 | 36.8 |
240 threads | 480 threads | ||||||||
---|---|---|---|---|---|---|---|---|---|
Images | Epochs | Epochs | |||||||
\(i^{\mathrm{a}}\)
|
\(it^{\mathrm{b}}\)
| 70 | 140 | 280 | 560 | 70 | 140 | 280 | 560 |
60k | 10k | 8.9 | 17.6 | 35.0 | 69.7 | 6.6 | 12.9 | 25.6 | 51.1 |
120k | 20k | 17.6 | 35.0 | 69.7 | 139.3 | 12.9 | 25.6 | 51.1 | 101.9 |
240k | 40k | 35.0 | 69.7 | 139.3 | 278.3 | 25.6 | 51.1 | 101.9 | 203.6 |
6 Summary and future work
-
CHAOS parallel implementation scales well with the increase of the number of threads;
-
convolutional layers are the most computationally expensive part of the CNN training effort; for instance, for 240 threads, 88% of the time is spent on the back-propagation of convolutional layers;
-
using CHAOS for training CNNs on Intel Xeon Phi, we achieved up to \(103\times \), \(14\times \), and \(58\times \) speedup compared to the single-thread performance on Intel Xeon Phi, Intel Xeon E5 CPU, and Intel Core i5, respectively;
-
image classification accuracy of CHAOS parallel implementation is comparable to the one running sequentially;
-
predicted execution times values obtained from our performance model match well the measured execution times;
-
results of the performance model indicate that CHAOS scales well beyond the 240 hardware threads of the Intel Xeon Phi that is used in this paper for experimentation.