31.03.2017  Ausgabe 10/2017 Open Access
Modelling parallel overhead from simple runtime records
 Zeitschrift:
 The Journal of Supercomputing > Ausgabe 10/2017
Wichtige Hinweise
This study is based on work supported by the VSC Research Center funded by the Austrian Federal Ministry of Science, Research and Economy (bmwfw). The computational results presented have been achieved using the Vienna Scientific Cluster (VSC).
1 Introduction
In times of ever increasing computing power the number of critical components affecting scalability and efficiency of a particular application is also growing rapidly. Of particular importance are the increasing dynamic nature of highperformance computing (HPC) systems [
1], the necessity for a scalable yet robust implementation of the target problem using modern parallelization paradigms [
2], the achievement of cacheoptimal performance at the single node level [
3] and a straightforward way to accurately monitor and analyse the extent to which individual system/software components do condition the overall system. It is important to raise awareness for these critical issues also at the nonspecialist user level, where a great number of people nowadays is making routine use of HPC resources in order to gain new insights and drive forward exciting activities.
Assessment of parallel performance and overhead has been extensively studied in the past [
4–
12]. Starting with the logP model [
4,
5], four parameters, i.e. latency, overhead, gap (reciprocal communication bandwidth) and number of processors, could be accounted for and a given algorithm be analysed, respectively. A key design goal was to find a balance between overly simplistic and overly specific models. Application to MPI [
6] and several extensions respecting large messages [
6,
7] and contention effects [
8] have been described. A more abstract framework with tuneable complexity but still practical timing requirements has been provided with PERC [
9]. More recent trends in hybrid MPI/OpenMP programming were taken care of by a combination of application signature with system profiles [
10]. Along similar lines applicationcentric performance modelling [
11,
13] was described based on characteristics of the application and the target computing platform with the objective of successful largescale extrapolation. Similar predictions could also be made with the help of runtime functions within the SUIF infrastructure [
12].
Anzeige
Recently, the cost of computation has become cheap in relation to communication [
14]; thus, in order to make an algorithm scalable, the overhead due to communication must be reduced to a minimum [
15]. While several powerful tools for quantifying communication overhead have been developed in the past [
16–
20], their routine use by the general HPC practitioner must still be considered far from standard practice. Consequently, it would be nice to have available a quick and simple method to estimate the extent of communication overhead without the need for additional interference with the software/system layer (e.g. without recompiling, switching on profiling flags, linking to additional libraries). Ideally, such a method should be easy to adopt by any HPC user interested in the subject. In the following we aim to outline the basics and practical details of exactly such an approach.
Table 1
HPL: ExeTimes,
\(t_n\), and MPIOverhead,
\(\tau _n^{MPI}\)
n

\(t_n\)

\(\tau _n^{MPI}\)

\(\pm \Delta \tau _n^{MPI}\)

\(\frac{\tau _n^{MPI}}{t_n}\)


(s)

(s)

(s)


\(^*1\)

1092139.0

0.0

0.0

0.000

4

243000.0

14550.0

2215.3

0.060

8

115000.0

5293.8

1557.3

0.046

16

75300.0

6988.1

2260.3

0.093

32

35200.0

4111.6

1184.4

0.117

48

27500.0

6202.5

1735.2

0.226

64

19300.0

3571.2

928.4

0.185

96

13725.0

3525.9

2313.6

0.257

128

11300.0

3400.8

668.8

0.301

160

9210.0

2963.2

434.2

0.322

192

6970.0

1749.6

360.2

0.251

256

5670.0

1738.2

309.8

0.307

320

4680.0

1455.4

294.4

0.311

384

3920.0

1204.7

231.6

0.307

640

2900.0

1213.8

185.0

0.419

768

2450.0

995.3

170.0

0.406

896

2390.0

1112.4

172.0

0.465

1024

2090.0

960.0

145.2

0.459

1296

1670.0

707.5

140.3

0.424

1520

1870.3

1084.4

122.5

0.580

Table 2
GROMACS: ExeTimes,
\(t_n\), and MPIOverhead,
\(\tau _n^{MPI}\)
n

\(t_n\)

\(\tau _n^{MPI}\)

\(\pm \Delta \tau _n^{MPI}\)

\(\frac{\tau _n^{MPI}}{t_n}\)


(s)

(s)

(s)


1

5317.0

0.0

0.0

0.000

2

2690.0

39.5

15.3

0.015

4

1380.0

50.7

19.1

0.037

8

730.0

43.5

10.4

0.060

16

388.0

35.6

1.4

0.092

32

251.0

68.0

55.3

0.271

48

146.0

26.6

17.6

0.182

64

134.0

42.9

28.3

0.320

80

105.0

30.4

15.9

0.290

96

121.0

52.9

12.1

0.437

112

99.1

44.6

16.5

0.451

128

78.7

30.8

14.3

0.391

160

66.6

27.3

8.3

0.410

192

98.4

61.3

6.0

0.623

224

54.0

25.4

8.3

0.470

256

53.7

27.9

7.5

0.519

288

116.0

76.0

10.5

0.655

320

63.1

40.2

5.1

0.638

336

45.6

23.8

3.5

0.522

384

47.0

27.6

3.3

0.588

416

99.4

80.6

4.1

0.811

448

44.6

27.0

5.1

0.606

512

88.1

73.5

3.5

0.834

592

43.7

29.7

1.7

0.679

688

42.0

29.4

1.7

0.699

Table 3
AMBER: ExeTimes,
\(t_n\), and MPIOverhead,
\(\tau _n^{MPI}\)
n

\(t_n\)

\(\tau _n^{MPI}\)

\(\pm \Delta \tau _n^{MPI}\)

\(\frac{\tau _n^{MPI}}{t_n}\)



(s)

(s)

(s)


1

4602.0

[4623.0]

0.0

[ 0.0]

0.0

[0.0]

0.000

[0.000]

2

2350.0

[2362.3]

40.6

[ 33.1]

9.0

[–]

0.017

[0.014]

4

1230.0

[1245.5]

51.8

[ 59.8]

23.3

[–]

0.042

[0.048]

8

633.1

[ 634.4]

35.5

[ 34.3]

17.7

[–]

0.056

[0.054]

12

484.1

[ 410.9]

46.4

[ 41.5]

16.6

[–]

0.096

[0.101]

16

383.0

[ 378.3]

51.1

[ 43.5]

9.0

[–]

0.133

[0.115]

24

296.0

[ 244.6]

69.9

[ 50.1]

31.4

[–]

0.236

[0.205]

32

280.0

[ 281.2]

101.6

[100.4]

28.9

[–]

0.363

[0.357]

40

207.0

[ 183.5]

66.1

[ 59.3]

22.8

[–]

0.319

[0.323]

48

233.0

[ 202.9]

104.7

[ 79.5]

30.8

[–]

0.449

[0.392]

64

200.0

[ 168.0]

99.0

[ 71.6]

28.0

[–]

0.495

[0.426]

80

167.0

[ 149.1]

84.2

[ 69.0]

21.6

[–]

0.504

[0.463]

96

139.0

[ 136.7]

69.9

[ 67.1]

23.9

[–]

0.503

[0.491]

112

142.0

[ 123.4]

79.4

[ 62.9]

21.8

[–]

0.559

[0.510]

128

125.0

[ 116.4]

70.5

[ 62.2]

20.7

[–]

0.564

[0.534]

160

116.0

[ 103.7]

69.4

[ 58.2]

17.3

[–]

0.598

[0.561]

192

113.0

[ 94.9]

72.4

[ 55.6]

17.1

[–]

0.641

[0.586]

224

112.0

[ 91.4]

75.6

[ 55.9]

16.5

[–]

0.675

[0.612]

256

127.0

[ 92.1]

90.9

[ 59.6]

14.1

[–]

0.716

[0.647]

288

132.0

[ 90.6]

98.2

[ 60.7]

12.8

[–]

0.744

[0.670]

320

128.0

[ 96.5]

95.2

[ 66.9]

10.5

[–]

0.744

[0.693]

352

146.0

[ 111.3]

112.7

[ 82.8]

7.9

[–]

0.772

[0.744]

416

131.0

[ 117.0]

101.9

[ 89.6]

7.1

[–]

0.778

[0.766]

512

124.0

[ 102.1]

99.3

[ 79.7]

6.8

[–]

0.801

[0.781]

Table 4
InHouseDev: ExeTimes,
\(t_n\), and MPIOverhead,
\(\tau _n^{MPI}\)
n

\(t_n\)

\(\tau _n^{MPI}\)

\(\pm \Delta \tau _n^{MPI}\)

\(\frac{\tau _n^{MPI}}{t_n}\)


(s)

(s)

(s)


1

2124.0

0.0

0.0

0.000

2

1290.0

172.0

0.0

0.133

4

554.0

181.7

5.8

0.328

8

356.0

196.6

8.2

0.552

16

287.0

212.9

6.3

0.742

32

239.0

203.2

3.7

0.850

64

220.0

201.9

2.7

0.918

128

215.0

206.1

1.6

0.959

Table 5
VASP: ExeTimes,
\(t_n\), and MPIOverhead,
\(\tau _n^{MPI}\)
n

\(t_n\)

\(\tau _n^{MPI}\)

\(\frac{\tau _n^{MPI}}{t_n}\)


(s)

(s)


1

5375.4

0.0

0.000

2

2664.9

48.0

0.018

4

1424.9

42.8

0.030

8

889.8

42.7

0.048

16

452.9

30.8

0.068

32

261.1

46.2

0.177

48

208.0

52.4

0.252

64

159.7

48.7

0.305

80

145.0

51.5

0.355

96

126.3

44.7

0.354

112

164.0

88.1

0.537

128

107.4

47.3

0.440

160

111.1

55.4

0.499

192

109.3

60.1

0.550

224

103.4

60.6

0.586

256

104.8

66.5

0.635

512

82.3

59.2

0.719

768

86.3

69.6

0.806

1024

79.2

65.0

0.821

Table 6
QUANTUM ESPRESSO: ExeTimes,
\(t_n\), and MPIOverhead,
\(\tau _n^{MPI}\)
n

\(t_n\)

\(\tau _n^{MPI}\)

\(\frac{\tau _n^{MPI}}{t_n}\)


(s)

(s)


1

5531.3

0.0

0.000

2

3085.6

216.0

0.070

4

1784.2

269.4

0.151

8

1281.7

362.7

0.283

16

793.6

388.9

0.490

32

356.3

170.0

0.477

48

294.3

165.4

0.562

64

229.2

132.7

0.579

80

203.7

123.0

0.604

96

194.5

125.1

0.643

112

186.3

126.7

0.680

128

156.7

100.8

0.643

160

155.9

110.2

0.707

192

157.1

119.1

0.758

224

153.5

120.0

0.782

256

166.8

136.4

0.818

288

151.6

123.5

0.815

320

155.8

129.5

0.831

352

154.3

129.8

0.841

416

164.8

142.2

0.863

512

161.6

141.1

0.873

768

185.8

168.5

0.907

1024

228.3

212.3

0.930

Table 7
LAMMPS: ExeTimes,
\(t_n\), and MPIOverhead,
\(\tau _n^{MPI}\)
n

\(t_n\)

\(\tau _n^{MPI}\)

\(\frac{\tau _n^{MPI}}{t_n}\)


(s)

(s)


1

4501.0

0.0

0.000

2

2432.6

73.0

0.030

4

1298.9

85.7

0.066

8

702.7

97.0

0.138

16

409.7

100.8

0.246

32

290.9

128.9

0.443

48

261.8

147.9

0.565

64

246.4

157.4

0.639

80

332.9

257.0

0.772

96

294.9

226.8

0.769

112

319.0

260.3

0.816

128

300.8

247.9

0.824

160

356.2

309.2

0.868

192

360.8

319.3

0.885

224

335.5

295.2

0.880

256

308.5

276.1

0.895

288

318.7

288.1

0.904

320

330.9

303.8

0.918

352

389.0

362.9

0.933

416

362.8

338.5

0.933

512

355.6

333.6

0.938

768

396.9

378.2

0.953

1024

423.5

402.3

0.950

2 Basic model
We begin our investigation with the selection of a set of scientific applications frequently used on HPC platforms. They are,
Realistic problems are defined and computed in parallel on increasing numbers of cores using MPI [
2] as the communication protocol. Only strong scaling is considered, i.e. constant problem size computed in shorter times with increasing numbers of processing elements (cores). Times to solution,
\(t_n\), are recorded as a function of numbers of involved cores,
n, and results are summarized in Tables
1,
2,
3,
4,
5,
6 and
7 (columns 1, 2). In addition, the time spent in MPI calls,
\(\tau _n^{MPI}\), is also recorded and included in Tables
1,
2,
3,
4,
5,
6 and
7 (column 3). Two different tools are used to measure MPI times, in particular
mpiP [
17] and
allinea/MAP [
18]. The time records obtained from both tools are largely identical as demonstrated by the example of AMBER (see Table
3).
\(\tau _n^{MPI}\) assessment based on
mpiP analysis (Tables
1,
2,
3,
4) yields individual MPI timings on a pertask basis; hence, averages need to be formed for the
n different tasks of a particular sample run. Because individual MPI times do vary considerably, it was also of interest to compute the variance of
\(\tau _n^{MPI}\) and its corresponding standard deviation,
\(\pm \Delta \tau _n^{MPI}\) (see Tables
1,
2,
3,
4, column 4). Given the diversity of the applications and their markedly different characteristics in terms of parallel scalability, it is not obvious to identify common trends in the introduced parallel overhead. However, what appears to be a rather general signature of all applications is the rather smooth development of the quotient between parallel overhead and total runtime,
\(\tau _n^{MPI}/t_n\), which is graphically illustrated in Fig.
1 (also see final column in Tables
1,
2,
3,
4,
5,
6,
7). All data sets can be approximated by the following simple expression in two adjustable parameters,
b and
c,
$$\begin{aligned} \frac{\tau _n^{MPI}}{t_n} = \frac{b}{c + 1}  \frac{b}{c + n} \end{aligned}$$
(1)
2.1 Generalization
So far we have been very imprecise in the use of the term “parallel overhead” and frequently replaced it with “communication overhead” or
\(\tau _n^{MPI}\), etc. In general, we consider every incremental time fragment emerging within a parallel algorithm
parallel overhead if it is in excess of the serial algorithm required to solve exactly the same type of problem. Typically this will include [
15],
Measuring parallel overhead is not a trivial matter [
29–
31]. A conventional view is that to a large extent it is all covered by communication overhead. In fact, if we review the above list we see that tasklevel recording of individual MPI times (as done here) will either explicitly or implicitly include almost all of the incurred parallel overhead. Moreover, since our primary interest is in providing an approximate estimate, we shall consider
\(\tau _n^{MPI}\) to be a sufficiently accurate measure of the total parallel overhead and adopt the notation
\(\tau _n\) for the latter throughout the remainder of this article. Estimating
\(\tau _n\) will now help to (i) raise awareness that a particular application may be significantly affected by parallel overhead, (ii) facilitate
a posteriori assessment of various applications reporting times to solution,
\(t_n\), as a function of numbers of cores,
n, (iii) identify optimal runtime conditions on a given parallel architecture.

time to interchange data

time to synchronize individual parallel tasks

extra computing time due to code sections arising only in the parallel algorithm

computing time penalties due to load balancing issues

computing time penalties due to inhomogeneous conditions between individual components of the parallel machine [ 1]
Anzeige
2.2 Solving for \(\tau _n\)
In the following we shall build on the model established in Eq. (
1) and try to isolate a closed form for the parallel overhead,
\(\tau _n\), thereof. Starting with
we can formally decompose the time to solution,
into an ideal time to solution,
\(t_n^{\hbox {Amdahl}}\), and an associated parallel overhead,
\(\tau _n\). As already implied by the superscript, the initial term is given from the classic Amdahl relation [
32–
35],
where
\(t_1\) denotes the single core execution time, and
\(f_s\) its serial fraction that cannot be run in parallel. It follows from Eqs. (
2) and (
3) that we can isolate an expression for the parallel overhead, in particular,
and thus again using Eq. (
3) describe the total time to solution,
as a multiplicative extension to the original proposal of Amdahl [
32–
35].
$$\begin{aligned} \tau _n = t_n \left( \frac{b}{c + 1}  \frac{b}{c + n} \right) \end{aligned}$$
(2)
$$\begin{aligned} t_n = t_n^{\hbox {Amdahl}} + \tau _n \end{aligned}$$
(3)
$$\begin{aligned} t_n^{\hbox {Amdahl}} = f_s t_1 + \frac{(1  f_s)t_1}{n} \end{aligned}$$
(4)
$$\begin{aligned} \tau _n = \frac{t_n^{\hbox {Amdahl}} b(n  1)}{(1 + c b)n + (b + c +c^2)} \end{aligned}$$
(5)
$$\begin{aligned} t_n = t_n^{\hbox {Amdahl}} \left[ 1 + \frac{b(n  1)}{(1 + c b)n + (b + c +c^2)} \right] \end{aligned}$$
(6)
×
3 Results
3.1 HPC systems used
All test applications examined here—except the inhouse developed code—were run on the
Vienna Scientific Cluster, version 3 (VSC3) [
36]. VSC3 consists of 2020 compute nodes, all of them equipped with dual socket 8 core Intel Xeon CPUs (E52650v2, 2.6 GHz, Ivy Bridge) and interconnected by a dualrail Infiniband QDR80 network. Standard node memory is 64 GB; optionally available are nodes with 128 or 256 GB. The system features a rather unconventional cooling infrastructure, i.e.
Liquid Immersion Cooling [
37], where hardware components are fully immersed in mineral oil.
The inhouse developed code was run on VSC2, another HPC installation consisting of 1314 compute nodes with 2 CPUs of type AMD Opteron 6132 HE (2.2 GHz, 8 core) and again interconnected via an Infiniband QDR fabric. Standard nodes on VSC2 provide 32 GB RAM.
3.2 Parallel overhead determined from runtime records
The simplest type of performance analysis for a particular application is to record execution times for increasing numbers of cores operating in parallel. This is also the most relevant type of analysis because it is based on exactly that type of executable that will be used later on in the production stage. Thus, no alterations to the binary have to be made for the purpose of analysing the code, for example introduction of internal timers, instrumentation due to profiling/debugging, inclusion of event counters, library wrappers; and all observed execution times do directly reflect the most natural runtime behaviour of the application taken into account.
×
×
×
×
×
×
×
Applying Eq. (
6) to exactly such a simple record of just execution times,
\(t_n\), for varying numbers of cores,
n, should result in the derivation of parameters,
b and
c, which in turn can be plugged into Eq. (
5) to yield approximate estimates for the corresponding parallel overhead,
\(\tau _n\). The latter is of fundamental interest, for both additional development and practical deployment at optimal runtime conditions. An example of such an approach is given in Fig.
2. The application considered was HPL [
21], and the underlying data are collected in columns 1–2 of Table
1. Experimental runtimes (brown triangles) are reproduced fairly well from a fit using Eq. (
6). The obtained curve is shown as the cyan line in Fig.
2. Parameters
b and
c obtained from the fit are then applied in Eq. (
5) to determine an estimate for the parallel overhead, and the corresponding graph is shown as the orange line in Fig.
2. Since in this particular case we have available experimentally derived values for
\(\tau _n\) (Table
1, columns 3–4), a direct comparison can be made between calculated and measured results (compare brown triangles with error bars to the orange curve in Fig.
2). Apart from an initial region of general uncertainty (see dimension of the error bars at small numbers of cores), the agreement between predicted and experimental values is rather good. A consequence of all of this is a significant deviation from Amdahl’s Law [
32–
35] starting already at modest numbers of cores (compare grey line with cyan curve in Fig.
2).
Additional tests were carried out for the rest of the applications, and corresponding results are graphically summarized in Figs.
3,
4,
5,
6,
7, and
8. It should be noted that the scale on both axes had to be changed considerably between different applications in order to emphasize their specific characteristics in terms of scaling and overhead times. From this it also becomes clear that the approach is rather general and can be applied to a wide range of diverse applications in identical fashion. As can be seen from Figs.
3,
4,
5,
6,
7, and
8, general results remain the same for all applications considered. However, remarkable specific differences arise upon closer examination. For example, GROMACS [
22] exhibits a strange pattern of zigzaglike execution times that is closely paralleled by the overhead times (see Fig.
3). This may indicate restricted ability to split the problem into parallel tasks of arbitrary size. Obviously, fitting such a data set can only deliver a best compromise for
\(\tau _n\). In contrast, the general evolution of sample AMBER [
23] appears to be smooth (Fig.
4). Similar to all the other examples, it is interesting to see how quickly
\(\tau _n\) is becoming the dominant factor and how steadily standard deviations to
\(\tau _n\) do decrease with increasing numbers of cores.
The immediate impression of the inhouse developed code [
27,
28] is that here we certainly face the least optimized application (Fig.
5). However, it is still interesting to observe that the proposed method for predicting parallel overhead remains applicable even in such cases. Here a saturation domain is reached quickly because of a strongly rising parallel overhead. Standard deviations of
\(\tau _n\) are remarkably small. Owing to the implemented communication model of primary/secondary tasks, standard deviations will start to make sense only for runs involving more than two tasks (see Table
4, fourth column). Moreover, averages and related properties will comprise only the group of secondary tasks of dimension
\(n1\).
Smooth trends are seen in sample VASP [
24] with again
\(\tau _n\) quickly becoming the determining factor (Fig.
6). In contrast, a rather pronounced inversion in
\(t_n\) evolution is observed in both of the final two samples, QUANTUM ESPRESSO [
25] and LAMMPS [
26] (Figs.
7,
8). Interestingly, fitted curves do still lead to reasonably good approximations of
\(\tau _n\) demonstrating the versatility and broad applicability of the approach.
3.3 Fitting with GNUPLOT
Care must be taken when fitting the data set and the following remarks may prove useful when reproducing our results. All our fits have been obtained with the help of package GNUPLOT [
38]. Two cases may be distinguished depending on whether or not the serial fraction,
\(f_s\), is known in detail. In the majority of cases
\(f_s\) in not known and cannot be determined accurately in a quick and straightforward way. However, treating it as another entirely free parameter is also discouraged because it may rapidly turn into a negatively signed number due to overfitting. A working procedure includes the following steps,
In so doing, the formerly unknown value of
\(f_s\) can be obtained as a byproduct. It should be pointed out that occasionally dropping a couple of very large initial data points for small values of
n was necessary to obtain a reasonable approximation in the limit of large
n. Graphical control was the most important guiding principle all throughout the fitting process.

define an explicit value for \(f_s\) (either known or guessed)

graphically check the quality of the fit and aim at asymptotic standard errors in the range of 10–30%

make sure that \(c>b\) and try to have \(b+\Delta b \approx c\), where \(\Delta b\) is the reported asymptotic standard error

incrementally decrease \(f_s\) and repeat the above steps until an optimal fit is obtained
4 Conclusion
A simple procedure is presented that allows an approximate estimation of parallel overhead solely based on runtime records. The method exhibits a broad range of applicability including welloptimized applications as well as less advanced implementations where code optimization is still in progress (compare for example Fig.
2 with Fig.
5). Asymptotic limits show a rather smooth trend and thus facilitate reasonable approximations in the limit of large
n. Specifics of a particular HPC installation do not seem to play a significant role since two entirely different systems were employed here and led to similar conclusions.
Acknowledgements
Open access funding provided by TU Wien (TUW).
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.