2018  OriginalPaper  Buchkapitel
Tipp
Weitere Kapitel dieses Buchs durch Wischen aufrufen
Erschienen in:
Topics in Parallel and Distributed Computing
Parallel execution time is expected to decrease as the number of processors increases. We show in this chapter that this is not as easy as it seems, even for perfectly parallel applications. In particular, processors are subject to faults. The more processors are available, the more likely faults will strike during execution. The main strategy to cope with faults in High Performance Computing is checkpointing. We introduce the reader to this approach, and explain how to determine the optimal checkpointing period through scheduling techniques. We also detail how to combine checkpointing with prediction and with replication.
Bitte loggen Sie sich ein, um Zugang zu diesem Inhalt zu erhalten
Sie möchten Zugang zu diesem Inhalt erhalten? Dann informieren Sie sich jetzt über unsere Produkte:
Anzeige
It is interesting to point out why the value of
T
_{FO} given by Eq. (
8) is a firstorder approximation, even for large jobs. Indeed, there are several restrictions for the approach to be valid:
We have stated that the expected number of faults during execution is
\(\ensuremath {N_{\text{faults}}} = \frac {\ensuremath {\textsc {Time}_{\text{final}}} }{\mu } \), and that the expected time lost due to a fault is
\(\ensuremath {T_{\text{lost}}} = \frac {\ensuremath {T} }{2} + \ensuremath {D} + \ensuremath {R} \). Both statements are true individually, but the expectation of a product is the product of the expectations only if the random variables are independent, which is not the case here because
Time
_{final} depends upon the fault interarrival times.
In Eq. (
4), we have to enforce
C ≤
T in order to have
Waste
_{FF} ≤ 1.
In Eq. (
5), we have to enforce
\(\ensuremath {D} +\ensuremath {R} + \frac {\ensuremath {T} }{2} \leq \mu \) in order to have
Waste
_{fault} ≤ 1. We must cap the period to enforce this latter constraint. Intuitively, we need
μ to be large enough for Eq. (
5) to make sense. However, for largescale platforms, regardless of the value of the individual MTBF
μ
_{ind}, there is always a threshold in the number of components
p above which the platform MTBF,
\(\mu = \frac {\mu _{\text{ind}}}{p}\), becomes too small for Eq. (
5) to be valid.
Equation (
5) is accurate only when two or more faults do not take place within the same period. Although unlikely when
μ is large in front of
T, the possible occurrence of many faults during the same period cannot be eliminated.
To ensure that the condition of having at most a single fault per period is met with a high probability, we cap the length of the period: we enforce the condition
T ≤
αμ, where
α is some tuning parameter chosen as follows. The number of faults during a period of length
T can be modeled as a Poisson process of parameter
\(\beta = \frac {\ensuremath {T} }{\mu }\). The probability of having
k ≥ 0 faults is
\(P(X=k) = \frac {\beta ^{k}}{k!} e^{\beta }\), where
X is the random variable showing the number of faults. Hence the probability of having two or more faults is
π =
P(
X ≥ 2) = 1 − (
P(
X = 0) +
P(
X = 1)) = 1 − (1 +
β)
e
^{−β }. If we assume
α = 0.27 then
π ≤ 0.03, hence a valid approximation when bounding the period range accordingly. Indeed, with such a conservative value for
α, we have overlapping faults for only 3
% of the checkpointing segments in average, so that the model is quite reliable. For consistency, we also enforce the same type of bound on the checkpoint time, and on the downtime and recovery:
C ≤
αμ and
D +
R ≤
αμ. However, enforcing these constraints may lead to use a suboptimal period: it may well be the case that the optimal period
\(\sqrt {2(\mu  (\ensuremath {D} +\ensuremath {R} )) \ensuremath {C} }\) of Eq. (
8) does not belong to the admissible interval [
C,
αμ]. In that case, the waste is minimized for one of the bounds of the admissible interval. This is because, as seen from Eq. (
7), the waste is a convex function of the period.
We conclude this discussion on a positive note. While capping the period, and enforcing a lower bound on the MTBF, is mandatory for mathematical rigor, simulations in Aupy et al. [
2] show that actual job executions can always use the value from Eq. (
8), accounting for multiple faults whenever they occur by reexecuting the work until success. The firstorder model turns out to be surprisingly robust!
There is a beautiful method to compute the optimal value of
T
_{FO} accurately. First we show how to compute the expected time
\(\ensuremath {\mathbb {E}} (\ensuremath {\textsc {Time}} (\ensuremath {T}  \ensuremath {C} ,\ensuremath {C} ,\ensuremath {D} ,\ensuremath {R} ,\lambda ))\) to execute a work of duration
T −
C followed by a checkpoint of duration
C, given the values of
C,
D, and
R, and a fault distribution
Exp(
λ). If a fault interrupts a given trial before success, there is a downtime of duration
D followed by a recovery of length
R. We assume that faults can strike during checkpoint and recovery, but not during downtime.
\(\ensuremath {\mathbb {E}} (\ensuremath {\textsc {Time}} (\ensuremath {T}  \ensuremath {C} ,\ensuremath {C} ,\ensuremath {D} ,\ensuremath {R} ,\lambda ))= e^{\lambda \ensuremath {R} } \left (\frac {1}{\lambda } + \ensuremath {D} \right ) (e^{\lambda \ensuremath {T} } 1 )\).
For simplification, we write
Time instead of
Time(
T −
C,
C,
D,
R,
λ) in the proof below. Consider the following two cases:
Thus
Time obeys the following recursive equation:
The expectation of
Time can be computed from Eq. (
15) by weighting each case by its probability to occur:
which simplifies into:
Either there is no fault during the execution of the period, then the time needed is exactly
T;
Or there is one fault before successfully completing the period, then some additional delays are incurred. More specifically, as seen for the first order approximation, there are two sources of delays: the time spent computing by the processors before the fault (accounted for by variable
Time
_{lost}), and the time spent for downtime and recovery (accounted for by variable
Time
_{rec}). Once a successful recovery has been completed, there still remain
T −
C units of work to execute.
denotes the amount of time spent by the processors before the first fault, knowing that this fault occurs within the next
T units of time. In other terms, it is the time that is wasted because computation and checkpoint were not successfully completed (the corresponding value in Fig.
2 is
T
_{lost} −
D −
R).
represents the amount of time needed by the system to recover from the fault (the corresponding value in Fig.
2 is
D +
R).
We have
\(\ensuremath {\mathbb {E}} (\ensuremath {\ensuremath {\textsc {Time}_{\text{lost}}} } ) = \int _0^{\infty } x \ensuremath {\mathbb {P}} (X=xX < \ensuremath {T} ) dx = \frac {1}{\ensuremath {\mathbb {P}} (X < \ensuremath {T} )}\int _0^{\ensuremath {T} } x e^{\lambda x} dx\), and
\(\ensuremath {\mathbb {P}} (X < \ensuremath {T} ) = 1e^{\lambda \ensuremath {T} }\). Integrating by parts, we derive that
Next, the reasoning to compute
\(\ensuremath {\mathbb {E}} (\ensuremath {\ensuremath {\textsc {Time}_{\text{rec}}} } )\), is very similar to
\(\ensuremath {\mathbb {E}} (\ensuremath {\textsc {Time}} )\) (note that there can be no fault during
D but there can be some during
R):
Here,
R
_{lost} is the amount of time lost to executing the recovery before a fault happens, knowing that this fault occurs within the next
R units of time. Replacing
T by
R in Eq. (
17), we obtain
\(\ensuremath {\mathbb {E}} (\ensuremath {R_{lost}} ) = \frac {1}{\lambda }  \frac {\ensuremath {R} }{e^{\lambda \ensuremath {R} }  1}\). The expression for
\(\ensuremath {\mathbb {E}} (\ensuremath {\ensuremath {\textsc {Time}_{\text{rec}}} } )\) simplifies to
Plugging the values of
\(\ensuremath {\mathbb {E}} (\ensuremath {\ensuremath {\textsc {Time}_{\text{lost}}} } )\) and
\(\ensuremath {\mathbb {E}} (\ensuremath {\ensuremath {\textsc {Time}_{\text{rec}}} } )\) into Eq. (
16) leads to the desired value:
Proposition
3 is the key to proving that the optimal checkpointing strategy is periodic. Indeed, consider an application of duration
Time
_{base}, and divide the execution into periods of different lengths
T
_{i}, each with a checkpoint as the end. The expectation of the total execution time is the sum of the expectations of the time needed for each period. Proposition
3 shows that the expected time for a period is a convex function of its length, hence all periods must be equal and
T
_{i} =
T for all
i.
There remains to find the best number of periods, or equivalently, the size of each work chunk before checkpointing. With
k periods of length
\(T =\frac {\ensuremath {\textsc {Time}_{\text{base}}} }{k}\), we have to minimize a function that depends on
k. This is easy for a skilled mathematician who knows the Lambert function 𝕃 (defined as
\(\ensuremath {\mathbb {L}} (z)e^{\ensuremath {\mathbb {L}} (z)}=z\)). She would find the optimal rational value
k
_{opt} of
k by differentiation, prove that the objective function is convex, and conclude that the optimal value is either ⌊
k
_{opt}⌋ or ⌈
k
_{opt}⌉, thereby determining the optimal period
T
_{opt}. What if you are not a skilled mathematician? No problem, simply use
T
_{FO} as a firstorder approximation, and be comforted that the firstorder terms in the Taylor expansion of
T
_{opt} is …
T
_{FO}! See Bougeret et al. [
3] for all details.
In this section we give another proof of Proposition
1. Interestingly, it applies to any continuous probability distribution with bounded (nonzero) expectation, not just Exponential laws.
First we prove that Eq. (
1) does hold true. Consider a single processor, say processor
p
_{q}. Let
X
_{i},
i ≥ 0 denote the IID (independent and identically distributed) random variables for the fault interarrival times on
p
_{q} , and assume that
X
_{i} ∼
D
_{X}, where
D
_{X} is a continuous probability distribution with bounded (nonzero) expectation
μ
_{ind}. In particular,
\(\mathbb E\left (X_{i}\right ) = \mu _{\text{ind}}\) for all
i. Consider a fixed time bound
F. Let
n
_{q}(
F) be the number of faults on
p
_{q} until time
F. More precisely, the (
n
_{q}(
F))th fault is the last one to happen before time
F or at time
F, and the (
n
_{q}(
F) + 1)st fault is the first to happen after time
F. By definition of
n
_{q}(
F), we have
Using Wald’s equation (see the textbook of Ross [
16, p. 420]), with
n
_{q}(
F) as a stopping criterion, we derive:
and we obtain:
As promised, Eq. (
18) is exactly Eq. (
1).
Now consider a platform with
p identical processors, whose fault interarrival times are IID random variables that follow the distribution
D
_{X}. Unfortunately, if
D
_{X} is not an Exponential law, then the interarrival times of the faults of the whole platform, i.e., of the superprocessor of section “
Checkpointing on a Parallel Platform”, are no longer IID. The minimum trick used in the proof of Proposition
1 works only for the first fault. For the following ones, we need to remember the history of the previous faults, and things get too complicated. However, we could still define the MTBF
μ of the superprocessor. Using Eq. (
18),
μ must satisfy:
where
n(
F) is the number of faults on the superprocessor until time
F. But does the limit always exist? and if yes, what is its value?
The answer to both questions is not difficult. Consider a fixed time bound
F as before. Let
n(
F) be the number of faults on the whole platform until time
F, and let
m
_{q}(
F) be the number of these faults that strike component number
q. Of course we have
\(n(F) = \sum _{q=1}^{p} m_{q}(F)\). By definition,
m
_{q}(
F) is the number of faults on component
q until time
F. From Eq. (
18) again, we have for each component
q:
Since
\(n(F) = \sum _{q=1}^{p} m_{q}(F)\), we also have:
which answers both questions at the same time!
The curious reader may ask how to extend Eq. (
19) when processors have different faultrates. Let
\(X_{i}^{(q)}\),
i ≥ 0 denote the IID random variables for the fault interarrival times on
p
_{q} , and assume that
\(X_{i}^{(q)} \sim D_{X}^{(q)}\), where
\(D_{X}^{(q)}\) is a continuous probability distribution with bounded (nonzero) expectation
μ
^{(q)}. For instance if
μ
^{(2)} = 3
μ
^{(1)}, then (in expectation) processor 1 experiences three times more failures than processor 2. As before, consider a fixed time bound
F, and let
n
_{q}(
F) be the number of faults on
p
_{q} until time
F. Equation (
18) now writes
\(\lim _{F \rightarrow +\infty } \frac {\mathbb E\left (m_{q}(F)\right )}{F} = \frac {1}{\mu ^{(q)}}\). Now let
n(
F) be the total number of faults on the whole platform until time
F. The same proof as above leads to
Kella and Stadje [
14] provide more results on the superposition of renewal processes (which is the actual mathematical name of the topic discussed here!).
The discussion on predictions in section “
Fault Prediction” has been kept overly simple. For instance when a fault is predicted, sometimes there is not enough time to take proactive actions, because we are already checkpointing. In this case, there is no other choice than ignoring the prediction.
Furthermore, a better strategy should take into account at what point in the period does the prediction occur. After all, there is no reason to always trust the predictor, in particular if it has a bad precision. Intuitively, the later the prediction takes place in the period, the more likely we are inclined to trust the predictor and take proactive actions. This is because the amount of work that we could lose gets larger and larger. On the contrary, if the prediction happens in the beginning of the period, we have to tradeoff the probability that the proactive checkpoint may be useless (if we take a proactive action) with the small amount of work that may be lost in the case where a fault would actually happen (if we do not trust the predictor). The optimal approach is to never trust the predictor in the beginning of a period, and to always trust it in the end; the crossover point
\(\frac {\ensuremath {C_{\ensuremath {pr} }} }{\ensuremath {pr} }\) depends on the time to take a proactive checkpoint and on the precision of the predictor. Details are provided by Aupy et al. [
2] for details.
Finally, it is more realistic to assume that the predictor cannot give the exact moment where the fault is going to strike, but rather will provide an interval of time, a.k.a. a prediction window. Aupy et al. [
1] provide more information.
In the context of replication, there are two natural options for “counting” faults. The option chosen in section “
Replication” is to allow new faults to hit processors that have already been hit. This is the option chosen by Ferreira et al. [
8], who introduced the problem. Another option is to count only faults that hit
running processors, and thus effectively kill replica pairs and interrupt the application. This second option may seem more natural as the running processors are the only ones that are important for the application execution. It turns out that both options are almost equivalent, the values of their
MNFTI only differ by one, as shown by Casanova et al. [
5].
Speaking of faults, an important question is: why don’t we repair (or rejuvenate) processors on the fly, instead of doing so only when the whole application is forced to stop, recover from the last checkpoint, and restart execution? The answer is technical: current HPC resource management systems assign the user a fixed set of resources for the execution, and do not allow new resources (such as spare nodes) to be dynamically added during the execution. In fact, frequently, a new configuration is assigned to the user at restart time. But nothing prevents us from enhancing the tools! It should then be possible to reserve a few additional nodes in addition to the computing nodes. These nodes would be used to migrate the system image of a replica node as soon as its buddy fails, in order to recreate the failed node on the fly. Of course the surviving node must be isolated from the application while the migration is taking place, in order to maintain a coherent view of both nodes, and this induces some overhead. It would be quite interesting to explore such strategies.
Here a few bibliographical notes about replication. Replication has long been used as a faulttolerance mechanism in distributed systems (see the survey of Gartner [
12]), and in the context of volunteer computing (see the work of Kondo et al. [
15]). Replication has recently received attention in the context of HPC (High Performance Computing) applications. Representative papers are those by Schroeder and Gibson [
17], Zheng and Lan [
20], Engelmann, Ong, and Scorr [
7], and Ferreira et al. [
8]. While replicating all processors is very expensive, replicating only critical processes, or only a fraction of all processes, is a direction being currently explored under the name
partial replication.
Speaking of critical processes, we make a final digression. The defacto standard to enforce faulttolerance in critical or embedded systems is
Triple Modular Redundancy, or TMR. Computations are triplicated on three different processors, and if their results differ, a voting mechanism is called. TMR is not used to protect from failstop faults, but rather to detect and correct errors in the execution of the application. While we all like, say, safe planes protected by TMR, the cost is tremendous: by definition, two thirds of the resources are wasted (and this does not include the overhead of voting when an error is identified).
In this exercise you are asked to help Alice (again). She is still writing her thesis but she does not want to checkpoint at given periods of time. She hates being interrupted in the middle of something because she loses concentration. She now wants to checkpoint only at the end of a chapter. She still has to decide after which chapters it is best to checkpoint.
The difference with the original problem is that the checkpoints can only be taken at given timesteps. If we formulate the problem in a abstract way, we have a linear chain of
n tasks (the
n chapters in Alice’s thesis),
T
_{1},
T
_{2}, …,
T
_{n}. Each task
T
_{i} has weight
w
_{i} (the time it takes to write that chapter). The cost to checkpoint after
T
_{i} is
C
_{i}. The time to recover from a fault depends upon where the last checkpoint was taken. For example, assume that
T
_{i} was checkpointed, and that
T
_{i+1},
T
_{i+2} were not. If a fault strikes during the execution of
T
_{i+3}, we need to roll back and read the checkpoint of
T
_{i} from stable storage, which costs
R
_{i}. Then we start reexecuting
T
_{i+1} and the following tasks. Note that the costs
C
_{i} and
R
_{i} are likely proportional to the chapter length).
As before, the interarrival times of the faults are IID random variables following the Exponential law
Exp(
λ). We must decide after which tasks to checkpoint, in order to minimize the expectation of the total time. Figure
12 gives you a hint.
Time
_{C}(
i) is the optimal solution for the execution of tasks
T
_{1},
T
_{2}, …,
T
_{i}. The solution to the problem is
Time
_{C}(
n), and we use a dynamic programming algorithm to compute it. In the algorithm, we need to know
Time
_{Z} (i+1,j), the expected time to compute a segment of tasks [
T
_{i+1}…
T
_{j}] and to checkpoint the last one
T
_{j}, knowing that there is a checkpoint before the first one (hence after
T
_{i}) and that no intermediate checkpoint is taken.
Time
_{Z} stands for
Zero intermediate checkpoint. It turns out that we already know the value of
Time
_{Z}(
i + 1,
j): check that we have
and use Proposition
3.

We have stated that the expected number of faults during execution is \(\ensuremath {N_{\text{faults}}} = \frac {\ensuremath {\textsc {Time}_{\text{final}}} }{\mu } \), and that the expected time lost due to a fault is \(\ensuremath {T_{\text{lost}}} = \frac {\ensuremath {T} }{2} + \ensuremath {D} + \ensuremath {R} \). Both statements are true individually, but the expectation of a product is the product of the expectations only if the random variables are independent, which is not the case here because Time _{final} depends upon the fault interarrival times.

In Eq. ( 5), we have to enforce \(\ensuremath {D} +\ensuremath {R} + \frac {\ensuremath {T} }{2} \leq \mu \) in order to have Waste _{fault} ≤ 1. We must cap the period to enforce this latter constraint. Intuitively, we need μ to be large enough for Eq. ( 5) to make sense. However, for largescale platforms, regardless of the value of the individual MTBF μ _{ind}, there is always a threshold in the number of components p above which the platform MTBF, \(\mu = \frac {\mu _{\text{ind}}}{p}\), becomes too small for Eq. ( 5) to be valid.

Equation ( 5) is accurate only when two or more faults do not take place within the same period. Although unlikely when μ is large in front of T, the possible occurrence of many faults during the same period cannot be eliminated.
(i)
Either there is no fault during the execution of the period, then the time needed is exactly
T;
(ii)
Or there is one fault before successfully completing the period, then some additional delays are incurred. More specifically, as seen for the first order approximation, there are two sources of delays: the time spent computing by the processors before the fault (accounted for by variable
Time
_{lost}), and the time spent for downtime and recovery (accounted for by variable
Time
_{rec}). Once a successful recovery has been completed, there still remain
T −
C units of work to execute.
$$\displaystyle \begin{aligned} \ensuremath{\textsc{Time}} =\left\{ \begin{array}{ll} \ensuremath{T} & \text{if there is no fault} \\ \ensuremath{\ensuremath{\textsc{Time}_{\text{lost}}} } + \ensuremath{\ensuremath{\textsc{Time}_{\text{rec}}} } +\ensuremath{\textsc{Time}} & \text{otherwise} \end{array}\right. {} \end{aligned} $$
(15)
Time _{lost}
denotes the amount of time spent by the processors before the first fault, knowing that this fault occurs within the next
T units of time. In other terms, it is the time that is wasted because computation and checkpoint were not successfully completed (the corresponding value in Fig.
2 is
T
_{lost} −
D −
R).
Time _{rec}
represents the amount of time needed by the system to recover from the fault (the corresponding value in Fig.
2 is
D +
R).
$$\displaystyle \begin{aligned} \ensuremath{\mathbb{E}} (\ensuremath{\textsc{Time}} ) &= \ensuremath{\mathbb{P}} \left ( \text{no fault}\right ) \cdot \ensuremath{T} +\ensuremath{\mathbb{P}} \left ( \text{a fault strikes}\right ) \cdot \ensuremath{\mathbb{E}} \left ( \ensuremath{\ensuremath{\textsc{Time}_{\text{lost}}} } + \ensuremath{\ensuremath{\textsc{Time}_{\text{rec}}} } +\ensuremath{\textsc{Time}} \right) \\ &= e^{ \lambda \ensuremath{T} } \ensuremath{T} + (1e^{ \lambda \ensuremath{T} } ) \left( \ensuremath{\mathbb{E}} (\ensuremath{\ensuremath{\textsc{Time}_{\text{lost}}} } )+ \ensuremath{\mathbb{E}} (\ensuremath{\ensuremath{\textsc{Time}_{\text{rec}}} } ) +\ensuremath{\mathbb{E}} (\ensuremath{\textsc{Time}} ) \right)\ , \end{aligned} $$
$$\displaystyle \begin{aligned} \ensuremath{\mathbb{E}} (\ensuremath{\textsc{Time}} )= \ensuremath{T} + (e^{\lambda \ensuremath{T} } 1 ) \left ( E(\ensuremath{\ensuremath{\textsc{Time}_{\text{lost}}} } )+ E(\ensuremath{\ensuremath{\textsc{Time}_{\text{rec}}} } ) \right ) {} \end{aligned} $$
(16)
$$\displaystyle \begin{aligned} \ensuremath{\mathbb{E}} (\ensuremath{\ensuremath{\textsc{Time}_{\text{lost}}} } ) = \frac{1}{\lambda}  \frac{\ensuremath{T} }{e^{\lambda \ensuremath{T} }  1} {} \end{aligned} $$
(17)
$$\displaystyle \begin{aligned} \ensuremath{\mathbb{E}} (\ensuremath{\ensuremath{\textsc{Time}_{\text{rec}}} } ) = e^{\lambda \ensuremath{R} } (\ensuremath{D} +\ensuremath{R} ) + (1e^{\lambda \ensuremath{R} }) (\ensuremath{D} +\ensuremath{\mathbb{E}} (\ensuremath{R_{lost}} )+\ensuremath{\mathbb{E}} (\ensuremath{\ensuremath{\textsc{Time}_{\text{rec}}} } )) \end{aligned}$$
$$\displaystyle \begin{aligned} \ensuremath{\mathbb{E}} (\ensuremath{\ensuremath{\textsc{Time}_{\text{rec}}} } ) = \ensuremath{D} e^{ \lambda \ensuremath{R} } + \frac{1}{\lambda} (e^{\lambda \ensuremath{R} }  1)\end{aligned} $$
$$\displaystyle \begin{aligned} \ensuremath{\mathbb{E}} (\ensuremath{\textsc{Time}} (\ensuremath{T}  \ensuremath{C} ,\ensuremath{C} ,\ensuremath{D} ,\ensuremath{R} ,\lambda))= e^{\lambda \ensuremath{R} } \left(\frac{1}{\lambda} + \ensuremath{D} \right) (e^{\lambda \ensuremath{T} } 1 ) \end{aligned}$$
$$\displaystyle \begin{aligned}\sum_{i=1}^{n_{q}(F)} X_{i} \leq F < \sum_{i=1}^{n_{q}(F)+1} X_{i}.\end{aligned}$$
$$\displaystyle \begin{aligned}\mathbb E\left(n_{q}(F)\right) \mu_{\text{ind}} \leq F \leq (\mathbb E\left(n_{q}(F)\right) +1) \mu_{\text{ind}},\end{aligned}$$
$$\displaystyle \begin{aligned} \lim_{F \rightarrow +\infty} \frac{\mathbb E\left(n_{q}(F)\right)}{F} = \frac{1}{\mu_{\text{ind}}}. {} \end{aligned} $$
(18)
$$\displaystyle \begin{aligned}\lim_{F \rightarrow +\infty} \frac{\mathbb E\left(n(F)\right)}{F} = \frac{1}{\mu},\end{aligned}$$
$$\displaystyle \begin{aligned} \lim_{F \rightarrow +\infty} \frac{\mathbb E\left(m_{q}(F)\right)}{F} = \frac{1}{\mu_{\text{ind}}}. \end{aligned}$$
$$\displaystyle \begin{aligned} \lim_{F \rightarrow +\infty} \frac{\mathbb E\left(n(F)\right)}{F} = \frac{p}{\mu_{\text{ind}}} \end{aligned} $$
(19)
$$\displaystyle \begin{aligned} \lim_{F \rightarrow +\infty} \frac{\mathbb E\left(n(F)\right)}{F} = \sum_{q=1}^{p} \frac{1}{\mu^{(q)}} \end{aligned} $$
(20)
$$\displaystyle \begin{aligned}\ensuremath{\textsc{Time}_{\text{Z}}} (i+1,j) = \ensuremath{\mathbb{E}} \left (\ensuremath{\textsc{Time}} \left (\sum_{k=i+1}^{j} w_{k},\ensuremath{C} _{j},\ensuremath{D} ,\ensuremath{R} _{i},\lambda\right ) \right)\end{aligned}$$
×
1
2
As a side note, one needs only 23 persons for the probability of a common birthday to reach 0.5 (a question often asked in geek evenings).
By the way, there is a nice little exercise in “
Appendix 6: Scheduling a Linear Chain of Tasks” if you are motivated to help.
1.
Zurück zum Zitat G. Aupy, Y. Robert, F. Vivien, and D. Zaidouni. Checkpointing strategies with prediction windows. In Dependable Computing (PRDC), 2013 IEEE 19th Pacific Rim International Symposium on, pages 1–10. IEEE, 2013. G. Aupy, Y. Robert, F. Vivien, and D. Zaidouni. Checkpointing strategies with prediction windows. In
Dependable Computing (PRDC), 2013 IEEE 19th Pacific Rim International Symposium on, pages 1–10. IEEE, 2013.
2.
Zurück zum Zitat G. Aupy, Y. Robert, F. Vivien, and D. Zaidouni. Checkpointing algorithms and fault prediction. Journal of Parallel and Distributed Computing, 74(2):2048–2064, 2014. CrossRef G. Aupy, Y. Robert, F. Vivien, and D. Zaidouni. Checkpointing algorithms and fault prediction.
Journal of Parallel and Distributed Computing, 74(2):2048–2064, 2014.
CrossRef
3.
Zurück zum Zitat M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien. Checkpointing strategies for parallel jobs. In Proceedings of SC’11, 2011. M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien. Checkpointing strategies for parallel jobs. In
Proceedings of SC’11, 2011.
4.
Zurück zum Zitat F. Cappello, A. Geist, B. Gropp, L. V. Kalé, B. Kramer, and M. Snir. Toward Exascale Resilience. Int. Journal of High Performance Computing Applications, 23(4):374–388, 2009. CrossRef F. Cappello, A. Geist, B. Gropp, L. V. Kalé, B. Kramer, and M. Snir. Toward Exascale Resilience.
Int. Journal of High Performance Computing Applications, 23(4):374–388, 2009.
CrossRef
5.
Zurück zum Zitat H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni. Combining process replication and checkpointing for resilience on exascale systems. Research report RR7951, INRIA, May 2012. H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni. Combining process replication and checkpointing for resilience on exascale systems. Research report RR7951, INRIA, May 2012.
6.
Zurück zum Zitat J. T. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. FGCS, 22(3):303–312, 2004. CrossRef J. T. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps.
FGCS, 22(3):303–312, 2004.
CrossRef
7.
Zurück zum Zitat C. Engelmann, H. H. Ong, and S. L. Scorr. The case for modular redundancy in largescale highh performance computing systems. In Proc. of the 8th IASTED Infernational Conference on Parallel and Distributed Computing and Networks (PDCN), pages 189–194, 2009. C. Engelmann, H. H. Ong, and S. L. Scorr. The case for modular redundancy in largescale highh performance computing systems. In
Proc. of the 8th IASTED Infernational Conference on Parallel and Distributed Computing and Networks (PDCN), pages 189–194, 2009.
8.
Zurück zum Zitat K. Ferreira, J. Stearley, J. H. I. Laros, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold. Evaluating the Viability of Process Replication Reliability for Exascale Systems. In Proc. of the ACM/IEEE SC Conf., 2011. K. Ferreira, J. Stearley, J. H. I. Laros, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold. Evaluating the Viability of Process Replication Reliability for Exascale Systems. In
Proc. of the ACM/IEEE SC Conf., 2011.
9.
Zurück zum Zitat P. Flajolet, P. J. Grabner, P. Kirschenhofer, and H. Prodinger. On Ramanujan’s QFunction. J. Computational and Applied Mathematics, 58:103–116, 1995. MathSciNetCrossRef P. Flajolet, P. J. Grabner, P. Kirschenhofer, and H. Prodinger. On Ramanujan’s QFunction.
J. Computational and Applied Mathematics, 58:103–116, 1995.
MathSciNetCrossRef
10.
Zurück zum Zitat A. Gainaru, F. Cappello, and W. Kramer. Taming of the shrew: Modeling the normal and faulty behavior of largescale hpc systems. In Proc. IPDPS’12, 2012. A. Gainaru, F. Cappello, and W. Kramer. Taming of the shrew: Modeling the normal and faulty behavior of largescale hpc systems. In
Proc. IPDPS’12, 2012.
11.
Zurück zum Zitat A. Gainaru, F. Cappello, M. Snir, and W. Kramer. Failure prediction for hpc systems and applications: Current situation and open issues. Int. J. High Perform. Comput. Appl., 27(3):273–282, 2013. CrossRef A. Gainaru, F. Cappello, M. Snir, and W. Kramer. Failure prediction for hpc systems and applications: Current situation and open issues.
Int. J. High Perform. Comput. Appl., 27(3):273–282, 2013.
CrossRef
12.
Zurück zum Zitat F. Gärtner. Fundamentals of faulttolerant distributed computing in asynchronous environments. ACM Computing Surveys, 31(1), 1999. CrossRef F. Gärtner. Fundamentals of faulttolerant distributed computing in asynchronous environments.
ACM Computing Surveys, 31(1), 1999.
CrossRef
13.
Zurück zum Zitat T. Hérault and Y. Robert, editors. FaultTolerance Techniques for HighPerformance Computing, Computer Communications and Networks. Springer Verlag, 2015. T. Hérault and Y. Robert, editors.
FaultTolerance Techniques for HighPerformance Computing, Computer Communications and Networks. Springer Verlag, 2015.
14.
Zurück zum Zitat O. Kella and W. Stadje. Superposition of renewal processes and an application to multiserver queues. Statistics & probability letters, 76(17):1914–1924, 2006. MathSciNetCrossRef O. Kella and W. Stadje. Superposition of renewal processes and an application to multiserver queues.
Statistics & probability letters, 76(17):1914–1924, 2006.
MathSciNetCrossRef
15.
Zurück zum Zitat D. Kondo, A. Chien, and H. Casanova. Scheduling Task Parallel Applications for Rapid Application Turnaround on Enterprise Desktop Grids. J. Grid Computing, 5(4):379–405, 2007. CrossRef D. Kondo, A. Chien, and H. Casanova. Scheduling Task Parallel Applications for Rapid Application Turnaround on Enterprise Desktop Grids.
J. Grid Computing, 5(4):379–405, 2007.
CrossRef
16.
Zurück zum Zitat S. M. Ross. Introduction to Probability Models, Eleventh Edition. Academic Press, 2009. S. M. Ross.
Introduction to Probability Models, Eleventh Edition. Academic Press, 2009.
17.
Zurück zum Zitat B. Schroeder and G. Gibson. Understanding failures in petascale computers. Journal of Physics: Conference Series, 78(1), 2007. B. Schroeder and G. Gibson. Understanding failures in petascale computers.
Journal of Physics: Conference Series, 78(1), 2007.
18.
Zurück zum Zitat J. W. Young. A first order approximation to the optimum checkpoint interval. Comm. of the ACM, 17(9):530–531, 1974. CrossRef J. W. Young. A first order approximation to the optimum checkpoint interval.
Comm. of the ACM, 17(9):530–531, 1974.
CrossRef
19.
Zurück zum Zitat L. Yu, Z. Zheng, Z. Lan, and S. Coghlan. Practical online failure prediction for blue gene/p: Periodbased vs eventdriven. In Dependable Systems and Networks Workshops (DSNW), pages 259–264, 2011. L. Yu, Z. Zheng, Z. Lan, and S. Coghlan. Practical online failure prediction for blue gene/p: Periodbased vs eventdriven. In
Dependable Systems and Networks Workshops (DSNW), pages 259–264, 2011.
20.
Zurück zum Zitat Z. Zheng and Z. Lan. Reliabilityaware scalability models for high performance computing. In Proc. of the IEEE Conference on Cluster Computing, 2009. Z. Zheng and Z. Lan. Reliabilityaware scalability models for high performance computing. In
Proc. of the IEEE Conference on Cluster Computing, 2009.
21.
Zurück zum Zitat Z. Zheng, Z. Lan, R. Gupta, S. Coghlan, and P. Beckman. A practical failure prediction with location and lead time for blue gene/p. In Dependable Systems and Networks Workshops (DSNW), pages 15–22, 2010. Z. Zheng, Z. Lan, R. Gupta, S. Coghlan, and P. Beckman. A practical failure prediction with location and lead time for blue gene/p. In
Dependable Systems and Networks Workshops (DSNW), pages 15–22, 2010.
 Titel
 Scheduling for FaultTolerance: An Introduction
 DOI
 https://doi.org/10.1007/9783319931098_6
 Autoren:

Guillaume Aupy
Yves Robert
 Sequenznummer
 6