Abstract
For some computing systems, failure is rare enough that it can be ignored. In other systems, failure is so common that how to handle it can have a significant impact on the performance of the system. There are many different recovery schemes for tasks, however, they can be classified into three broad categories: 1) Resume: when a task fails, it knows exactly where it stops and can continue at that point when allowed to resume (i.e., preemptive resume - prs); 2) Replace: when a task fails, then later when the processor continues, it begins with a brand new task (i.e., preemptive repeat different prd); and, 3) Restart: when a task fails it loses all work done to that point and must start anew upon continuing later (i.e., preemptive repeat identical - pri).In this paper, assuming a computing system is unreliable, we discuss how heavy-tail (hereafter referred to as power-tail - PT) distributions can appear in a job's task stream given the Restart recovery procedure. This is an important consideration since it is known that power-tails can lead to unstable systems [4], We then demonstrate how to obtain performance and dependablity measures for a class of computing systems comprised of P unreliable processors and a finite number of tasks N given the above recovery procedures.
- A. Bobbio and K. Trivedi, "Computation of the Distribution of the Completion Time When the Work Requirement is a PH Random Variable", Communications in Statistics - Stochastic Models, 1990.Google Scholar
- M. Greiner, M. Jobmann, and L. Lipsky, "The Importance of Power-Tail Distributions for Modeling Queueing Systems," Operations Research, 47(2), 1999. Google ScholarDigital Library
- V. Kulkarni, V. Nicola, and K. Trivedi, "The Completion Time of a Job on a Multmode System," Advances in Applied Probability, 19:932--954, 1987.Google ScholarCross Ref
- L. Lipsky, Queueing Theory: A Linear Algebraic Approach, MacMillan and Company, New York, 1992.Google Scholar
Index Terms
- On unreliable computing systems when heavy-tails appear as a result of the recovery procedure
Recommendations
On checkpointing and heavy-tails in unreliable computing environments
In this paper, we discuss checkpointing issues that should be considered whenever jobs execute in unreliable computing environments. Specifically, we show that if proper check-pointing procedures are not properly implemented, then under certain ...
Comments