Skip to main content
main-content
Top

About this book

This book addresses the question of how system software should be designed to account for faults, and which fault tolerance features it should provide for highest reliability. With this second edition of Software Design for Resilient Computer Systems the book is thoroughly updated to contain the newest advice regarding software resilience. With additional chapters on computer system performance and system resilience, as well as online resources, the new edition is ideal for researchers and industry professionals.

The authors first show how the system software interacts with the hardware to tolerate faults. They analyze and further develop the theory of fault tolerance to understand the different ways to increase the reliability of a system, with special attention on the role of system software in this process. They further develop the general algorithm of fault tolerance (GAFT) with its three main processes: hardware checking, preparation for recovery, and the recovery procedure. For each of the three processes, they analyze the requirements and properties theoretically and give possible implementation scenarios and system software support required. Based on the theoretical results, the authors derive an Oberon-based programming language with direct support of the three processes of GAFT. In the last part of this book, they introduce a simulator, using it as a proof of concept implementation of a novel fault tolerant processor architecture (ERRIC) and its newly developed runtime system feature-wise and performance-wise.

Due to the wide reaching nature of the content, this book applies to a host of industries and research areas, including military, aviation, intensive health care, industrial control, and space exploration.

Table of Contents

Frontmatter

1. Introduction

Abstract
The chapter introduces what design principles and application requirements we need to pursue achieving resilient functioning of computer system. Principle of simplicity, reliability, reconfigurability, scalability, and redundancy are briefly discussed. Shown that implementation of principles with architecture might include system software modification as well as redevelopment of basis hardware blocks—processing area, storage area, and interfacing area of computer system.
Igor Schagaev, Eugene Zouev, Kaegi Thomas

2. Hardware Faults

Abstract
This chapter explains hardware faults, their origins, dependency on technology used, and some known solutions of fault toleration using error correction codes and redundancy hardware schemes. Shown that with growing density of hardware there is a risk of multiple temporary fault which grows at order of magnitude prime concern for designers of new computer systems for safety-critical application. Hardware faults occur due to natural phenomena such as ionized radiation, variations in the manufacturing process, vibrations, etc. We present in this chapter a short introduction to hardware faults, show the typical fault types and patterns, and also give examples of how to deal with these faults.
Igor Schagaev, Eugene Zouev, Kaegi Thomas

3. Fault Tolerance: Theory and Concepts

Abstract
This chapter briefly introduces how reliability of the system might be considered in combination with fault tolerance. Having introduced hardware faults in the previous chapter, we present in this chapter the elements of theory of fault tolerance and reliability and show how the hardware components of a computing system can be made more resilient to hardware faults. We then introduce the mathematical definition of reliability and show how to calculate the reliability of a system according to the topology of its components. Then we describe the connection between reliability and fault tolerance, i.e., we show how applying different types of redundancy, implemented in software and hardware, increases the reliability of a system. Also, some design advices are given.
Igor Schagaev, Eugene Zouev, Kaegi Thomas

4. Generalized Algorithm of Fault Tolerance (GAFT)

Abstract
Fault tolerance so far was considered as a property of a system. In fact and instead, we introduce a Generalized Algorithm of Fault Tolerance (GAFT) that considers property of fault tolerance as a system process. GAFT implementation analysis—if we want to make it rigorous—should be using classification of redundancy types. Various redundancy types have different “power” of use at various steps of GAFT. Properties of GAFT implementation impact on overall performance of the system, coverage of faults, and ability of reconfiguration. Clear that separation of malfunctions from permanent fault simply must be implemented and reliability gain is analyzed. A ratio of malfunctions to permanent faults is achieving 105−7 and simple exclusion from working configuration a malfunctioned element is no longer feasible. Further, we have to consider GAFT extension in terms of generalization and application for support of system safety of complex systems. Our algorithms of searching correct state, “guilty” element, and analysis of potential damages become powerful extension of GAFT for challenging applications like avionic systems, aircraft as a whole. In Chap. 3, we showed that fault tolerance should be treated as a process. In this chapter, we elaborate further this process into a clearly defined algorithm and develop a framework to the design of fault-tolerant systems, the generalized algorithm of fault tolerance—GAFT.We also introduce a theoretical model to quantify the impact of the additional redundancy to the reliability of the whole system and derive an answer to the question of how much added redundancy leads to the system with highest reliability. A question that GAFT cannot answer is how the real source of a detected fault can be identified, as the fault manifestation might have occurred in another hardware element and spread in the system due to nonexistent fault containment. We will show an algorithm that based on the dependencies of the elements of a system can identify the possible fault sources and also predict which elements an identified fault might have affected. We now start in a first step by further elaborating the process of fault tolerance.
Igor Schagaev, Eugene Zouev, Kaegi Thomas

5. GAFT Generalization: A Principle and Model of Active System Safety

Abstract
Fault tolerance so far was considered as a property of a system. In fact and instead, we introduce a Generalized Algorithm of Fault Tolerance (GAFT) that considers fault tolerance as a system process. There is no doubt we have to analyze GAFT implementation using redundancy types of computer systems. Properties of GAFT implementation impact on overall performance of the system, coverage of faults, and ability of reconfiguration.
Igor Schagaev, Eugene Zouev, Kaegi Thomas

6. System Software Support for Hardware Deficiency: Functions and Features

Abstract
This chapter explains what is required to be changed or modified at the level of system software (runtime system, languages) to address new properties of computer system fault tolerance, resilience, and reliability. From the first principles, we think through the system design attempting to reflect new system properties “as early as possible”.
Igor Schagaev, Eugene Zouev, Kaegi Thomas

7. Testing, Checking, and Hardware Syndrome

Abstract
In previous chapters, we introduced the processes of checking and testing, the first of the three main processes of Generalized Algorithm of Fault Tolerance—GAFT. In this chapter, we further discuss the process of checking hardware, at first software-based hardware-checking and at second hardware-based checking. For the software-based hardware checking, we show what a software-based test should include, when they are the preferred choice over hardware-based checking schemes, and especially how such tests can be scheduled in the system without interfering with ongoing real-time tasks. Further to support handling of hardware-based checking, we introduce a new system condition descriptor—so-called a syndrome, and illustrate how it can be used as a mechanism to signal to the operating system the hardware condition, including manifestation of detected error. We then show the steps the runtime system performs to eliminate the fault and in case of permanent errors how the software can reconfigure the hardware to exclude the faulty element. We also explain in which cases software has to adapt to the new hardware topology. We start by explaining how software-based checks can be used to detect hardware faults. Runtime systems use online or offline scheduling mechanisms for task management of programs—own—system software ones and user application ones. Since [14] it is expected that runtime system provides a special session of tasks scheduling (offline or online during execution) for the purposes of diagnostic of hardware conditions—recall Apple and Microsoft system starting delays. Later for some systems that operate in domain of real-time monitoring scheduling of tasks, critical in time of execution especially criticality of hardware availability and efficiency of process scheduling become crucial. In turn, testing itself becomes “hot” in terms of required time and coverage of hardware. Thus in this chapter, we initially analyze simple sequences of testing of hardware elements of computer systems. Further, we introduce a concept of transparent for user application procedure of hardware testing. This enables to prove integrity of computer system hardware, and guarantee it within a reasonable time, without delay of service of execution of user tasks.
Igor Schagaev, Eugene Zouev, Kaegi Thomas

8. Recovery Preparation

Abstract
In the last section, we showed how hardware integrity of a computing system can be efficiently ensured using hardware-checking schemes and system software testing procedures and their sequences. However, to recover from faults, it is necessary to eliminate the effects the error had on the computation, i.e., the software code and data space. In GAFT, this corresponds to preparation for recovery. We now want to show how software has to be organized to be able own recovery or in other words, we want to revise different strategies how software can, after the detection of an error, ensure that the error did not affect the software state, or if this cannot be ensured, what precautions software has to conduct to be able to re-establish a correct software state. First, we revise the state of the art and then introduce a new technology and show its power and limitations. In the next step, we will show how hardware can assist software in the process of recovery preparation. For all generic approaches to recovery preparation, so-called stable storage, a nonvolatile, reliable, and fast storage is needed. If no direct hardware support is available, stable storage must be implemented in software. We will present a possible software implementation of such a stable storage.
Igor Schagaev, Eugene Zouev, Kaegi Thomas

9. Recovery: Searching and Monitoring of Correct Software States

Abstract
The last of the three GAFT processes is called recovery and recovery monitoring. After the detection of an error and possible reconfiguration, the last step is recovering the software, which means that the effect of the error on the software must be eliminated. In line with the previous chapters and [16], the recovery consists of restoring the last recovery point and continuing the processing. But is this really sufficient? What if latent faults exist in the system and manifest themselves in the system but trigger some detection schemes an arbitrary time later? Assuming this reasonable and unpleasant sequence of events, it becomes clear that just restoring data and program from the last stored recovery point is not enough. We have to admit that we do not have any guarantee that fault is now eliminated: even when hardware is restored or even reconfigured—we have erroneous states of software recorded in recovery points. Thus, we have to consider the recovery process itself and analyze which classic algorithms are applicable and fit the purpose of efficient recovery. We introduce and analyze three recovery algorithms that are able to ensure successful recovery by iteratively go through all stored recovery points.
Igor Schagaev, Eugene Zouev, Kaegi Thomas

10. Recovery Algorithms: An Analysis

Abstract
Discovered algorithm of modified linear recovery seems to be effective in terms of power of detection of fault and correct state of the system. At the same time, classic algorithms well presented in literature binary search and linear search can also be applied for the same purpose. Thus, we have to consider the recovery process itself and analyze which classic algorithms are applicable and fit the purpose of efficient recovery. We introduce and analyze three recovery algorithms that are able to ensure successful recovery by iteratively go through all stored recovery points.
Igor Schagaev, Eugene Zouev, Kaegi Thomas

11. Programming Languages for Safety-Critical Systems

Abstract
In previous chapters, we introduced the three main processes required to implement generalized algorithm of fault tolerance (GAFT ), namely—testing and checking , second recovery preparation , and third and finally recovery and recovery monitoring. We described what every of these steps incorporates, gave possible solutions, and analyzed them. In the Chap. 7, we introduced syndrome for testing and checking ; here we introduce programming language models for the two other mentioned processes. What we now want to do is to synthesize the introduced concepts into system software tools—programming languages and their compilers. We will discuss possible project solutions related to the overall architecture of software tools and introduce the major components of the architecture.
Eugene Zouev

12. Proposed Runtime System Structure with Support of Resilient Concurrency

Abstract
This chapter is some kind of summary of Chaps. 6, 7 in terms of implementation of principles described there at the level of the runtime system. Here, we summarize new functions and features of runtime system that simply must be implemented during the design of fault-tolerant and resilient systems. Special attention was given to concurrency monitoring in the presence of potential threats and degradation. While it is clear which part of system software has to deal with this, the whole problem of concurrency never was addressed proper—here we present our attempt.
Igor Schagaev, Eugene Zouev, Kaegi Thomas

13. Proposed Runtime System Versus Existing Approaches

Abstract
In this chapter, we briefly compare our approach of a fault-tolerant operating system with existing approaches. We use our own definition of fault tolerance as a process that is required to support the implementation of all steps of GAFT.
Igor Schagaev, Eugene Zouev, Kaegi Thomas

14. Hardware: The ERRIC Architecture

Abstract
There is no doubt, system software support of computer resilience is incomplete when hardware does not have essential properties required to implement it. Luckily, hardware with required support of recoverability and fault detection is available and described in full details in [13]. Here, we briefly describe available hardware with attention to implementation support of PRE properties (performance reliability and energy) requirements.
Igor Schagaev, Eugene Zouev, Kaegi Thomas

15. Architecture Comparison and Evaluation

Abstract
Currently, there are several processor architectures on the market, dominated by the x86 architecture on the desktop and the ARM in embedded devices and should therefore be included in an architectural comparison. Thus without even brief analysis, how resilient computing “fit” these available architectures and why we need to develop our own ERRIC , our implementation and application would be incomplete. The core of the analysis of instruction set and impact on computer architecture was presented in Schagaev et al. (Instruction set and their impact on modern computer architecture, 1989, [1]). Since then nothing much has been changed in architecture, few system architectures were lost in the market competition (not the worst one, to be honest), leaving SPARC , Intel, and ARM with lion shares of the market. The SPARC processor is included in this comparison, as it is heavily used in space, especially the fault-tolerant version LEON3 (LEON3-FT SPARC V8-RTAX—Data Sheet and User’s manual, 2009, [2]). In order to have a consistent comparison, we only compare the 32-bit versions of the respective processors, although 64-bit versions are available for the x86 and SPARC architecture. Both the SPARC and the ARM processors are RISC based and are thus quite similar in the instruction set , whereas the x86 architecture is CISC based and provides a multitude of instructions in comparison to the other architectures. The x86 is also the only architecture that allows using memory locations directly in instructions (register memory architecture), whereas RISC machines (load and store architecture) must first load the argument into a register. This enormous number of instructions leads to the curious situation that the instruction decoder of the Intel Atom CPU needs as much space as a complete ARM Cortex-A5 (Nachwuchs fr die die cortex-a-familie, 2009, [3]).
Igor Schagaev, Eugene Zouev, Kaegi Thomas

16. ERRIC Reliability

Abstract
The key property of our design is resilience. So far we mostly covered system software methods and schemes to achieve or support it. Previous Chaps. 14 and 15 explained briefly what hardware (processor) should possess (including reduction on functions and limited in architecture options) to be able to implement resilience in the most efficient way. Here, we intend to analyze what we have achieved in hardware design in terms of malfunction tolerance—attempting to use heavy artillery of system software as less as possible, making dirty work of fault detection and determination (malfunction or permanent for hardware).
Igor Schagaev, Eugene Zouev, Kaegi Thomas

17. On Performance: From Hardware up to Distributed Systems

Abstract
Nothing is easy nowadays: frequency of processors increased thousand times, system performance as a whole sometimes tripled. Complexity of the system became uncontrollable with zillions of processes and elements to juggle increased unconsciously, leaving for us some comfort but at an astronomical cost. What it means? We are doing something seriously wrong and doing it consistently and persistently. Thus, authors of this work have decided to put together our own discussions and estimations we did since 2002 up to now. We show that system performance depends on user, hardware, and software, structure or architecture of a system and its topology. We propose to see performance analysis a bit wider, thinking systematically what various zones of computer or distributed system can bring or contribute, including the role of processor, structure of system software and overvalued parallelization (try to eat and dance at the same time—it might be fun). We have introduced a kind of virtual architecture through which see instruction execution considering what is in there for us and what system requires for itself. The observation is rather pessimistic. We have briefly demonstrated what simplest architecture if carefully designed can give regarding performance, reliability and energy efficiency AT THE SAME TIME! Regarding distributed systems, we show that Amdahl Law is also very overoptimistic mostly serves to promote parallel architectures and distributed systems. Simple model that we have explained for kids from British primary school and even did field study with them so-called “fence model” made clear that the limit of performance or simply overall reasonably good design is unachievable until we start rethinking the whole architecture and its main element interaction—human, hardware and system software together, pursuing three nonfunctional requirements, performance, reliability, and energy efficiency in concert.
Igor Schagaev, Hao Cai, Simon Monkman

18. Distributed Systems: Maximizing Resilience

Abstract
We claim that system redundancy (natural or artificial, deliberately introduced) should be applied for the purposes of Performance, Reliability and Energy efficiency or “PRE-smartness”. A process of an algorithm of application of available redundancy for PRE-smartness is proposed showing that it is further extension generalized algorithm of fault tolerance, when property of fault detection is replaced as property of PRE-smartness. We present a system level implementation steps of PRE-smartness. Using ITACS LTD forward and backward tracing algorithms applied for distributed computer system we demonstrate that efficiency (reliability as a whole, especially availability; performance of detection and recovery) grow up to the level when real-time applications of distributed system grows substantially. Ans estimation of reliability gain is modeled and demonstrated suggesting a policy of implementation of PRE-smartness as a permanent on-going process required for the system successful functioning.
Igor Schagaev

19. Distributed Systems: Resilience, Desperation

Abstract
Major drawbacks of existing distributed systems, including networking are discussed. To cope with them we introduce a new design concept of performance-, reliability-, energy- smart system, applied for distributed system and networking. Our approach of distributed system handling was called a desperation control, which introduce and explain. It assumes introduction of new routing algorithms for routing and for package handling inside routers. We claim that state of a system as a whole as well as state of a package should be considered and treated in concert varying ways of package handlings. Shown that proposed approach at order of magnitude boosts performance and reliability of network making it suitable for real-time packages and widening applications for really important applications: existing technologies and billions of devices should serve and make people better. Using elements of graph theory, we analyze and compare proposed algorithms with known, explaining our advantages. Weighted costs of each segment of package path, combined with vector heuristic introduced and enriched by graph-logic extension for each routing table makes proposed concept the best so far from all known. Resilience of distributed computer systems as well as networking therefore becomes improved, extending application domain to the really important applications like safety critical, military, banking, health real-time monitoring, transport and similar.
Igor Schagaev, Stephen Farrell

20. ERRIC Machine Code Simulator

Abstract
The architecture and implementation of the ERRIC instruction code software simulator is presented. The main property of the simulator is providing a tool for developing real ERRIC -based software even before a real ERRIC processor appears, or to develop software without a processor at hand. The overall structure of the software tool is described, and the key data structures and algorithms of the simulator are explained and discussed.
Aleksey Gospodchikov, Sergey Koziakov, Danila Romanov, Eugene Zouev

Backmatter

Additional information