Skip to main content
main-content

Über dieses Buch

Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems describes coding approaches for designing fault-tolerant systems, i.e., systems that exhibit structured redundancy that enables them to distinguish between correct and incorrect results or between valid and invalid states. Since redundancy is expensive and counter-intuitive to the traditional notion of system design, the book focuses on resource-efficient methodologies that avoid excessive use of redundancy by exploiting the algorithmic/dynamic structure of a particular combinational or dynamic system.
The first part of Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems focuses on fault-tolerant combinational systems providing a review of von Neumann's classical work on Probabilistic Logics (including some more recent work on noisy gates) and describing the use of arithmetic coding and algorithm-based fault-tolerant schemes in algebraic settings. The second part of the book focuses on fault tolerance in dynamic systems. Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems also discusses how, in a dynamic system setting, one can relax the traditional assumption that the error-correcting mechanism is fault-free by using distributed error correcting mechanisms. The final chapter presents a methodology for fault diagnosis in discrete event systems that are described by Petri net models; coding techniques are used to quickly detect and identify failures.
From the Foreword: "Hadjicostis has significantly expanded the setting to processes occurring in more general algebraic and dynamic systems... The book responds to the growing need to handle faults in complex digital chips and complex networked systems, and to consider the effects of faults at the design stage rather than afterwards."
George Verghese, Massachusetts Institute of Technology
Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems will be of interest to both researchers and practitioners in the area of fault tolerance, systems design and control.

Inhaltsverzeichnis

Frontmatter

Introduction

Chapter 1. Introduction

Abstract
Modern digital systems are subject to a variety of potential faults that can corrupt their output and degrade their performance [Johnson, 1989; Pradhan, 1996; Siewiorek and Swarz, 1998]. In this context, a fault is a deviation of a given system from its required or expected behavior. The more complex a computational system is or the longer an algorithm runs for, the higher is the risk of a hardware malfunction that renders the overall functionality of the system useless. Depending on the duration of faults, two broad classes are defined [Johnson, 1989]:(i) Permanent faults manifest themselves in a consistent manner and include design or software errors, manufacturing defects, or irreversible physical damage. (ii) Transient faults do not appear on a consistent basis and only manifest themselves in a certain portion of system invocations; transient faults could be due to noise, such as absorption of alpha particles and electromagnetic interference, or environmental factors, such as overheating.
Christoforos N. Hadjicostis

Fault-Tolerant Combinational Systems

Frontmatter

Chapter 2. Reliable Combinational Systems Out of Unreliable Components

Abstract
In one of his most influential papers, von Neumann considered the construction of reliable combinational systems out of unreliable components [von Neumann, 1956]. He focused on a class of digital systems that performed computation by using appropriately interconnected voting mechanisms. More specifically, von Neumann constructed reliable systems out of unreliable 3-bit voters, some of which were used to perform computation and some of which were used as “restoring organs” to achieve error correction. The voters used for computational purposes received inputs that were either primary inputs, constants or outputs from other voters; the voters that functioned as “restoring organs” ideally (i.e., under fault-free conditions) received identical inputs.
Christoforos N. Hadjicostis

Chapter 3. Algorithm-Based Fault Tolerance for Combinational Systems

Abstract
Modular redundancy schemes are attractive because they are universally applicable and can be implemented without having to develop explicit fault models. Their major drawback is that they can be prohibitively expensive due to the overhead of replicating the hardware. Arithmetic coding and algorithm-based fault tolerance (ABFT) schemes partially overcome this problem by offering sufficient fault coverage while making more efficient use of redundancy. This comes at the cost of narrower applicability and harder design. In fact, the main task in arithmetic coding and ABFT schemes is the development of appropriate fault models and the recognition of the structural features that make a particular computation or algorithm amenable to efficient utilization of redundancy. A variety of useful results and constructive procedures for systematically achieving this goal have been obtained for computations that take place in an abelian group or in a semigroup. This chapter reviews work on arithmetic codes and ABFT, and describes a systematic approach for protecting combinational systems whose functionality possesses certain algebraic structure.
Christoforos N. Hadjicostis

Fault-Tolerant Dynamic Systems

Frontmatter

Chapter 4. Redundant Implementations of Algebraic Machines

Abstract
This chapter extends the algebraic approach of Chapter 3 in order to provide fault tolerance to group and semigroup machines. The discussion characterizes redundant implementations using algebraic homomorphisms and demonstrates that for a particular error-correcting scheme there exist many possible redundant implementations, each potentially offering different fault coverage [Had-jicostis, 1999]. The fault model assumes that the error detecting/correcting mechanism is fault-free and considers faults that cause the redundant machine to transition to an incorrect state. Explicit connections to hardware implementations and hardware faults are addressed in Chapter 5 for linear time-invariant dynamic systems (implemented using delay, adder and gain elements) and in Chapter 6 for linear finite-state machines (implemented using XOR gates and flip-flops). The issue of faults in the error corrector is studied in Chapter 7. Related work appeared in the context of providing fault tolerance to arbitrary finite-state machines via external monitoring mechanisms [Iyengar and Kinney, 1985; Leveugle and Saucier, 1990; Parekhji et al., 1991; Robinson and Shen, 1992; Leveugle et al., 1994; Parekhji et al., 1995]. This work, however, was not formulated in an algebraic setting and does not make use of algebraic properties and structure.
Christoforos N. Hadjicostis

Chapter 5. Redundant Implementations of Discrete-Time Linear Time-Invariant Dynamic Systems

Abstract
This chapter discusses fault tolerance in discrete-time linear time-invariant (LTI) dynamic systems [Hadjicostis and Verghese, 1997; Hadjicostis and Verghese, 1999; Hadjicostis, 1999]. It focuses on redundant implementations that reflect the state of the original system into a larger LTI dynamic system in a linearly encoded form. In essence, this chapter restricts attention to discrete-time LTI dynamic systems and linear coding techniques, both of which are rather standard and well-developed topics in system and coding theory respectively. Interestingly enough, the combination of linear dynamics and coding reveals some novel aspects of the problem, as summarized by the characterization of the class of appropriate redundant implementations given in Theorem 5.1. In most of the fault-tolerance schemes discussed, error detection and correction is performed at the end of each time step, although examples of non-concurrent schemes are also presented [Hadjicostis, 2000; Hadjicostis, 2001].
Christoforos N. Hadjicostis

Chapter 6. Redundant Implementations of Linear Finite-State Machines

Abstract
This chapter applies techniques similar to those of Chapter 5 to provide fault tolerance to linear finite-state machines (LFSM’s) [Hadjicostis, 1999]. The discussion focuses on linear encoding techniques and, as in Chapter 5, results in a complete characterization of the class of appropriate redundant implementations. It is shown that, for a given LFSM and a given linear encoding, there exists a variety of possible implementations and that different criteria can be used to choose the most desirable one [Hadjicostis, 2000; Hadjicostis and Verghese, 2002]. The implications of this approach are demonstrated by studying hardware implementations that use interconnections of 2-input XOR gates and single-bit memory elements (flip-flops). The redundancy in the state representation (which essentially appears as a linearly encoded binary vector) is used by an external, fault-free mechanism to perform concurrent error detection and correction at the end of each time step. The assumption of a fault-free error corrector is relaxed in Chapter 7.
Christoforos N. Hadjicostis

Chapter 7. Unreliable Error Correction in Dynamic Systems

Abstract
This chapter focuses on constructing reliable dynamic systems exclusively out of unreliable components, including unreliable components in the error-correcting mechanism. At each time step, a particular component can suffer a transient fault with a probability that is bounded by a constant. Faults between different components and between different time steps are treated as independent. Essentially, the chapter considers an extension of the techniques described in Chapter 2 to a dynamic system setting. Since dynamic systems evolve in time according to their internal state, the major task is to effectively deal with the effects of error propagation, i.e., the effects of errors that corrupt the system state.
Christoforos N. Hadjicostis

Chapter 8. Coding Approaches for Fault Detection and Identification in Discrete Event Systems

Abstract
This chapter applies coding techniques in the context of detecting and identifying faults in complex discrete event systems (DES’s) that can be modeled as Petri nets [Hadjicostis, 1999; Hadjicostis and Verghese, 1999]. The approach is based on replacing the Petri net model of a given DES with a redundant Petri net model in a way that preserves the state, evolution and properties of the original system in some encoded form. This redundant Petri net model enables straightforward fault detection and identification based on simple parity checks that are used to verify the validity of artificially-imposed invariant conditions. Criteria and methods for designing redundant Petri net models that achieve the desired objective while minimizing the cost associated with them (e.g., by minimizing the number of sensors or communication links) are not pursued here, but several examples illustrate how such problems can be approached.
Christoforos N. Hadjicostis

Chapter 9. Concluding Remarks

Summary
This book presented a unifying approach for constructing fault-tolerant combinational and dynamic systems. The underlying motive was to develop resource-efficient alternatives to modular redundancy by constructing appropriate redundant system embeddings. These embeddings preserve the functionality of the original system and are designed in a way that imposes constraints on the set of outputs/states that are reachable under fault-free conditions. Violations of these constraints can then be used by an external mechanism to detect and correct errors. The faults that cause the errors could be due to hardware malfunctions, communication faults, incorrect initialization, and so forth.
Christoforos N. Hadjicostis

Backmatter

Weitere Informationen