Skip to main content
main-content

Über dieses Buch

A research project to investigate the design and construction of reliable computing systems was initiated by B. Randell at the University of Newcastle upon Tyne in 1972. In over ten years of research on system reliability, a substantial number of papers have been produced by the members of this project. These papers have appeared in a variety of journals and conference proceedings and it is hoped that this book will prove to be a convenient reference volume for research workers active in this important area. In selecting papers published by past and present members of this project, I have used the following criteria: a paper is selected if it is concerned with fault tolerance and is not a review paper and was published before 1983. I have used these criteria (with only one or two exceptions!) in order to present a collection of papers with a common theme and, at the same time, to limit the size of the book to a reasonable length. The papers have been grouped into seven chapters. The first chapter introduces fundamental concepts of fault tolerance and ends with the earliest Newcastle paper on reliability. The project perhaps became well known after the invention of recovery blocks - a simple yet effective means of incorporating fault tolerance in software. The second chapter contains papers on recovery blocks, starting with the paper which first introduced the concept.

Inhaltsverzeichnis

Frontmatter

Introduction

Introduction

It was in January 1972 that a long term programme of research into the problems of designing highly reliable computing systems was initiated at the University of Newcastle upon Tyne. However the origins of the project can be readily traced back some years earlier.

B. Randell

System Reliability

Introduction

In our every day conversations we tend to use the terms ‘fault’, ‘error’ and ‘failure’ (often interchangeably) to indicate the fact that something is ‘wrong’ with a system. However, in any discussion on reliability and fault tolerance, more precision is called for to avoid confusion. The definitions for these terms presented here in the first paper by Anderson and Lee owe their origins to the unpublished work of Melliar-Smith and to his collaborative work with Randell (see their joint paper in Chap. 3). In true computer science fashion, Melliar-Smith and Randell defined a system recursively as composed out of ‘smaller’ systems and defined the occurrence of a failure to be the event when the behaviour of a system does not agree with that required by the specification. Why does a system fail? To answer this is it necessary to examine the internal state of the system, which then leads us to the notions of ‘errors’ and ‘faults’.

Santosh Kumar Shrivastava

Fault Tolerance Terminology Proposals

At present, the fault tolerance community is hampered by using a set of conflicting terms to refer to closely related fault tolerance concepts. This paper presents informal, but precise, definitions and terminology for these concepts. In particular, the terms fault, error and failure are carefully defined and distinguished. The aim is to promote discussion in the hope that an agreed terminology will emerge.

T. Anderson, P. A. Lee

System Structure for Software Fault Tolerance

This paper presents and dicusses the rationale behind a method for structuring complex computing systems by the use of what we term “recovery blocks,” “conversations,” and “fault-tolerant interfaces.” The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.

B. Randell

Operating Systems: The Problems of Performance and Reliability

The problems of achieving satisfactory levels of system performance and reliability are amongst the most difficult that operating system designers and implementors have to face. This is particularly the case with generic operating systems, i.e., systems intended for use in many different versions, in a wide variety of different environments. The present paper attempts to explore the reasons for these difficulties, and to discuss the interplay between performance and reliability, and, in particular, the problems of achieving high reliability in the presence of hardware failures and software errors.

B. Randell

Recovery Blocks

Introduction

Recovery blocks provide a means of introducing a measure of tolerance against software faults. The four aspects of fault tolerance–error detection, damage assessment, error recovery and fault treatment — are embodied in a recovery block in a disciplined fashion. Acceptance tests are used for detecting errors and damage assessment for a single sequential process is particularly simple — the program in execution is assumed to be effected. (Damage assessment for concurrent processes poses complications and is the subject of Chapt. 4.) Error recovery involves restoring all modified non-local variables to their values at the beginning of the recovery block. Finally, fault treatment consists of executing an alternative module in an attempt to avoid the fault(s) that resulted in the failure of the previous module.

Santosh Kumar Shrivastava

A Program Structure for Error Detection and Recovery

The paper describes a method of structuring programs which aids the design and validation of facilities for the detection of and recovery from software errors. Associated with the method is a mechanism for the automatic preservation of restart information at a level of overhead which is believed to be tolerable.

J. J. Horning, H. C. Lauer, P. M. Melliar-Smith, B. Randell

A Reconsideration of the Recovery Block Scheme

The recovery block scheme has been introduced as a method of providing fault tolerance at the software level in a computing system. From the widespread interest that has been expressed in the scheme there appears to be a standard set of questions which are posed about its implementation and utility. This paper presents a brief overview of the recovery block scheme and then examines in detail the issues that these questions raise.

P. A. Lee

Recovery Blocks in Action: A System Supporting High Reliability

The need for reliable complex systems motivates the development of techniques by which acceptable service can be maintained, even in the presence of residual errors. Recovery blocks allow a software designer to include tests on the acceptability of the various phases of a system’s operation, and to specify alternative actions should the acceptance tests fail. This approach relies on certain architectural features, ideally implemented in hardware, by which control and data structures can be retrieved after errors.A brief account is presented of the recovery block scheme, together with a description of a new implementation of the underlying cache mechanism. The salient features of a proposed computer architecture are described, which incorporates this implementation and also provides a high level of detection for errors such as the corruption of code and data. A prototype system has been constructed to test the viability of these techniques by executing programs containing recovery blocks on an emulator for the proposed architecture. Experience in running this system are recounted with respect to the execution of programs based on erroneous algorithms and also with respect to errors introduced by deliberate attempts to corrupt the system.

T. Anderson, R. Kerr

Sequential Pascal with Recovery Blocks

The programming language Sequential Pascal has been extended to include recovery blocks. This paper describes the modifications made to the kernel and interpreter of Brinch Hansen’s Pascal system to support recovery blocks and the associated recovery caches needed for state restoration.

S. K. Shrivastava

Fault-Tolerant Sequential Programming Using Recovery Blocks

When using recovery blocks [1], it is desirable to structure a program such that no unrecoverable operations (e.g. I/O) appear within a recovery block — thus ensuring that a recovery action will generate a consistent prior state. The figure below shows one case where only assignments are recoverable and a large file is to be processed (the merge sort example of the next section illustrates this approach): There can be two ways of designing the different algorithms for the primary and the alternatives of a recovery block: algorithms that are different but produce identical results (see the median example below) or algorithms for alternatives that are designed to provide a degraded service (producing different but nevertheless acceptable results, see the stable marriage example). In the latter situation the acceptance test can only be as strong as the test needed to check the adequacy of the ‘weakest’ alternative. Sometimes, this may prove unacceptable where a stronger test is needed for the primary (or even some of the alternatives). The following figure suggests a simple way of including both of these tests. ‘I’ represents the acceptance test and ‘Q’ represents a stronger test for the primary. It is assumed that if ‘Q’ is false, the primary will fail. ensure I by begin.. ; assert Q end else by...

S. K. Shrivastava, A. A. Akinpelu

A Recovery Cache for the PDP-11

Backward error recovery is an integral part of the recovery block scheme that has been advanced as a method for providing tolerance against faults in software; the recovery cache has been proposed as a mechanism for providing this error recovery capability. This paper describes a recovery cache that has been built for the PDP-11 family of machines. This recovery cache has been designed to be an “add-on” unit which requires no hardware alterations to the host CPU but which intersects the bus between the CPU and the memory modules. Specially designed hardware enables concurrent operation of the recovery cache and the host system, and aims to minimize the overheads imposed on the host.

P. A. Lee, N. Ghani, K. Heron

Recovery and Crash Resistance in a Filing System

This paper describes mechanisms that provide the user of a filing system the dynamic facility for defining a scope within which backing out can be done on request.Check points (defining the beginning of a new scope) can dynamically be established and procedures for ‘acceptance’ (at the end of the scope) or ‘undoing’ (within or at the end of the scope) can be invoked. These scopes can be nested.It is also shown that these mechanisms can be used to provide crash resistance. After a crash the system will be left in the state it was in before it entered the current scope (or outermost scope if scopes are nested).

J. S. M. Verhofstad

Exception Handling

Introduction

The recovery block approach discussed earlier is a means of coping with residual software faults in a system. Manifestation of such a fault can cause the program module under execution to produce an unexpected and undesired response. The possible responses obtainable from a module can be classified as: (i) expected and desired; (ii) expected and undesired; and (iii) unexpected and undesired.

Santosh Kumar Shrivastava

Software Reliability: The Role of Programmed Exception Handling

The paper discusses the basic concepts underlying the issue of software reliability, and argues that programmed exception handling is inappropriate for dealing with suspected software errors. Instead it is shown, using an example program, how exception handling can be combined with the recovery block structure. The result is to improve the effectiveness with which problems due to anticipated faulty input data, hardware components, etc., are dealt with, while continuing to provide means for recovering from unanticipated faults, including ones due to residual software design errors.

P. M. Melliar-Smith, B. Randell

Exception Handling and Software Fault Tolerance

Some basic concepts underlying the issue of fault-tolerant software design are investigated. Relying on these concepts, a unified point of view on programmed exception handling and default exception handling based on automatic backward recovery is constructed. The cause—effect relationship between software design faults and failure occurrences is explored and a class of faults for which default exception handling can provide effective fault tolerance is characterized. It is also shown that there exists a second class of design faults which cannot be tolerated by using default exception handling. The role that software verification methods can play in avoiding the production of such faults is discussed.

F. Cristian

Robust Data Types

Data types with total operations and exceptions are proposed as basic building blocks for the construction of modular robust software. A notation for specifying such data types is presented and the issues underlying their correct implementation in a programming language supporting data abstraction and exception handling are discussed and illustrated by examples. New light is shed on some important aspects of exception handling such as the identification and specification of exceptions, the precise detection of exception occurrences, recovery of consistent states after exception detections and verification of the correct implementation of specified exceptional effects for operations.

F. Cristian

Systematic Detection of Exception Occurrences

In proving the correctness of a program, a common strategem is to consider only initial states in which certain properties are satisfied. For example, in the knowledge that a given array contains at least one positive element, one might prove a program for finding, say, the first positive element in that array, even though the program may otherwise (i.e. if the array does not contain any positive elements) lead to unpredictable results.

E. Best, F. Cristian

Safe Programming

Safe specifications and programs are advocated as a simple way of enhancing the reliability of software. The behaviour of a safe program can be more easily certified as being correct with respect to its safe specification, which implies guaranteed termination. This paper describes the theory of safe programming, demonstrates the building of a safe program and summarises the experience gained from practical applications of safe programming.

T. Anderson, R. W. Witty

Concurrent Systems

Introduction

Atomic actions are now regarded as an important structuring concept for concurrent systems. Given a system of interacting processes (e.g. processes that share data), we would like execution of a program by a process to be ‘free of interference’ from (‘indivisible’ with respect to) other such executions. The so-called serializability property of atomic actions guarantees just that. Atomic actions can also be embellished with the ‘failure atomicity’ property: they can be provided with backward recovery capability such that an action in execution can be terminated without producing any side effects. The Newcastle papers collected in this chapter are concerned with the development of atomic action concept and discuss programming language features and recovery facilities for the construction reliable concurrent systems.

Santosh Kumar Shrivastava

Process Structuring, Synchronization, and Recovery Using Atomic Actions

This paper explores the notion of an atomic action as a method of process structuring. This notion, first introduced explicitly by Eswaren et al. [6] in the context of data base systems, reduces the problem of coping with many processes to that of coping with a single process within the atomic action. A form of process synchronization, the await statement, is adapted to work naturally with atomic actions. System recovery is also considered and we show how atomic actions can be used to isolate recovery action to a single process. Explicit control of recovery is provided by a reset procedure that permits information from rejected control paths to be passed to subsequent alternative paths.

D. B. Lomet

A Formal Model of Atomicity in Asynchronous Systems

We propose a generalisation of occurrence graphs as a formal model of computational structure. The model is used to define the “atomic occurrence” of a program, to characterise “interference freeness” between programs, and to model error recovery in a decentralised system.

E. Best, B. Randell

Reliable Resource Allocation Between Unreliable Processes

Basic error recovery problems between interacting processes are first discussed and the desirability of having separate recovery mechanisms for cooperation and competition is demonstrated. The paper then concentrates on recovery mechanisms for processes competing for the use of the shared resources of a computer system. Appropriate programming language features are developed based on the class and inner features of SIMULA, and on the structuring concepts of recovery blocks and monitors.

S. K. Shrivastava, J.-P. Banâtre

Concurrent Pascal with Backward Error Recovery: Language Features and Examples

The programming language Concurrent Pascal has been extended to include some language features that facilitate the writing of fault-tolerant software. As a result, it is possible now to (1) write operating systems witn a measure of fault-tolerance, and (2) for such an operating system to support fault-tolerant user programs. The paper describes these language features and illustrates their use with the help of a few working examples.

S. K. Shrivastava

Concurrent Pascal with Backward Error Recovery: Implementation

The implementation of backward error recovery features requires the support of a run time subsystem (called the recovery system) that is responsible for performing the task of state restoration. The recovery system implemented to support the recovery features of Concurrent Pascal includes, for each process, a recovery cache for recording appropriate recovery data. This paper describes the details of the recovery system that was implemented as a part of the interpreter of Brinch Hansen’s Concurrent Pascal system.

S. K. Shrivastava

A Framework for Software Fault Tolerance in Real-Time Systems

Real-time systems often have very high reliability requirements and are therefore prime candidates for the inclusion of fault tolerance techniques. In order to provide tolerance to software faults, some form of state restoration is usually advocated as a means of recovery. State restoration can be expensive and the cost is exacerbated for systems which utilize concurrent processes. The concurrency present in most real-time systems and the further difficulties introduced by timing constraints suggest that providing tolerance for software faults may be inordinately expensive or complex. We believe that this need not be case, and propose a straightforward pragmatic approach to software fault tolerance which is believed to be applicable to many real-time systems. The approach takes advantage of the structure of real-time systems to simplify error recovery, and a classification scheme for errors is introduced. Responses to each type of error are proposed which allow service to be maintained.

T. Anderson, J. C. Knight

Multilevel Systems

Introduction

A well known principle advocated for the construction of complex software systems is to structure the system into a hierarchy of interfaces (levels of abstraction). An interface may be characterized by the objects made available on that interface together with operations for manipulating them. In the literature, one frequently encounters diagrams such as that depicted in Fig. 1 (a), which is intended to indicate that a program such as Pj makes use of objects provided by the underlying interface Li. The function of Pj is to maintain the interface Li; that is, the (abstract) objects available on Li have their concrete implementations in Pj. Since we are concerned with the provision of fault tolerance in computer systems, it is instructive at this point to re-examine the recovery block scheme of Chap. 2 as a two level system (see Fig. 1 b). The program P0 implements the recovery cache scheme, thereby providing recoverable store words on interface L0. An ‘update a word’ operation available on interface L0 has its concrete implementation in program P0, and essentially involves P0 recording (if necessary) the prior value of the word in the cache before performing the update. Similarly, the operation ‘recover’ available on L0 is implemented by P0 by restoring the prior values of the relevant words. Thus the abstraction of backward recovery available to programs running over L0 is supported by normal ‘forward’ actions of program. P0. We can take this idea further, and construct another interface L1 that provides new sets of recoverable objects — this naturally leads us to the concept of multilevel recovery systems.

Santosh Kumar Shrivastava

A Model of Recoverability in Multilevel Systems

Backward error recovery (that is, resetting an erroneous state of a system to a previous error-free state) is an important general technique for recovery from errors in a system, especially those errors which were not foreseen. However, the provision of backward error recovery can be complex, particularly if the implementation of the system is “multilevel” and recovery is to be provided at a number of these levels. This paper discusses two distinct categories of multilevel system, and then examines in detail the issues involved in providing backward error recovery in both types of system.

T. Anderson, P. A. Lee, S. K. Shrivastava

The Provision of Recoverable Interfaces

The recovery block scheme has been proposed as one method of providing fault tolerant software, and is dependent on the availability of recoverable interfaces so that any damage caused by an erroneous program can be repaired by backward error recovery. However, it is clear that the interface provided by the hardware in any practical system will contain unrecoverable objects. This paper investigates a method of structuring a system into multiple levels so that a level of software can “hide” the unrecoverable features of an interface and provide a new interface with recoverable objects to programs needing facilities for backward error recovery. The paper discusses this organisation of recovery in such a system.

T. Anderson, P. A. Lee

Structuring Distributed Systems for Recoverability and Crash Resistance

An object-oriented multilevel model of computation is used to discuss recoverability and crash resistance issues in distributed systems. Of particular importance are the issues that are raised when recoverability and crash resistance properties are desired from objects whose concrete representations are distributed over several nodes. The execution of a program at a node of the system can give rise to a hierarchy of processes executing various parts of the program at different nodes. Recoverability and crash resistance properties are needed to ensure that such a group of processes leave the system state consistent despite faults in the system.

S. K. Shrivastava

Distributed Systems

Introduction

Distributed systems provide new possibilities for the construction of high performance computer systems; at the same time, however, they present reliability problems not normally encountered in centralised systems. The eight papers of this chapter report both theoretical and experimental work in the area of reliability in distributed systems.

Santosh Kumar Shrivastava

State Restoration in Distributed Systems

This paper concerns an important aspect of the problem of designing fault-tolerant distributed computing systems. The concepts involved in “backward error recovery”, i.e. restoring a system, or some part of a system, to a previous state which it is hoped or believed preceded the occurrence of any existing errors are formalised, and generalised so as to apply to concurrent, e.g. distributed, systems. Since in distributed systems there may exist a great deal of independence between activities, the system can be restored to a state that could have existed rather than to a state that actually existed.The formalisation is based on the use of what we term “Occurrence Graphs” to represent the cause-effect relationships that exist between the events that occur when a system is operational, and to indicate existing possibilities for state restoration. A protocol is presented which could be used in each of the nodes in a distributed computing system in order to provide system recoverability in the face even of multiple faults.

P. M. Merlin, B. Randell

Recovery Control of Communicating Processes in a Distributed System

The backward recovery of a computation to a previously existing state is a well-known method for attaining a degree of fault tolerance in digital systems.In this paper a protocol is developed for the purpose of providing “unplanned” recovery control in a distributed system of communicating processes. The protocol has the property of ensuring that the whole system reverts to a consistent state in the event of one or more processes initiating recovery action and it supports the determination of recovery point safety; that is when a recovery point cannot possibly be recovered to. It provides recovery control that is “unplanned” in the sense that the consistent state to which the system reverts after the initiation of recovery action is not predetermined. It is determined dynamically when recovery action is initiated and is based on the recorded information flow between the processes.The protocol is first developed for a model of computation in which each process independently implements a succession of single level, i.e. non-nested, recovery regions and where no restrictions are placed on inter-process message passing. The model is then extended to cover the case where processes may implement nested recovery regions. A development of the basic protocol which covers this case is presented and is shown to be significantly more complicated.

W. G. Wood

A Dependency, Commitment and Recovery Model for Atomic Actions

Some ideas on the construction of user applications as atomic actions are developed. Atomic actions that last a long time pose several problems if conventional ideas on concurrency control and recovery are applied. What is required is some means of delaying commitment without sacrificing performance. A model is proposed in which it is possible for an action to release and process as yet uncommitable objects. The impact of this on recovery is also discussed.

S. K. Shrivastava

Fail-Safe Extrema-Finding in a Circular Distributed System

In [2] an algorithm is given and analysed for finding the highest numbered node in a circular decentralised arrangement of nodes. We extend this algorithm to cover also the case in which nodes may enter a certain well-defined failure mode.

E. Best, F. Panzieri

The Design of a Reliable Remote Procedure Call Mechanism

In this contribution we describe the design of a reliable Remote Procedure Call mechanism intended for use in local area networks. Starting from the hardware level that provides primitive facilities for data transmission, we describe how such a mechanism can be constructed. We discuss various design issues involved, including the choice of a message passing system over which the remote call mechanism is to be constructed and the treatment of various abnormal situations such as lost messages and node crashes. We also investigate what the reliability requirements of the Remote Procedure Call mechanism should be with respect to both the application programs using it and the message passing system on which it itself is based.

S. K. Shrivastava, F. Panzieri

Reliable Remote Calls for Distributed UNIX: An Implementation Study

An implementation of a reliable remote procedure call mechanism for obtaining remote services is described. The reliability issues are discussed together with how they have been dealt with. The performance of the remote call mechanism is compared with that of local calls. The remote call mechanism is shown to be an efficient tool for distributed programming.

F. Panzieri, S. K. Shrivastava

The Newcastle Connection or UNIXes of the World Unite!

In this paper we describe a software subsystem that can be added to each of a set of physically interconnected UNIX or UNIX look-alike systems, so as to construct a distributed system which is functionally indistinguishable at both the user and the program level from a conventional single-processor UNIX system. The techniques used are applicable to a variety and multiplicity of both local and wide area networks, and enable all issues of inter-processor communication, network protocols, etc., to be hidden. A brief account is given of experience with such a distributed system, which is currently operational on a set of PDP11s connected by a Cambridge Ring. The final sections compare our scheme to various precursor schemes and discuss its potential relevance to other operating systems.

D. R. Brownbridge, L. F. Marshall, B. Randell

Recoverability Aspects of a Distributed File System

The paper presents the recoverability features of a distributed file system which has been built as an extension of the UNIX1 file system. The algorithm to achieve recoverability for the distributed files is discussed. The techniques implemented include: backward and forward error recovery, global and incremental recovery data recording. The paper outlines how these techniques can coexist to provide the abstraction of recoverability to a user of the distributed file system.

M. Jegado

Fault Tolerance and System Structuring

Fault Tolerance and System Structuring

We discuss a general approach to the design of fault-tolerant computing systems, concentrating on issues of system structuring rather than on the design of particular algorithms. Three forms of structuring are described. The first is based on the use of what we term “idealized fault-tolerant components”. Such components provide a means of system structuring which makes it easy to identify what parts of a system have what responsibilities for trying to cope with what sorts of faults. The second is a “recursive structuring” scheme. It involves using complete computers as the basic idealized fault-tolerant components of a distributed computing system whose functionality matches that of its component computers. Finally we discuss a generalization of the usual concept of an “atomic action”, which provides a means of structuring both forward and backward error recovery in distributed systems. These discussions are given in general terms, and also illustrated by brief accounts of recent and current work at Newcastle on the construction of UNIX-based fault-tolerant and distributed systems.

B. Randell
Weitere Informationen