Skip to main content

Über dieses Buch

In the last decade of Computer Science development, we can observe a growing interest in fault-tolerant computing. This interest is the result of a rising number of appl'ications where reliable operation of computing systems is an essential requirement. Besides basic research in the field of fault-tolerant computing, there is an increasing num­ ber of systems especially designed to achieve fault-tolerance. It is the objective of this conference to offer a survey of present research and development activities in these areas. The second GI/NTG/GM~ Conference on Fault-Tolerant Computing Systems has had a preparatory time of about two years. In March 1982, the first GI conference concerning fault-tolerant computing systems was held in Munich. One of the results of the conference was to bring an organiza­ tional framework to the FTC community in Germany. This led to the founding of the common interest group "Fault-Tolerant Computing Systems" of the Gesellschaft fur Informatik (GI), the Nachrichtentechnische Gesellschaft (NTG), and the Gesellschaft fur MeB- und Regelungstechnik (VDI/VDE-GMR) in November 1982. At that time, it was also decided to schedule a biannual conference on fault-tolerant computing systems. One of the goals of this second conference is to strengthen the relations with the international FTC community; thus, the call for papers was extended not only to German-speaking countries, but to other countries as well.



Verteilte Systeme I / Distributed systems I

The Architecture and the Fault-Treatment of MuTEAM

A multimicroprocessor prototype, called MuTEAM, is presented. Its architecture is oriented to the exploitation of decentralized control policies for both error treatment and resource management.The hardware architecture is composed by a set of clusters, each cluster comprising a tightly coupled multiprocessor. Within a cluster, specialized hardware is provided for memory protection and management, and for interprocessor communication. The kernel implements the run-time support for a CSP-like, message-based concurrent programming language.The prototype includes a set of fault tolerance mechanisms, inserted in the system programming language and run-time support, and in the hardware configuration. These are being used to develop a fault treatment policy, consisting of separate phases of diagnosis, reconfiguration and recovery. All of these are executed by sets of parallel processes, without a supervisor entity. In particular, diagnosis is provided by a set of dedicated processes, one running in each computer element, while reconfiguration and recovery are obtained through proper organization of the application code.

P. Corsini, L. Simoncini, L. Strigini

Anwenderwerkzeuge für das fehlertolerante Multimikrocomputersystem FUTURE

Für die Anwenderprogrammierung besitzt das FUTURE-System neben der Programmiersprache Modula-2 Echtzeitbetriebssystemfunktionen. Die Aufrufe für Intertaskkommunikation und Prozeßein-/ausgabe aktivieren für den Anwender transparent die Fehlertoleranzmechanismen. Nur bei der Konfiguration eines Anwendertasksystems muß als einziges direktes Fehlertoleranzkennzeichen die Zuverlässigkeitsklasse (Taskklasse) jeder Task angegeben werden. Sowohl die Benutzung der Systemdienste, welche Fehlertoleranzmechanismen aufrufen, als auch die Angabe der Taskklassen werden durch ein Initialisierungsprogramm überwacht. Für den Test steht ein Monitor zur Verfügung, der auch fehlertoleranzspezifische Kommandos besitzt.

Franz Demmelmeier

Implementing Fault-Tolerance in a Distributed System Architecture

Atomic actions are widely accepted as a powerful general concept for providing consistency and error recovery. However, so far the necessary mechanisms to implement these properties severely confine interprocess communication and the degree of internal concurrency and thereby interfering with the main advantages of distributed systems. Based on the concept of an object oriented architecture an efficient error recovery strategy is presented which largely overcomes these constraints. To do so, we consider the premature release of objects, i.e. allowing access to them before the associated atomic action has committed. The atomic action concept is extended by separating the termination of such an action from the commitment procedure. This avoids possible delay of the affected participants. The redundancy, inherent in a distributed system, is exploited to run computations and necessary recovery activity concurrently.

E. Nett, K.-E. Grosspietsch, J. Kaiser, R. Kröger

Extending Resilient Objects Efficiently

Resilient objects are instances of distributed abstract data types that are tolerant to failures. Due to the distributed nature of resilient objects and the replication of data, potential for a high degree of concurrency exists within them. This paper introduces a new concurrency control algorithm, which achieves higher concurrency than conventional methods like two-phase locking. Objects are specified in a high level language, and the algorithm uses the specification to take advantage of the structure of resilient objects and to exploit semantic information about operations. This information is given in a high level specification language.

Kenneth P. Birman, Thomas A. Joseph, Thomas Räuchle

Fehlertolerante Betriebssysteme / Fault-tolerant operating systems

A Fault Tolerant UNIX(TM) Process System Based On Reliability Classes

A fault tolerance model for the dynamic process system of the UNIX operating system is described which at minimum garantees existing programs to be executable without any changes. The model is based on a concept which was designed for the FUTURE fault tolerant multicomputer system and offers taskspecific, static hardware redundancy using 4 reliability classes. The report starts with a discussion of the fundamental problems concerning the adaptation of that concept to the UNIX process environment. Then two models for a process oriented class specification are derived. Within the first one the class is defined once a UNIX command is invoked by the user. All processes which are generated within the execution of one command automatically inherit the same reliability class. The second model gives the possibility to specify different classes within such a process subset. It is based on an external structure which contains an exact specification of process relationships and interprocess communication according to predefined rules.

T. Brand, H. Kästner, F. Demmelmeier, G. Koller

Design and Structure of an Operating System Layer Implementing Fault Tolerance

In this paper, design and structure of an operating system layer which is to implement fault tolerance in a multi-microprocessor system is described. This layer is installed on top of an off-the-shelf operating system. It is transparent to the user and realizes the “software implemented fault tolerance (on operating system level)” of the work station ATTEMPTO.

Th. Risse, R. Brause, M. Dal Cin, E. Dilger, J. Lutz

The Introduction of Fault-Tolerance in a Hierarchical Operating System

A general method for introducing fault-tolerance in a hierarchical operating system is presented here. First, a hierarchically structured conventional (non-fault-tolerant) operating system is described. In order to transform it into a fault-tolerant system, each conventional machine is augmented with an Error Detection and Recovery (EDR) mechanism, thus obtaining a corresponding fault-tolerant machine. It is determined that, from the standpoint of fault-tolerance, three types of machines can be identified: physical, kernel, and process type. The EDR mechanism makes a conventional machine fault-tolerant by transforming its conventional operations into fault-tolerant operations. To provide this transformation, a set of operations are defined for the EDR mechanism. A model for fault-tolerant operations is developed, such that known techniques for fault-tolerance (e.g. recovery block, N-version programming, etc.) can be represented as particular cases. The general fault-tolerant operating system obtained is a hierarchy of fault-tolerant machines, with the physical type machines at the bottom, followed by the kernel type machines above them, and the process type as the upper machines.

Marius D. Soneru, Wing H. Huen

Verifikations- und Laufzeitvalidationsmethoden / Verification and run-time validation techniques

Efficient Online Error Detection Techniques for Trees in Database Systems

B- and B*-trees are often used for the implementation of access paths in database systems. Certain faults may possibly create inconsistencies in those trees which seriously affect database processing. This paper presents error detection algorithms which may be executed as part of normal tree handling in database operation. Consistency checking requires only a small amount of redundant information. To keep DB processing efficient, fast algorithms are proposed for checking purposes. Moreover, they do not require additional I/O operations.

Klaus Küspert

CADAS: A Tool for Designing Reliable Embedded Software and Supporting Testing “in the Large”

The development of embedded software systems has been plagued by problems of reliability, availability and high costs. The development of ADA has recognised the contribution of the implementation language to this budget deficit. Unfortunately, that of the support environment including verification and validation and the interplay of its individual components has still to be generally recognised and understood.In order to define an ideal tool for verification and validation, its interplay with other programming support environment components must first be recognised. The design and part realisation of such a tool has been achieved through the development of CADAS (Computer Aided Design And Verification System). In this paper its development, implementation and use todate are discussed. A concept of how it may be used to support the development and verification of exception handlers is presented.

M. A. J. Burford, F. Belli

Automatische Verifikation logischer Entwürfe

For hierarchically structured logic designs our program VERENA verifies if a transformation from a register-transfer description to a logic level description has been correctly performed. Fach of the two descriptions is considered as a composition of macro units and it is assumed that there exists a one-to-one mapping between the sets of macro units at the register transfer and the logic level. The transformation is correct if related macro units are equivalent. The equivalence proof for two macro units is reduced to an equivalence proof for two combinational circuits. We describe a new method for solving this problem which has been programmed as an experimental version.

W. Grass, N. Schielow

Allgemeine Modelle / General models

Modeling and Validating Broadcasting-Free Concurrent Systems

We consider the systems composed of a finite set of asynchronously cooperating subsystems. The subsystems may communicate with each other by sending or receiving synchronization messages. Each particular synchronization message has a unique sender and a unique receiver. Such set of subsystems is called “broadcasting-free”. We show how to generate a global description of the system from the descriptions of its subsystems provided they cooperate by means of the synchronization mechanism called rendez-vous. We show also how to validate the system behaviours. The susbystems and the whole system will be modelled by abstract machines defined in the paper.

Stanislaw Budkowski

A Theory for the Analysis and Construction of Fault-Tolerant Systems

In this paper, we are going to introduce mathematical concepts to define the notion of a „Module“ for hardware and software systems and to take into account methods of fault-tolerant computation from the very beginning of system development. Moduls are formalized by means of Regular Expressions and their derivates. With the aid of these mathematical concepts a theory of the analysis and construction of fault-tolerant systems will be presented.The general fault-tolerant problem will be solved in three steps: The predication of fault-tolerence of a system, the diagnosis of faults while the system is in operation and, last but not least, a construction of a fault-tolerant system.The methods applied are syntactical although the class of faults under consideration are by no means restricted to the calls of syntax errors in programming.

B. Eggers, F. Belli

An Uniform Approach to the Operating System in Fault-Tolerant Distributed Computer Systems

In the paper a formal model of the fault-tolerant distributed computer system with the possibility of reconfiguration is considered. A fault-tolerant distributed computer system model from hardware viewpoint forms a fault-tolerant net, in which concurrent algorithms are performed. Due to uniform criteria of reconfigurations and interactions, an uniform analysis, synthesis and optimization is possible. Next the main idea of the operating system for the above systems has been described.

Eugeniusz Eberbach, Jan R. Just

Fehlertolerante Hardwarekomponenten / Fault-tolerant hardware components

Modular Design of Totally Self-Checking Checkers for 1-Out-of-N Codes

This paper presents modular design methods of totally self-checking checkers (TSCCs) for 1-out-of-n (1/n) codes. In a modular method a number of 1-out-of-ni TSCCs of lower rank are used to build a larger 1/n TSCC (ni<n, n>6). Two modular design procedures are developed. The elementary modular design procedure in which the 1/4, 1/5, 1/6 TSCCs are used as modules to build any 1/n TSCC (n≥7), and the optimal modular design procedure, in which the problem of the most efficient decomposition of the 1/n TSCCs into arbitrary 1/ni TSCCs of lower rank is investigated.

C. Efstathiou, C. Halatsis

Systematic t-Error Correcting All Unidirectional Error Detecting Codes

In this paper we give two new methods for the construction of systematic t-random error correcting and all unidirectional error detecting codes. Also we give the encoding/decoding algorithms and discuss their implementation.Beginning from a k-error correcting systematic parity check code we append check symbols in order to construct t-EC/AUED codes, with t≤k. This code construction technique takes into account the Hamming distance of the parity check code in order to reduce the number of bits for the check symbols. Thus the codes derived are significantly more efficient than the known codes with the same capabilities. Moreover these codes have a very useful property. Discarding some check symbols we get codes with weaker capabilities against random errors. This property offers the ability of uniform coding over system units which have different demands on random errors.

D. Nikolos, N. Gaitanis, G. Philokyprou

Concurrent Error-Detection/-Correction of Logical Operations

As a consequence of studies concerning a fault-tolerant microprogrammed microprocessor an error-detection and error-correction scheme for logical operations has been developed. Implementing inverse residue coding for error-detection and inverse biresidue coding for error-correction the hardware is considered. The theoretical basis of residue coding is briefly summarized and the error-detection/error-correction of logical operations is presented. Concurrent error-detection capability is discussed as well as the estimation of hardware overhead.

W. Michael Trautwein

Zuverlässigkeitsmodelle / Reliability models

Dependability Modeling and Evaluation of Software-and-Hardware Systems

This paper is an attempt to provide a conceptual framework for modeling and evaluating the reliability and availability of computing sytems, with respect to both physical and design faults.

J. C. Laprie

Qantitative Bewertung der Zuverlässigkeit von Echtzeitprogrammsystemen

Improvement of programming methods, perfecting of programming languages and increase of expenditure for verification and testing do not lead to programm systems which are perfect and absolutely reliable as experience has shown.

D. Balzer, B. Böhme

Issues in Reliability Modeling of Fault-Tolerant Computers

Four key issues in reliability modeling of fault-tolerant systems (imperfect coverage, large state space, non-exponential behavior and error estimation) are discussed. Some general methods as well as HARP (The Hybrid Automated Reliability Predictor) methods of dealing with these problems are outlined. An annotated bibliography provides pointers to the literature for further information.

Kishor Trivedi, Joanne Bechta Dugan, Robert Geist, Mark Smotherman

Einzelfehler Tolerierende Kommunikationsnetze

Es werden ausgewählte Eigenschaften und Parameter von Einzelfehler tolerierenden Kommunikationsnetzen diskutiert. Neben den Zuverlässigkeitsparametern, die hier jedoch nur bis zur Indikatorfunktion für den Netzzusammenhalt entwickelt werden, sind dies hauptsächlich Selbstdiagnose-Eigenschaften.Die Arbeit soll bei der Analyse bestehender Netze helfen, die Eigenschaft der Einzelfehlertoleranz verknüpft mit rascher Fehlerortung nötigenfalls durch gezielte Zusätze zu erreichen, und sie soll Hilfen zur systematischen Synthese derartiger Systeme geben.

Winfrid Schneeweiss

Verteilte Systeme II / Distributed systems II

Implementation of a Fault-Tolerant File Management System

In our paper we present the concept of a file management system which is designed to meet the requirements of file consistency and data integrity in a fault-tolerant multimicrocomputer system and which has been implemented in an experimental model. Based on redundant file server hardware the system guarantees access to all background data even in case of a disk failure. The locking and backup mechanism used to preserve file consistency with restarting processes and the protocol synchronizing the two file server modules are explained in detail. Finally the reinsertion of a repaired storage device into the running system is described.

D. Bernhardt, A. Klein

A Fault-Tolerant Loop-Structured Local Area Network on the Base of Fibre-Optical Transmission

The paper describes a technique for increasing the reliability of a local area network whose stations are interconnected by optical fibres forming a double ring with anti-parallel transmission in both rings. One of the rings is used for transmitting and the other for receiving messages. Logically the network is operated as a loop system in which one link of the double ring between two (first of all arbitrary) stations is not used by feeding the messages from the transmitting ring into the receiving ring in the station before the unused (redundant) link (the so-called reversal station). If a station fails or a link breaks the function of the reversal station is transferred into the station before the failure location not using the failing link now. Thus the full network functionality is re-established. The reconfiguration protocol is described and illustrated by an example.

W. Schröck

Synchronization Tools and a Restart Method in the Fault-Tolerant Distributed Automation System FIPS

The experimental microcomputer-system FIPS represents a part of a distributed processautomation system. It was designed to investigate the possibilities of implementing fault-tolerance with respect to the specific requirements of the industrial processautomation (fairly independent and flexible extensible local subsystems). The present paper describes the synchronization tools of the system supporting parallel running jobs on different subsystems that compare or vote on processoutput-signals. Further it shows a method to inform the application programmer about a job restart allowing him to react according to the requirements of the technical process and of his control-algorithm.

Hubert Maencher

Diagnose auf Systembene / Diagnosis at system level

Distributed Self-Diagnosis and Fault-Tolerant Communication in Parallel Multiprocessor Networks

In this paper concepts are presented which provide fault-tolerant communication and distributed system-level self-diagnosis in networks. Information of the state of the system is not obtained by performing a normal diagnostic algorithm /6,9,10/, but is provided by the interconnection structure of the network itself. Maximum diagnosability and maximum degree of fault-tolerant communication can be achieved in merely three rounds. This result covers not only processing element faults but also faults of the interconnections.

Elmar Dilger

System Level Fault-Diagnosis in Distributed Systems

This paper presents a graph model for diagnosing faulty and repaired units and their interconnections in distributed systems. Some requirements for the application of the model to computer systems are treated. The system level fault diagnosis can be computed by an algorithm which runs on the modules of the system. Such an algorithm is discussed in the paper and included in the appendix.

Klaus Moritzen

Selbstdiagnoseverfahren auf der Grundlage der mathematischen Logik

Es wird gezeigt, wie Probleme der Fehlerentdeckung bei sich selbst testenden Systemen mit Mitteln der mathematischen Logik behandelt werden können. Die dargestellte Methode beruht auf der vollständigen Beschreibung von Testbedingungen und Testaussagen durch Mengen geeigneter logischer Formeln. Die Untersuchung und Auswertung solcher (logischen) Syndrome kann, wie gezeigt wird, mit Mitteln der mathematischen Logik geschehen. Auf ihrer Grundlage werden, in Form von Kalkülen, Verfahren angegeben, mit denen aus (logischen) Syndromen Aussagen über Fehlerzustände der Systemeinheiten gewonnen werden können. Die Allgemeinheit der Beschreibungsmethode erlaubt die Darstellung verschiedenartiger Testsituationen in einheitlicher Weise und damit ihre Behandlung mit den selben Verfahren.

Horst Federmutz

Tests with fault-localizing capabilities improve system level diagnosis

A fundamental work in the system level diagnosis is the paper of F. Preparata, G. Metze and R. Chien [1] published in 1967. Since then digital systems have become larger and more complex, systems reliability requirements have increased also. It is then desirable to be able to determine quickly and precisely faulty units in a system, in a case of a system fail. Due to it, system level diagnosis becomes more and more important.

Maciej Modrzejewski

Protokolle / Protocols

Bestimmung der Protokoll-Menge für Verteilte Fehlermaskierungs-Systeme

In static redundant multi-computers voters can be implemented as distributed systems. The nodes of a voter-system execute a protocol, which masks faults and evaluates sender-receiver-pairs for redundant communication between user-processes with minimum overhead. This paper gives a description method and a correctness criterion for the set of all voting-protocols. It is based on a so-called Voting-Protocol-Graph, whose arcs model special protocol properties, such as availibility of signature information. The semantics of the graph is defined by corresponding attributed petri-nets. The attributes of a protocol-unit’s last action have to meet a fault-tolerance criterion, which depends on a given fault-model.

Klaus Echtlel

Modelling and Verification of a Checkpoint-Restart-Protocol

The paper presents a protocol for the treatment of temporary local crashes and communication failures in systems of communicating processes. Assuming the existence of a global partitioning of the processes’ interactions into a sequence of segments as well as the availability of a reliable memory to each participant, the recovery strategy lies in the provident permanent recording and discarding of checkpoint data and, in cases of failure, common restarts from some recent global checkpoint (boundary between two segments). Apart from the inherent limitations imposed by data flow and resources the processes may run totally asynchronously. The protocol is modelled by a PrT-net, which permits to formally state and verify the protocol’s service, namely guaranteeing liveness and the eventual progress of the system’s actions. The final part of this paper is devoted to the structure and basic ideas of this proof.

Bernd Baumgarten, Peter Ochsenschläger

Hardware-Testmethoden I / Hardware testing methods I

VLSI Functional Testing Using Critical Path Traces at a Hardware Description Language Level

A new approach for testing VLSI circuits is presented. Through backward critical path tracing, a test and all faults detectable by the test are generated simultaneously. Therefore, the expensive fault simulation is completely eliminated. We present a critical path test generation procedure for digital systems described by hardware description language (HDL). A multiplication circuit described by a HDL is utilized for demonstrating the test generation method.

Li Shen, Stephen Y. H. Su

Random Testing of LSI Self-Checking Circuits

The principle of random testing is to apply a sequence of random input patterns simultaneously to both a circuit under test and a reference circuit. The outputs are compared. The research aim is to determine the test length (number of input patterns to apply) to obtain a given test quality. In the case of the microprocessor one applies a sequence of random instructions with random data, and a method of evaluating an upper bound for a detecting sequence has been defined in previous works. This paper recalls the method and concerns extensions to use it for LSI circuits which are not microprocessors. One gives results (test lengths) for three LSI self-checking circuits and more general comments about random testing of self-checking circuits. Some experimental results are presented in a summary form.

Hugues Deneux, Pascale Thevenod-Fosse

Functional Testing Vs. Structural Testing of Rams

The object of this paper is to synthetize results that have been published for RAM testing since 1977. These results are concerned with functional and semi-functional testing. It is noted that although the used vocabulary is different, fault hypotheses are very similar. These types of algorithms reached the availability of quasi-optimal algorithms. Structural testing is then introduced, the goal being to relate functional fault hypotheses and structural fault hypotheses, and to investigate the interest of structural type test algorithms compared to functional type test algorithms.

H. Sahami, B. Courtois

Hardware Testmethoden II / Hardware testing methods II

Compression of Multiple-Valued Data Serial Streams by Means of Parallel LFSR Signature Analyzer

It is characteristic of microprocessor systems that, apart from states 1 and 0, also states of high impedance HZ occur in them, among other things, on data and address buses. With regard to this one should take into account the possibility of applying such a state to a signature analyzer. The standard solution here is the application of an appropriate three-state data probe with the unit decoder 3/2 — JK flip-flop [1] which, at the moment of the HZ state detection, prolongs the previously existing state. Thus, the third state is interrpreted as the state 1 or 0, according to the state which had a tested point before its occurence. Such a solution has been approved by the Hewlett Packard Company and then by other companies in their signature analyzers.

Andrzej Hławiczka

The Detection of Small Size Multiple Faults by Single Fault Test Sets in Programmable Logic Arrays

In this paper we present a method to quantitively predict the multiple fault coverage capability of a single fault detection test set in a PLA. The method enables to determine the coverage ratio defined as the ratio of the number of multiple contact faults detected by a single fault test Tc to the total number of all multiple faults. The analysis is divided into two parts. First we consider multiple faults which do not contain any four-way masking cycle. Next, the masking relations are studied in detail and it is shown that Tc detects significant percentage of faults with four-way masking cycle. Based on these results the bounds of coverage capability of Tc are determined. It is shown that the multiple fault coverage ratio dr increases with increasing number of rows of a PLA and the ratio drops with increasing size of faults.Index Terms — PLA testing, contact faults, fault masking, multiple fault detection, fault coverage, worst case coverage.

Janusz Rajski, Jerzy Tyszer

Can Design Faults be Tolerated?

The short answer to the question posed by the title is “Yes”. A more cautious, and less simplistic, response would be that in certain circumstances, with appropriate provision of redundancy and allied supporting mechanisms, it is certainly possible to provide a measure of tolerance to faults of design. However, although this question may serve as an appropriate title, and starting point for discussion, it does not adequately address the significant issues concerning the application of fault tolerance techniques to deficiencies of design. As is usually the case, the first, and perhaps most important, step is to ask the right questions. In this paper, I propose to substitute five further questions in place of my title and, in answering those questions, will argue the case for the use of design fault tolerance in the development of reliable computing systems. In so doing I hope to justify the short and cautious answers already given above.

T. Anderson
Weitere Informationen