Friends: A flexible architecture for implementing fault tolerant and secure distributed applications

Abstract

FRIENDS is a software-based architecture for implementing fault-tolerant and, to some extent, secure applications. This architecture is composed of sub-systems and libraries of metaobjects. Transparency and separation of concerns is provided not only to the application programmer but also to the programmers implementing metaobjects for fault tolerance, secure communication and distribution. Common services required for implementing metaobjects are provided by the sub-systems. Metaobjects are implemented using object-oriented techniques and can be reused and customised according to the application needs, the operational environment and its related fault assumptions. Flexibility is increased by a recursive use of metaobjects. Examples and experiments are also described.

Jean-Charles Fabre, Tanguy Pérennou

Adaptable fault tolerance for distributed process control using exclusively standard components

Abstract

This paper describes an adaptable fault tolerance architecture for distributed process control which uses exclusively standard hardware, standard system software and standard protocols. It offers a quick and low cost solution to provide non-safety critical, technical facilities and plants with continuous service. Thereby a maximum of practicability for the application engineers is achieved. The architecture is composed from well known fault tolerance methods under the constraints of real-time requirements. The latitude of non-safety critical applications is carefully used to minimize the fault tolerance overhead. Because of the transparency of the fault tolerance each functional part of the process control, which is represented by an application task, can be implemented without regard to non-determinism and executing hosts. The configuration of a control system is easy and simply done by naming hosts, tasks and groups in a file, wherein every individual task has to be declared with the selected fault tolerance strategy.

It can be expected by a fault-tolerant system that reconfiguration, following a fault, is done automatically. The present system does more: it reintegrates repaired hosts automatically and re-establishes the redundant operation, while the entire system is working.

Jürgen Bohne, Reny Grönberg

On stratified sampling for high coverage estimations

Abstract

This paper addresses the problem of estimating the coverage of a fault tolerance mechanism through statistical processing of observations collected in faultinjection experiments. In an earlier paper, several techniques for sampling the fault/activity input space of a fault tolerance mechanism were presented. Various estimators based on simple sampling in the whole space and stratified sampling in a partitioned space were studied; confidence limits were derived based on a normal approximation. In this paper, the validity of this approximation is analyzed, especially for high coverage systems. The theory of confidence regions is then introduced to estimate the coverage without approximation when, for practical reasons, stratification is used. Three statistics are considered for defining confidence regions. It is shown that one of these statistics — a vectorial statistic — is often more conservative than the other two. However, only the vectorial statistic is computationally tractable. The results obtained are compared with those based on approximation by means of three hypothetical example systems.

David Powell, Michel Cukier, Jean Arlat

Fault injection evaluation of assigned signatures in a RISC processor

Abstract

This paper proposes a new assigned signature monitoring technique called VASC (Versatile Assigned Signature Checking) and presents a fault injection evaluation of this technique in a modern reduced instruction set processor (RISC). VASC is applied at the machine instructions level and is the very first assigned signature monitoring technique that allows the user to choose block sizes and checking intervals with complete freedom. This feature allows the tuning of the application with the lowest overhead and makes it possible to identify and analyse the relationship between overheads and coverages for different choices of block sizes and checking intervals. Previous works presented assigned signature techniques that were either specific to a Very Large Instruction Word (VLIW) processor or have very high memory and execution time overhead. VASC can be applied to any system with small (and adjustable) execution time and memory overhead. We have measured those overheads in a PowerPC processor and in a Transputer T805 processor, showing that they are completely adjustable. One interesting conclusion is that the best error coverage of the technique does not correspond to the highest amount of control flow checking. The evaluation of the error coverage and latency of VASC, implemented in a PowerPC RISC processor, has been carried out by using the Xception software fault injection tool. The technique accounted for 10 to 16% of the detected errors, but the overall error coverage improvement was relatively small (less than 4%). One reason for this is in the fact that the percentage of control flow errors (the main type of errors detected by VASC) in the PowerPC is relatively low compared to the same percentage in processors with a more complex instruction set, which suggests that assigned signatures are much better suited for systems having variable-sized instructions and less efficient memory access checking.

Pedro Furtado, Henrique Madeira

An evaluation of the error detection mechanisms in MARS using software-implemented fault injection

Abstract

The concept of fail-silent nodes greatly simplifies the design and safety proof of highly dependable fault-tolerant computer systems. The MAintainable Real-Time System (MARS) is a computer system where the hardware, operating system, and application level error detection mechanisms are designed to ensure the fail silence of nodes with a high probability.

The goal of this paper is two-fold: First, the error detection capabilities of the different mechanisms are evaluated in software-implemented fault injection experiments using the well-known bit-flip fault model. The results show that a fail silence coverage of at least 85% is achievable by the combination of hardware and system level software error detection mechanisms. With the additional use of application level error detection mechanisms a fail silence coverage of 100% was achieved.

Second, the limits of the application level error detection mechanisms are evaluated. In these experiments, the fault model consists of highly improbable residual faults to deliberately force the occurrence of fail silence violations. Despite this worst-case scenario, more than 50% of the presumed undetectable errors were detected by other mechanisms and hence did not lead to fail silence violations.

Emmerich Fuchs

Dependability modeling and analysis of complex control systems: An application to railway interlocking

Abstract

This paper describes the dependability modelling and evaluation of a real complex system, made of redundant replicated hardware and redundant diverse software. It takes into account all aspects of their interactions (including correlation between the diverse software variants) and of the criticality of the several components. Our approach has been to realise the system model in a structured way. This allows to cope with complexity and to focus, where interesting, on specific behaviour for a more detailed analysis. Furthermore each level may be modelled using different methodologies and its evaluation performed with different tools without the need of modifying the general structure of the model. In order to validate the most complex sub-models, we built alternatives using different tools and methodologies; this proved to be very useful since it allowed to find small bugs and imperfections and to gain more confidence that the models represented the real system behaviour. With respect to the real system taken as the example, our analyses, which could not be reported here, allowed to establish the dependability bottlenecks of the current version and to state targets for the several subcomponents such that the system targets could be reached, thus providing hints for next releases or modifications of the system and information to assign targets to the various components of the system.

Manuela Nelli, Andrea Bondavalli, Luca Simoncini

The effect of interfailure time variability on the software reliability growth modelling

Abstract

This paper deals with the effect of interfailure time variability on the modelling of software reliability growth. Some, primarily mathematical arguments, are outlined that cast a doubt on the commonly adopted concept of failures ' occurrence as a Poisson (either homogeneous or non-homogeneous) process. On a contrived (but plausible for the software practice) example the consequences of this assumption, in some cases inappropriate for the software reliability predictions, are illustrated. The idea of “tuning” the reliability growth models based on measurements of the distribution of a single execution time is advocated.

Peter Popov

Dependability evaluation of a computing system for traction control of electrical locomotives

Abstract

This article presents the dependability analysis of a computing system controlling the traction system and especially the semiconductor current-converters of a modern electric locomotive. Following special aspects of this application are taken into account: different degrees of performance reduction, lurking errors, periodic tests, short repair times. The dependability evaluation started from a simple “symptomatic model” showing the stages of performance degradation. It turned out to be a very helpful visualisation aid when discussing the failure modes with the experts for the components concerned. The detailed knowledge about hardware failures and their effects was collected in one large FMEA-table. For the subsequent mathematical analysis an elaborate “FMEA-oriented Markov model” was automatically constructed from the FMEA-table. This approach proved to be efficient and straightforward, giving clear results and hints on which components have most influence on MTTF, which must possibly be redesigned or planned redundantly. The customer's special requirements could be taken into account by arbitrarily varying the sets of up- and down-states. The approach is assumed to be applicable to many similar problems of dependability evaluation.

Silke Draber, Bernhard Eschermann

Dependability models of RAID using stochastic activity networks

Abstract

The development of CTMCs to model RAIDs with a general failure-repair process seems a complex task. The introduction of a technique with higher modeling level, that facilitates the description of the failure-repair processes is essential to deepen in the dependability analysis of these systems.

In this paper, different dependability models of RAID level 5 are presented. The chosen technique is an extension of timed Petri nets known as Stochastic Activity Networks (SAN). As will be shown, it is relatively simple to describe a SAN equivalent to the developed CTMC. This model will be called the basic model. Furthermore, the flexibility introduced by these nets, through the cases associated with its timed activities and the input and output gates, allows an extension of the basic SAN model in order to obtain more accurate results or to adapt it to a new specification of the system operation. Thus, the basic model will be modified to represent the repair process with detail, differentiating its two phases: replacement and reconstruction. Also, the influence of a limited number of on-line spares on the reliability of the system will be considered. Thereinafter, using the structure of the basic model, a SAN of a RAID with a higher protection level (RAID level 6) is presented.

Vicente Santonja, Marina Alonso, Javier Molero, Juan J. Serrano, Pedro Gil, Rafael Ors

Compiler assisted self-checking of structural integrity using return address hashing

Abstract

A software-based approach to control-flow checking is presented. The method uses the control flow graph of a program to construct a state machine which is embedded into the program using a modified GNU C-compiler. Using the return address register as the state variable of the FSM no data overhead occurs. Employing a Compiler for the embedding of the redundant code into the program permits the exploitation of delay slots and jump optimizations for modern RISC processors. The method is evaluated on a SPARC processor using software-implemented control-flow error injection and the SPECint92 benchmark suite. The average temporal overhead is below 20% and the errors violating the fail-silent model can be reduced by a factor of 6 down to 0.3%.

Uwe Wildner

Single source fault-tolerant broadcasting for two-dimensional meshes without virtual channels

Abstract

In this paper, the authors propose a fault-tolerant single source broadcast algorithm for wormhole routed two-dimensional meshes that utilizes neighbor status information to dynamically construct a broadcast spanning tree when up to N−1 faults are present in an N×N two-dimensional mesh. Correctness proofs for the proposed broadcasting algorithm are presented and the algorithm is also proven to be livelock- and deadlock-free. The proposed Virtual Source Broadcast (VSB) algorithm can be implemented alone without requiring any virtual channels. However, supporting simultaneous unicast and broadcast messages will require two additional virtual channels per physical link in addition to the virtual channels required for the unicast algorithm. The paper also compares the proposed VSB algorithm with broadcasting techniques that have been proposed by other authors.

D. R. Avresky, Chris M. Cunningham

On-line testing of an off-the-shelf microprocessor board for safety-critical applications

Abstract

The paper describes the strategy adopted to implement on-line test procedures for a commercial microprocessor board used in an automated light-metro control system. Special care has been devoted to chose the most effective test strategy for memory elements, processors, and caches, while guaranteeing a minimum impact on the normal behavior of the whole system. Implementation of the described techniques will significantly improve the system ability to safely react to possible faults. This will be quantitatively determined in the subsequent dependability evaluation phase.

F. Corno, M. Damiani, L. Impagliazzo, P. Prinetto, M. Rebaudengo, G. Sartore, M. Sonza Reorda

The logic threshold based voting: A model for local feedback bridging fault

Abstract

In order to simulate the effects of a bridging fault it is necessary to accurately determine the intermediate voltage of the shorted nodes and compare it to the logic threshold voltage of the driven gates. This paper presents a model called ”the Logic Threshold Based voting model ” which can be used to determine if a bridging fault with local feedback gives an intermediate voltage which is higher or lower than a given threshold voltage. The approach is extremely faster than the previous ones since no SPICE simulation is required.

M. Renovell, P. Huc, Y. Bertrand

On the yield of VLSI processors with on-chip CPU cache

Abstract

Yield enhancement through the acceptance of partially good chips is a well-known technique [1–3]. In this paper we derive a yield model for single-chip VLSI processors with a partially good on-chip cache. Also, we investigate how the yield enhancement of VLSI processors with on-chip CPU cache relates with the number of acceptable faulty cache blocks, the percentage of the cache area with respect to the whole chip area and various manufacturing process parameters as defect densities and the fault clustering parameter.

D. Nikolos, H. T. Vergos

Design of dependable hardware: What BIST is most efficient?

Abstract

We show that self-testability is an essential feature of VLSI circuits used as components of dependable systems. We discuss one of the key problems associated with designing self-testable circuits — the selection of BIST (built-in self-test) technique. Specifically, we present advantages and disadvantages of two basic strategies that can be applied when designing a VLSI chip: using one universal BIST technique vs. using several dedicated BIST techniques for a single chip design. As a good candidate for a universal BIST technique, having a potential to effectively and efficiently support the design of a large class of VLSI circuits, we advise the CSTP (Circular Self-Test Path) technique. The applicability of the CSTP is illustrated with an example of self-testable design of a bit-slice processor 2901, a circuit of diversified internal structure comprising various types of functional blocks (ALU, two-port RAM, registers, latches, random logic, multiplexer-based shifters). The design characteristics and the results of simulation experiments carried out for this circuit show that the proposed solution offers high quality testing at acceptable BIST implementation costs.

Andrzej Kraśniewski

Pseudorandom testing of microprocessors at instruction/data flow level

Abstract

A universal approach to testing microprocessors is described. It is based on a test program with pseudorandom stream of instructions and data. An original software generator of such programs has been presented. Theoretical studies and many simulation experiments show that the proposed approach has similar properties to pseudorandom testing at the circuit level and covers a lot of fault models.

Janusz Sosnowski, A. Kuśmierczyk

Multi-level test generation and fault diagnosis for finite state machines

Abstract

In this paper, a new multi-level technique, based on alternative graphs, with uniform procedures at each level for test generation, fault simulation and fault diagnosis in finite state machines (FSM) is presented. For the description of function (behavior), structure and faults in FSM, three levels are used: functional (state transition diagrams), logical (signal path) and gate levels. In test generation, simultaneously all levels are used. Faults from different classes are inserted and activated at different levels by uniform procedures. State initialization and fault propagation are carried out only at the functional level. Backtracking will not cross level borders, hence, the high efficiency of test generation can be reached. Fault diagnosis is carried out using top-down technique, keeping the complexity of candidate fault sets in each level as low as possible.

R. Ubar, M. Brik

Dynamic testing from bounded data type specifications

Abstract

Due to practical limitations in software and hardware, data type implementations are always bounded. Such limitations are a frequent source of faults which are difficult to find out. As soon as boundaries are clearly identified in a specification, functional testing should be able to address any boundary fault.

We propose to enrich a data type specification formalism, namely algebraic specifications, allowing a natural description of data type boundaries. This enhancement allows us to adapt the existing testing theory, the method and the tool, initially dedicated to functional testing from unbounded data type specifications.

Several examples of test data selection with the LOFT tool, from two bounded data type specifications, will illustrate the benefit of our approach: an assisted test selection process, formally defined in a functional testing theory, allowing adequate coverage of both data types bounds and the definition domain of the specified operations.

Agnès Arnould, Pascale Le Gall, Bruno Marre

A theory of specification-based testing for object-oriented software

Abstract

The current strategies for testing object-oriented software all lack the formal basis which is necessary to perform this task efficiently. We propose the adaptation to object-oriented software of an existing theory of testing for stateless ADTs, to find errors in a class by checking that its implementation meets its specification. We present shortly in an informal way an object-oriented language, CO-OPN/2, in which language we will write the specification. We introduce a notion of test that takes into account the possible and impossible sequences of call of class methods. We examine the black-box test procedure, and give techniques to select a finite and pertinent test set from an exhaustive test set, including all the possible behaviors of the class under test, by applying test reduction hypothesis. We also study the construction of an oracle, the procedure that analyses the results of the tests, adapted to object-oriented software.

Stéphane Barbey, Didier Buchs, Cécile Péraire

Proving safety properties for embedded control systems

Abstract

It is well-known that a fundamental problem in embedded control systems is the verification of the safety requirements. Formal methods and related support tools can successfully be applied in the formal proof that a system is safe. However, the complexity of real systems is such that automated tools often fail to formally validate such systems. A typical case is when “state explosion” problems arise.

In this paper, we show some «dbstraction techniques” to make the problem of safety requirements validation tractable by current tools. These abstraction techniques have been defined inside a verification methodology that has been tested on the specification of a railway computer based interlocking signalling control system. The conditions under which this methodology can be applied to systems in different application areas are finally discussed.

Cinzia Bernardeschi, Alessandro Fantechi, Stefania Gnesi, Giorgio Mongardi

Enhancing dependability of cooperative applications in partitionable environments

Abstract

This paper presents a pragmatic approach to providing partition processing system support for cooperative applications. A method for specifying and programming application-level partition processing strategies is described. The system support is based on a partition typing mechanism which allows the application programmer to model the relative importance of partitions. This is combined with a split/merge rules configuration table through which the partition processing strategy is defined. In the context of cooperative application semantics, our approach combines the correctness of the pessimistic, and the availability of the optimistic approaches for data management in partitionable environments. The paper focuses on the practical issues linked with, firstly, the specification, and secondly, the support at runtime, of the partition processing strategies. This approach is relevant in the context of large-scale asynchronous distributed systems such as the Internet, which, as a result of current technology and topology, are inevitably prone to partitions. Examples are given, illustrating how the partition support is used and combined with new feedback techniques in order to implement more robust cooperative environments.

François J. N. Cosquer, Pedro Antunes, Paulo Veríssimo

Efficient message logging for uncoordinated checkpointing protocols

Abstract

A message is in-transit with respect to a global state if its sending is recorded in this global state, while its receipt is not. Checkpointing algorithms have to log such in-transit messages in order to restore the state of channels when a computation has to be resumed from a consistent global state after a failure has occurred. Coordinated checkpointing algorithms log those in-transit messages exactly on stable storage. Because of their lack of synchronization, uncoordinated checkpointing algorithms conservatively log more messages.

This paper presents an uncoordinated checkpointing protocol that logs all in-transit messages and the smallest possible number of non in-transit messages. As a consequence, the protocol saves stable storage space and enables quicker recoveries. An appropriate tracking of message causal dependencies constitutes the core of the protocol.

Achour Mostefaoui, Michel Raynal

Atomic updates of replicated data

Abstract

Although several replication strategies have been described and compared in the literature, very few work has been published on the underlying protocols needed to support these strategies. In fact, the classical transactional protocols that are usually assumed are not fault-tolerant, and thus create a window of vulnerability for the “faulttolerant” strategies they intend to support. In this paper, we point out this issue for “quorum voting” replication strategies. We describe a faulttolerant protocol that enables to adequately support these strategies. We present some performance figures showing that, in addition to its higher resilience, our protocol provides better performance than the other possible alternatives.

Rachid Guerraoui, Rui Oliveira, André Schiper

Removal of all faulty nodes from a fault-tolerant service by means of distributed diagnosis with imperfect fault coverage

Abstract

In general, offering a fault-tolerant service boils down to the execution of replicas of a service process on different nodes in a distributed system. The service is fault-tolerant in such a way, that, even if some of the nodes on which a replica of the service resides, behave maliciously, the service is still performed correctly. To be able to guarantee the correctness of a fault-tolerant service despite the presence of maliciously functioning nodes, it is of key importance that all faulty nodes are timely removed from this service. Faulty nodes are detected by tests performed by the nodes offering the service. In practice, tests always have an imperfect fault coverage. In this paper, a distributed diagnosis algorithm with imperfect tests is described, by means of which all detectably faulty nodes are removed from a fault-tolerant service. This may, however, inevitably, imply the removal of a number of correctly functioning nodes from the service too. The maximum number of correctly functioning nodes removed from the service by the algorithm is calculated. Finally, the minimally required number of nodes needed in a fault-tolerant service to perform this diagnosis algorithm is given.

André Postma, Gerie Hartman, Thijs Krol

Constraint based system-level diagnosis of multiprocessors

Abstract

The paper presents a novel modelling technique for system-level fault diagnosis in massive parallel multiprocessors, based on a re-formulation of the problem of syndrome decoding to a constraint satisfaction problem (CSP). The CSP based approach is able to handle detailed and inhomogeneous functional fault models on a similar level as the Russel-Kime model [18]. Multiple-valued logic is used to describe system components having multiple fault modes. The granularity of the models can be adjusted to the diagnostic resolution of the target without altering the methodology. Two algorithms for the Parsytec GCel massively parallel system are used as illustrations in the paper: the centralized method uses a detailed system model, and provides a fine-granular diagnostic image for off-line evaluation. The distributed method makes fast decisions for reconfiguration control, using a simplified model.

J. Altmann, T. Bartha, A. Pataricza, A. Petri, P. Urbán

A unified theory for f1/f2-diagnosable communication networks

Abstract

We propose a new diagnosis model that allows a continuum between the two classical invalidation models in system-level diagnosis, namely, the PMC and BGM models. This new invalidation model assumes that there is no maximal set of connected faulty units of cardinality greater than f ₂ (1≤f ₂) in which all the faulty units can declare other faulty unit in this set to be fault-free out of a total of up to f ₁ faulty units in a network with n units (f ₁≤n). This f ₁/f ₂-invalidation model provides a realistic representation of behavior of faulty units in the diagnosis of heterogeneous communication networks. A complete characterization theorem provides the necessary and sufficient conditions for a network to be one-step f ₁/f ₂-diagnosable. We propose an adaptive algorithm for the diagnosis of communication networks of arbitrary topologies using the f ₁/f ₂-invalidation model.

Guy G. Berthet, Henri J. Nussbaumer

Springer Professional

Dependable Computing — EDCC-2

Second European Dependable Computing Conference Taormina, Italy, October 2–4, 1996 Proceedings

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Friends: A flexible architecture for implementing fault tolerant and secure distributed applications

Adaptable fault tolerance for distributed process control using exclusively standard components

On stratified sampling for high coverage estimations

Fault injection evaluation of assigned signatures in a RISC processor

An evaluation of the error detection mechanisms in MARS using software-implemented fault injection

Dependability modeling and analysis of complex control systems: An application to railway interlocking

The effect of interfailure time variability on the software reliability growth modelling

Dependability evaluation of a computing system for traction control of electrical locomotives

Dependability models of RAID using stochastic activity networks

Compiler assisted self-checking of structural integrity using return address hashing

Single source fault-tolerant broadcasting for two-dimensional meshes without virtual channels

On-line testing of an off-the-shelf microprocessor board for safety-critical applications

The logic threshold based voting: A model for local feedback bridging fault

On the yield of VLSI processors with on-chip CPU cache

Design of dependable hardware: What BIST is most efficient?

Pseudorandom testing of microprocessors at instruction/data flow level

Multi-level test generation and fault diagnosis for finite state machines

Dynamic testing from bounded data type specifications

A theory of specification-based testing for object-oriented software

Proving safety properties for embedded control systems

Enhancing dependability of cooperative applications in partitionable environments

Efficient message logging for uncoordinated checkpointing protocols

Atomic updates of replicated data

Removal of all faulty nodes from a fault-tolerant service by means of distributed diagnosis with imperfect fault coverage

Constraint based system-level diagnosis of multiprocessors

A unified theory for f1/f2-diagnosable communication networks

Backmatter