Skip to main content
Top
Published in: International Journal on Software Tools for Technology Transfer 5/2022

Open Access 04-11-2022 | General

Analysis of non-Markovian repairable fault trees through rare event simulation

Authors: Carlos E. Budde, Pedro R. D’Argenio, Raúl E. Monti, Mariëlle Stoelinga

Published in: International Journal on Software Tools for Technology Transfer | Issue 5/2022

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Dynamic fault trees (DFTs) are widely adopted in industry to assess the dependability of safety-critical equipment. Since many systems are too large to be studied numerically, DFTs dependability is often analysed using Monte Carlo simulation. A bottleneck here is that many simulation samples are required in the case of rare events, e.g. in highly reliable systems where components seldom fail. Rare event simulation (RES) provides techniques to reduce the number of samples in the case of rare events. In this article, we present a RES technique based on importance splitting to study failures in highly reliable DFTs, more precisely, on a variant of repairable fault trees (RFT). Whereas RES usually requires meta-information from an expert, our method is fully automatic. For this, we propose two different methods to derive the so-called importance function. On the one hand, we propose to cleverly exploit the RFT structure to compositionally construct such function. On the other hand, we explore different importance functions derived in different ways from the minimal cut sets of the tree, i.e., the minimal units that determine its failure. We handle RFTs with Markovian and non-Markovian failure and repair distributions—for which no numerical methods exist—and implement the techniques on a toolchain that includes the RES engine FIG, for which we also present improvements. We finally show the efficiency of our approach in several case studies.
Notes
The authors are listed in alphabetical order. This work was partially supported by the EU Grant Agreement 101008233 (MISSION), ANPCyT PICT-2017-3894 (RAFTSys), and SeCyT project 33620180100354CB (ARES). Funded also by the EU Grant Agreement 101067199 (ProSVED). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or The European Research Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.
The original online version of this article was revised: Missing Open Access funding information has been added in the Funding Note.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Reliability engineering is an important field that provides methods and tools to assess and mitigate the risks related to complex systems. Fault tree analysis (FTA) is a prominent technique here. Its application encompasses a large number of industrial domains that range from automotive and aerospace system engineering to energy and telecommunication systems and protocols. Fault trees. A fault tree (FT) describes how component failures occur and propagate through the system, eventually generating system-wide failures. Technically, an FT is a directed acyclic graph whose leaves model component failures and whose other nodes (called gates) model failure propagation. Using fault trees, one can compute dependability metrics to quantify how a system fares w.r.t. certain performance indicators. Two common metrics are system reliability—the probability that there are no system failures during a given mission time—and system availability—the average percentage of time that a system is operational. Static fault trees (also known as standard FTs) contain a few basic gates, like \(\textsf {\MakeUppercase {AND}}\) and \(\textsf {\MakeUppercase {OR}}\) gates. This makes them easy to design and analyse but also limits their expressivity. Dynamic fault trees (DFTs [26, 57]) are a common and widely applied extension of standard FTs, catering for more complex dependability patterns, like spare management and causal dependencies. Such gates make DFTs more difficult to analyse. In static FTs, it only matters whether or not a component has failed, so they can be analysed with Boolean methods, such as binary decision diagrams [38]. Dynamic fault trees, on the other hand, crucially depend on the failure order, so Boolean methods are insufficient. Moreover, and on top of these two classes, repairable fault trees (RFT [7]) permit components to be repaired after they have failed. Repairs are not only crucial in fault-tolerant and resilient systems, they are also an important cost driver. Hence, repairable fault trees allow one to compare different repair strategies with respect to various dependability metrics. In this article we consider repairable fault trees.
Fault tree analysis. The reliability and availability of a fault tree can be computed via numerical methods, such as probabilistic model checking. This involves exhaustive explorations of state-based models such as interactive Markov chains [54]. Since the number of states (i.e. system configurations) is exponential in the number of tree elements, analysing large trees remains a challenge today [1, 38]. Moreover, numerical methods are usually restricted to exponential failure rates and combinations thereof, like Erlang and acyclic phase-type distributions [54].
Alternatively, fault trees can be analysed using standard Monte Carlo simulation, which embedded in formal system modelling is typically called statistical model checking (SMC [30, 52, 54]). Here, a large number of simulated system runs (samples) are produced. Reliability and availability are then statistically estimated from the resulting sample set. Such sampling does not involve storing the full state space so, although the result provided can only be correct with a certain probability, SMC is much more memory efficient than numerical techniques. Furthermore, SMC is not restricted to exponential probability distributions.
However, a known bottleneck of SMC are rare events: when the event of interest has a low probability, which is typically the case in highly reliable systems, millions of samples may be required to observe the event. It is a well-known limitation of SMC that producing these samples can take an unacceptably long simulation time.
Rare event simulation. To alleviate this problem, the field of rare event simulation (RES) provides techniques that reduce the number of samples required to produce a useful estimate [49]. These techniques can be classically categorised as importance sampling and importance splitting.
Importance sampling means that the method will tweak the probabilities in the model, then compute the metric of interest for the changed system, and finally adjust the analysis results to the original model [33, 47]. Unfortunately, this approach has specific requirements on the stochastic model: in particular, it is generally applicable to models with exponential probability distributions only.
Importance splitting, deployed in this paper, does not have this limitation. Importance splitting relies on rare events that arise as a sequence of less rare intermediate events [3, 39]. It exploits this fact by generating more (partial) samples on paths where such intermediate events are observed.
As a simple example, consider a biased coin whose probability of heads is \(p=\frac{1}{80}\). Suppose we flip it eight times in a row, and say we are interested in observing at least three heads. If head comes up at the first flip (H), then we are on a promising path. We can then clone (split) the current path H, generating e.g. 7 copies of it, each clone evolving independently from the second flip onwards. Say one clone observes three heads—the copied H plus two more. Then, this observation of the rare event (three heads) is counted as \(\frac{1}{7}\) rather than as one observation to account for the splitting where the clone was spawned. Now, if a clone observes a new head (HH), this is even more promising than H, so the splitting can be repeated. If we make five copies of the HH clone, then observing three heads in any of these copies counts as \(\frac{1}{35} = \frac{1}{7}\cdot \frac{1}{5}\). Alternatively, observing tails as second flip (HT) is less promising than heads. One could then decide not to split such a path.
This example highlights a key ingredient of importance splitting: the importance function, which indicates for each state how promising it is w.r.t. the event of interest. This function, as well as other parameters such as thresholds [27], are used to choose e.g. the number of clones spawned when a simulation run visits certain state. An importance function for our example could be the number of heads seen thus far. Another one could be such a number, multiplied by the number of coin flips yet to come. The goal is to give higher importance to states from which observing the rare event is more likely. The efficiency of an importance splitting implementation increases as the importance function better reflects such property.
Rare event simulation has been successfully applied in several domains [5, 6, 48, 59, 60, 65]. However, a key bottleneck is that it critically relies on expert knowledge. In particular, for importance splitting, finding a good importance function is a well-known, highly non-trivial task [36, 49].
Our contribution: rare event simulation for fault trees. This article presents an importance splitting method to analyse RFTs. In particular, we automatically derive the importance function by exploiting the description of a system as a fault tree. This is crucial, since the importance function is normally given manually in an ad hoc fashion by a domain or RES expert. We use two general approaches to derive the importance function.
The first approach builds local importance functions for the (automata-semantics of the) nodes of the tree. Then these local functions are aggregated into an importance function for the full tree. Aggregation uses structural induction in the layered description of the tree, and the way they are aggregated depends strongly on the gate at the top of the subtree combining the propagation of the fault below it.
The second approach is based on (minimal) cut sets. Cut sets are sets of basic events such that, if all elements in any cut set fail, the tree fails—we note that cut sets are defined for static fault trees; we conservatively extend these to dynamic fault trees. Thus, the more elements in a cut set that have failed, the higher its importance. Since a fault tree usually has multiple cut sets, we take the maximum importance over all cut sets. We also explore some variants of this idea, where we also normalise the cut sets by their maximum weights, or prune them based on their cardinality or failure probability.
Using such importance functions, we implement importance splitting methods to run RES analyses. We use a variety of RES algorithms to estimate system unreliability and unavailability. Our approach converges to precise estimations in increasingly reliable systems at much faster pace than standard Monte Carlo. This method has four advantages over earlier analysis methods for RFTs—which we overview in the related work section 7—namely: (1) we are able to estimate both the system reliability and availability; (2) we can handle arbitrary failure and repair distributions; (3) we can handle rare events; and (4) we can do it in a fully automatic fashion.
We implemented our theory in a full-stack toolchain that is also presented here. Within this toolchain, and in addition to the new importance function generation techniques, we introduce the language Kepler as an extension of Galileo [55, 56] to describe repairable fault trees in textual format. We also changed the RES engine on the core of the https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figa_HTML.gif statistical model checker [11] by resampling time values at the moment of splitting. This requires considering algorithms specifically tailored to generate conditional pseudorandom variates. We show how this modification provides significant improvements to the performance of the tool.
With this toolchain, we computed confidence intervals for the unreliability and unavailability of several case studies. Our case studies are RFTs whose failure and repair times are governed by arbitrary continuous probability density functions (PDFs). Each case study was analysed for a fixed runtime budget and in increasingly resilient configurations. In all cases, our approach could estimate the narrowest intervals for the most resilient configurations.
Summarising, the contributions reported in this work are:
1.
two methods to automatically generate importance functions, one based recursively on the structure of the RFT, and the other on its collection of minimal cut sets;
 
2.
a toolchain including the previous methods that, given the input RFT, estimates the required dependability metric through RES in a fully automated fashion;
 
3.
an improvement of the https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figb_HTML.gif statistical model checker thorough resampling of the clocks at the moment of splitting;
 
4.
the textual language Kepler to describe RFT having failure and repair time with arbitrary distributions; and
 
5.
an extensive validation of the techniques and tools in a variety of case studies.
 
Paper outline. This article is structured as follows. In Sec. 2, we discuss the concepts related to fault trees which are fundamental to our contribution, while in Sec. 3 we recall the basics of RES and the importance splitting techniques. Sec. 4 focuses on our fundamental contributions, namely, the techniques for automatically deriving importance functions and the technique for improving the engine of https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figc_HTML.gif . Sec. 5 describes the toolchain with particular attention on the language Kepler  and its translation to IOSA, the input language of https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figd_HTML.gif . Using the toolchain, we performed an extensive experimental evaluation that we present in Sec. 6. We overview related work in Sec. 7 and conclude our contributions in Sec. 8.
We remark that this article merges and extends the contributions reported in [12, 19].

2 Fault tree analysis

2.1 Syntax

A fault tree ‘\(\triangle \)’ is a directed acyclic graph that models how component failures propagate and eventually cause the full system to fail. We consider repairable fault trees (RFTs), where the occurrence time of failures and repairs is governed by arbitrary probability distributions.
Basic elements. The leaves of the tree, called basic events or basic elements (\(\textsf {\MakeUppercase {BE}}\) s), model the failure behaviour of components. Thus, a \(\textsf {\MakeUppercase {BE}}\) b is equipped with a failure distribution \(F_b\) that governs the probability for b to fail before time t, and a repair distribution \(R_b\) that governs its repair time. Some \(\textsf {\MakeUppercase {BE}}\) s are used as spare components. The spare basic elements (\(\textsf {\MakeUppercase {SBE}}\) s) replace a primary component when it fails. \(\textsf {\MakeUppercase {SBE}}\) s are also equipped with a dormancy distribution \(D_b\), since spares fail less often when dormant, i.e. not in use. Only if an \(\textsf {\MakeUppercase {SBE}}\) becomes active, its failure distribution is given by \(F_b\).
Gates. Non-leave nodes are called intermediate events and are labelled with gates, that describe how combinations of lower failures propagate to upper levels. Fig. 1 shows their syntax. Their meaning is as follows. The \(\textsf {\MakeUppercase {AND}} \) gate fails if all its children fail, the \(\textsf {\MakeUppercase {OR}} \) gate fails if at least one of its children fails, and the \(\textsf {\MakeUppercase {VOT}} _k\) gate fails if k of its m children fail (with \(1\leqslant k\leqslant m\)). The latter is called the voting or k out of m gate. Note that \(\textsf {\MakeUppercase {VOT}} _1\) is equivalent to an \(\textsf {\MakeUppercase {OR}}\) gate, and \(\textsf {\MakeUppercase {VOT}} _m\) is equivalent to an \(\textsf {\MakeUppercase {AND}}\). If any of these gates is in a fail state, it becomes repaired if the condition that produces the failure is falsified. Thus, for instance, a failing \(\textsf {\MakeUppercase {AND}} \) is repaired if at least one of its failing children is repaired.
Note that the three gates we presented so far react only based on changes in the combination of the inputs provided by their children. These are called static gates and are already present in Static Fault Trees. In contrast, the following gates are called dynamic gates and react to the change of state of their children taking into account also other aspects like timing and dependence.
The priority-and gate (\(\textsf {\MakeUppercase {PAND}}\)) is an \(\textsf {\MakeUppercase {AND}}\) gate that only fails if its children fail orderly from left to right (though adjacent children may also fail simultaneously). \(\textsf {\MakeUppercase {PAND}}\) gates express failures that can only happen in a particular order, e.g. a general electric failure can only happen if first the circuit breaker fails and then a short circuit occurs. A failing \(\textsf {\MakeUppercase {PAND}}\) gate gets repaired whenever its right-most failed child becomes repaired [45, 46].
\(\textsf {\MakeUppercase {SPARE}}\) gates have one primary child and one or more spare children: spares replace the primary when it fails. A \(\textsf {\MakeUppercase {SPARE}}\) gate fails if the primary (or current active) child fails and it does not succeed to find an operational spare child. It becomes repaired whenever the primary child is repaired or an operational spare child becomes available.
The \(\textsf {\MakeUppercase {FDEP}}\) gate has a trigger child and several dependent children: all dependent children become unavailable when the trigger fails. Note that the dependent children do not necessarily fail as a cause of the failure of the trigger child and that they become available again as soon as the trigger child is repaired. \(\textsf {\MakeUppercase {FDEP}}\) s can model, for instance, network elements that become unavailable if their connecting bus fails.
Repair boxes. An \(\textsf {\MakeUppercase {RBOX}}\) determines which basic element is repaired next according to a given policy. Thus all its inputs are \(\textsf {\MakeUppercase {BE}}\) s or \(\textsf {\MakeUppercase {SBE}}\) s. Unlike gates, an \(\textsf {\MakeUppercase {RBOX}}\) has no output since it does not propagate failures.
Top-level event. A full-system failure occurs if the top event (i.e. the root node) of the tree fails.
Example. The repairable fault tree in Fig. 2 models a railway-signal system, which fails if its high voltage and relay cabinets fail [31, 53]. Thus, the top event is an \(\textsf {\MakeUppercase {AND}}\) gate with children \(\textsf {HVcab}\) (a \(\textsf {\MakeUppercase {BE}}\)) and \(\textsf {Rcab}\). The latter is a \(\textsf {\MakeUppercase {SPARE}}\) gate with primary \(\textsf {P}\) and spare \(\textsf {S}\). All \(\textsf {\MakeUppercase {BE}}\) s are managed by one \(\textsf {\MakeUppercase {RBOX}}\) with repair priority \(\textsf {HVcab}>\textsf {P}>\textsf {S}\).
Notation. The nodes of a tree \(\triangle \) are given by the set
https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Equ4_HTML.png
We let vw range over https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq49_HTML.gif . A function
https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Equ5_HTML.png
yields the type of each node in the tree. A function
https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Equ6_HTML.png
returns the ordered list of children of a node. If clear from context, we omit the superscript \(\triangle \) from function names.

2.2 Semantics

Following [45] we give semantics to RFT as Input/Output Stochastic Automata (IOSA), so that we can handle arbitrary probability distributions. Each state in the IOSA represents a system configuration, indicating which components are operational and which have failed. Transitions among states describe how the configuration changes when failures or repairs occur.
More precisely, a state of the IOSA derived from an RFT is a tuple https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq51_HTML.gif , where https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq52_HTML.gif is the state space and \({{\varvec{x}}}_v \) denotes the state of node \(v\) in \(\triangle \). The possible values for \({{\varvec{x}}}_v \) depend on the type of \(v\). The output https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq58_HTML.gif of node \(v\) indicates whether it is operational ( https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq60_HTML.gif ) or failed ( https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq61_HTML.gif ) and is calculated as follows:
  • \(\textsf {\MakeUppercase {BE}}\) s (white circles in Fig. 1) have a binary state: \({{\varvec{x}}}_v =0\) if \(\textsf {\MakeUppercase {BE}}\) \(v\) is operational and \({{\varvec{x}}}_v =1\) if it is failed. The output of a \(\textsf {\MakeUppercase {BE}}\) is its state: https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq68_HTML.gif .
  • \(\textsf {\MakeUppercase {SBE}}\) s (gray circles in Fig. 1e) have two additional states: \({{\varvec{x}}}_v =2,3\) if a dormant \(\textsf {\MakeUppercase {SBE}}\) \(v\) is respectively operational or failed. Here https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq73_HTML.gif .
  • \(\textsf {\MakeUppercase {AND}}\) s  have a binary state. The \(\textsf {\MakeUppercase {AND}}\) gate \(v\) fails iff all children fail: https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq77_HTML.gif . An \(\textsf {\MakeUppercase {AND}}\) gate outputs its internal state: https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq79_HTML.gif .
  • \(\textsf {\MakeUppercase {OR}}\) gates are analogous to \(\textsf {\MakeUppercase {AND}}\) gates, but fail iff any child fail, i.e. https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq82_HTML.gif for \(\textsf {\MakeUppercase {OR}}\) gate \(v \).
  • \(\textsf {\MakeUppercase {VOT}}\) gates  also have a binary state: a \(\textsf {\MakeUppercase {VOT}} _k\) gate fails whenever \(1\leqslant k\leqslant m\) children fail, thus https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq88_HTML.gif if https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq89_HTML.gif , and https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq90_HTML.gif otherwise.
  • \(\textsf {\MakeUppercase {PAND}}\) gates  admit multiple states to represent the failure order of the children. For \(\textsf {\MakeUppercase {PAND}}\) \(v\) with two children we let \({{\varvec{x}}}_v \) equal: 0 if both children are operational; 1 if the left child failed, but the right one has not; 2 if the right child failed, but the left one has not; 3 if both children have failed, the right one first; 4 if both children have failed, otherwise. The output of \(\textsf {\MakeUppercase {PAND}} \) gate \(v\) is https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq97_HTML.gif if \({{\varvec{x}}}_v=4\) and https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq99_HTML.gif otherwise. \(\textsf {\MakeUppercase {PAND}}\) gates with more children are handled by exploiting the fact that \(\textsf {\MakeUppercase {PAND}} (w_1,w_2,w_3) = \textsf {\MakeUppercase {PAND}} (\textsf {\MakeUppercase {PAND}} (w_1,w_2),w_3)\).
  • \(\textsf {\MakeUppercase {SPARE}}\) gate  \(v\) leftmost child is its primary \(\textsf {\MakeUppercase {BE}}\). All other (spare) inputs are \(\textsf {\MakeUppercase {SBE}}\) s. \(\textsf {\MakeUppercase {SBE}}\) s can be shared among several \(\textsf {\MakeUppercase {SPARE}}\) gates. When the primary of \(v\) fails, it is replaced with an available \(\textsf {\MakeUppercase {SBE}}\). An \(\textsf {\MakeUppercase {SBE}}\) is unavailable if it is failed or if it is replacing the primary \(\textsf {\MakeUppercase {BE}}\) of another \(\textsf {\MakeUppercase {SPARE}}\). The output of \(v \) is https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq114_HTML.gif if its primary is failed and no spare is available. Else https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq115_HTML.gif .
  • An \(\textsf {\MakeUppercase {FDEP}}\) gate  has no output. Its leftmost input is the trigger. We consider non-destructive \(\textsf {\MakeUppercase {FDEP}}\) s [8]: if the trigger fails, the output of all other inputs is set to 1, without affecting the internal state. Since this can be modelled by a suitable combination of \(\textsf {\MakeUppercase {OR}}\) gates [45], we omit the details.
For example, the RFT from Fig. 2 starts with all elements operational, so the initial state is \({{\varvec{x}}}^0=(0,0,2,0,0)\). If then \(\textsf {P}\) fails, \({{\varvec{x}}}_\textsf {P}\) and https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq122_HTML.gif are set to 1 (failed) and \(\textsf {S}\) becomes \({{\varvec{x}}}_\textsf {S}=0\) (active and operational spare), so the state changes to \({{\varvec{x}}}^1=(0,1,0,0,0)\). The traces of the IOSA are given by https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq126_HTML.gif , where a change from \({{\varvec{x}}}^j\) to \({{\varvec{x}}}^{j+1}\) corresponds to transitions triggered in the IOSA.
Dynamic fault trees may exhibit nondeterministic behaviour as a consequence of underspecified failure behaviour [23, 37]. This can happen e.g. when two \(\textsf {\MakeUppercase {SPARE}}\) s have a single shared \(\textsf {\MakeUppercase {SBE}}\): if all elements are failed and the \(\textsf {\MakeUppercase {SBE}}\) is repaired first, the failure behaviour depends on which \(\textsf {\MakeUppercase {SPARE}}\) gets the \(\textsf {\MakeUppercase {SBE}}\). Monte Carlo simulation, however, requires fully stochastic models and cannot cope with nondeterminism. To overcome this problem, we deploy the theory from [24, 45]. If a fault tree adheres to some mild syntactic conditions, then its IOSA semantics is weakly deterministic, meaning that all resolutions of the nondeterministic choices lead to the same probability value. In particular, we require that (1) each \(\textsf {\MakeUppercase {BE}}\) is connected to at most one \(\textsf {\MakeUppercase {SPARE}}\) gate, and that (2) \(\textsf {\MakeUppercase {BE}}\) s and \(\textsf {\MakeUppercase {SBE}}\) s connected to \(\textsf {\MakeUppercase {SPARE}}\) s are not connected to \(\textsf {\MakeUppercase {FDEP}}\) s. In addition to this, some semantic decisions have been fixed. Notably, the semantics of \(\textsf {\MakeUppercase {PAND}}\), which normally has some ambiguity and has rarely been discussed in the context of repairs, is here fully specified. Besides, policies should be provided for \(\textsf {\MakeUppercase {RBOX}}\) and spare assignments.

2.3 Minimal cut sets

Cut sets are a well known qualitative technique in FTA for static FTs. A cut set is a set of basic elements (\(\textsf {\MakeUppercase {BE}}\) s and \(\textsf {\MakeUppercase {SBE}}\) s) whose joint failure will cause a top-level event. A minimal cut set (MCS) is a cut set of which no subset is a cut set. These concepts can be lifted to dynamic fault trees (and RFTs) in general, but this requires introducing an order to capture temporal dependencies, plus several other subtleties [37, 54]. Nevertheless, RFTs as defined above rule out several issues raised by cut sets in DFTs, such as event simultaneity [37]. Furthermore, for FT analysis related to MCS, we exclude order dependence by considering RFTs without \(\textsf {\MakeUppercase {PAND}}\) gates.
For an illustration, Fig. 3b lists all minimal cut sets of the tree \(\triangle \) from Fig. 3a. We use the notation: \({\mathcal {M}}(\triangle )\) for the family of all minimal cut sets of the FT \(\triangle \); \({\mathcal {M}}_{<N}(\triangle )\) for the subset of \({\mathcal {M}}(\triangle )\) that excludes cut sets with N or more \(\textsf {\MakeUppercase {BE}}\) s (called pruning of order N); \({\mathcal {M}}_{>\lambda }(\triangle )\) for the subset of \({\mathcal {M}}(\triangle )\) that excludes cut sets where the product of the failure rate of the \(\textsf {\MakeUppercase {BE}}\) s is \(\leqslant \lambda \in {\mathbb {R}} _{>0}\). The latter is only defined when all \(\textsf {\MakeUppercase {BE}}\) s and \(\textsf {\MakeUppercase {SBE}}\) s in the RFT have exponential failure and dormancy distributions. To obtain \({\mathcal {M}}_{>\frac{1}{4}}(\triangle )\) in Fig. 3b, we make this the case with failure rates: \(\frac{1}{4}\) for \(\textsf {\MakeUppercase {BE}} _1\) and \(\textsf {\MakeUppercase {BE}} _7\), \(\frac{6}{20}\) for \(\textsf {\MakeUppercase {BE}} _2\), \(\frac{2}{3}\) for \(\{\textsf {\MakeUppercase {BE}} _i\}_{i=3}^6\), and \(\frac{1}{2}\) for \(\textsf {\MakeUppercase {BE}} _8\) and \(\textsf {\MakeUppercase {SBE}} _9\).
Cut set pruning as in \({\mathcal {M}}_{<N}(\triangle )\) and \({\mathcal {M}}_{>\lambda }(\triangle )\) is a standard way to speed up FT analyses [57]. The goal is to ignore the most unlikely (and hard to compute) cut sets: pruning of order N assumes that the top-level event will most likely occur by cut sets with less than N \(\textsf {\MakeUppercase {BE}}\) s; pruning by rate \(\leqslant \lambda \) assumes that the top-level event will occur first by cut sets where \(\textsf {\MakeUppercase {BE}}\) s have higher rates and thus fail faster. Choosing such N and \(\lambda \) to prune irrelevant MCS of a given tree depends on its structure and the \(\textsf {\MakeUppercase {BE}}\) s failure, dormancy, or repair distributions.
An RFT without \(\textsf {\MakeUppercase {PAND}}\) or \(\textsf {\MakeUppercase {SPARE}}\) gates can be translated into a static FT in such a way that they have exactly the same set of MCSs (notice that \(\textsf {\MakeUppercase {RBOX}}\) s have no effect on minimal cut sets). Obtaining the MCSs from static FT can be easily done, e.g. by translating it into an FT in disjunctive normal form and taking the minimal clauses as the minimal cut sets. For this work, we also opt to translate \(\textsf {\MakeUppercase {SPARE}}\) gates into \(\textsf {\MakeUppercase {AND}}\) gates. However, this translation does not preserve all MCSs, yielding, on the translated FT, a subset of the MCSs of the original RFT. For example, the RFT on the left of Fig. 4 has three MCSs. In particular, since \(\textsf {\MakeUppercase {SBE}}\) cannot be operational in both \(\textsf {\MakeUppercase {SPARE}}\) gates at the same time, \(\{\textsf {\MakeUppercase {BE}} _1,\textsf {\MakeUppercase {BE}} _3\}\) is one of these MCSs. However, this set is not an MCS in the FT on the right side of Fig. 4, which replaces the \(\textsf {\MakeUppercase {SPARE}}\) gates with \(\textsf {\MakeUppercase {AND}}\) gates.

2.4 Dependability metrics

An important use of fault trees is to compute relevant dependability metrics. Let \(X_t\) denote the random variable that represents the state of the top event at time t [22]. Two popular metrics are:
  • system reliability: the probability of observing no top event failure before some mission time \(T>0\), viz.
    https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Equ7_HTML.png
  • system availability: the proportion of time that the system remains operational in the long run, viz.
    $$\begin{aligned}\text {AVA} =\lim _{t\rightarrow \infty } \textit{Prob}\left( X_t=0\right) \text {.}\end{aligned}$$
System unreliability and unavailability are the complements of these metrics, i.e. https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq189_HTML.gif and \(\text {UNAVA} =1-\text {AVA} \).

3 Stochastic simulation for fault trees

Standard Monte Carlo simulation (SMC). Monte Carlo simulation takes random samples from stochastic models to estimate a (dependability) metric of interest. For instance, to estimate the unreliability of a tree \(\triangle \)we sample N independent traces from its IOSA semantics. An unbiased statistical estimator for https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq192_HTML.gif is the proportion of traces observing a top-level event, that is, \({\hat{p}}_N=\frac{1}{N}\sum _{j=1}^N X^j\) where \(X^j=1\) if the j-th trace exhibits a top-level failure before time T and \(X^j=0\) otherwise. The statistical error of \({\hat{p}}\) is typically quantified with two numbers \(\delta \) and \(\varepsilon \) s.t. \({\hat{p}}\in [p-\varepsilon ,p+\varepsilon ]\) with probability \(\delta \). The interval \({\hat{p}}\pm \varepsilon \) is called a confidence interval (CI) with coefficient \(\delta \) and precision \(2\varepsilon \).
Such procedures scale linearly with the number of tree nodes and cater for arbitrary PDFs, i.e. not restricted to exponential ones. However, they encounter a bottleneck to estimate rare events: if \(p\approx 0\), very few traces observe \(X^j=1\). Therefore, the variance of estimators like \({\hat{p}}\) becomes huge, and CIs become too wide, easily degenerating to the trivial interval [0, 1]. Increasing the number of traces alleviates this problem, but even standard confidence interval settings—where \(\varepsilon \) is relative to p—require sampling an unacceptably large number of traces [49]. Rare event simulation techniques solve this specific problem.
Rare Event Simulation (RES). RES techniques [49] increase the number of traces that observe the rare event, e.g. a top-level event in an RFT. Two prominent classes of RES techniques are importance sampling, which adjusts the PDF of failures and repairs, and importance splitting (ISPLIT) [43], which samples more (partial) traces from states that are closer to the rare event. We focus on ISPLIT due to its flexibility with respect to the probability distributions.
ISPLIT can be efficiently deployed as long as the rare event \(\gamma \), here defined as the set of states characterising the rare event, can be described as a nested sequence of less-rare events https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq209_HTML.gif . This decomposition allows ISPLIT to study the conditional probabilities \(p_k=\textit{Prob}(\gamma _{k+1}\,|\,\gamma _k)\) separately, to then compute \(p=\textit{Prob}(\gamma ) = \prod _{k=0}^{M\text {-}1}{} \textit{Prob}(\gamma _{k+1}\,|\,\gamma _k)\). Moreover, ISPLIT requires all conditional probabilities \(p_k\) to be much greater than p, so that estimating each \(p_k\) can be done efficiently with SMC.
The key idea behind ISPLIT is to define the events \(\gamma _k\) via a so called importance function https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq215_HTML.gif that assigns an importance to each state  https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq216_HTML.gif . The higher the importance of a state, the closer it is to the rare event \(\gamma _M\). Event \(\gamma _k\) collects all states with importance at least \(\ell _k\), for certain sequence of threshold levels \(0=\ell _0<\ell _1<\cdots <\ell _M\). Formally: https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq221_HTML.gif .
To exploit the importance function https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq222_HTML.gif in the simulation procedure, ISPLIT samples more (partial) traces from states with higher importance. Two well-known methods are deployed and compared in this paper: RESTART and Fixed Effort. Fixed Effort (FE) [27] samples a predefined amount of traces in each region https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq223_HTML.gif for \(0\leqslant k < M\). Thus, starting at \(\gamma _0\) it first estimates the proportion of traces that reach \(\gamma _1\), i.e. the probability https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq227_HTML.gif . Next, from the states that reached \(\gamma _1\) new traces are generated to estimate https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq229_HTML.gif , and so on until \(p_M\). Fixed Effort thus requires that (i) each trace has a clearly defined end, so that estimations of each \(p_k\) finish with probability 1, and (ii) all rare events reside in the uppermost region https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq232_HTML.gif .
Example. Fig. 5a shows Fixed Effort estimating the probability to visit states labelled https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq233_HTML.gif before others labelled https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq234_HTML.gif . States https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq235_HTML.gif have importance >13, and thresholds \(\ell _1=4\) and \(\ell _2=10\) partition the state space in regions https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq238_HTML.gif s.t. all https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq239_HTML.gif . The effort is 5 simulations per region, for all regions: we call this algorithm FE\(_{5}\). In region https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq241_HTML.gif , 2 simulations made it from the initial state to threshold \(\ell _1\), i.e. they reached some state with importance 4 before visiting a state https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq243_HTML.gif . In https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq244_HTML.gif , starting from these two states, 3 simulations reached \(\ell _2\). Finally, 2 out of 5 simulations visited states https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq246_HTML.gif in https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq247_HTML.gif . Thus, the estimated rare event probability of this run of FE 5 is \({\hat{p}}=\prod _{i=1}^2\hat{p_i}=\frac{2}{5}\frac{3}{5}\frac{2}{5}=9.6\times 10^{-2}\).
RESTART (RST) [62, 63] is another RES algorithm, which starts one trace in \(\gamma _0\) and monitors the importance of the states visited. If the trace up-crosses threshold \(\ell _1\), the first state visited in https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq251_HTML.gif is saved and the trace is cloned, aka split—see Fig. 5b. This mechanism rewards traces that get closer to the rare event. Each clone then evolves independently, and if one up-crosses threshold \(\ell _2\) the splitting mechanism is repeated. Instead, if a state with importance below \(\ell _1\) is visited, the trace is truncated ( https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq254_HTML.gif in Fig. 5b). This penalises traces that move away from the rare event. To avoid truncating all traces, the one that spawned the clones in region https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq255_HTML.gif can go below importance \(\ell _k\). To deploy an unbiased estimator for p, RESTART measures how much split was required to visit a rare state [62]. In particular, RESTART does not need the rare event to be defined as \(\gamma _M\) [58], and it was devised for steady-state analysis [63] (e.g. to estimate \(\text {UNAVA}\)) although it can also be used for transient studies as depicted in Fig. 5b [59].

4 Importance splitting for FTA

The effectiveness of ISPLIT crucially relies on the importance function https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq259_HTML.gif , as well as the threshold levels \(\ell _1,\dots ,\ell _M\) [43]. Traditionally, these are given by domain and/or RES experts, thus requiring considerable expert knowledge. To alleviate this requirement, we focus on techniques that are able to obtain importance functions and splitting thresholds automatically. In particular, we introduce a first technique that derives https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq261_HTML.gif from the structure of the given RFT, and a second technique that defines https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq262_HTML.gif based on the MCSs of the RFT. We also discuss methods to select the threshold levels \(\ell _k\) and an improvement on https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Fige_HTML.gif based on resampling time values at the moment of splitting simulation runs.

4.1 Compositional importance functions for fault trees

The core idea behind importance splitting is that states that are more likely to lead to the rare event should have a higher importance. To achieve this, the key lies in defining an importance function https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq264_HTML.gif and thresholds \(\ell _k\) that are sensitive to both the state space https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq266_HTML.gif and the transition probabilities of the system. For us, https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq267_HTML.gif are all possible states of an RFT. Its top event fails when certain nodes fail in certain order, and remains failed until certain repairs occur. To exploit this for ISPLIT, the structure of the tree must be embedded into https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq268_HTML.gif .
The strong dependence of the importance function https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq269_HTML.gif on the structure of the tree is easy to see in the following example. Take the RFT \(\triangle \)from Fig. 2 and let its current state x be s.t. \(\textsf {P}\) is failed and \(\textsf {HVcab}\) and \(\textsf {S}\) are operational. If the next event is a repair of \(\textsf {P}\), then the new state \({{\varvec{x}}}'\) (where all basic elements are operational) is farther from a failure of the top event. Hence, a good importance function should satisfy https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq276_HTML.gif . Oppositely, if the next event had been a failure of \(\textsf {S}\) leading to state \({\varvec{x}}''\), then one would want that https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq279_HTML.gif . The key observation is that these inequalities depend on the structure of \(\triangle \)as well as on the failures/repairs of basic elements.
In view of the above, any attempt to define an importance function for an arbitrary fault tree \(\triangle \) must put its (gate) structure in the forefront. In Table 1 we introduce a compositional heuristic for this, which defines local importance functions distinguished per node type. The importance function associated to node \(v\) is https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq283_HTML.gif . We define the global importance function of the tree ( https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq284_HTML.gif ) as the local importance function of the top event node of \(\triangle \).
Table 1
Compositional (“structural”) importance function for RFTs
type(\(v\))
https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq287_HTML.gif
\(\textsf {\MakeUppercase {BE}}\), \(\textsf {\MakeUppercase {SBE}}\)
https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq290_HTML.gif
\(\textsf {\MakeUppercase {AND}}\)
https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq292_HTML.gif
\(\textsf {\MakeUppercase {OR}}\)
https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq294_HTML.gif
\(\textsf {\MakeUppercase {VOT}} _k\)
https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq296_HTML.gif
\(\textsf {\MakeUppercase {SPARE}}\)
https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq298_HTML.gif
\(\textsf {\MakeUppercase {PAND}}\)
https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq300_HTML.gif
 
where \({\textit{ord}}=1\) if \({\varvec{x}}_v \in \{1,4\}\) and \({\textit{ord}}=-1\) otherwise
with https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq304_HTML.gif and https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq305_HTML.gif
Thus, https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq306_HTML.gif is defined in Table 1 via structural induction in the fault tree. It is defined so that it assigns to a failed node \(v\) its highest importance value. Functions with this property deploy the most efficient ISPLIT implementations [43], and some RES algorithms (e.g. Fixed Effort) require this property [27].
In the following we explain our definition of https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq308_HTML.gif . If \(v\) is a failed \(\textsf {\MakeUppercase {BE}}\) or \(\textsf {\MakeUppercase {SBE}}\), then its importance is 1; else it is 0. This matches the output of the node, thus https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq312_HTML.gif . Intuitively, this reflects how failures of basic elements are positively correlated to top event failures. The importance of \(\textsf {\MakeUppercase {AND}}\), \(\textsf {\MakeUppercase {OR}}\), and \(\textsf {\MakeUppercase {VOT}} _k\) gates depends exclusively on their input. The importance of an \(\textsf {\MakeUppercase {AND}}\) is the sum of the importance of their children, and scaled by a normalisation factor (explained below). This reflects that \(\textsf {\MakeUppercase {AND}}\) gates fail when all their children fail, and each failure of a child brings an \(\textsf {\MakeUppercase {AND}}\) closer to its own failure, hence increasing its importance. Instead, since \(\textsf {\MakeUppercase {OR}}\) gates fail as soon as a single child fails, their importance is the maximum importance among its children. The importance of a \(\textsf {\MakeUppercase {VOT}} _k\) gate is the sum of the k (out of m) children with highest importance value.
Omitting normalisation may yield an undesirable importance function. To understand why, suppose a binary \(\textsf {\MakeUppercase {AND}}\) gate \(v\) with children l and r, and define https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq323_HTML.gif . Suppose that https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq324_HTML.gif takes it highest value in https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq325_HTML.gif while https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq326_HTML.gif in https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq327_HTML.gif and assume that states \({\varvec{x}}\) and \({{\varvec{x}}}'\) are s.t. https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq330_HTML.gif , https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq331_HTML.gif , https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq332_HTML.gif , https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq333_HTML.gif . This means that in both states one child of \(v\) is “good-as-new” and the other is “half-failed” and hence the system is equally close to fail in both cases. Hence we expect https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq335_HTML.gif when actually https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq336_HTML.gif . Instead, https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq337_HTML.gif operates with https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq338_HTML.gif and https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq339_HTML.gif , which can be interpreted as the “percentage of failure” of the children of \(v\). To make these numbers integers we scale them by \(\text {lcm}_v \), the least common multiple of their max importance values. In our case \(\text {lcm}_v =6\) and hence https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq343_HTML.gif . Similar problems arise with all gates, hence normalisation is applied in all cases.
\(\textsf {\MakeUppercase {SPARE}}\) gates with m children—including its primary—behave similarly to \(\textsf {\MakeUppercase {AND}}\) gates: every failed child brings the gate closer to failure, as reflected in the left operand of the \(\max \) in Table 1. However, \(\textsf {\MakeUppercase {SPARE}}\) s fail when their primaries fail and no \(\textsf {\MakeUppercase {SBE}}\) s are available (as opposed to failed, e.g. possibly being used by another \(\textsf {\MakeUppercase {SPARE}}\)). This means that the gate could fail despite some children being operational. To account for this we exploit the gate output: multiplying https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq350_HTML.gif by m we give the gate its maximum value when it fails, even when this happens due to unavailable—but possibly operational—\(\textsf {\MakeUppercase {SBE}}\) s.
For a \(\textsf {\MakeUppercase {PAND}}\) gate \(v\) we have to carefully look at the states. If the left child l fails first, then the right child r contributes positively to the failure of the \(\textsf {\MakeUppercase {PAND}}\) and hence the importance function of the node \(v\). If instead the right child has failed before the left child, then the \(\textsf {\MakeUppercase {PAND}}\) gate will not fail and hence we let it contribute negatively to the importance function of \(v\). Thus, we multiply https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq358_HTML.gif (the normalized importance function of the right child) by \(-1\) in the later case, i.e. when state \({\varvec{x}}_v \notin \{1,4\}\). Instead, the left child always contributes positively. Finally, the max operation is twofold: on the one hand, https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq361_HTML.gif ensures that the importance value remains at its maximum while failing (\(\textsf {\MakeUppercase {PAND}}\) s remain failed even after the left child is repaired); on the other, it ensures that the smallest possible value while operational is 0 (since importance values cannot be negative.)

4.2 Importance functions based on minimal cut sets

In the previous section we introduced a compositional technique to derive the importance function. The key concept is that, in general, importance should reflect proximity to the rare event. This means that the importance of a gate should increase as the gate approaches its own failure. Note that the type of a gate defines how, as its children fail, the gate approaches its own failure. Following the structure of the RFT, its importance function is determined by the importance of its top-level gate.
In this section we follow a different approach based on the set of minimal cut sets. Recall that the joint failure of all basic elements in a MCS determines the failure of the top event. Thus, by counting how many \(\textsf {\MakeUppercase {BE}}\) s have failed in a MCS, we have an idea of the proximity to the occurrence of the top-level event. This suggests the following importance function: given a state of the RFT, its importance is defined by the maximum number of failed \(\textsf {\MakeUppercase {BE}}\) s in any MCS.
Table 2
Importance functions for automatic RES in fault trees
Name
Expression
Description
https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq365_HTML.gif
https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq366_HTML.gif
For each MCS of the tree, https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq367_HTML.gif counts the number of \(\textsf {\MakeUppercase {BE}}\) s that have failed in the current state x. (Recall that a basic element b in x has failed if https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq369_HTML.gif .) The importance https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq370_HTML.gif of the current state of the tree is the maximum among these counts.
https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq371_HTML.gif
https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq372_HTML.gif
https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq373_HTML.gif operates similarly to function https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq374_HTML.gif above, but here the maximum ranges over a pruned set of MCS, discarding cut sets with N or more \(\textsf {\MakeUppercase {BE}}\) s.
https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq376_HTML.gif
https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq377_HTML.gif
Similar to https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq378_HTML.gif but using the failure rates for pruning, https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq379_HTML.gif considers only MCS where the product of the failure rate of all \(\textsf {\MakeUppercase {BE}}\) s is greater than \(\lambda \). Applicable only to FTs whose failure and dormancy distributions are Markovian.
https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq382_HTML.gif
https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq383_HTML.gif
https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq384_HTML.gif is a normalised version of https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq385_HTML.gif . The normalisation follows a similar procedure to Sec. 4.1, where \(\text {lcm}\) is the least common multiple of the cardinality of every MCS in \({\mathcal {M}}(\triangle ^{\!*})\).
Formally, let \(\triangle ^{\!*}\) be the FT rewrite of the original RFT \(\triangle \)  as described in Sec. 2.3—i.e. replacing \(\textsf {\MakeUppercase {SPARE}}\) and \(\textsf {\MakeUppercase {FDEP}}\) gates resp. for \(\textsf {\MakeUppercase {AND}}\) and \(\textsf {\MakeUppercase {OR}}\)—and let \({\mathcal {M}}(\triangle ^{\!*})\) be the set of all its minimal cut sets. Then, the importance function described above is given by function https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq395_HTML.gif in Table 2.
We further consider pruned variants of https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq396_HTML.gif that discards cut sets based on their cardinality ( https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq397_HTML.gif ) or, if \(\textsf {\MakeUppercase {BE}}\) failures are exponentially distributed, based on the product of the failure rates of the \(\textsf {\MakeUppercase {BE}}\) s in the cut set ( https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq400_HTML.gif ). The only difference between these functions and https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq401_HTML.gif is the range of the \(\max \) operator, which reflects the pruning of some minimal cut sets.
A concept already discussed in Sec. 4.1 is importance normalisation. We also experiment with it here: in Table 2, https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq403_HTML.gif stands for the normalised version if https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq404_HTML.gif . By dividing each summation by its maximum possible value, namely \(|\textit{MCS}|\), we compute the percentage of failure introduced by MCS in the current state. The scaling factor \(\text {lcm}\) ensures that the resulting value is an integer as required by our tooling framework. Finally, although omitted in Table 2, we further define the functions https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq407_HTML.gif and https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq408_HTML.gif as the normalised versions of the functions https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq409_HTML.gif and https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq410_HTML.gif , respectively.

4.3 Automatic importance splitting for FTA

The techniques to provide importance functions introduced in the previous subsections are based on the the distribution of operational/failed basic elements in the fault tree, being it through the structure of the tree or the contribution to its minimal cut sets. This follows the core idea of importance splitting: the more failed \(\textsf {\MakeUppercase {BE}}\) s/\(\textsf {\MakeUppercase {SBE}}\) s (in the right order), the closer a tree is to its top event failure.
However, the ISPLIT strategy is to run more simulations from those states that have a higher probability to lead to rare states. This is only partially reflected by whether basic element b is failed. Probabilities lie also in the failure, repair and dormancy distributions (\(F_b,R_b,D_b\)). These distributions govern the transitions among states https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq414_HTML.gif , and can be exploited for importance splitting.
Thus, after determining the importance function, we run “pilot simulations” on the importance-labelled states of the tree. Running simulations exercises the fail and repair distributions of \(\textsf {\MakeUppercase {BE}}\) s and \(\textsf {\MakeUppercase {SBE}}\) s, imprinting this information in the thresholds \(\ell _k\). Several algorithms can do such selection of thresholds. They operate sequentially, starting from the initial state—a fully operational tree—which has importance \(i_0=0\). For instance, Expected Success [13] runs N finite-life simulations. If \(K<\frac{N}{2}\) simulations reach the next smallest importance \(i_1>i_0\), then the first threshold will be \(\ell _1=i_1\). Next, N simulations start from states with importance \(i_1\), to determine whether the next importance \(i_2\) should be chosen as threshold \(\ell _2\), and so on.
Expected Success also computes the effort per splitting region https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq425_HTML.gif . For Fixed Effort, “effort” is the base number of simulations to run in region https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq426_HTML.gif . For RESTART, it is the number of clones spawned when threshold \(\ell _{k+1}\) is up-crossed. In general, if K out of N pilot simulations make it from \(\ell _{k-1}\) to \(\ell _k\), then the k-th effort is https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq430_HTML.gif . This is chosen so that, during RES estimations, one simulation makes it from threshold \(\ell _{k-1}\) to \(\ell _k\) on average.
Thus, using the method from [14, 15] based on any of our importance functions, we compute (automatically) the thresholds and their effort for the given RFT. This is all the meta-information required to apply importance splitting RES [14, 27, 28].

4.4 Sampling conditional variables

The engine of the simulation tool https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figf_HTML.gif is based on standard discrete-event simulation [42]: time advances in discrete steps until the first occurrence of an event and, at this moment, the occurrence time of the newly enabled events are sampled. Thus, once the occurrence time of an event is fixed, it remains unaltered until it actually occurs. This has negative implications on the ISPLIT method. Since each time a simulation run up-crosses a threshold, the current state is cloned and hence the occurrence time of all active (i.e. enabled but yet to occur) events are also copied on the splitting simulation runs. This means that there exists some dependence among the splitting runs with the consequent adverse impact on the variance of the parameter calculated during simulation.
To improve the convergence of RESTART and FE, it is therefore convenient that at each splitting point the occurrence time of every active event is resampled conditioned to the time elapsed since it became active. For this, we have implemented two different methods in https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figg_HTML.gif , one based on the inverse transform method and the other using a rejection technique [42].
If the random variable X that determines the occurrence of an event has a cumulative distribution function (CDF) F whose inverse \(F^{-1}\) can be written in a closed form, then we use the inverse-transform method as described in the following. Let \(t\) be the value that we want to sample from X conditioned to the fact that \(t_e\) units of time have passed. Then
$$\begin{aligned} P(X\leqslant t\mid X>t_e) = \frac{F(t)-F(t_e)}{1-F(t_e)}. \end{aligned}$$
More precisely, we are interested on sampling the remaining time \(t_r\) such that \(t= t_e+ t_r\). Hence, we are interested in the random variable \(Y = X - t_e\) with CDF \(F_{t_e}\) such that:
$$\begin{aligned} F_{t_e}(t_r) = P(X\leqslant t_e+ t_r\mid X>t_e) = \frac{F(t_e+ t_r)-F(t_e)}{1-F(t_e)}. \end{aligned}$$
Following the inverse-transform method, a (pseudo) random value for Y can be obtained by generating a value \(u\sim u[0,1]\) taking \(t_r= F_{t_e}^{-1}(u) = F^{-1}( u + (1-u) \cdot F(t_e) ) - t_e\).
Oppositely, for the case in which \(F^{-1}\) cannot be written in a closed formula (e.g. log-normal and gamma distributions), we use the rejection method. The method is simple:
1.
generate \(t\sim X\), and
 
2.
if \(t> t_e\) output \(t\), otherwise repeat from 1.
 
However, as \(t_e\) increases, the likelihood of repeating the loop increases as well, which could turn the algorithm inefficient. In fact, if N is the number of repetitions until successfully choosing \(t\), then \(E(N)=\frac{1}{1-F(t_e)}\). In view of this, we only run the algorithm if \(F(t_e)\leqslant 0.75\), in which case \(E(N)\leqslant 4\). Otherwise we simply keep the previously sampled value.

5 Toolchain

Figure 6 outlines the complete toolchain implemented to deploy the theory described above. To model the input RFT, we introduce the new Kepler  textual format, which extends the Galileo textual format [21, 55, 56] which is a widespread syntax to describe fault trees [9, 25, 44]. Kepler  follows the syntax of Galileo adding support for repairs and non-Markovian distributions. We present Kepler  in Sec. 5.1 and give its complete grammar in Appendix A.
The toolchain is structured as follows. The Kepler  specification file is given as input to a Python converter script that produces three outputs: the IOSA specification that encodes the semantics of the tree, the property queries for unreliability or unavailability synthesised for the tree, and our compositional importance functions in terms of variables of the IOSA semantic model. This information is dumped into a single text file and fed to https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figh_HTML.gif , a statistical model checker specialised in importance splitting RES. https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figi_HTML.gif interprets this importance function, deploying it into its internal model representation, which results in a global function for the whole tree. https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figj_HTML.gif can then use ISPLIT algorithms such as RESTART and Fixed Effort via the automatic methods described above. The result are confidence intervals that estimate the reliability or availability of the RFT.

5.1 The Kepler  language

Standard Galileo  supports three PDF families, namely exponential, Weibull, and log-normal. Kepler  extends Galileo with arbitrary failure distributions—we introduce its full syntax in Code 5 on page 20. The current definition of Kepler  supports a particular set of distributions but it can be straightforwardly extended to support others.
Kepler  is a declarative language. Each line describes a node in the tree by its name, type, some extra characteristics and the names of its children. When declaring a (spare) basic element, we use the keyword https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq451_HTML.gif to precede the definition of its failure distribution, https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq452_HTML.gif for the repair distribution and https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq453_HTML.gif for its dormant failure distribution. The presence of a dormancy failure distribution is the only distinguishable factor between the definition of a \(\textsf {\MakeUppercase {BE}}\) and the definition of a \(\textsf {\MakeUppercase {SBE}}\). \(\textsf {\MakeUppercase {SPARE}}\) s are defined by the keyword https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq457_HTML.gif .
Code 1 provides an example of a \(\textsf {\MakeUppercase {SPARE}}\) gate (\(\texttt {Gate2}\)) with a primary basic element (\(\texttt {BE\_C}\)) and a spare one (\(\texttt {BE\_D}\)). Their respective failure PDFs are Rayleigh (\(\sigma =0.06\)) and exponential (\(\lambda =0.00111\)). Notice that, unlike Galileo, we allow the dormancy PDF of an \(\textsf {\MakeUppercase {SBE}}\) to be independent of its failure PDF. Thus we define the dormancy of \(\texttt {BE\_D}\) as an \(\text {Erlang}(k=3,\lambda =9)\).
Furthermore, Kepler  supports the use of (multiple) repair boxes with the keywords https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq467_HTML.gif . The first parameter after https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq468_HTML.gif defines the policy for serving the queue of failed \(\textsf {\MakeUppercase {BE}}\) s and \(\textsf {\MakeUppercase {SBE}}\) s. Currently, only non-preemptive priority policies are provided through the keyword https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq471_HTML.gif . Other policies are proposed in [45]. For example, in Code 2, all \(\textsf {\MakeUppercase {BE}}\) s are repairable, with repair time uniformly distributed. Line 6 of the code defines the \(\textsf {\MakeUppercase {RBOX}}\) of the system, which handles one repair at a time with the priority given by the order of the list. Thus, for instance, if \(\texttt {BE\_E}\) and \(\texttt {BE\_F}\) fail while \(\texttt {BE\_G}\) is being repaired, \(\texttt {BE\_E}\) will be chosen next.

5.2 Compiling Kepler  to IOSA

We developed a Python textual converter that takes as input an RFT modelled in  Kepler. The converter automatically produces 3 outputs:
1.
The IOSA model of the input RFT,
 
2.
A property specification for evaluating the unreliability or unavailability of the tree, and
 
3.
The importance functions for the tree.
 
The translation of a Kepler  model to a IOSA specification follows the semantics of RFTs defined in [45, 46]. As an example, Code 3 shows the IOSA module corresponding to the basic element \(\texttt {BE\_E}\) in Code 2.
The syntax of IOSA is close to that of PRISM [41] but includes primitives to handle stochastic timing. In IOSA, systems are modelled as a set of interacting processes which communicate by synchronising equally named transitions. Transitions are split into input and output. Output transitions are generative while input are reactive and only take place by synchronising with output transitions of other modules. IOSA presents a discrete-event continuous-time semantics. All clock variables in a IOSA model count down at the same rate and can be set to values sampled from their associated probability distribution. An example of setting clock \(\texttt {BE\_E\_rc}\) is seen at the end of line 10 in Code 3. When a clock variable expires (i.e. reaches zero), it may enable an output transition. We indicate this by preceding the clock name with the symbol \(\texttt {@}\) in the guard of a transition—see, for example, expression \(\texttt {@ BE\_E\_fc}\) at line 6 of Code 3.
The unreliability and unavailability queries are encoded as variants of PCTL [32] and CSL [2]. To do so, the script automatically identifies the state characterisation of the top-level event. This state condition depends solely on the semantic model of the top-level gate. For instance, Code 4 shows the unavailability property generated for the RFT of Code 2. Here, \(\texttt {Gate3\_count == 3}\) characterises the failing of the top-level \(\textsf {\MakeUppercase {AND}}\) gate with 3 children.
Finally, all importance functions are calculated following Tables 1 and 2. We recall that https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq484_HTML.gif and https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq485_HTML.gif are only available for models without \(\textsf {\MakeUppercase {PAND}}\) s.

5.3 https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figo_HTML.gif : RES to estimate rare dependability metrics

The https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figp_HTML.gif tool was devised to study temporal logic queries of IOSA models [10], described either in their native syntax or in the JANI model exchange format [16]. Using RES embedded in statistical model checking, https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figq_HTML.gif computes CIs that estimate the value with which a model satisfies (1) transient and (2) steady-state properties. An example of (1) is the probability of observing a system failure before a given mission time T, i.e. the unreliability value https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq487_HTML.gif . An example of (2) is the proportion of time that a repairable system remains inoperative, i.e. the unavailability value \(\text {UNAVA}\).
https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figr_HTML.gif was designed for automatic RES, implementing the algorithms from [14, 15] to derive an importance function from the system model. For this, https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figs_HTML.gif uses the property query to identify the states of relevant IOSA modules that represent the rare event. However, in FTA the relevant information is in the structure of the tree, not in the query, so this strategy fails to produce useful importance functions.
https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figt_HTML.gif can also take a composition function as input, to aggregate the local importance functions of the system modules that are relevant for the rare event. We exploit this feature with our Kepler  to IOSA compiler, thus instructing https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figu_HTML.gif how to implement the importance functions from Sects. 4.1 and 4.2. Note nonetheless that those functions depend at least on the state of \(\textsf {\MakeUppercase {BE}}\) s and \(\textsf {\MakeUppercase {SBE}}\) s or, in the general case, on the output https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq491_HTML.gif of all nodes \(v\) in Table 1 for which https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq493_HTML.gif appears in https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq494_HTML.gif . So https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figv_HTML.gif offers an option to build local importance functions for those nodes, namely \({\{\textsf {\MakeUppercase {BE}},\textsf {\MakeUppercase {SBE}},\textsf {\MakeUppercase {PAND}},\textsf {\MakeUppercase {SPARE}} \}}\), which are the base cases of the global importance function for the whole tree.
Thus from an FT model and via our toolchain, https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figw_HTML.gif can build the thresholds required to perform ISPLIT using several heuristics, and then run diverse RES algorithms to estimate rare event properties—see e.g. [11, 13, 17].

6 Experimental evaluation

To demonstrate the effectiveness of our theory we used this toolchain to compute the unreliability and unavailability of 27 highly-resilient repairable non-Markovian DFTs. These models come from seven literature case studies, that we enriched with \(\textsf {\MakeUppercase {RBOX}}\) elements and non-Markovian failure, dormancy, and repair stochastic distributions.

6.1 General setup

We estimated the \(\text {UNAVA}\) or of each tree in increasingly resilient configurations. Thus we show how Crude Monte Carlo (CMC) loses efficiency in comparison to our automatic implementations of RES, with an efficiency gap that increases as the dependability metrics become rarer.
Moreover, we compare the quality of the RES algorithms that result from three different importance functions, namely: the compositional function from Sec. 4.1 defined recursively in the structure of the tree \(\triangle \), denoted structural ( https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq499_HTML.gif ); function https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq500_HTML.gif from Sec. 4.2 that works on the set of minimal cut sets of \(\triangle \); and its normalised variant https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq502_HTML.gif . We did not consider the pruned variants of https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq503_HTML.gif , which [12] reported as less promising.
To compare the efficiency of these functions empirically, we passed them as composition functions to https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figx_HTML.gif for different types of ISPLIT algorithms (engines in https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figy_HTML.gif terminology) and heuristics for thresholds selection. We tested the engines: Fixed Effort [27] with different effort values, i.e. different amounts of partial runs performed in each https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq504_HTML.gif region; standard RESTART [62], for different values of splitting per threshold; and RESTART with prolonged retrials of level 2 [61], RESTART-\(\text {P}_2\), also for different values of splitting per threshold. For each engine we experimented three methods to build thresholds: a modified Sequential Monte Carlo algorithm [15] using global splitting 2 (for the RESTART variants) or global effort 8 (Fixed Effort); the same with values 5 and 16 resp.; the Expected Success algorithm [13], that computes splitting/effort values independently for each https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq506_HTML.gif region.
All these configurations result in nine ISPLIT variants: FE\(_m\) for \({m=8,16,\text {es}}\) and RST\(_n\), RST2\(_n\) for \({n=2,5,\text {es}}\). If one of the importance functions https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq512_HTML.gif , https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq513_HTML.gif , https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq514_HTML.gif shows better efficiency than the rest to analyse all case studies, in one or more of these ISPLIT variants without showing worse performance in the rest, then we conclude that it is (for our practices) a higher-quality function for RES.
More precisely, we define an instance \(\mathfrak {y}\) as a combination of an algorithm algo (i.e. an ISPLIT variant using one of the three functions), an RFT, and a dependability metric. An RFT is identified by a case study (\({{\textit{CS}}}\)) and a parameter (\(\mathfrak {p}\)), where larger values of the parameter of the RFT \(\textit{CS}_p\) indicate smaller dependability values \(p_{{{\textit{CS}}}_p}\). Running algo for a fixed simulation time, instance \(\mathfrak {y}\) estimates the value https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq521_HTML.gif . We fix the confidence coefficient \(\delta =0.95\), so this experiment produces a confidence interval (CI) \({\hat{p}}_\mathfrak {y}\) that has a certain width \(\Vert {{\hat{p}}_\mathfrak {y}} \Vert \in [0,1]\). The performance of algo can be measured by that width: the smaller \(\Vert {{\hat{p}}_\mathfrak {y}} \Vert \), the more efficient the algorithm that achieved it.
This is a direct and standard approach to quantify efficiency: measure the confidence interval width for a fixed simulation budget. There are, however, two more dimensions to consider. First, the simulation budget may not suffice to observe rare events in certain cases, e.g. when using CMC on an instance with very low dependability value \(p_\mathfrak {y}\). In such cases the https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figz_HTML.gif tool reports a null estimate \({\hat{p}}_\mathfrak {y}=[0,0]\), which is an indication of poor performance. Second, the simulation of random events depends on the RNG—and its seed—used by https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figaa_HTML.gif , so different runs yield CIs of different width. This is another indicator: the less variability observed for these widths, the better the algorithm—inasmuch the CI is narrow, i.e. relative to the first dimension mentioned.
We use these three dimensions to assess the performance of an algo: its capability to converge, the expected CI width achieved, and the variability of these widths. For this we repeated 10 times the estimation of \({\hat{p}}_\mathfrak {y}\) for each instance \(\mathfrak {y}\), measuring: (i) how many times it yielded not-null estimates; (ii) what was the average width \(\Vert {{\hat{p}}_\mathfrak {y}} \Vert \); and (iii) what was the standard deviation of those widths.
We performed 10 repetitions to ensure statistical significance: in the bar plots that we present in Sec. 6.3, a 95% CI for a bar is narrower than the whiskers and, in the hardest configuration of every \(\textit{CS}\), the whiskers of algorithms under comparison never overlap.
Case studies. Our seven parametric case studies are:
1.
the synthetic model DSPARE\(\mathfrak {p}\) [12], with \(\mathfrak {p}\in \{3,4,5\}\) shared \(\textsf {\MakeUppercase {SBE}}\) s and 1 \(\textsf {\MakeUppercase {RBOX}}\);
 
2.
the synthetic model \(\textsf {\MakeUppercase {VOT}}\) \(\mathfrak {p}\) [12], with \(\mathfrak {p}\in \{1,\ldots ,4\}\) shared \(\textsf {\MakeUppercase {BE}}\) s and 1 \(\textsf {\MakeUppercase {RBOX}}\);
 
3.
FTPP\(\mathfrak {p}\) [26], where we study one triad with \(\mathfrak {p}\in \{4,5,6\}\) shared \(\textsf {\MakeUppercase {SBE}}\) s, using one \(\textsf {\MakeUppercase {RBOX}}\) for the processors and another for the network elements;
 
4.
HECS\(\mathfrak {p}\) [57], with 2 memory interfaces, 4 \(\textsf {\MakeUppercase {RBOX}}\) (one per subsystem), \(\mathfrak {p}\in \{1,\ldots ,5\}\) shared spare processors, and \(2\mathfrak {p}\) parallel buses;
 
5.
RC\(\mathfrak {p}\) [30], with one \(\textsf {\MakeUppercase {RBOX}}\) and \(\mathfrak {p}\in \{3,\ldots ,6\}\) \(\textsf {\MakeUppercase {SPARE}}\) s;
 
6.
HVC\(\mathfrak {p}\) [31], with one \(\textsf {\MakeUppercase {RBOX}}\) and \(\mathfrak {p}\in \{2,\ldots ,4\}\) shared \(\textsf {\MakeUppercase {SBE}}\) s;
 
7.
\({RWC}{\mathfrak {p}\in \{4,\ldots ,7\}}\) [53], which is a non-trivial combination of RC\(\mathfrak {p}\) and HVC\(\mathfrak {p}\) via \(\textsf {\MakeUppercase {VOT}}\) and \(\textsf {\MakeUppercase {OR}}\) gates that maintains both independent \(\textsf {\MakeUppercase {RBOX}}\) elements.
 
In total we have 27 RFTs with PDFs that include exponential, Erlang, uniform, Rayleigh, Weibull, normal, and log-normal distributions. All details of these case studies can be found in the artifact accompanying this work, where we provide the fault trees written in the Kepler syntax, as well as their IOSA translations [18].
Transient vs. steady-state. From the nine ISPLIT variants, only the three based on standard RESTART (RSTn) could be used to estimate both types of properties. Fixed Effort (FE\(_m\)) was not used for steady-state properties as this requires regeneration theory [27], which is not always feasible with non-Markovian models. Conversely, RESTART-\(\text {P}_2\) (RST\(_{2n}\)) was not used for transient properties because there is no current support for this combination of engine and property in https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figab_HTML.gif .
Hardware. Our experiments ran in a PBS-administered cluster running Linux CenOS 7 (kernel 3.10.0-957), whose nodes have CPUs Intel® Xeon® E5-2680 v4@2.40 GHz (14 cores, 35M cache), each with 384 GB of DDR4 RAM @1600 MHz.

6.2 Experimental results: scatter plots

We start by presenting scatter plots, that offer a high-level overview of our general results, later refined into bar plots.
Fig. 7 compares all runs of CMC against all RES runs that used RESTART. We arrange the plots in a \({3\times 3}\) matrix, where a row indicates a way to select thresholds, and a column indicates an importance function. For instance, the upper-left scatter plot of Fig. 7 compares CMC against RESTART using the https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq575_HTML.gif function and thresholds built with Sequential Monte Carlo for global splitting 2. Instead, the lower-right scatter plot compares CMC to RESTART using the https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq576_HTML.gif function and the Expected Success algorithm.
In each scatter plot, a mark at (xy) coordinates corresponds to an instance whose CI width was x for RESTART and y for CMC. Thus, a mark above the solid diagonal line means that RESTART built a narrower CI than CMC in the same simulation time. Dotted diagonal lines indicate \(10\times \) narrower CIs (note that the axes are in logarithmic scale). A mark on the upper- or right-most bars labelled \(\emptyset \) respectively indicates that none of the 10 CMC or RESTART experiments managed to build a CI (e.g. CMC for \(\textsf {\MakeUppercase {VOT}}\)4).
These xy values are the robust average of the corresponding 10 +10 runs, computed via Z-score\(_{m=2}\) to remove outliers [34]. Mark styles differentiate the case studies and properties, e.g. https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq581_HTML.gif compare unavailability for the four \(\textsf {\MakeUppercase {VOT}}\)\({\mathfrak {p}}\) case studies, while https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq584_HTML.gif represent unreliability for RWC\({\mathfrak {p}}\).
Fig. 7 shows that in general and as expected, RESTART performs at least as well as CMC to estimate the dependability metrics, with an efficiency gap that increases as the RFT gains on resilience. However, this relative gain concerns \(\text {UNAVA}\) studies, which are the first-row models in the legend of Fig. 7, that have an A superscript and blue-tainted colour marks. For RESTART fails to surpass CMC significantly: we will show next that Fixed Effort is a much better ISPLIT algorithm in these cases.
When comparing importance functions, Fig. 7 shows a tendency that favours https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq587_HTML.gif , evidenced in (each row of) plots by the higher mass of marks above the diagonal that occur on the column corresponding to https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq588_HTML.gif . Nevertheless, there are two outliers in this trend that we discuss here.
First, RESTART with https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq589_HTML.gif and Expected Success failed to build any CI for \(\text {UNAVA}\) of the RFT \(\textsf {\MakeUppercase {VOT}}\)3 (mark https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq592_HTML.gif on the right-most bar of the bottom-left scatter plot). Studying the output logs of https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figac_HTML.gif it was found that Expected Success failed in the selection of a threshold at importance value 44 (out of 70). Although https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figad_HTML.gif has recovery heuristics for such situations, the result for this case was the selection of a splitting value = 2234. This is in contrast to the values observed for all other experiments, that seldom go over 200 and never over 400. Therefore, an oversampling occurred in the https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq593_HTML.gif region built above such threshold, which produced a bottleneck that could not be overcome in the 30 min of runtime allowed for this experiment, producing the lack of results.
Second, RESTART with https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq594_HTML.gif and Sequential Monte Carlo for global splitting 5 failed to build any CI for HVC7 when studying . This is less surprising given the lower efficiency of RESTART to study unreliability in our experiments, and the nature of this case study that we discuss next when analysing Fixed Effort. Still we studied the output logs of https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figae_HTML.gif , and in this case found an undersampling caused by thresholds set too far apart by Sequential Monte Carlo. This is a known issue for this algorithm, that uses a single splitting value for all thresholds.
Thus, both outliers have roots on a combination of the algorithms to build thresholds, and their current implementation in https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figaf_HTML.gif . The better performance of RESTART with https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq595_HTML.gif , for the rest of the instances when compared to the columns of https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq596_HTML.gif and https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq597_HTML.gif , tilt the scale in favour of the structural function https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq598_HTML.gif defined in Sec. 4.1.
Figure 8 shows our experimental results for RESTART-\(\text {P}_2\), which essentially delays the truncation of simulation retrials for 2 threshold levels. In comparison RESTART can be defined as RESTART-\(\text {P}_0\), which truncates a retrial as soon as it visits a state whose importance is below the level of creation of the retrial. Empirical and theoretical studies have shown RESTART-\(\text {P}_1\) and RESTART-\(\text {P}_2\) to be more efficient than RESTART in the analysis of queueing systems [17, 61]. Here we experiment with the latter, within the current capabilities of https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figag_HTML.gif (\(\text {UNAVA}\) properties only), to further validate our previous discussion for https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq613_HTML.gif .
For that function, Fig. 8 shows a similar trend than Fig. 7, with the following difference. When thresholds are selected with Sequential Monte Carlo for global splitting 2 (upper-left scatter plots in the figures), RESTART-\(\text {P}_2\) managed to build narrower CIs than RESTART, most notably for the RWC\({\mathfrak {p}}\) case studies. The opposite effect is observed (but to a lower degree) when thresholds are selected with the Expected Success algorithm.
However interesting, this does not change the fact that for any row of scatter plots, those produced with the https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq616_HTML.gif function in Fig. 8 generally produced the narrower CIs. Also, being more sensitive than RESTART to the choice of thresholds via Expected Success, RESTART-\(\text {P}_2\) puts in evidence that https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq618_HTML.gif can result in ISPLIT algorithms that perform worse than CMC, e.g. the lower-central scatter plot of Fig. 8. Thus, all in all these results increase the evidence in favour of https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq619_HTML.gif as the most efficient of the three importance functions tested, at least for steady-state analysis.
Finally, Fig. 9 shows our experimental results when using Fixed Effort to study . Something that jumps to sight is the high number of failures of the algorithm for all importance functions, but exclusively for the HVC\({\mathfrak {p}}\) case studies (plus some RWC\(\mathfrak {p}\)), and paradoxically for the less rare instances of those models.
To understand this we study the structure of these RFTs, whose smallest instance (HVC4) require 6 basic elements to fail in order to trigger a top event. This is not too favourable for ISPLIT, specially considering the fast repair times (uniformly distributed in [0.15, 0.45]) with respect to failure times (\(\text {Erlang}(3,0.25)\) and \(\text {Rayleigh}(1.999)\)). Still this did not stop RESTART to at least match the performance of CMC.
In general, the issue with Fixed Effort is that it is more structured, and thus more brittle than RESTART (and CMC), as shown here. By conditioning the success of the whole run on the chained success on every https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq628_HTML.gif region, a single failed step produces a 0 estimate and starts all estimations anew. This can be very efficient—see e.g. the CIs \(\approx 100\times \) narrower than CMC for HECS5—as it avoids to waste effort in unpromissing simulations. In HVC however, where repair times are extremely faster than failures, such reset condition happens almost always before the top event of the tree. This does not affect RESTART and CMC so badly, which continue simulations as before. But for Fixed Effort and given the short runtimes allowed for the smallest instances—HVC4 and HVC5 are truncated after 90 and 300 s respectively—it results in null estimates as observed in Fig. 9. This is related to the fact that none of our importance functions considers time—the value of IOSA clocks—as a factor for splitting. We touch upon this subject in the conclusions.
For the rest of the cases, where redundancy (rather than fail vs. repair time) is the root factor for resilience, Fig. 9 shows an excellent performance of Fixed Effort to estimate unreliability. The dominance of https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq630_HTML.gif is not as clear here as it is for \(\text {UNAVA}\) studies via RESTART and RESTART-\(\text {P}_2\); however, neither of the other two functions is clearly superior. Consider e.g. DSPARE\(\mathfrak {p}\), where https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq634_HTML.gif shows consistent better results than https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq635_HTML.gif and https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq636_HTML.gif ; and also the best-performance cases, namely HECS\(\mathfrak {p}\), where all functions perform similarly for the different ways of choosing thresholds.
Therefore, our general observations remain favourable for https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq638_HTML.gif , as the importance function that produces the most efficient ISPLIT implementations in general scenarios.

6.3 Experimental results: bar plots

Unlike the scatter plots in Figs. 7 to 9, the bar plots in this section show the variance of the CI widths produced by each algorithm. These are plotted as whiskers on top of the bars, where the height of a bar indicates the width of the CI achieved by the corresponding instance. Numbers in the range [0, 10] at the base of the bars tell how many of the 10 experimental repetitions managed to build a not-null CI. The label “\(p_\mathfrak {p}\approx \mu \pm \sigma ^2\)” at the right of each plot shows the robust mean and variance estimated for the corresponding dependability metric, computed from the complete series of runs (190 independent experiments per case study).
Steady-state studies for RC. Fig. 10 shows the widths of the CIs produced for unavailability estimation on the RC\(\mathfrak {p}\) case studies. The plots illustrate how https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq641_HTML.gif performs better than both https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq642_HTML.gif and https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq643_HTML.gif in every one of the ISPLIT variants tested. This difference between the RES implementations resulting from https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq644_HTML.gif and the other functions increases for lower values of the steady-state property, and the variance of these results remains among the lowest of all cases. Moreover, here https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq645_HTML.gif is the only function with which all RESTART algorithms maintain or increase the efficiency gap that sets them apart from CMC.
Transient studies for DSPARE. Fig. 11 shows the widths of the CIs produced for unreliability estimation on the case studies DSPARE\(\mathfrak {p}\). Once again we see https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq653_HTML.gif outperforming the other two functions in general, with an efficiency gap and accuracy that increases as the event becomes more rare. However and as discussed in Sec. 6.2, in this case this only happens with Fixed Effort variants, since RESTART fails to perform better than CMC. Yet the difference between the Fixed Effort implementations and the rest (specially with the https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq654_HTML.gif function) is remarkable, and in particular Fixed Effort is the only algorithm capable of producing consistently useful CIs (i.e. that exclude 0).
Most work on DFT analysis assumes discrete [4, 57] or exponentially distributed [23, 40] components failure. Furthermore, components repair is seldom studied in conjunction with dynamic gates [4, 7, 40, 44, 54]. In this work we addressed repairable DFTs, whose failure and repair times can follow arbitrary PDFs. More in detail, RFTs were first formally introduced as stochastic Petri nets in [7, 20]. Our work stands on [45, 46], which reviews [20] in the context of stochastic automata with arbitrary PDFs. In particular we also address non-Markovian continuous distributions: in Sec. 6 we experimented with exponential, Erlang, uniform, Rayleigh, Weibull, normal, and log-normal PDFs. Furthermore and for the first time (with the exclusion of [12, 19] on which this work stands), we consider the application of [20, 45] to study rare events.
Much effort in RES has been dedicated to study highly reliable systems, deploying either importance splitting or sampling. Typically, importance sampling can be used when the system takes a particular shape. For instance, a common assumption is that all failure (and repair) times are exponentially distributed with parameters \(\lambda ^i\), for some \(\lambda \in {\mathbb {R}} \) and \(i\in {\mathbb {N}} _{>0}\). In these cases, a favourable change of measure can be computed analytically [29, 33, 47, 48, 53, 65].
In contrast, when events occur at times following less-structured or even arbitrary distributions, importance splitting is more easily applicable. As long as a full system failure can be broken down into several smaller failures, an importance splitting method can be devised. Of course, its efficiency relies heavily on the choice of importance function. This choice is typically done ad hoc for the model under study [43, 58, 60]. In that sense [14, 15, 35, 36] are among the first to attempt a heuristic derivation of all parameters required to implement splitting, for which they exploit formal specifications of the model and property query.
Here we extended [10, 14, 15] in two different ways. One is the natural way in which we use the structure of the fault tree to define composition operands. With these operands we aggregate the automatically-computed local importance functions of the tree nodes. This aggregation results in an importance function for the whole model, that we present in Table 1 as the “structural” function https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq663_HTML.gif .
The other extension relates to [58, 59], where [58] initially defined a “state variable” (an importance function for the RESTART algorithm) \(S({\varvec{x}})=\max _i\{c_i({\varvec{x}})\}\), where \(c_i({\varvec{x}})\) is the number of components of type i that are failed in state \({\varvec{x}}\). Even though this is defined for a specific system, in essence and viewing the system as a fault tree, \(S({\varvec{x}})\) counts the maximum number of failed components in any MCS. This was generalised in [59] using cut set analysis to define an importance function which, in our setting, is given by https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq668_HTML.gif , where in turn \(\min _{|{\textit{MCS}}|} = \big (\min _{{\textit{MCS}}\in {\mathcal {M}}(\triangle )}|{\textit{MCS}}|\big )\). Unlike \(S({\varvec{x}})\), \({\Phi }({\varvec{x}})\) does not require all cut sets to have the same cardinality. However, both functions are hindered when the branches of the fault tree have different failure probabilities.
In [19] it was proposed to alleviate that issue via cut set pruning and importance normalisation, but the former strategy proved inefficient in the general case. Thus here we chose to experiment with the original and normalised variants of this function, i.e. https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq672_HTML.gif and https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq673_HTML.gif in Table 2.

8 Conclusions

We have presented a theory to deploy automatic importance splitting (ISPLIT) for fault tree analysis of repairable dynamic fault trees (RFTs). This Rare Event Simulation approach supports arbitrary probability distributions of components failure and repair. The core of our theory lies on the general definition of importance functions. Thus, we provide the importance function https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq674_HTML.gif , which is defined structurally on the given tree \(\triangle \), and a family of importance functions, including https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq676_HTML.gif and https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq677_HTML.gif , derived from the collection of minimal cut sets of \(\triangle \).
From such functions we have implemented ISPLIT algorithms and used them to estimate the unreliability and unavailability of highly-resilient RFTs. Setting itself apart from classical approaches, that define importance functions ad hoc using expert knowledge, our theory computes all metadata required for RES from the model and metric specifications. From this basis, we have shown how diverse ISPLIT algorithms can be automatically implemented from https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq679_HTML.gif , https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq680_HTML.gif , or https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq681_HTML.gif . Our experimentation shows that these algorithms can converge to narrower confidence intervals than crude Monte Carlo simulation (CMC).
The efficiency gap observed between CMC and the automatic ISPLIT implementations—i.e. how narrow are the CIs achieved for a fixed simulation budget—depends on a number of factors. These include the type of property, the specific RES algorithm, the relation between the fail and repair times of the \(\textsf {\MakeUppercase {BE}}\) s and \(\textsf {\MakeUppercase {SBE}}\) s of the tree, and (to a lesser degree) also the heuristic chosen to select thresholds. Nevertheless and in all cases, implementations with the structural importance function proved to be at least as performant as with those based on minimal cut sets, and in many cases consistently better, most prominently with RESTART variants for RES.
Another main advantage of https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq684_HTML.gif over https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq685_HTML.gif and its variants is that the former is linear in the size of the tree, and its bottom-up computation from the tree structure (regardless of whether it is a proper tree or not) is likewise linear in the tree size. In contrast, https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq686_HTML.gif and its variants are worse-case exponential in the size of the tree. This has not been a problem for the relatively small RFTs considered here, but in industrial applications with thousands of \(\textsf {\MakeUppercase {BE}}\) s this could quickly become a limiting factor. Moreover, importance functions are constantly evaluated during rare event simulation, so the computation overhead (see e.g. [64]) could easily tilt the scale against https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq688_HTML.gif even in the cases where the exponential explosion is not observed.
Besides theoretical contributions and to provide an empirical basis to our studies, we have presented a toolchain that demonstrates our automatic RES applications. With the aim to improve a former toolchain we have also introduced Kepler, a textual format to represent RFTs, and changed the ISPLIT engine of https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figah_HTML.gif to include conditional sampling. This modification increases trace independence during splitting and had a positive impact on the performance of the importance splitting methods in https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figai_HTML.gif with respect to previous versions. All the experimental results here presented can be inspected and reproduced with the software artifact that we have made publicly available to that end [18].
There are several paths open for future development. First and foremost, we are looking into new ways to define the importance function, e.g. to cover more general categories of FTs such as fault maintenance trees [51]. In addition, we have defined https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq689_HTML.gif , https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq690_HTML.gif , and https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq691_HTML.gif based on the tree structure alone. It would be interesting to further include stochastic information in this phase, and not only afterwards during the thresholds-selection phase. Finally, we are investigating enhancements in IOSA and our toolchain, to exploit the ratio between fail and dormancy PDFs of \(\textsf {\MakeUppercase {SBE}}\) s in warm \(\textsf {\MakeUppercase {SPARE}}\)  gates.

Acknowledgements

The authors thank Marco Biagi, for the many discussions that lead to the first implementation of the structural importance function in https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_Figaj_HTML.gif . Thanks extend also to José and Manuel Villén-Altamirano, for fruitful discussions that helped to better understand the application scope of this approach.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix

Kepler grammar for repairable DFTs

Kepler is a textual, human-readable syntax to describe Fault Trees with dynamic gates (\(\textsf {\MakeUppercase {PAND}}\), \(\textsf {\MakeUppercase {SPARE}}\), \(\textsf {\MakeUppercase {FDEP}}\)), repairs (\(\textsf {\MakeUppercase {RBOX}}\)), and general continuous distributions. Although a subset of such distributions is currently described, the extension to others is direct. We refer to these extended fault trees as RFTs.
The first line in a Kepler description of an RFT declares which is the top-level gate. Each subsequent line is either empty or it defines a node in the RFT. Each node can be described either with our newly introduced syntax, or by using the legacy Galileo syntax.
In Code 5 we describe the full Kepler syntax in standard BNF notation. Grammar symbols are capitalised words between carets, e.g. https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq698_HTML.gif . In particular, the entry point of the syntax is the production rule https://static-content.springer.com/image/art%3A10.1007%2Fs10009-022-00675-x/MediaObjects/10009_2022_675_IEq699_HTML.gif . The horizontal line ‘|’ separates options. Other characters in the production rules are either literals or terminal symbols. Notice that our definition includes a formalisation of the legacy Galileo standard syntax.
Literature
3.
go back to reference Bayes, A.J.: Statistical techniques for simulation models. Aust. Comput. J. 2(4), 180–184 (1970) Bayes, A.J.: Statistical techniques for simulation models. Aust. Comput. J. 2(4), 180–184 (1970)
21.
go back to reference Coppit, D., Sullivan, K.J.: Galileo: A tool built from mass-market applications. In: Proceedings of the 2000 International Conference on Software Engineering 2000, pp. 750–753. IEEE (2000) Coppit, D., Sullivan, K.J.: Galileo: A tool built from mass-market applications. In: Proceedings of the 2000 International Conference on Software Engineering 2000, pp. 750–753. IEEE (2000)
34.
go back to reference Iglewicz, B., Hoaglin, D.: How to detect and handle outliers. ASQC basic references in quality control. ASQC Quality Press (1993) Iglewicz, B., Hoaglin, D.: How to detect and handle outliers. ASQC basic references in quality control. ASQC Quality Press (1993)
39.
go back to reference Kahn, H., Harris, T.E.: Estimation of particle transmission by random sampling. Natl. Bur. Stand. Appl. Math. Ser. 12, 27–30 (1951) Kahn, H., Harris, T.E.: Estimation of particle transmission by random sampling. Natl. Bur. Stand. Appl. Math. Ser. 12, 27–30 (1951)
41.
go back to reference Kwiatkowska, M., Norman, G., Parker, D.: Prism: Probabilistic symbolic model checker. In: International Conference on Modelling Techniques and Tools for Computer Performance Evaluation, pp. 200–204. Springer (2002) Kwiatkowska, M., Norman, G., Parker, D.: Prism: Probabilistic symbolic model checker. In: International Conference on Modelling Techniques and Tools for Computer Performance Evaluation, pp. 200–204. Springer (2002)
42.
go back to reference Law, A.M.: Simulation modeling and analysis. McGraw-Hill (2014) Law, A.M.: Simulation modeling and analysis. McGraw-Hill (2014)
45.
go back to reference Monti, R.E.: Stochastic automata for fault tolerant concurrent systems. Ph.D. thesis, FAMAF, Universidad Nacional de Córdoba, Córdoba, Argentina (2018) Monti, R.E.: Stochastic automata for fault tolerant concurrent systems. Ph.D. thesis, FAMAF, Universidad Nacional de Córdoba, Córdoba, Argentina (2018)
46.
go back to reference Monti, R.E., Budde, C.E., D’Argenio, P.R.: A compositional semantics for repairable fault trees with general distributions. In: LPAR, EPiC Series in Computing, vol. 73, pp. 354–372. EasyChair (2020). https://doi.org/10.29007/p16v Monti, R.E., Budde, C.E., D’Argenio, P.R.: A compositional semantics for repairable fault trees with general distributions. In: LPAR, EPiC Series in Computing, vol. 73, pp. 354–372. EasyChair (2020). https://​doi.​org/​10.​29007/​p16v
50.
go back to reference Rubino, G., Tuffin, B. (eds.): Rare event simulation using Monte Carlo methods. Wiley (2009)MATH Rubino, G., Tuffin, B. (eds.): Rare event simulation using Monte Carlo methods. Wiley (2009)MATH
52.
go back to reference Ruijters, E., Guck, D., van Noort, M., Stoelinga, M.: Reliability-centered maintenance of the electrically insulated railway joint via fault tree analysis: a practical experience report. In: DSN 2016, pp. 662–669. IEEE Computer Society (2016). https://doi.org/10.1109/DSN.2016.67 Ruijters, E., Guck, D., van Noort, M., Stoelinga, M.: Reliability-centered maintenance of the electrically insulated railway joint via fault tree analysis: a practical experience report. In: DSN 2016, pp. 662–669. IEEE Computer Society (2016). https://​doi.​org/​10.​1109/​DSN.​2016.​67
57.
go back to reference Vesely, W., Stamatelatos, M., Dugan, J., Fragola, J., Minarick, J., Railsback, J.: Fault tree handbook with aerospace applications. NASA Office of Safety and Mission Assurance (2002). Version 1.1 Vesely, W., Stamatelatos, M., Dugan, J., Fragola, J., Minarick, J., Railsback, J.: Fault tree handbook with aerospace applications. NASA Office of Safety and Mission Assurance (2002). Version 1.1
58.
go back to reference Villén-Altamirano, J.: RESTART method for the case where rare events can occur in retrials from any threshold. Int. J. Electron. Commun. 52(3), 183–189 (1998) Villén-Altamirano, J.: RESTART method for the case where rare events can occur in retrials from any threshold. Int. J. Electron. Commun. 52(3), 183–189 (1998)
62.
go back to reference Villén-Altamirano, M., Martínez-Marrón, A., Gamo, J., Fernández-Cuesta, F.: Enhancement of the accelerated simulation method RESTART by considering multiple thresholds. In: Proc. 14\(_{th}\) Int. Teletraffic Congress, Teletraffic Science and Engineering, vol. 1, pp. 797–810. Elsevier (1994). https://doi.org/10.1016/B978-0-444-82031-0.50084-6 Villén-Altamirano, M., Martínez-Marrón, A., Gamo, J., Fernández-Cuesta, F.: Enhancement of the accelerated simulation method RESTART by considering multiple thresholds. In: Proc. 14\(_{th}\) Int. Teletraffic Congress, Teletraffic Science and Engineering, vol. 1, pp. 797–810. Elsevier (1994). https://​doi.​org/​10.​1016/​B978-0-444-82031-0.​50084-6
63.
go back to reference Villén-Altamirano, M., Villén-Altamirano, J.: RESTART: a method for accelerating rare event simulations. In: Queueing, Performance and Control in ATM (ITC-13), pp. 71–76. Elsevier (1991) Villén-Altamirano, M., Villén-Altamirano, J.: RESTART: a method for accelerating rare event simulations. In: Queueing, Performance and Control in ATM (ITC-13), pp. 71–76. Elsevier (1991)
Metadata
Title
Analysis of non-Markovian repairable fault trees through rare event simulation
Authors
Carlos E. Budde
Pedro R. D’Argenio
Raúl E. Monti
Mariëlle Stoelinga
Publication date
04-11-2022
Publisher
Springer Berlin Heidelberg
Published in
International Journal on Software Tools for Technology Transfer / Issue 5/2022
Print ISSN: 1433-2779
Electronic ISSN: 1433-2787
DOI
https://doi.org/10.1007/s10009-022-00675-x

Other articles of this Issue 5/2022

International Journal on Software Tools for Technology Transfer 5/2022 Go to the issue

Premium Partner