Periodic inspection optimization model for a complex repairable system

https://doi.org/10.1016/j.ress.2010.04.003Get rights and content

Abstract

This paper proposes a model to find the optimal periodic inspection interval on a finite time horizon for a complex repairable system. In general, it may be assumed that components of the system are subject to soft or hard failures, with minimal repairs. Hard failures are either self-announcing or the system stops when they take place and they are fixed instantaneously. Soft failures are unrevealed and can be detected only at scheduled inspections but they do not stop the system from functioning. In this paper we consider a simple policy where soft failures are detected and fixed only at planned inspections, but not at moments of hard failures. One version of the model takes into account the elapsed times from soft failures to their detection. The other version of the model considers a threshold for the total number of soft failures. A combined model is also proposed to incorporate both threshold and elapsed times. A recursive procedure is developed to calculate probabilities of failures in every interval, and expected downtimes. Numerical examples of calculation of optimal inspection frequencies are given. The data used in the examples are adapted from a hospital's maintenance data for a general infusion pump.

Introduction

Complex repairable systems such as medical devices, telecommunication systems, and electronic instruments consist of a large number of interacting components that perform the system's required functions. A repairable system, on failure, can be restored to satisfactory performance by any method except replacement of the entire system [1]. A repairable system is usually subject to periodically or non-periodically planned inspections during its life cycle. The scheduled inspections are performed to verify device safety and performance by detecting potential and hidden failures and taking appropriate actions. The actions can be to fix the potential failure if the device is found defective or perform preventive maintenance if no failure is detected to avoid or reduce future failures.

We will consider a system with several failure modes that may be classified into two broad categories: type I and type II, by their consequence or possibility of detection. Failure modes of type I are ones that have more significant influence on system operation. Failure modes of type II are ones that are less critical for the system. Meeker and Escobar [2, p. 327] also consider two types of failures: hard and soft. They suggest that soft failures may be defined as when degradation (gradual loss of performance) exceeds certain specified level, while hard failures cause the system to stop working. In this paper, type I failures (which we can also call “hard” failures) signify those failures that stop operation of the system or they are self-announcing and are fixed as soon as they occur. On the other hand, type II failures (which we can also call “soft” failures) are failures that do not make the system stop, but can reduce the system's performance and require to be fixed eventually. Soft failures are usually not self-announcing and are detected and fixed only at the next scheduled inspection. Therefore, there is a time delay between real occurrence of a soft failure and its detection.

There are different categories of soft failure of components in complex devices. One category includes a wide range of integrated protective components used to protect a system from unwanted transient incidents such as current and voltage surges. For example, infusion pumps and medical ultrasounds are equipped with circuit breakers to protect them against overload and short circuits. If these protective components fail, the system can continue its main function, although the risk of damage increases if an overload or short circuit takes place. The failures of protective components are not self-announcing and can be rectified only at inspection. This group of components does not age further when a failure occurs and until the component is fixed and put back in service. Standby redundant components are another category of components with soft failures. They are used in many critical systems to enhance system safety and reliability. The system, depending on its design, can continue functioning if a required number of these redundant components work. Uninterruptible power supplies (UPSs), dual redundant processors, and redundant fan trays are examples of redundant components used to increase system safety and availability. A CAT scan (CT) may have several CPUs running different algorithms for image processing. If one CPU fails, processing is routed to other CPUs without interrupting or degrading the system's performance. Again, components in this category do not age after failure, and usually their failures are detected only at inspection.

In addition to protective and redundant components, there are other components whose failure does not halt the system, even though failure can have serious consequences, even catastrophe, if left unattended. With infusion pumps, for example, audible or visual signals are used to communicate with operators and inform them of the status of the patient to whom the device is attached. When the level of liquid delivered to a patient reduces to a certain level, the component responsible for producing signals starts to send a warning alarm. If the component fails, the pump can still function, but the patient's health risk increases if the operator does not take action. Devices in overcrowded hospitals are always in use, and even though some of their informative features such as audible signals may fail, they remain in operation until the next scheduled inspection.

The results obtained from the study in [3] show that some components of a general infusion pump such as indicators, switches, and occlusions have hard failures. The system stops operating when its power switch fails, and failure is self-announcing—as soon as there is an occlusion alarm, the user is notified that something is wrong. The pump can continue to operate if a protective component such as a circuit breaker or an informative component such as an audible signal generator fails, but such failures may cause damage if left unattended.

It should be noted that some components may have several failure modes, including soft and hard, but in the analysis they may be considered separately, which will be discussed later. For simplicity we will identify components with their dominant failure modes and assume that each component has only one failure mode—either soft or hard. Once the component experiences a soft failure it stays in the same condition until it is fixed as discussed above. Even if the component may deteriorate in some way after the failure the repair at inspection will return it to the state just before failure.

Both components with soft and hard failures should be in general incorporated in developing an inspection optimization model for a complex repairable system. Numerous models have been developed for inspection and maintenance optimization of a system subject to failure. Nakagawa and Mizutani [4] in their recent paper give an overview of replacement policies with minimal repair, block replacement and simple replacement for an operating unit in a finite time horizon. In all three policies they assume that the unit is replaced at periodic replacement times and is as good as new. In the policy with minimal repair the unit is minimally repaired if it fails between periodic replacements. In block policy the unit is always replaced at failures between replacements, and in the simple policy there is time elapse between a failure and its detection at the next inspection time after failure. Moreover, the paper presents optimal periodic and sequential policies for an imperfect preventive maintenance and an inspection model for finite time span. In the inspection model it is assumed that the unit is replaced if a failure is detected at inspection; hence this model describes mainly a failure-finding optimization problem.

Further inspection and maintenance models for single-unit and multi-unit systems are discussed in detail in Refs. [5], [6], [7], [8], [9], [10] including age-dependent, periodic PM, failure limit, sequential PM, and repair limit policies in infinite time span. The proposed models are classified according to the degree to which the operating condition of an item is restored by maintenance [9], [11], [12], [13], [14]. Perfect repair restores the system to as good as new and minimal repair restores it to the same failure rate (intensity of failures) as before failure. Imperfect repair restores the system to somewhere between as good as new and as bad as old. Worse and worst repairs make the failure rate increase; however, the system does not break down in the worse repair, while the worst repair unintentionally makes the system fail.

Total expected cost and expected downtime per unit time are usually considered for inspection/replacement decision problems. The majority of maintenance/inspection models assume that the system is not operative after failure, i.e., they consider failures as hard failures. However, the delay-time model [15], [16] regards the failure process as a two-stage process. First a defect is initiated and if unattended, it will lead to a failure. The delay time between initiation of the defect and time of failure represents a window of opportunity for preventing the failure. Wang [17] in his recent paper presents an inspection optimization model for minor and major inspections that are carried out for a production process subject to two types of deterioration. Minor and major defects are identified and repaired, respectively, at routine and major inspections.

In many real-world situations where safety and reliability of devices is vital, devices must be inspected periodically. For example, according to regulations, all major components and functions of clinical medical devices in a hospital must be checked at inspection to assure that the devices are safe for use on patients. If a device fails between two consecutive scheduled inspections and the failure is discovered by operators, it is checked and repaired immediately regardless of scheduled plans. Often hospitals take non-scheduled maintenance as an opportunity to check the device's failed component and to inspect all other major components/functions. In fact, clinical engineers working in a hospital are supposed to go through a predesigned checklist when inspecting major components/functions of a device at both scheduled and non-scheduled inspections.

In this paper, we consider a model for a multi-component repairable system on a finite horizon subject to both soft and hard failures. For practical purposes and simpler implementation most organizations specially dealing with a large number of devices use a periodic inspection policy. A non-periodic inspection policy can be also considered using the same mathematical model but with more time consuming calculation and more demand on practical implementation. We assume that the components with soft failures are repaired if found failed at inspections to the same condition as just before failure (minimal repair), even if they age in a certain way while in the failed state. The components with hard failures are immediately minimally repaired on failure, without delay. Therefore, the components restart with the same failure rate as before failure, and the number of failures for each component follows a non-homogeneous Poisson process. It should be noted that at inspection only states of components with soft failures are found, but not their age at failure (i.e., the times of soft failures are censored), which makes the problem of estimating the failure rate complicated [3]. At this stage we consider a simple policy in which at hard failures only the failed component is inspected and fixed and at periodic scheduled inspections all components with soft failures are checked and fixed if found failed. Because of instantaneous minimal repair of hard components, they do not have impact on the inspection policy, and will be ignored in calculation in this paper. Therefore, the model reduces to a model of a system consisting of several units/components with different failure rates. Depending on consequences of failures of individual components, dependence between components should be taken into account, or can be ignored. For example, if only the downtime of each component contributes to the cost, then only marginal failure rates may be considered and any relationship between components can be ignored. This case will be discussed in Section 3. Hence at this stage the main contribution of this paper is on the component level in introducing delayed minimal repairs of the components on finite time intervals and in using recursive calculation.

This inspection optimization model is developed particularly for finding the optimal periodic inspection interval for medical devices being used in hospitals. The assumptions used to construct the optimization model in this paper are obtained from the result of study performed in Ref. [3].

A detailed problem description is given in Section 2. Section 3 describes the inspection optimization model considering downtime of components with type II failures and includes a numerical example using adapted data from the case study [3]. Section 4 discusses the model considering the number of type II failures exceeding a pre-defined threshold, and a numerical example. The combined model and its numerical example are given in Section 5. The final Section 6 presents conclusions.

Section snippets

Problem definition

Consider a complex repairable system consisting of two groups of components. Failure of a component in the first group (type I failure, or “hard” failure) is revealed or self-announcing and when it occurs the system stops working. However, the system can still continue operating when a component in the second group fails (type II failure, or “soft” failure). Therefore, type I failure is reported immediately as soon as it occurs and the failed component is inspected (non-scheduled inspection)

Model considering downtime and repair of components

When the downtime of a component with type II failure is important, a method considering downtime should be employed. At times, the longer a component is in a failed state, the more significant is its influence on system performance or safety. For example, a defective head in a thickness gauge can produce inaccurate measurements starting from the moment of its deformation and lasting until the defect is rectified at inspection. Therefore, a penalty should be imposed for the period of less

Model considering the number of failures

In this variation of the model we introduce a threshold Ntrh for the acceptable number of failures observed at a scheduled inspection. According to the total number of failures that exceeds the threshold a penalty cost is incurred. This method can be used when only the total number of failures is important regardless of which combination of failures has occurred. An example of an application of this method can be a computer system with several external hard drive backups; even if all hard drive

The combined model

Eqs. (5), (11) can also be combined to take into account both the exceeding number and elapsed times of failures, thus making the cost model more flexible. It should be noted that the “threshold” part of the model requires calculation of the distribution of the total number of failures, whereas the “downtime” part of the model requires only marginal distributions of times to failure. We also assume that components are independent.

In the combined model we haveE[CST]=ncI+j=1mcj(nP¯nj(t))+cPk=1

Concluding remarks

A complex repairable system consists of various components with different types of failure, hard and soft. The so-called hard failures are revealed, i.e., either they are self-announcing or the system stops when they take place. However, the system can still continue functioning when soft failures occur. Both hard and soft failures should be incorporated in the general formulation of an inspection optimization model for a repairable system. Soft and hard failures are important in terms of the

Acknowledgements

We acknowledge the Natural Sciences and Engineering Research Council (NSERC) of Canada, the Ontario Centre of Excellence (OCE), and the C-MORE Consortium members for their financial support. We are thankful to the referees, whose constructive criticism and comments have helped us to improve the presentation of the paper and to augment our literature review with new references. Specifically, remarks concerning applications of the proposed model were very helpful.

References (24)

  • W. Meeker et al.

    Statistical methodology for reliability data

    (1998)
  • Taghipour S, Banjevic D, Jardine AKS. Reliability analysis of maintenance data for complex medical devices. Qual Reliab...
  • Cited by (115)

    View all citing articles on Scopus
    View full text