Introduction

The study of interdependent, networked, systems is an area that has recently received a lot of attention1,2,3,4,5,6,7,8,9,10,11, where the majority of work has so far focussed on the interactions between different ‘critical infrastructures’12,13,14,15,16. We argue that critical infrastructures should themselves be viewed as a special class of interdependent systems, due to the presence of in-built monitoring and control mechanisms12,17,18. The type of control most prevalent in such systems is so-called ‘supervisory’ control— as distinguished from, say, controllability19 —which typically involves monitoring an underlying process, with the option of a pre-defined intervention once a critical state is reached. Here, in keeping with the picture of interdependent networks, both monitoring and intervention are local processes, associated with specific points on the underlying network. Furthermore, we are interested in the case when the control is ‘distributed’, that is the local interventions are somehow coordinated via communications between sensors. At the most general level, we are interested in building a physics-like model of such systems: that is, complicated enough to encompass any interesting behaviour, but sufficiently idealized that the mechanisms at play can be easily identified and understood.

Our ideas are based on the supervisory control and data acquisition (SCADA) concept, ubiquitous in real-world monitoring of industrial manufacturing, power generation and distribution processes (e.g., electricity, gas and water)20. To this end, our model comprises an underlying system, here, an electrical network, where a simple control device is placed on each transmission line with a probability p (see Fig. 1). The device monitors the load of that line and, if it is overloaded, then the device can dissipate the excess load with a probability of success q and prevent the failure. In the opposite case, the line fails and the load is redistributed. The redistribution of loads may then lead to the overloading and failure of further power lines and so on, potentially resulting in large system-wide outages21. If, at any stage during this process, more than one line becomes overloaded, then it is assumed that the next line to fail will be the one with the largest excess load. In the case where these lines are supervised, it therefore helps if the control devices respond in a coordinated way— always dissipating the excess load on the line under the greatest threat of failure. We therefore stipulate that for a control device to be operational, it must be in contact with a central processing unit (CPU). We envisage a communication network composed of ICT-like links connecting the devices and the CPU where, in keeping with a distributed SCADA picture, each device can also act as a signal relay— so called ‘daisy chaining’. Crucially, this means that when a control device fails, it can disconnect many other devices from the CPU, rendering them useless— and dramatically increasing the fragility of the system. The structure of the supervisory network is therefore very important and we consider two extremes. On one hand, a Euclidean minimum spanning tree (EMST) minimises the total length of the control network— and hence the cost— but typically sacrifices direct connectivity to the CPU. On the other hand, a mono-centric network maximises direct connectivity to the CPU, but can be very costly in terms of the total length of network needed. We interpolate between these two configurations by using a simple rewiring process: for each node in a EMST, replace with probability μ, the edge connected to the neighbor closest to the CPU (along the network), by an edge that connects directly to the CPU. The result is that the topology of the supervisory network relies on one continuous parameter μ [0, 1], such that μ = 0 and μ = 1 correspond to EMST and mono-centric networks respectively.

Figure 1
figure 1

Supervisory control.

(a): The underlying power grid is represented by blue nodes and black edges. The load carried by power-lines (edges) is supervised by control devices, shown by red squares. The control devices act as signal relays and form a supervisory network (red edges) with the CPU. (b): For a given overloaded edge, there is a probability p that a control device is present. If this device is connected to the CPU, then it can be determined if the edge is carrying the largest excess load in the system. If so, the device will attempt to dissipate the excess load, with a success rate q.

For modelling the electrical network we adopt a straightforward approach which has been proposed and analysed elsewhere22. The idea assumes a set of producers and consumers linked by power lines, where the resulting load carried by each line, or edge, may be represented by a random variable drawn from a uniform distribution U. Since U is properly normalized, the upper and lower bounds of the distribution are related to the average load , such that

In keeping with the above, it is also assumed that the transmission lines have an intrinsic carrying capacity (assumed here, without loss of generality, to be one) which, if exceeded, causes the line to fail and the load to be redistributed evenly amongst its nearest neighbours23. The crucial departure from Ref. 22, is in our choice of network topology. Since many critical infrastructures are, to a good approximation, planar subdivisions24, we use the well known Delaunay triangulation25, which is a simple, reasonable model for planar networks such as power grids.

Results

We test the vulnerability of our model against failure cascades by using computer simulations (see Methods section for details). For given values of the parameters p, q, μ and , we repeatedly generate instances of the ensemble, each time initiating a cascade according to a ‘fallen tree’ approach— that is, an unspecified external event removes an edge and, if it is supervised, the associated control device. Following each cascade, Nlcc, the size of the remaining largest connected component of the underlying electricity network, is recorded. We assume that administrators/designers of real systems are interested in ensuring that cascades are bounded by a certain size. To this end, we consider

the probability that, following a cascade, the number of nodes disconnected from the largest connected component— the effective cascade ‘size’: 1 − Nlcc/N— is less than a fraction ε (0, 1] of the original nodes.

In general, as one would expect, the larger the average load carried by the system, the smaller the probability that the cascade size is bounded (see Fig. 2a). However, we also observe another feature of this type of cascading model, first identified in Ref. 22: for each value of p, there is a non-zero critical value

that corresponds to the maximum average load below which cascade sizes are bounded with probability one (within a given accuracy, here 1 part in 5 × 103). Plotting the values of against p, a sharp transition can be observed at some point p* (see Fig. 2b). Above this value, the fraction of disconnected nodes is always bounded by ε = 1/2, regardless of how much load the system is carrying. In the completely reliable case (q = 1), p* just corresponds to the percolation threshold pc (~0.33 for Delaunay triangulations26). The cascades are then ‘percolation controlled’ due to the formation of a giant component connected by supervised edges, coined here the giant supervised component (GSC). The upper bound on cascade size that is enforced by the GSC can be lowered by employing more control devices— i.e., increasing p (see Fig. 2c). For p ≥ 1 − pc, most nodes are connected by supervised edges and cascades cannot disconnect nodes from the giant component.

Figure 2
figure 2

The effects of reliable control devices (q = 1).

(a): The probability that, following a cascade, the remaining largest connected component of the underlying grid contains more than half of the nodes P1/2, is dependent on the average load carried by the system and the number of control devices present p. For each value of p, the system is characterized by a critical average load . Below this critical value, cascades never disconnect more than half of the system (P1/2 = 1), whilst above it, there is always a finite chance that this will happen (P1/2 < 1). (b): As the bond-percolation threshold p* = pc ~ 0.33 is approached, the critical value rises sharply to one due to the formation of a giant supervised component (GSC). Inset: Results are unchanged by increases in system size. (c): The bound on cascade size can be lowered by increasing p > pc and therefore the size of the GSC. For p ≥ 1 − pc most nodes are connected by supervised edges and therefore cascades cannot disconnect any nodes completely. Inset: For values of the cost (estimated as the total length of the supervisory network) above C ~ 1/2 the average sub-tree size 〈s〉 of the control network— and therefore the average number of devices disconnected at cascade initiation— is less than 1 and negligible as a fraction of the system size. (d): Critical value for p > p*. In this case, increasing the cost of the supervisory network only increases the critical load associated with bounding small cascades and not those of the order of the system size.

Whilst q = 1, the only impact of decreasing μ is to increase the number of devices disconnected by the initial external shock. Disregarding the correlation induced by starting the cascade at the point of disconnection, this effect corresponds to a small shift

in the positive x-direction of Figs. 2b and 2c. Here, 〈s〉 is the average sub-tree size associated with a randomly chosen node (see Fig. 2c inset). Figure 2d shows the effects of this shift when p > p*, for both large and small ε. Here, it is natural to characterize changes in μ by a normalized cost

where L(μ) is the total length of the supervisory network. The message of Fig. 2d is that: increasing the number of direct CPU connections at the cost of increased network length, is only beneficial if the suppression of small cascades is desired.

If, in contrast to above, the control devices have an inherent rate-of-failure (q < 1), then a GSC may be either disconnected or reduced in size as control devices fail. In the best case scenario, when the supervising network is mono-centric and q is close to one, the picture is one of ‘effective percolation’ with (see Methods)

where α is determined by the topology of the underlying network (~2.4 for a Delaunay triangulation, see Methods section for details). This simple form shows good agreement with direct estimates of the value of p* (see Fig. 3b and Methods for details). For lower values of q, percolation-like descriptions are no longer appropriate: regardless of the number of control devices, it is not possible to bound cascade sizes in a way that is independent of the average load carried by the system. Indeed, if control devices are both unreliable (q < 1) and the control network is tree-like (μ < 1), the system is very susceptible to large failure cascades, with little impact made by increasing p (see Fig. 4). In this case, we see that for both large and small cascades, the topology of the control network is very relevant and can induce extreme fragility in the control system (see Fig. 5).

Figure 3
figure 3

The effects of control device failure (q ≠ 1) when every device is connected directly to the CPU (μ = 1).

(a): As the reliability of the control devices decreases, more devices are needed to maintain the same critical load. (b): Agreement between the numerical value pmid obtained for the transition— and the theoretical form pcq−α motivated by simple arguments (see main text and Methods Section).

Figure 4
figure 4

The effect of μ (p = 0.5, q = 0.9).

When the supervising network is almost mono-centric (μ = 0.9) very few control devices fail and therefore the remaining largest connected component connects 85% of the nodes in the system. If the supervising network is almost a tree (μ = 0.1) then even though the inherent failure rate is low, many devices become disconnected from the CPU and therefore only 10% of nodes are left connected following a cascade.

Figure 5
figure 5

The effect of topology in control networks with unreliable devices (q = 0.9).

For all cascade sizes (ie. regardless of ε), the critical load depends strongly the structure of the supervisory network, in contrast with the completely reliable case. In particular, even when there are already many direct-CPU links (C > 1/2), the critical load that the system can carry is drastically reduced by introducing more dependency into the supervisory network.

Discussion

In conclusion, we have introduced a minimal model which incorporates the salient features of many real-world control systems. Firstly, the control devices are simple: they only have so-called ‘supervisory’ functions of monitoring and intervention. Secondly, the system is ‘distributed’, that is, not only are the devices positioned in space but they require coordination— in this case, by connection to a CPU. Thirdly, we also incorporate the effects of devices having an inherent rate-of-failure. With only these simple characteristics, the resulting behaviour is very rich. The primary feature concerns the fragility of such control systems: a small reduction of control device reliability leads to a regime where the ability to suppress cascades is dramatically affected by the topology of the control network. Our results suggest that it is much more cost-effective to try to improve the reliability of control devices rather than working on the stability of the supervisory control network. We believe that these results make a first step in understanding distributed supervisory control, whilst also providing helpful guidelines to designers and administrators of real systems. We welcome further work in the area.

Methods

Simulations

To simulate the system, N nodes are placed in the plane at random, the Delaunay triangulation is then formed and loads are allocated to the resulting edges according to . The supervisory network is incorporated by first adding a control device to each edge with probability p, then forming the network according to the rewiring procedure described in the main text (dependent on parameter μ). Cascades are initiated by assuming an external event that causes an edge to be removed at random and its load is redistributed amongst its nearest neighbors. If the failing edge was supervised, then the control device is also removed. During the ensuing cascade, we stipulate that for a control device to work, it must be connected to the CPU, a special node that cannot be removed. If a control device is unconnected, then it cannot work and is of no use. However, if a control device is connected and it is supervising an edge that is about to fail— i.e., it is carrying the largest excess load in the system— then there is a probability q that the excess load is dissipated and the load of the edge is reset to . The quantity q can be thought of as the inherent reliability of a device.

Simulations were written in C++ and implemented using the Boost Graph library27 where possible. Delaunay triangulations were produced using an iterative algorithm25.

Results are presented for systems of size N = 500 (~3 × 103 edges) and statistics are calculated over 5 × 103 instances of each ensemble (defined by parameters p, q, μ and ). Critical values and p* are accurate up to an error of approximately ±0.02, since they are identified by varying the underlying parameter by finite increments. In Figs. 4c and 5, corresponds to Pε > 0.99 in order to accommodate the noise associated with different control network structures.

Formation of an effective GSC

Labelling each supervised edge by i = 1, 2, …, Es, the probability that a supervised edge survives a cascade is where ni is the number of times a device is solicited— i.e., it tries to dissipate its excess load with probability q. Here, for large enough systems the number of supervised edges is given by Es = pE. (Since the average degree of a Delaunay triangulation is peaked around six, the total number of edges E is well approximated by E ~ 3N.) Using a bar to denote system average , we know that if Var [n] is small, then . Approximating a large system average with an ensemble average 〈…〉 over many smaller systems, the results are given in Table 1. Here it is clear that the average 〈n〉 is well approximated by the value 2.4, regardless of p and q and that the variance is always very small compared to the average. We can then write the effective probability that a generic edge resists failure as

with . The system will then be resilient if peff = pc, which implies Eq. (6).

Table 1 For different values of p and q, 〈n〉 ~ 2.4 and Var [n]n

Equation (6) may be contrasted with a direct approximation of when an effective GSC forms. From simulation results, we associate each transition with the value pmid, defined as halfway between pc and the lowest value of p for which is maximal (i.e., the midpoint of the transition).