Erschienen in:

Open Access 2016 | OriginalPaper | Buchkapitel

2. Achieving Anti-fragility

verfasst von : Kjell Jørgen Hole

Erschienen in: Anti-fragile ICT Systems

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

A stakeholder is a person or institution with a legitimate interest in a given information and communications technology (ICT) system. Examples of stakeholders are users, owners, operators, regulatory government agencies, system architects, and software developers. Given a set of stakeholders, a complex adaptive ICT system is fragile to a particular type of negative impact, for example, downtime, if a possible large impact is unacceptable to some stakeholders in the set and robust if all possible impacts are acceptable to all stakeholders. The ICT system is anti-fragile if it learns (perhaps with help from some stakeholders) to maintain an acceptable impact to all stakeholders as the system and environment change over time.

This chapter first considers rare failures causing unacceptable impact and argues that it is very hard to predict all such future events. Second, it argues that it is necessary to limit the impact of failures to gain robustness and to learn from the remaining small failures to achieve anti-fragility. Third, the chapter discusses limitations of classical risk analysis methods before finally introducing an alternative definition of risk in complex adaptive ICT systems.

2.1 Black and Gray Swans

As stated in Chap. 1, global emergent behaviors of complex adaptive systems are modeled as stochastic events with given probability distributions. For simplicity, we assume that the studied behavior of a system is modeled by a continuous random variable with a distribution given by a probability density function (PDF). Figure 2.1 shows two PDFs, each with a left and right tail. The tails determine the probability of outliers in the form of extreme global behavior. The left tail defines the probabilities of outliers with huge negative impact, while the right tail defines the probabilities of outliers with huge positive impact. We are only concerned with negative impact in this book.

As illustrated in Fig. 2.1, there are PDFs with thin tails and thick (or fat) tails. If a PDF has thin tails, then most events occur close to the mean of the PDF. Furthermore, outliers far from the mean have such low probabilities that they can be ignored for all practical purposes. This is the case for the thin-tailed bell curve (or normal distribution). However, if a PDF has thick tails, then the probabilities of the outliers are too large to be ignored. Observe that a PDF can also have one thick tail and one thin tail.

Large man-made systems designed in a top-down manner by successively being broken down into smaller parts tend to have global behaviors whose probabilities are defined by PDFs with thick left tails. In general, the thick tails are due to positive feedback loops created by a series of interacting processes that together result in systems adapting to the effect of their previous behaviors (see Fig. 1.2). Positive feedback loops allow for outliers with unacceptable impact [3, 4]. Taleb [9] distinguishes between two types of outliers with negative impact, namely, black and gray swans . Figure 2.2 depicts the differences in probability and impact between a nonrecurrent black or gray swan and so-called normal, recurrent incidents: Both types of swans are surprising outliers, falsifying previous assumptions about the negative impact of incidents made by most or all stakeholders of a system.

Assume an arbitrary but fixed set of stakeholders. A black swan is a metaphor for rare global behavior of a complex adaptive system whose huge negative impact comes as a total surprise to all stakeholders in the set. This type of extreme emergent behavior is the “unknown unknown,” a rare bombshell event that none of the stakeholders have considered.

Two important observations can be made about black swans. First, a black swan cannot be described by any of the stakeholders because the event is completely unknown to all of them. Second, while a black swan is a total surprise to all the stakeholders considered, there may be other individuals outside the group of stakeholders for which the event is not a big surprise. As an example, while the economic crisis of 2007/2008 came as a huge surprise to most people, a few individuals, including Taleb [9], foresaw the crisis, even though they could not say when the crisis would occur or exactly how serious the consequences would be.

A gray swan is a metaphor for rare global behavior with a large negative impact that is somewhat predictable but typically overlooked by most of the stakeholders considered. It is the “known unknown,” a rare event that some know is possible but no one knows when or whether it will occur. Because a gray swan is not a complete surprise to all stakeholders, it tends to have less impact than a black swan. However, its impact is still huge. For simplicity, we often neglect to define a set of stakeholders when we discuss gray and black swans. However, the reader should assume that users, owners, software developers, operators, and regulatory government agencies are always among the stakeholders.

2.2 Examples of Swans

Hindsight bias, or the knew-it-all-along effect, is the natural tendency, after an incident has occurred, to conclude that the incident was foreseeable, despite there having been little or no objective basis for this conclusion. Hindsight bias [6, 9] causes observers to miscategorize black swans as gray swans after the fact. Moreover, differences in understanding, personal involvement, and available information cause individuals to disagree on whether a large-impact event is a black or gray swan at all. Consequently, it is hard to make all observers agree on what incidents are gray and black swans in complex ICT systems, especially when the observers have no access to the stakeholders. We can, however, give examples of incidents that many, but perhaps not all, security experts will categorize as swans.

When the computer worm Nimda first appeared on the Internet in September 2001, it spread quickly, causing hundreds of millions of dollars in damages, according to press reports. Although the public was familiar with worms at the time, we characterize Nimda as a black swan because it was the first infectious malware with multiple attack methods [30]. Nimda’s five attack methods made it extremely difficult to foresee all of their consequences. The large number of infected computers demonstrates that the attacks surprised computer owners, software vendors, and information technology departments. While NIMDA caused much damage, it could have been much worse. The worm occurred only one week after the 9/11 terrorist attacks. According to Geer [30], the backdoor installed by NIMDA could have been exploited to run denial-of-service attacks on emergency services all over the United States, causing public loss of confidence after the nationwide uncertainty created by the shock of 9/11.

In August 2001, a company providing services to Norwegian banks installed new disks in a backup system used to mirror the production environment. System operators inadvertently routed the instruction to format the disks to the production environment rather than to the backup system. The error rendered production data inaccessible on about 280 disks, thus halting the production environment. This rare incident affected 114 banks and roughly 1 million users. It took seven days before payment card, ATM, Internet banking, and phone banking services were all back to normal operation. While the total cost to the company is not publicly known, it was likely very large, since the company had to compensate the banks for their financial losses. This gray swan occurred because administrators did not pay enough attention to the established security procedures and thus triggered a single point of failure in the system.

While we should always try to remove single points of failures from ICT systems, there exist systems for which a single point of failure is an essential side effect of the design [30]. The single red phone on the American president’s desk is a good example. Many red phones would be a far worse solution from a risk management point of view. When a single point of failure is a design requirement, we need to deploy defense in depth, which is not a research-grade problem. Hence, we will not discuss single points of failure in any detail in this book.

2.3 Limiting the Impact of Failures

To understand the challenges of curbing the impact of failures in complex ICT systems, we study why it is so hard to predict rare events with large negative impacts [9, 34]. Let the term incident denote an event with negative impact. To predict any future incident, we must describe the incident, estimate its probability, and calculate the impact. Many incidents causing, for example, unplanned downtime are predictable, especially incidents due to single points of failure. As an example, ICT systems without redundant data storage or backup power are sure to fail sooner or later. However, swan incidents exist that are very hard or even impossible to predict.

In fact, it is very hard to accurately predict extreme global behavior in complex ICT systems [7, 34]. Because the systems have too many dynamic interactions for humans to even enumerate all the possible scenarios leading to outliers with a huge negative impact, it is easy for all stakeholders to overlook a future swan, thus making it black. Furthermore, it is hard to estimate the probabilities of identified gray swans, because a complex system changes significantly and perhaps abruptly over time and because a system’s recorded history might not contain a single swan; for example, a 100-year flood is not likely to show up in 10 years of historical data.

Complex systems’ lack of well-defined boundaries makes it hard to build models to accurately estimate the probabilities of gray swans. Taleb utilizes power laws to illustrate that small model errors greatly affect rare events’ estimated probabilities [11]. Experience with a particular system type helps estimate gray swan probabilities in a similar new system. However, because the estimation of gray swan probabilities in a large system requires many assumptions, especially when considering the design of a system that has not yet been implemented, the estimates carry significant uncertainty. All in all, it is very hard for stakeholders to accurately predict the gray swans that actually occur. In addition, even if a system owner mitigates all the gray swans, an unknown black swan can still cause huge damage.

Since the probability of each black and gray swan is both small and unknown, it is tempting to ignore swans altogether. However, because a complex ICT system is typically vulnerable to many swans, there is a significant probability that at least one swan will occur. Thus, no matter the quality of the risk analysis, swans causing unacceptable impact will occur in complex ICT systems sooner or later; unless the systems are especially designed and operated to limit the impact of rare, unforeseeable events [3, 4, 6, 10].

To avoid surprising outliers and help ensure event distributions with thin left tails, Chap. 4 proposes four design principles to isolate local failures affecting small parts of systems, thus preventing them from propagating into systemic or global failures affecting complete systems.

2.4 Learning from Small Failures

In an interesting monograph, Sidney Dekker [17] recounts series of small, rather insignificant everyday decisions leading to major disasters, including large oil spills and plane crashes. There are no easily detectable properties of the decisions that signal major disasters in the future. In fact, given the information available at the time, most of the decisions are reasonable when studied in isolation. However, over time, the decisions reduced the diversity and redundancy of the systems and made them steadily more fragile to disasters. This fragilizing process was mainly driven by pressure to use fewer resources and to produce results faster. Some stakeholders contributed to the system fragility by introducing conflicting requirements and regulations, while other stakeholders encouraged risky behavior to reach certain goals, such as producing large quantities of oil.

The accident scenarios described by Dekker [17] further demonstrate that broken parts are not the major reason for disasters in complex adaptive systems. Rather, it is the stakeholders’ inability to cope with the complexity of a system and its changing environment. Lack of understanding, insufficient communication between stakeholders, and pressure to improve a system’s “efficiency” all increase its fragility to disasters. Dekker shows how stakeholders build and operate systems they do not fully understand. While stakeholders grasp the functionality of each part, the huge amount of interactions between the many parts and the changing rules and regulations governing the operations of the systems make it impossible for stakeholders to prevent rare catastrophic events.

In summary, man-made complex systems in general and complex ICT systems in particular tend to drift into systemic failure because they become increasingly fragile due to internal and external changes. The drift occurs slowly, with few or no obvious indications of increased fragility before a major incident occurs [2, 17, 35]. Since black and gray swans in complex systems limit the stakeholders’ ability to predict extreme global behavior with a huge negative impact, the stakeholders must analyze local failures (with limited impact) and introduce countermeasures to avoid increased fragility due to local failures propagating into global failures. Daniel Kahneman’s pioneering work [36] and a monograph by Michael T. Nygard [35] confirm the discussed limits of prediction and the need to learn from local failures. Since the capacity to detect small failures is crucial to determine vulnerabilities, the comprehensive monitoring of a system’s behavior is extremely important to achieve anti-fragility. The goal is not to prevent all failures in an ICT system but to avoid silent failures and quickly start necessary repairs.

Because systemic failures are most often, but not always, initiated by local failures that propagate due to positive feedback loops, it is possible to prevent many swans by detecting local failures and preventing them from propagating. While all swans may not be absolutely prevented, it is possible to make rare events rarer and reduce their impact. Chapter 4 proposes an operational principle that induces artificial failures into a system to quickly detect vulnerabilities with the potential to cause systemic failures. A team of experts with diverse skill sets should learn from the induced incidents because a team could respond faster and gain more insights than a single individual. All team members should have “skin in the game” [10, Chap. 23]: When the members face the consequences of their actions and suffer failure as well as enjoy success, they become motivated to learn rapidly and not take unwarranted chances. A team of software developers has skin in the game when it is responsible for both the development and operations (DevOps) of its software [37, 38]. Another way of introducing skin in the game is to let team members use their own software as much as possible.

The increasingly popular DevOps methodology emphasizes communication, collaboration, and integration between software developers and information technology operations professionals. DevOps is a response to the interdependence of software development and information technology operations. It facilitates learning from natural and induced failures and encourages software developers to create robust code so they do not have to fix problems at three o’clock in the morning.

2.5 An Alternative Justification

We have argued that a complex ICT system exposed to swan incidents must be anti-fragile to the swans’ impacts to thrive over time. According to Taleb [10], the need for anti-fragility can be summarized as follows: Let X be a random variable representing events with some probability distribution (given by a PDF) and let h(X) be another random variable representing the possible impacts, for example, the financial costs to a stakeholder. In practice, we care about h(X) and not X. While it is often hard to change the thick-tail distribution of X, it can be much easier to change the distribution of h(X). Our goal is to ensure that the distribution of h(X) has a thin left tail to avoid intolerable costly outliers (see Fig. 2.2).

Since a complex adaptive system and its environment change over time, perhaps abruptly, the distribution of h(X) also changes. The left tail of the changing distribution of h(X) is unknown because we do not have sufficient data, that is, the history of the system may not contain any outliers and, even if it did, there is no guarantee that the future of the system will be anything like its past. Hence, an anti-fragile system must prevent local failures from propagating into systemic failures and use local failures to detect and remove vulnerabilities that can lead to systemic failures in the future.

While the discussed approach leads to a thinning of the left tail of h(X), there is no absolute guarantee that a swan will not occur in a complex ICT system. Guaranteed swan-free ICT systems can only be achieved by keeping the systems relatively small to limit their importance and possible negative impact. It may also be necessary to isolate systems from each other, for example, systems with particularly sensitive information should not be connected to the Internet.

2.6 Risk Analyses Ignore Swans

The reader may wonder how classical methods for the risk analysis of ICT systems rate the impact of swans. The short answer is that they mostly ignore swans altogether. This unfortunate tendency partly explains why we continue building ICT systems with tightly interconnected parts, little diversity, and low redundancy that allow local failures to propagate into systemic failures.

Traditionally, analysts evaluate risk by estimating the probability of a threat exploiting a vulnerability and by determining the resulting incident’s negative impact. Analysts often use the values low, medium, and high to approximate the probability and impact, resulting in the five-level risk matrix in Fig. 2.3. The matrix incorrectly classifies a gray swan as a medium risk because it has a low probability and high impact according to the approximations.

As an example, a nationwide outage in a power grid is a medium risk despite the outage’s ability to inflict damage in the billions of dollars. Since swans, with their huge impacts, tend to dominate the total risk of complex ICT systems, the use of risk matrices has lead to a gross underestimation of the total risk associated with many systems.

The underlying problem is that risk matrices of the type depicted in Fig. 2.3 implicitly assume that the distribution of the impact h(X) has a thin left tail. Since the probabilities of nonrecurrent outliers or swans are assumed to be so small that the incidents can be ignored, the risk matrix only represents recurrent incidents with larger probabilities and smaller impacts than those of swans. However, a complex adaptive ICT system with many tightly connected parts is very likely to have a h(X) distribution with a thick left tail, making it dangerous to use the risk matrix in Fig. 2.3 because it excludes the possibility of swans.

2.7 Understanding and Reducing Risk

An interesting video exists (https://www.youtube.com/watch?v=MKcZtvwch1w) of the late Peter L. Bernstein discussing risk. According to Bernstein, we talk about risk when we do not know what will happen. Risk simply means that more things can happen than will happen. Since this book focuses on swans incidents, we use a more specific and narrow definition of risk. Consider a group of one or more stakeholders with interests in a complex adaptive ICT system. We define the risk associated with the group of stakeholders as the largest negative impact of all incidents that can happen to the group during a fixed period. How the impact is actually measured depends on the system and the interests of the stakeholders. Impact is commonly measured in terms of financial loss. Note that our definition of risk is not based on the probability of an incident. Because the definition of risk is tailored to the book’s focus on intolerable incidents, it may not be the best choice in other settings.

Risk is a consequence of dependence [31]. A part (or system) \(\mathcal X\) depends on another part (system) \(\mathcal Y\) if a failure in \(\mathcal Y\) negatively affects the functionality of \(\mathcal X\) . The main sources of risk in an ICT system are the dependencies between its parts creating positive feedback loops, which again cause local failures to propagate into global failures. In general, the growing number of dependencies in increasingly complex systems causes incidents impacting stakeholders to become less frequent, because the systems become better at handling recurrent incidents over the normal operating range. However, at the same time, the impacts of nonrecurrent incidents are increasing due to the positive feedback loops propagating (combinations of) rare local events outside the normal operating range.

In Taleb’s [9] terminology, while incidents affecting stakeholders are becoming less frequent, gray and black swans occur more often in ICT systems with tight internal integration as their complexity grows. Since it is hard to determine all the dependencies of complex systems, the probability of swans in complex ICT systems is underestimated, causing intolerable impacts because most stakeholders are not prepared for swans.

As the risk of recurrent incidents is reduced and the intervals between incidents grow longer, the assumption that complex ICT systems are “safe” also grows, thus causing a situation (actually a feedback loop) in which stakeholders create increasingly more complex systems with tightly integrated parts [31]. To counter this development and reduce the risks to the stakeholders, it is necessary to create ICT systems with only tolerable failures. Since the causes of swans are, at best, hard to predict, it is necessary to limit the impact of incidents, even though we have no a priori knowledge of their causes.

2.8 Taleb’s Four Quadrants

Following Taleb [11, 12], we create a map to classify the negative impact of different failures in complex adaptive ICT systems. We again represent the impact of events in a complex adaptive ICT system by a continuous random variable with a particular PDF. Furthermore, we discriminate between two types of negative impacts, namely, local and global impacts . Some systems only permit the local impact of failures, while other systems allow local failures to propagate and create a global (systemic) impact. The PDF of the local or global impact has a thin or thick left tail.

The four quadrants of the map in Fig. 2.4 represent the four possible combinations of local and global impacts and thin and thick tails. The quadrants represent four classes of complex ICT systems with very different extreme behaviors. The map shows where classical risk analysis works well and where it is of questionable use and can lead to the gross underestimation of the risk by ignoring swans in the form of rare outliers with an intolerable negative impact.

A system in the first quadrant in Fig. 2.4 is very safe. It only experiences local failures with limited impact because the PDF of the local impact has a thin left tail. Unfortunately, it seems that today’s complex ICT systems are not in this quadrant. The second quadrant is also a fairly safe place for a system. Global failures may occur, but the global impact is tolerable due to the thin left tail of the PDF. Systems in the third quadrant only experience local failures, but these can have a relatively large impact because the PDF of the local impact has a thick left tail. Hence, rigorous risk management is needed.

Systems in the fourth quadrant must be avoided because they are vulnerable to gray and black swans with an intolerable impact. While the probability of a single swan is small, ICT systems in the fourth quadrant are usually vulnerable to many swans, making it inevitable that one will occur sooner or later. As explained in Sects. 2.3 and 2.6, classical risk analysis cannot handle nonrecurrent swans in the fourth quadrant.

We want to develop and operate complex adaptive ICT systems where all failures are local with limited impact, that is, we want the systems to fall in the first quadrant in Fig. 2.4. However, since we will not succeed in limiting absolutely all failures of complex national and international ICT infrastructures, these systems will more likely end up in the second or third quadrant, which is also acceptable as long as we avoid swans with an intolerable impact in the fourth quadrant.

2.9 Discussion and Summary

If we consider a complex adaptive ICT system over a period of, say, 20 years, then normal incidents will occur repeatedly during the period. Hence, these recurrent incidents should become less and less surprising to the system’s stakeholders. The same is not true for gray and black swans. Because swans are so rare, they will not occur multiple times over the considered period. Consequently, swans are, at best, very hard to predict, since there is little or nothing in the system’s history to signal their future occurrence. However, since complex ICT systems are vulnerable to many swans, the probability that at least one swan will occur is too large to be ignored.

Given a set of stakeholders, a complex ICT system is fragile to a particular type of negative impact if a possible large impact is unacceptable to some stakeholders in the set and robust if all possible impacts are acceptable to all stakeholders. It is not enough for complex ICT systems to be robust, because internal and external changes fragilize complex systems over time, making them increasingly vulnerable to large-impact events, including swans. Since we cannot hope to predict all negative events that can significantly impact complex ICT systems, we must build systems that limit the impact of incidents of unknown origin and learn from events with a small negative impact how to limit the impact of all incidents. The resulting ICT systems are anti-fragile when they manage to reduce and maintain acceptable impacts to all stakeholders.

Stochastic modeling is much used in many research areas, particularly in modern financial theory. Financial models are very often based on PDFs with thin tails, leading to a gross underestimation of the risks associated with the economic processes being modeled. To better understand the devastating consequences of using the wrong stochastic models, the reader should consult the books of Pablo Triana [39] and Benoit Mandelbrot and Richard Hudson [40]. Both argue that standard financial models have led investors to take on huge hidden risks with ruinous consequences. Together, Taleb [8–11], Triana, Mandelbrot, and Hudson illustrate the folly of trying to predict extreme global behavior in complex adaptive systems of global importance.

Open Access This chapter is distributed under the terms of the Creative Commons Attribution-Noncommercial 2.5 License (http://creativecommons.org/licenses/by-nc/2.5/) which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

The images or other third party material in this book are included in the work’s Creative Commons license, unless indicated otherwise in the credit line; if such material is not included in the work’s Creative Commons license and the respective action is not permitted by statutory regulation, users will need to obtain permission from the license holder to duplicate, adapt or reproduce the material.