In general, an explanation provides an answer to a
why question. In classification tasks, this question becomes:
why was the classification made? Answering this question has ignited a whole research field [
2,
5,
21,
24,
38]. In this section, we will discuss the main approaches that relate to our setting.
Multiple explanation methods for (image) classification have been proposed in the literature. A distinction can be made between global explanations, which apply to a model in general, and instance-level explanations, which focus on isolated model predictions [
37]. In this paper, we focus on the latter. The main approaches to explain individual instances’ predictions are feature importance and counterfactual methods, which will be discussed briefly next.
2.1 Feature importance methods
Feature importance methods provide a ranked list of the features that are deemed most important for the prediction made on that instance. For an image explanation, this corresponds to showing the parts of the image (pixels or segments) that have contributed the most to the prediction. LIME [
41] and the model-agnostic implementation of SHAP [
36] are popular feature importance methods on the instance-level that provide a set of features (segments) with the coefficients of a linear model that is created around the instance that needs to be explained, and as such approximates the predictions of the actual prediction model around that point. Although these methods demonstrate the contribution of each feature to the overall prediction, they do not use the decision boundary (and hence provide no counterfactual), thereby losing the advantage to understand what needs to be changed in order to receive a desired outcome. Other drawbacks include the number of features in an explanation that needs to be set by the user, and the existence of a randomization component in the method (the generation of random data points around the instance to be explained), which leads to unstable results [
4]: running the explanation method for a given instance and a given prediction model twice can lead to two different explanations. It also does not make the influence of interactions between features clear [
19], as it uses a linear approximation. Finally, the computational time to generate explanations can be very large: for example, Lapuschkin [
30] reports around 10 minutes of computation time needed to generate a LIME explanation for a single prediction of the GoogleNet image classifier. That being said, they do offer valuable insights into an individual prediction (as demonstrated by their popularity) and are model-agnostic.
Other feature importance methods can be used to create visual heat maps on top of the pixels. Occlusion is a first general strategy, that measures the influence of each pixel, by masking regions and assessing the impact on the output score [
52,
53]. A second approach is taken by Bach et al. [
6] who introduce Layer-Wise Relevance Propagation (LRP) as a model-specific method to create instance-level explanations for neural networks. A third approach calculates the gradient of the prediction function at the instance to be explained, which indicates the importance of each pixel/feature in the prediction score [
44]. The latter two approaches require access to the model weights and can therefore not be used when the prediction model is only available as a scoring function, without access to the model internals. Additionally, an important disadvantage of pixel-wise heat map methods is the low abstraction level of the explanations [
42]. Since individual pixels are meaningless for humans, it is not always straightforward to derive interpretable concepts from it.
In general, a larger issue with feature importance methods is what they actually explain. Fernandez et al. [
19] argue that feature importance rankings do not explain a classification, but rather a
prediction score. End users typically wish to understand why a certain impactful
decision has been made. And while data scientists often focus on prediction scores (cf. the popularity of the ROC and AUC), they as well wish to understand certain classifications (instead of prediction scores) to answer the question of
why was this image misclassified? That brings us to counterfactual explanations, which explain a
classification made on a data instance, by a prediction model.
1
2.2 Counterfactual reasoning for image classification
Many authors in the field of philosophy and cognitive science have raised the importance of contrastive explanations [
35,
38]. Martens and Provost [
37] were the first to apply this idea for predictive modeling, in the context of document classification, and have sparked a large set of novel counterfactual methods to be introduced [
9,
10,
12,
19,
29,
40,
51]. Apart from the contrastiveness, counterfactual explanations have other benefits. It is argued that they are more likely to comply with recent regulatory developments such as GDPR. Wachter et al. [
51] state that counterfactual explanations are well-suited to fill three important needs of data subjects: information on how a decision was reached, grounds to contest adverse decisions and an idea of what could be changed to receive a desired outcome. Moreover, formulating an explanation as a set of features does not put constraints on model type and complexity [
7], which should make it robust for developments in modeling techniques. Finally, the explanation can be done without disclosing the entire model [
7], which allows companies to give only the necessary information without revealing trade secrets.
Several authors have used approaches that are closely related to counterfactual reasoning for image classification. Adversarial example methods for example, which aim at finding very small image perturbations that lead to false classifications [
22,
47,
48]. This has proven useful to protect a classifier against attempts to deceive it. However, since the found perturbations are often too small to be visible for humans (in extreme cases only one pixel), they cannot be used as an interpretable counterfactual explanation.
Other authors explicitly used counterfactual explanations for explainability in image classification. These papers are summarized in Table
1 in terms of the following dimensions. We also include our novel approach, SEDC(-T), which will be further described in Sect.
3.
1.
Model-agnostic (MA): does method work without access to model internals?
2.
Training data-agnostic (TA): does method work without access to training data?
3.
Abstraction-level (AL): what is the granularity of the features?
4.
Addition of evidence (AE): is purposefully adding evidence allowed?
5.
Explanation focus (EF): does the explanation focus on the changes or the counterfactual?
Table 1
Summary of counterfactual explanation methods: reported dimensions are whether method is model-agnostic (MA), whether method is training data-agnostic (TA), the abstraction-level (AL), whether adding evidence is allowed (AE) and whether the explanation focuses on the changes or the resulting counterfactual (EF)
| No | No | Pixel | Yes | Changes |
| No | No | Pixel | Yes | Counterfactual |
| No/Yes | No | Pixel | Yes | Counterfactual |
| No | No | Conceptual (textual) | Yes | Counterfactual |
| No | No | Conceptual | Yes | Counterfactual |
| No | No | Conceptual | No | Changes |
SEDC(-T) | Yes | Yes | Conceptual | No | Changes |
The following four important observations can be made from the literature overview. First, there is quite some ambiguity and vagueness regarding the terminology used in the context of counterfactual explanations for image classification. Initial work on counterfactual explanations focuses on representing the
changes that must be applied to alter a classification. Because it was first used for models based on textual and traditional data, an important advantage is reducing the typically large feature space to a smaller and more interpretable set of features. In the context of explaining image classifications, however, some authors use the modified (counterfactual) image as an explanation [
23,
28,
49], while others focus on representing the necessary changes between the instance to be explained and the counterfactual [
3,
26]. A possible reason for only considering the counterfactual as an explanation is that the necessary changes themselves are not interpretable. For instance, only showing the pixels that change the classification of digits [
28,
49] or only showing a part of an image belonging to the counterfactual class [
23] would clearly not suffice as an interpretable explanation. Since all authors refer to generating counterfactual explanations, we want to make a clear distinction between the necessary
changes to alter an image classification (the evidence counterfactual or EdC, which we pronounce as ‘Ed See’) and the
counterfactual image resulting from these changes (counterfactual). Our approach aims at finding an explanation that identifies the changes that are necessary to alter a classification.
Second, pixel-level explanations have their merits in toy examples, for example when classifying digits, but quickly lose their interpretability in real-life applications. This might be a reason why these pixel-explanation methods are typically tested on relatively simple datasets and classification tasks, such as MNIST [
32]. However, we see a shift towards explanations at a higher abstraction level. In line with this, we use image segments to compose a conceptual counterfactual explanation and test our approach on a broad classification task and dataset (sample of ImageNet data [
27]).
Third, existing counterfactual approaches allow the purposeful addition of evidence to the image, e.g., adding parts of an image belonging to the counterfactual class [
23] or adding concepts to the image supporting the counterfactual class [
3]. This leads to an EdC containing evidence that is actually not present in the original image, which seems rather counter-intuitive in the context of images. It can also be argued that this does not necessarily lead to interpretable explanations (e.g., is mixing two types of animals in one image semantically clear?) or that the explanation is not necessarily useful (e.g., any image can be turned into a zebra prediction by simply adding a zebra to the image). Therefore, we limit our search for explanations to the removal of evidence, which results in EdCs only containing evidence that is present in the image of interest.
Fourth, there is the requirement for most methods to have access to both the model internals (as most methods are not model-agnostic) and the training data. In practical applications, this is often not feasible. Many companies use classification models built by external vendors, e.g., Google Cloud Vision,
2 Amazon Rekognition
3 and the Computer Vision service in Microsoft Azure.
4 Even if the model itself would be open source (which is often not the case), the training data are rarely available, as this is either too large to efficiently share, or considered part of the vendor’s proprietary assets. This implies that in these cases, the previously proposed counterfactual methods cannot be used. Moreover, model-agnostic methods have a wide(r) applicability as they are not limited to specific model types or architectures. From an academic point of view, such approaches are arguably also more likely (or at least easier) to be re-used and built upon by other researchers. Our approach aims to fill this important gap in the literature by proposing the first model-agnostic counterfactual explanation method for image classification, only based on the black-box model and the image of interest.