1 Introduction
2 Related works
3 Explanation-based taxonomy
-
INtrinsically (IN) explainable methods are explainable by design methods that return a decision, and the reasons for the decision are directly accessible because the model is transparent.
-
Post-Hoc (PH) explanation methods provides explanations for a black-box model.
-
Global (G) explanation methods aim at explaining the overall logic of a black-box model. Therefore the explanation returned is a global, complete explanation valid for any instance;
-
Local (L) explainers aim at explaining the reasons for the decision of a black-box model for a specific instance.
-
Model-Agnostic (A) explanation methods can be used to interpret any type of black-box model;
-
Model-Specific (S) explanation methods that can be used to interpret only a specific type of black-box model.
-
Transparency (Arrieta et al. 2020), or equivalently understandability or intelligibility, is the capacity of a model to be interpretable itself. Thus, the model allows a human to direct understand its internal mechanism and its decision process.
-
Complexity (Doshi-Velez and Kim 2017) is the degree of effort required by a user to comprehend an explanation. The complexity can consider the user background or eventual time limitation necessary for the understanding.
4 Evaluation measures for explanations
-
Fidelity aims to evaluate how good is f at mimicking b. There are different implementations of fidelity, depending on the type of explanator under analysis (Guidotti et al. 2019a). For example, in methods where there is a creation of a surrogate model g to mimic b, fidelity compares the prediction of b and g on the instances used to train g.
-
Stability aims at validating if similar instances obtain similar explanations. Stability can be evaluated through the Lipschitz constant (Alvarez-Melis and Jaakkola 2018) as \( L_x = \text {max} \frac{\left\Vert e_{x} - e_{x'}\right\Vert }{\left\Vert x - x'\right\Vert }, \forall x' \in \mathcal {N}_x\) where x is the instance, \(e_x\) the explanation and \(\mathcal {N}_x\) is a neighborhood of instances similar to x.
-
Deletion and Insertion (Petsiuk et al. 2018) are metrics that remove the features that the explanation method f found important and see how the performance of b degrades. The intuition behind deletion is that removing the “cause” will force the black-box to change its decision. Among the deletion methods, there is the Faithfulness (Alvarez-Melis and Jaakkola 2018). It aims to validate if the relevance scores indicate true importance: we expect higher importance values for attributes that greatly influence the final prediction.8 Given a black-box model b and the feature importance e extracted from an importance-based explanator f, the faithfulness method incrementally removes each of the attributes deemed important by f. At each removal, the effect on the performance of b is evaluated. In general, a sharp drop and a low area under the probability curve mean a good explanation. On the other hand, the insertion metric takes a complementary approach. Typically, insertion and deletion evaluations are tailored for specific types of explainers called Feature Importance explainers for tabular data, Saliency Maps for image data, and Sentence Highlighting for text data.
-
Monotonicity (Luss et al. 2021) can be seen as an implementation of an insertion method: it evaluates the effect of b by incrementally adding each attribute in order of increasing importance. In this case, we expect that the black-box performance increases by adding more and more features, thereby resulting in monotonically increasing model performance.
-
Running Time: the time needed to produce the explanation is also an important evaluation.
-
Functionally-grounded metrics aim to evaluate interpretability by exploiting some formal definitions that are used as proxies. They do not require humans for validation. The challenge is to define the proxy to employ, depending on the context. As an example, we can validate the interpretability of a model by showing the improvements w.r.t. to another model already proven to be interpretable by human-based experiments.
-
Application-grounded evaluation methods require human experts to validate the specific task under analysis (Williams et al. 2016; Suissa-Peleg et al. 2016). They are usually employed in specific settings. For example, if the model is an assistant in the decision-making process of doctors, the validation is done by the doctors.
-
Human-grounded metrics evaluate the explanations through humans who are not experts. The goal is to measure the overall understandability of the explanation in simplified tasks (Lakkaraju et al. 2016; Kim et al. 2015). This validation is most appropriate for general testing notions of the quality of an explanation.
Type | Name | Ref. | Data type | IN/PH | G/L | A/S |
---|---|---|---|---|---|---|
FI | lrp |
Bach et al. (2015) | ANY | PH | L | A |
lime |
Ribeiro et al. (2016) | ANY | PH | L | A | |
shap |
Lundberg and Lee (2017) | ANY | PH | G/L | A | |
maple |
Plumb et al. (2018) | TAB | PH/IN | L | A | |
ebm |
Nori et al. (2019) | TAB | IN | G/L | A | |
nam |
Agarwal et al. (2021) | TAB | IN | L | S | |
ciu |
Anjomshoae et al. (2020) | TAB | PH | L | A | |
eem |
Chowdhury et al. (2022) | TAB | PH | G | A | |
dalex |
Lipovetsky (2022) | ANY | PH | G/L | A | |
RB | trepan |
Craven and Shavlik (1995) | TAB | PH | G | S |
msft |
Chipman et al. (1998) | TAB | PH | G | S | |
cmm |
Domingos (1998) | TAB | PH | G | S | |
dectext |
Boz (2002) | TAB | PH | G | S | |
sta |
Zhou and Hooker (2016) | TAB | PH | G | S | |
scalable-brl |
Yang et al. (2017) | TAB | IN | G/L | A | |
lore |
Guidotti et al. (2019a) | TAB | PH | L | A | |
rulematrix |
Ming et al. (2019) | TAB | PH | G/L | A | |
anchor |
Ribeiro et al. (2018) | ANY | PH | G/L | A | |
glocalx |
Setzu et al. (2019) | TAB | PH | G/L | A | |
skoperule |
Friedman and Popescu (2008) | TAB | PH | G/L | A | |
PR | ps |
Bien and Tibshirani (2011) | TAB | IN | G/L | S |
mmd-critic |
Kim et al. (2016) | ANY | IN | G | S | |
protodash |
Gurumoorthy et al. (2019) | ANY | IN | G | A | |
tsp |
Tan et al. (2020) | TAB | PH | L | S | |
CF | cem |
Dhurandhar et al. (2018) | ANY | PH | L | S |
cfx |
Albini et al. (2020) | TAB | PH | L | S | |
dice |
Mothilal et al. (2020) | TAB | PH | L | A | |
c-chave |
Pawelczyk et al. (2020) | TAB | PH | L | A | |
face |
Poyiadzi et al. (2020) | ANY | PH | L | A | |
Ares |
Ley et al. (2022) | TAB | PH | G | A |
5 Explanations for tabular data
adult
and german
dataset.115.1 Feature importance
adult
(a/b) and german
(c/d).13 We predicted the same record using LG and CAT, and then we explained it. Interestingly, for adult
(plots a/b), lime considers a similar set of features as important (even if with different values of importance) for the two models: on 6 features, only one differs. A different scenario is obtained applying lime on german
(plots c/d): different features are considered important by the two classifiers. However, the confidence of the prediction between the two classifiers is quite different: both of them predict the output correctly, but CAT has a higher value, suggesting that this could be the cause of differences between the two explanations.
adult
through the force plot showing each feature contributes to push the output value away from the base value, which is an average among the training dataset’s output values. The red features are pushing the output value higher, while the ones in blue are pushing it lower. For each feature, it is reported the actual value for the record under analysis. Only the features with the highest shap values are shown in this plot. In the first force plot, the features that are pushing the value higher are contributing more to the output value: from a base value of 0.18, it is reached an actual output value of 0.79. In the force plot on the right, the output value is 0.0, and Age, Relationship, and Hours Per Week are contributing to pushing it lower. Figure 4 (left and center) depicts the shap values through a decision plots: the contribution of all the features are reported in decreasing order of importance. The line represents the feature importance for the record under analysis, and it starts at its actual output value. In the first plot, predicted as \(>50k\), Occupation is the most important feature, followed by Age and Relationship. For the second plot, Age, Relationship, and Hours Per Week are the most important ones. shap also offers a global interpretation of the model-driven by the local interpretations. Figure 4 (right) reports a global decision plot that represents the feature importance of 30 records of adult
.
adult
. On the left are reported two explanations for a record classified as class \(> 50k\), while on the right, for one classified as \(< 50k\). On the top, there is a visualization based on Shapley values, which highlights as most important the feature Age (35 years old), followed by occupation. At the bottom, there is a Breakdown plot, in which the green bars represent positive changes in the mean predictions, while the red ones are negative changes. The plot also shows the intercept, which is the overall mean value for the predictions. It is interesting to see that Age and occupation are the most important features that positively contributed to the prediction for both plots. In contrast, Sex is positively important for Shapley values but negatively important for the Breakdown plot. In this case, there are important differences in the feature considered most important by the two methods: for the Shapley values, Age and Relationship are the two most important features, while in the Breakdown plot Hours Per Week is the most important one.Feature importance explainers, due to their complexity in the understating of the explanation, may be better suited for domain experts who know the meaning of the features employed, while they may be too difficult for ordinary end-users, especially when obtaining such importance values is complex.
5.2 Rule-based explanation
adult
. The first rule has a high precision (\(0.96\%\)) but a very low coverage (\(0.01\%\)). It is interesting to note that the first rule contains Relationship and Education Num, which are the features highlighted by most of the explainers analyzed so far. In particular, in this case, for having a classification \(>50k\), the Relationship should be husband and the Education Num at least a bachelor’s degree. Education Num can also be found in the second rule, in which case has to be less or equal to College, followed by the Maritial Status, which can be anything other than married with a civilian. This rule has an even better precision (\(0.97\%\)) and suitable coverage (\(0.37\%\)).Global tree-based explainers One of the most popular ways to generate explanation rules is by extracting them from a decision tree. In particular, due to the method’s simplicity and interpretability, decision trees explain black-box models’ overall behavior. Some explanation methods acting in this setting are model-specific explainers exploiting structural information of the black-box model under analysis. TREPAN (Craven and Shavlik 1995) is a model-specific global explainer tailored for neural networks. Given a neural network b, trepan generates a decision tree g that approximates the network by maximizing the gain ratio and the model fidelity. In particular, to leverage abstraction, trepan adopts n-of-m decision rules on which only n out of m conditions must be satisfied in order to fire the rule. DecText is a global model-specific explainer tailored for neural networks (Boz 2002). dectext resembles trepan with the difference that it considers four different splitting methods. Moreover, it also considers a pruning strategy based on fidelity to reduce the final explanation tree’s size. In this way, dectext can maximize the fidelity while keeping the model simple. Both trepan and dectext are presented as model-specific explainers, but they can be practically employed to explain any black-box as they do not use any internal information of neural networks. MSFT (Chipman et al. 1998) is a global, post-hoc, model-specific explainer for random forests that returns a decision tree. msft is based on the observation that, even if random forests contain hundreds of different trees, they are quite similar, differing only for a few nodes. Hence, it adopts dissimilarity metrics to summarize the random forest trees using a clustering method. Then, for each cluster, an archetype is retrieved as an explanation. CMM, Combined Multiple Model procedure (Domingos 1998), is another global, post-hoc, model-specific explainer for tree ensembles. The key point of cmm is the data enrichment. Given an input dataset X, cmm first modifies it n times. On the n variants of the dataset, it learns a black-box. Random records are generated and labeled using a bagging strategy on the black-boxes. The authors were able to increase the size of the dataset to build the final decision tree. STA, Single Tree Approximation (Zhou and Hooker 2016), is another global, post-hoc, model-specific explainer tailored for random forests. In sta, the decision tree is constructed by exploiting test hypothesis on the trees in the forest to find the best splits.Local rule-based explainers produce logical rules which are close to human reasoning and make them suitable for non-experts.
Global rule-based explainers In this section, we present global explainers that do not extract decision trees as a global interpretable model but as lists or sets of rules. The majority of the methods described in the following extract rules by exploiting ensemble methods or rule-based classifiers. The explainers considered are all agnostic. SkopeRules is a global, post-hoc, model-agnostic15 explainer on the rulefit (Friedman and Popescu 2008) idea to define an ensemble method and then extract the rules from it. skope-rules employs fast algorithms such as bagging or gradient boosting decision trees. After extracting all the possible rules, skope-rules removes rules redundant or too similar by a similarity threshold. Differently from rulefit, the scoring method does not solve the L1 regularization. Instead, the weights are given depending on the precision score of the rule. Scalable-BRL (Yang et al. 2017) is an interpretable rule-based model that optimizes the posterior probability of a Bayesian hierarchical model over the rule lists. The theoretical part of this approach is based on (Letham et al. 2015). GLocalX (Setzu et al. 2021) is a global model-agnostic post-hoc explainer that adopts the local to global paradigm, i.e., to derive a global explanation by subsuming local logical rules. GLocalX starts from an array of local explanation rules and follows a hierarchical bottom-up approach merging similar records expressing the same conditions. This small section comprises global explanation methods that extract rules in entirely different ways: either they exploit an ensemble method (skope-rules), a rule-based model (scalable-BRL) or several local explanations (GLocalX). In terms of goodness of explanations, skope-rules and scalable-BRL are tailored for an overall explanation of the machine learning model, focusing mostly on the data in input. GLocalX, instead, exploits local explanations and hence is able to tackle the problem from a different point of view, merging several local explanations. The output of these methods is a list of rules, and even if there are techniques to filter out meaningless rules, the complexity of the explanation produced may be huge.Global tree-based explainers produce a transparent model allowing the understanding of the general behavior of the black-box. Depending on the complexity of the tree, the actual ease of understanding of the explanation could be affected by this.
Rules-based explainers comparison In this section, we presented a great variety of methods that provide logical rules as explanations exploiting different strategies. Independently from the strategy, due to the simplicity of the rules, they are often the preferred explanation for non-expert people. The majority of the explainers presented in this section are based on the extraction of decision trees as surrogate models (lore, trepan, cmm, sta, dectext, msft), or ensemble methods based on decision trees, such as skope-rules. The remaining methods that do not rely on decision trees extract the rules in other ways, such as rule-based classifiers (again a surrogate model), as in the case of anchor, scalable-brl and of rulematrix. To further increase the comprehensibility of the explanation, some explainers correlate the explanations by graphical visualizations, such as rulematrix, anchor, and skope-rules. Overall, the majority of the explainers require a long computing time due to the different enrichment of the data or the use of rule-based classifiers, which are among the longest interpretable models to train. Hence, they may be better fitted for offline explanations. Depending on the complexity of the machine learning model in input, the explanations may be complex, such as deep trees or long lists of rules.Global rule-based explainers produce sets of rules describing the overall behavior of the model for each target class. Depending on the filters applied, the list of rules extracted may be long and difficult to understand.
Rules-based explanation methods extract rules exploiting different approaches, which may require a longer time w.r.t. feature importance methods, making them more suitable to offline settings. However, rules-based methods are tailored for common end-users due to their logical structure and simplicity.
5.3 Prototype-based explanations
Prototype-based explanations allow the users to reason by similarity and differences. Most of the methods in this setting are tailored to explain the data in input and not the black-box decisions.
5.4 Counterfactual-based explanations
Counterfactuals-based explanations allow the users to understand what to do to achieve a different outcome. This kind of reasoning is close to how human reason, hence it is becoming quite popular. To make counterfactuals as realistic as possible, they must meet criteria such as plausibility and actionability.
Dataset | Black-Box | Fidelity | Faithfulness | ||||
---|---|---|---|---|---|---|---|
lime | shap | anchor | lore | lime | shap | ||
adult | LG | 0.979 | 0.613 | 0.989 | 0.984 | 0.099 (0.30) | 0.38 (0.37) |
XGB | 0.977 | 0.877 | 0.978 | 0.982 | 0.030 (0.32) | 0.36 (0.49) | |
CAT | 0.96 | 0.777 | 0.988 | 0.989 | 0.077 (0.32) | 0.44 (0.37) | |
german | LG | 0.984 | 0.910 | 0.730 | 0.983 | 0.23 (0.60) | 0.19 (0.63) |
XGB | 0.999 | 0.821 | 0.802 | 0.982 | 0.16 (0.26) | 0.44 (0.21) | |
CAT | 0.979 | 0.670 | 0.620 | 0.981 | 0.34 (0.33) | 0.43 (0.32) |
5.5 Tabular data explainers quantitative comparison
adult
. In particular, shap has lower values for CAT (both german
and adult
), suggesting that it may not be good in explaining this kind of ensemble model. Concerning rule-based models, the fidelity is high for both of them. However, we notice that anchor shows lower values of fidelity for CAT-german
, a behavior which is similar to the one of shap. Besides fidelity, we compare lime and shap also on faithfulness and monotonicity. Overall, we did not find any model to be monotonic, and hence we do not report any results. On the other hand, the results for the faithfulness are reported in Table 2. For adult
, the faithfulness is quite low, especially for lime. The explainer with the highest faithfulness is shap explaining CAT. Regarding german
, instead, the values are higher, highlighting a better faithfulness overall. However, also for this dataset, shap has a better faithfulness w.r.t. lime. In Table 3 are reported the results obtained from the analysis on the stability: a high value means that the explainer presents high instability, meaning that we can have quite different explanations for similar inputs. None of the methods is remarkably stable according to this metric. This weakness is widely shared by many explainers independently from the data type and explanation type. Therefore, an important insight from these experiments is to work toward the stabilization of these procedures. Table 4 shows the explanation runtime approximated as order of magnitude. Overall, feature importance explanation algorithms are faster w.r.t. the rule-based ones. In particular, shap is the most efficient, followed by lime. We remark that the computation time of lore depends on the number of neighbors to generate exploiting a genetic algorithm (in this case, we considered 1000 samples). anchor, instead, requires a minimum precision as well as skoperule (we selected min precision of 0.40).Dataset | Black-box | lime | shap | anchor | lore |
---|---|---|---|---|---|
adult | LG | 24.37 (2.74) | 1.52 (4.49) | 22.36 (8.37) | 21.76 (11.80) |
XGB | 10.16 (6.48) | 2.17 (2.18) | 26.53 (13.08) | 30.01 (20.52) | |
CAT | 0.35 (0.43) | 0.03 (0.01) | 6.51 (4.40) | 27.80 (70.05) | |
german | LG | 18.87 (0.73) | 19.01 (23.44) | 101.07 (62.75) | 622.12 (256.70) |
XGB | 26.08 (14.50) | 38.43 (30.66) | 121.40 (98.43) | 725.81 (337.26) | |
CAT | 2.49 (9.91) | 15.92 (10.71) | 123.79 (76.86) | 756.70 (348.21) |
Dataset | Black-box | lime | shap | dalex | anchor | lore | skoperule |
---|---|---|---|---|---|---|---|
adult | LG | 0.1 (0.01) | 0.001 (0.00) | 90 (0.09) | 2 (0.10) | 15 (0.32) | 100 (0.32) |
XGB | 0.1 (0.02) | 0.2 (0.03) | 108 (0.10) | 5 (0.11) | 50 (0.13) | – | |
CAT | 0.2 (0.00) | 3 (0.02) | 110 (0.12) | 3 (0.21) | 35 (0.24) | – | |
german | LG | 0.007 (0.00) | 0.0008 (0.00) | 0.8 (0.00) | 2 (0.17) | 2 (0.31) | 70 (0.12) |
XGB | 0.03 (0.01) | 0.002 (0.00) | 2 (0.12) | 2 (0.12) | 4 (0.32) | – | |
CAT | 0.03 (0.00) | 0.002 (0.02) | 1 (0.20) | 2 (0.42) | 6 (0.20) | – |
Type | Name | References | Data type | IN/PH | G/L | A/S |
---|---|---|---|---|---|---|
SM | \(\epsilon \) -lrp |
Bach et al. (2015) | ANY | PH | L | S |
lime |
Ribeiro et al. (2016) | ANY | PH | L | A | |
shap |
Lundberg and Lee (2017) | ANY | PH | L | A | |
grad-cam |
Selvaraju et al. (2020) | IMG | PH | L | S | |
deeplift |
Shrikumar et al. (2017) | ANY | PH | L | S | |
smoothgrad |
Smilkov et al. (2017) | IMG | PH | L | S | |
intgrad |
Sundararajan et al. (2017) | ANY | PH | L | S | |
grad-cam++ |
Chattopadhay et al. (2018) | IMG | PH | L | S | |
rise |
Petsiuk et al. (2018) | IMG | PH | L | S | |
anchor |
Ribeiro et al. (2018) | ANY | PH | L | A | |
extreme perturbation |
Fong et al. (2019) | IMG | PH | L | S | |
xrai |
Kapishnikov et al. (2019) | ANY | PH | L | S | |
cxplain |
Schwab and Karlen (2019) | IMG | PH | L | S | |
eigen-cam |
Muhammad and Yeasin (2020) | IMG | PH | L | S | |
ablation-cam |
Desai and Ramaswamy (2020) | IMG | PH | L | S | |
score-cam |
Wang et al. (2020) | IMG | PH | L | S | |
opti-cam |
Zhang et al. (2023) | IMG | PH | L | S | |
CA | tcav |
Kim et al. (2018) | IMG | PH | L | A |
icnn |
Shen et al. (2021) | IMG | IN | G | S | |
ace |
Ghorbani et al. (2019) | IMG | PH | G | A | |
cace |
Goyal et al. (2019) | IMG | IN | G | A | |
conceptshap |
Yeh et al. (2020) | IMG | PH | G | A | |
pace |
Kamakshi et al. (2021) | IMG | PH | G | S | |
gan style |
Lang et al. (2021) | IMG | PH | G | A | |
CF | l2x |
Chen et al. (2018) | ANY | PH | L | A |
cem |
Dhurandhar et al. (2018) | IMG | PH | L | A | |
guided proto |
Looveren and Klaise (2021) | IMG | PH | L | A | |
abele |
Guidotti et al. (2020a) | IMG | PH | L | A | |
piece |
Kenny and Keane (2021) | IMG | PH | L | S | |
sedc |
Vermeire et al. (2022) | IMG | PH | L | A | |
ecinn |
Hvilshøj et al. (2021) | IMG | PH | L | A | |
PR | mmd-critic |
Kim et al. (2016) | ANY | IN | G | A |
influence functions |
Koh and Liang (2017) | ANY | PH | L | A | |
protopnet |
Chen et al. (2019) | IMG | IN | G | S | |
prototree |
Nauta et al. (2021) | IMG | IN | G | S | |
deformable protopnet |
Donnelly et al. (2022) | IMG | IN | G | S |
6 Explanations for image data
mnist
, cifar
in its 10 class flavor and imagenet
. We selected these datasets because they are widely used as benchmarks in ML in general and also in experimenting with XAI approaches. On these three datasets, we trained the models most used in literature to evaluate the explanation methods: for mnist
and cifar
, we trained a CNN with two convolutions and two linear layers, while for imagenet
, we decided to use the VGG16 network (Simonyan and Zisserman 2015).6.1 Saliency maps
mnist
, while the explanations returned for more complex images such as those of cifar
and imagenet
are quite unclear and only limitedly interpretable. In the literature, are introduced other variations of the lrp algorithm. \(\gamma \)-lrp favors the effect of positive contributions over negative contributions by separating the weights \(w_{ij}\) into \(w^{-}_{ij}+w^{+}_{ij}\) and adding a multiplier to the positive ones: \(w^{-}_{ij}+\gamma w^{+}_{ij}\) Another variant of \(\epsilon \) -lrp is spray (Lapuschkin et al. 2019), which builds a spectral clustering on top of the local instance-based \(\epsilon \) -lrp explanations. Similarly to (Li et al. 2019), it starts with the \(\epsilon \) -lrp of the input instance and finds the LRP attribution relevance for a single input of interest x.mnist
and cifar
as it segments the images in superpixels big as the whole image in some cases. On the other hand, those produced by xrai are much more clear. lime computes the segmentation at the very beginning of the algorithm on the raw images. Thus, for low-resolution images, segmentation algorithms are more difficult to calibrate. xrai instead firstly compute intgrad values and then agglomerate them using segmentation. This result seems to be much clearer, even with very small images. In general, we observe that segmentation methods work best for high-resolution images where the concept of the image can be easily separated. For instance, in the SM of the “seashore” image produced by xrai is very clear how the method selected three parts of the image: the horizon, the sea, and the promontory. Since pixel wise-methods produce SMs in terms of single pixels, which are low-level features, they are useful only for an expert user who wants to check the robustness of the black-box. Overall, we deduce that SMs returned by the segmentation methods are more human-friendly than the ones returned by pixel-wise methods.cifar
we notice that all the methods highlight the background of the images in particular in the “deer” class. This result is a problem in the learning phase of the black-box and should not be referred to as problems of the explainers. On the other hand, for imagenet
, we observe very different SMs. For instance, for the ice hockey image in Fig. 7, the class in the dataset is “puck”, i.e., the hockey disk. lime highlights the ice as important, while xrai and grad-cam++ highlight the stick of the player, grad-cam highlights the fans while rise the hockey player. Thus, for the same image, we can obtain very different explanations, further highlighting the fragility of the SMs. Regarding the second image of imagenet
(the second from the right), we can observe that all the methods capture the same pattern. A straw hat in the background triggered the class “shower cap” while the correct one was “mask”. Finally, in the “seashore” of imagenet
, we have an island in the sea. The top three predicted classes are seashore (0.91), promontory (0.04), and cliff (0.01). Half of the tested methods like lime, smoothgrad, rise, and grad-cam were fooled that the promontory is important to the class “seashore”. We can conclude that SMs are very fragile when we have multiple classes in the image, even if these classes have a very low predicted probability.Segmentation methods are more human-understandable than pixel-wise methods. Guided propagation methods can hardly be trusted due to confirmation bias, and therefore it is better not to adopt them.
6.2 Concept attribution
There is a need to build an explanation in terms of higher features called concepts for a general audience.
6.3 Prototype-based explanations
cifar
. We can extract some interesting knowledge from these methods. For example, from the prototype set, we can deduce that birds usually stand on a tree or fly in the sky, while, in the criticism images, we see that planes are all on a white background or have a different shape from the usual one used for passengers. Influence Functions (Koh and Liang 2017) is a global post-hoc model-agnostic explainer that tries to find the images most responsible for a given prediction through influence functions. The usage of influence functions is a technique from robust statistics to trace a model’s prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction. Visualizing the training points most responsible for a prediction can be useful for more in-depth insights into the black-box behavior. PROTOPNET (Chen et al. 2019) is a global interpretable model for image data that aims at identifying prototypical parts of images (named prototypes) and using them to implement an interpretable classification process. A special deep learning architecture is designed to retrieve these prototypes. The network learns a limited number of prototypical parts from the training set and then identifies parts on the test image that look like the prototypical parts. Then, it predicts based on a weighted combination of the similarity scores between parts of the image and the learned prototypes.Explanation prototypes are not very common for images because the usefulness of such explanations is not clear.
6.4 Counterfactual-based explanations
mnist
. However, from a human perspective, this approach might seem much more adversarial than useful for an explanation (Guidotti 2022). An extension of CEM to resolve this problem is presented in (Luss et al. 2021), where the authors leverage the usage of latent features created by a generative model to produce more trustful perturbations. L2X (Chen et al. 2018) is a local post-hoc model-agnostic explanation method that searches for the minimal number of pixels that change the classification. It is based on learning a function for extracting a subset of the most informative features for each given sample using Mutual Information. l2x adopts a variational approximation to efficiently compute the Mutual Information and gives a value for a group of pixels called patches. If the value is positive, a group contributed positively to the prediction. Otherwise, it contributed negatively. Guided Prototypes, Interpretable Counterfactual Explanations Guided by Prototypes (guidedproto) (Looveren and Klaise 2021) proposes a local post-hoc model-specific explainer that perturbs the input image by using a loss function \(\mathcal {L} = cL_{pred}+\beta L_1 + L_2\) optimized using gradient descent. The first term, \(cL_{pred}\), encourages the perturbed instance to predict another class then x, while the others are regularisation terms. In Fig. 12, we show the application of guidedproto on mnist
. It is interesting to notice that the counterfactuals unveil how easy it is to change the class with very few pixels. However, this kind of explanation is not easily human understandable because the few pixels modified can barely be noticed by human eyes. ABELE, Adversarial black-box Explainer generating Latent Exemplars) (Guidotti et al. 2019b), is a local post-hoc model-agnostic explainer that produces explanations composed of: (i) a set of exemplars and counter-exemplar images, i.e., prototypes and counterfactuals, and (ii) a SM. abele exploits an adversarial autoencoder (AAE) to generate the synthetic images used in the neighborhood to train the surrogate model used to explain x. Indeed, it builds a latent local decision tree that mimics the behavior of b and selects prototypes and counterfactuals from the synthetic neighborhood exploiting the tree. Finally, the SM is obtained by a pixel-by-pixel difference between x and the exemplars. In Fig. 12, we report an example of the application of abele on mnist
.
Counterfactual explanations are more user-friendly than prototypes and other forms of explanations because they highlight the changes to make to obtain the desired prediction.
6.5 Image explainers quantitative comparison
imagenet
. Table 6 reports the average results of the calculation of these metrics for a set of 100 randomly selected images for every dataset. Independently from the explainer adopted, we notice that insertion scores decrease while augmenting the dataset dimensions because we have higher information and more pixels have to be inserted to increase the performance. On the other hand, the deletion scores decrease. This might be tied to the fact that since we have more information, it is easier to decrease the performance. The best explainers are highlighted in bold. We notice that rise is the best approach overall, followed by intgrad, deeplift, and \(\epsilon \) -lrp. We notice that all these approaches are pixel-wise-based methods. Thus it seems that this kind of evaluation can be advantageous for these explainers. On the contrary, the segmentation-based explainers lime and xrai struggle in general and even more when handling low-resolution images.imagenet
, we only tested SMs methods due to the increasing computational costs of the other explainers. On the contrary, we tested tcav only on imagenet
as tcav needs to obtain different images representing different concepts, and this is difficult to do for very simple images like the one in mnist
and cifar
. From the results, we notice that grad-cam and grad-cam++ are the fastest methods, especially for complex models like the VGG network. In general, SM pixel-wise explanations are faster to achieve because segmentation slows down a lot, especially for high-resolution images. CA, CF, and PR methods are very slow compared to SM. This problem happens because these methods require additional training or use some search algorithm to return their explanations. CA, CF, and PR methods produce more useful explanations, but since SMs are easier and faster to obtain, they are seen more applied in the literature than the other methods making them more widespread.mnist | cifar | imagenet | |
---|---|---|---|
lime | 0.807 (0.14) | 0.41 (0.21) | 0.34 (0.25) |
\(\epsilon \) -lrp | 0.976 (0.02) | 0.56 (0.20) | 0.28 (0.19) |
intgrad | 0.975 (0.03) | 0.64 (0.22) | 0.37 (0.23) |
deeplift | 0.976 (0.02) | 0.57 (0.20) | 0.28 (0.19) |
smoothgrad | 0.959 (0.03) | 0.55 (0.23) | 0.34 (0.26) |
xrai | 0.956 (0.04) | 0.58 (0.21) | 0.40 (0.26) |
grad-cam | 0.941 (0.04) | 0.57 (0.20) | 0.21 (0.19) |
grad-cam++ | 0.941 (0.04) | 0.52 (0.22) | 0.32 (0.26) |
rise | 0.978 (0.03) | 0.61 (0.21) | 0.50 (0.26) |
lime | 0.388 (0.21) | 0.221 (0.19) | 0.051 (0.05) |
\(\epsilon \) -lrp | 0.120 (0.01) | 0.127 (0.11) | 0.014 (0.02) |
intgrad | 0.128 (0.01) | 0.118 (0.07) | 0.019 (0.04) |
deeplift | 0.120 (0.01) | 0.127 (0.11) | 0.014 (0.02) |
smoothgrad | 0.135 (0.04) | 0.153 (0.13) | 0.033 (0.05) |
xrai | 0.151 (0.04) | 0.144 (0.07) | 0.086 (0.11) |
grad-cam | 0.297 (0.20) | 0.153 (0.12) | 0.139 (0.12) |
grad-cam++ | 0.252 (0.13) | 0.283 (0.24) | 0.081 (0.10) |
rise | 0.120 (0.01) | 0.124 (0.07) | 0.044 (0.05) |
Dataset | Black-box | lime | \(\epsilon \) -lrp | intgrad | deeplift | smoothgrad | xrai | grad-cam | grad-cam++ | rise | tcav | mmd-critic | cem | guidedprop | abele |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mnist | CNN | 1 | 1 | 0.03 | 2 | 0.04 | 1 | 0.1 | 0.1 | 0.5 | – | 124 | 580 | 11 | 2000 |
cifar | CNN | 10 | 1 | 0.06 | 1 | 0.07 | 1.5 | 0.15 | 0.15 | 2 | – | 277 | 765 | 153 | 1800 |
imagenet | VGG16 | 50 | 2 | 5 | 3 | 0.8 | 18 | 0.25 | 0.25 | 21 | 300 | – | – | – | – |
Type | Name | References | Data type | IN/PH | G/L | A/S |
---|---|---|---|---|---|---|
SH | shap |
Lundberg and Lee (2017) | ANY | PH | L | S |
lime |
Ribeiro et al. (2016) | ANY | PH | L | A | |
deeplift |
Shrikumar et al. (2017) | ANY | PH | L | S | |
intgrad |
Sundararajan et al. (2017) | ANY | PH | L | S | |
l2x |
Chen et al. (2018) | ANY | PH | L | A | |
lionets |
Mollas et al. (2019) | ANY | PH | L | S | |
AB | – |
Li et al. (2016) | TXT | PH | L | S |
attentionmatrix |
Vaswani et al. (2017) | TXT | PH | L | S | |
exbert |
Hoover et al. (2019) | TXT | PH | L | S | |
CF | sedc |
Martens and Provost (2014) | TXT | PH | L | A |
lasts |
Guidotti et al. (2020b) | TXT | PH | L | S | |
xspells |
Lampridis et al. (2020) | TXT | PH | L | S | |
cat |
Chemmengath et al. (2022) | TXT | PH | L | A | |
polyjuice |
Pezeshkpour et al. (2019) | TXT | PH | L | A | |
Other | gyc |
Madaan et al. (2021) | TXT | PH | L | A |
quint |
Abujabal et al. (2017) | TXT | PH | L | S | |
anchor |
Ribeiro et al. (2018) | ANY | PH | L | A | |
criage |
Pezeshkpour et al. (2019) | TXT | PH | L | S | |
– |
Rajani et al. (2019) | TXT | PH | L | S | |
lasts |
Guidotti et al. (2020b) | TXT | PH | L | S | |
doctorxai |
Panigutti et al. (2020) | ANY | PH | L | S |
7 Explanations for text data
sst
, imdb
, and yelp
. We selected these datasets20 because they are the most used on sentiment classification and have different dimensions. On these datasets, we trained different black-box models. For every explainer, we present an example of an application on one or more datasets.
7.1 Sentence highlighting
Methods that use gradients such as intgrad perform better on the textual data since very deep ML models are usually implemented in the NLP field.
7.2 Attention-based explainers
It is unclear if attention can be considered a valid explanation. We suggest focusing on other types of explanations.
7.3 Prototype and counterfactual-based explainers
Prototype and Counterfactual explanations are very difficult to generate for text data due to the meaning of the sentences. Interactive methods such as polyjuice and gyc are promising approaches that introduce the human cognitive process that helps in the process of counterfactual generation.
7.4 Other types of explainers
Explanations for classifiers acting on text data are at the very early stages compared to tabular data and images.
sst | imdb | yelp | |
---|---|---|---|
intgrad | 0.6447 (0.21) | 0.647 (0.21) | 0.7595 (0.25) |
lime | 0.6199 (0.23) | 0.648 (0.21) | 0.7712 (0.25) |
deeplift | 0.6297 (0.23) | 0.600 (0.15) | 0.7565 (0.31) |
gradient x input | 0.6287 (0.23) | 0.630 (0.16) | 0.7590 (0.28) |
intgrad | 0.6107 (0.23) | 0.616 (0.16) | 0.7625 (0.33) |
lime | 0.6337 (0.23) | 0.599 (0.17) | 0.7513 (0.33) |
deeplift | 0.6137 (0.21) | 0.645 (0.16) | 0.7524 (0.30) |
gradient x input | 0.5852 (0.22) | 0.632 (0.16) | 0.7479 (0.31) |
7.5 Text explainers quantitative comparison
8 Explainers for other data types
8.1 Explanations for time series
8.2 Explanations for graphs
9 Explanation toolboxes
PyTorch
models. CaptumAI divides the available algorithms into three categories: Primary Attribution, in which there are methods able to evaluate the contribution of each input feature to the output of a model: intgrad (Sundararajan et al. 2017), grad-shap (Lundberg and Lee 2017), deeplift (Shrikumar et al. 2017), lime (Ribeiro et al. 2016), grad-cam (Selvaraju et al. 2020). Layer Attribution, in which the focus is on the contribution of each neuron: e.g. grad-cam (Selvaraju et al. 2020) and layer-deeplift (Shrikumar et al. 2017). Neuron Attribution, in which is analyzed the contribution of each input feature on the activation of a particular hidden neuron: e.g. neuron-intgrad (Sundararajan et al. 2017), neuron-grad-shap (Lundberg and Lee 2017). InterpretML22 (Nori et al. 2019) contains intrinsic and post-hoc methods for Python and R. InterpretML is particularly interesting due to the intrinsic methods it provides: Explainable Boosting Machine (ebm), Decision Tree, and Decision Rule List. These methods offer a user-friendly visualization of the explanations, with several local and global charts. InterpretML also contains the most popular methods, such as lime and shap. DALEX (Lipovetsky 2022) is an R and Python package that provides post-hoc and model-agnostic explainers that allow local and global explanations. It is tailored for tabular data and can produce different kinds of visualization plots. Alibi provides intrinsic and post-hoc models. It can be used with any type of input dataset and both for classification and regression tasks. Alibi provides a set of counterfactual explanations, such as cem, and, interestingly, an implementation of anchor (Ribeiro et al. 2018). Regarding global explanation methods, Alibi contains ale (Accumulated Local Effects) (Apley and Zhu 2016), which is a method based on partial dependence plots (Guidotti et al. 2019c). FAT-Forensics takes into account fairness, accountability and transparency. Regarding intrinsic explainability, it provides methods to assess explainability under three perspectives: data, models, and predictions. For accountability, it offers a set of techniques that assesses privacy, security, and robustness. For fairness, it contains methods for bias detection. What-If Tool is a toolkit providing a visual interface from which it is possible to play without coding. Moreover, it can work directly with ML models built on Cloud AI Platform (https://cloud.google.com/ai-platform). It contains a variety of approaches to get feature attribution values such as shap (Lundberg and Lee 2017), intgrad (Sundararajan et al. 2017), and smoothgrad (Selvaraju et al. 2020). Shapash is a Python library that aims to make machine learning interpretable and understandable by everyone. It provides several types of interpretable visualization that display explicit labels that everyone can understand. Shapash offers different types of interactive visualization, from feature importance graphs to contributions ones.