1 Introduction
-
the development of a method to compare local explanations against the decision-making of three classes of fully transparent predictive models;
-
evaluation of four XAI methods in various experimental setups; and
-
insights to support the choice of XAI methods to be used within a particular context.
2 Background and related works
2.1 Explainable AI
2.2 Post hoc explainable methods
2.2.1 LIME
2.2.2 SHAP
-
z is a simplified representation of x,
-
\(z' \subseteq z\) represents all \(z'\) vectors where the nonzero entries are a subset of the nonzero entries in z,
-
\(|z'|\) is the number of nonzero entries in \(z'\), and
-
M is the number of features in z
2.2.3 LINDA-BN
2.2.4 ACV
2.3 Evaluating explanations
-
Application-grounded evaluation wherein the evaluation is conducted with real end users in full context replicating a real-world application;
-
Human-grounded evaluation wherein the evaluation is conducted with laymen in simpler simulated contexts or completing proxy tasks that reflect a target application context; and
-
Functionally-grounded evaluation which requires no input from users and relies on evaluating the inherent abilities of the system via a formal definition of interpretability.
2.4 Motivation
-
Local surrogate models are commonly used to generate local, post hoc explanations. To what degree do permutation and the choice of local surrogate model affect the quality of explanation generated by the surrogate model?
-
Model-agnostic XAI methods often use some theoretical construct to define and calculate explanations. For example, principles from game theory are used to derive SHAP explanations [27, 28] and statistical models are applied by LINDA-BN [34]. However, some assumptions and design decisions may be needed to make these theoretical foundations suitable for application. How do the design choices made in the implementation of the explanation generation mechanism affect explanation quality?
-
While many local explanation methods use local surrogate models (at the data point level), a few also use global surrogate models (i.e. at the dataset level). How does the use of a global surrogate model affect explanation quality?
3 Method
3.1 Evaluation method
-
LIME uses permutation, then a local, linear surrogate model to derive an explanation. Previous works have suggested that LIME’s permutation of input x affects explanation stability, particularly as the length of the input increases [46, 50]. In addition to the effect of permutation, in this work we will examine the effects of the choice of local surrogate model.
-
LINDA-BN is grounded in statistical modelling and provides a queryable explanation.
-
ACV uses a global, rather than local, surrogate model to generate explanations. Moreover, ACV provides explanations of multiple types, including feature subsets and feature attribution.
3.2 Evaluation metrics
-
The correctness of the explanation in identifying the most impactful features;
-
The completeness of the explanation in identifying the most impactful features; and
-
The correctness of the importance ranking of all features, which is particularly key for feature attribution methods.
4 Design of experiments
4.1 Datasets
Dataset | Variable types | Num variables | Training instances | Class balance (%) |
---|---|---|---|---|
Adult Income | Mixed | 104 | 10,977 | 50.47 |
Breast Cancer | Continuous | 30 | 296 | 51.01 |
COMPAS | Mixed | 20 | 2793 | 50.13 |
Diabetes | Continuous | 8 | 375 | 50.4 |
Iris | Continuous | 4 | 70 | 52.86 |
Mushroom | Discrete | 117 | 5842 | 50.09 |
Nursery | Discrete | 27 | 6048 | 50.39 |
Dataset | Variable types | Num variables | Training instances | Target distribution |
---|---|---|---|---|
Bike Rentals | Mixed | 62 | 12,165 | Exponential |
Facebook | Discrete | 49 | 349 | Exponential |
Housing | Mixed | 23 | 354 | Normal |
Real Estate | Continuous | 6 | 289 | Normal |
Solar Flare | Discrete | 32 | 972 | Exponential |
Student Scores | Mixed | 58 | 454 | Normal |
Wine Quality | Continuous | 11 | 3428 | Normal |
4.2 Predictive models
Dataset | Decision tree | Logistic regression | Naïve Bayes |
---|---|---|---|
Adult Income | 0.82 | 0.82 | 0.81 |
Breast Cancer | 0.88 | 0.98 | 0.95 |
COMPAS | 0.71 | 0.73 | 0.72 |
Diabetes | 0.69 | 0.71 | 0.68 |
Iris | 1.00 | 1.00 | 1.00 |
Mushroom | 1.00 | 1.00 | 1.00 |
Nursery | 1.00 | 1.00 | 1.00 |
Dataset | Decision tree | Linear regression |
---|---|---|
Bike Rentals | 0.88 | 0.68 |
Facebook | 0.42 | 0.26 |
Housing | 0.63 | 0.71 |
Real Estate | 0.55 | 0.45 |
Solar Flare | 0.14 | 0.17 |
Student Scores | 0.79 | 0.90 |
Wine Quality | 0.28 | 0.29 |
Model | Applied To | Feature extraction | Ranking extraction |
---|---|---|---|
Decision tree | All Datasets | Features along the decision path were used as true features | Features were ranked in order and frequency of appearance on the decision path |
Logistic regression | All Classification Datasets | Features with coefficients in the top 5% of the range of coefficients | Features were ranked in order of the absolute values of their coefficients |
Linear regression | All Regression Datasets | Features with coefficients in the top 5% of the range of coefficients | Features were ranked in order of the absolute values of their coefficients |
Naïve Bayes | All Classification Datasets | Features for which the difference in likelihoods given each class were in the top 5% of the range of likelihoods | Features were ranked in order of the absolute values of the difference in likelihoods given each class |
4.2.1 Decision tree
4.2.2 Linear and logistic regression
4.2.3 Naïve Bayes
4.3 XAI techniques
XAI method | Explanation-generation mechanism | Used for | Feature extraction | Ranking extraction |
---|---|---|---|---|
LIME | Local, linear surrogate model to determine feature weights | All models | Features with weights in the top 5% of the range of weights | Features ranked in order of absolute value of weights |
SHAP | TreeSHAP: Tree traversal to determine contribution of each feature | All decision tree models | Features with contributions in the top 5% of the range of contributions | Features ranked in order of absolute value of contributions |
LinearSHAP: Examining feature coefficients and means to determine contribution | All linear and logistic regression models | |||
ExactExplainer and PermutationExplainer: Use of gray codes to determine feature contribution | All Naïve Bayes models | |||
LINDA-BN | Local Bayesian Network surrogate model to determine conditional dependence between all variables, including the target variable | All classification models | Features with the greatest impact on the target variable in the surrogate model | Features ranked in order of impact on the target variable in the surrogate model |
ACV | Global, random forest surrogate model to determine sufficient features for prediction | All models | Smallest set of sufficient features returned as explanation | Features ranked by percentage of occurrence across all sufficient feature sets |
4.3.1 LIME
4.3.2 SHAP
4.3.3 LINDA-BN
4.3.4 ACV
4.4 Identifying feature subsets
5 Results and analysis
Dataset | Model | LIME | SHAP | LINDA-BN | ACV | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Pr | Re | Tau | Pr | Re | Tau | Pr | Re | Tau | Pr | Re | Tau | ||
Adult Income | DT | 0.95 | 0.15 | 0.24 | 1.00 | 0.19 | 0.59 | 0.47 | 0.42 | 0.07 | 0.62 | 0.15 | 0.43 |
LR | 1.00 | 1.00 | 0.75 | 0.32 | 0.34 | 0.11 | 0.00 | 0.46 | 0.01 | 0.09 | 0.09 | 0.14 | |
NB | 0.15 | 0.15 | 0.48 | 0.68 | 0.68 | 0.67 | 0.03 | 0.91 | 0.04 | 0.22 | 0.24 | 0.20 | |
Breast Cancer | DT | 1.00 | 0.43 | 0.35 | 1.00 | 0.43 | 0.62 | 0.09 | 0.43 | 0.10 | 0.26 | 0.25 | 0.30 |
LR | 0.75 | 0.24 | 0.52 | 0.60 | 0.17 | 0.53 | 0.24 | 0.57 | 0.06 | 0.38 | 0.20 | 0.29 | |
NB | 0.41 | 0.42 | 0.61 | 0.40 | 0.37 | 0.69 | 0.05 | 0.53 | \(-\)0.07 | 0.18 | 0.20 | 0.44 | |
COMPAS | DT | 0.96 | 0.26 | 0.45 | 1.00 | 0.26 | 0.59 | 0.33 | 0.11 | 0.13 | 0.70 | 0.35 | 0.34 |
LR | 0.56 | 0.57 | 0.69 | 0.50 | 0.55 | 0.39 | 0.01 | 0.24 | \(-\)0.04 | 0.32 | 0.55 | 0.18 | |
NB | 0.06 | 0.05 | 0.35 | 0.82 | 0.82 | 0.52 | 0.11 | 0.55 | 0.01 | 0.25 | 0.38 | 0.21 | |
Diabetes | DT | 0.89 | 0.30 | 0.49 | 1.00 | 0.35 | 0.61 | 0.49 | 0.42 | 0.05 | 0.52 | 0.35 | 0.28 |
LR | 0.53 | 0.54 | 0.53 | 0.50 | 0.51 | 0.55 | 0.25 | 0.66 | 0.02 | 0.36 | 0.48 | 0.23 | |
NB | 0.64 | 0.64 | 0.63 | 0.82 | 0.83 | 0.80 | 0.24 | 0.60 | \(-\)0.07 | 0.27 | 0.42 | 0.20 | |
Iris | DT | 1.00 | 1.00 | 0.71 | 1.00 | 1.00 | 1.00 | 0.23 | 0.88 | 0.03 | 0.75 | 1.00 | 0.24 |
LR | 0.85 | 0.88 | 0.83 | 0.71 | 0.81 | 0.72 | 0.24 | 0.96 | \(-\)0.01 | 0.25 | 0.50 | 0.74 | |
NB | 0.81 | 0.75 | 0.83 | 0.94 | 0.94 | 0.88 | 0.27 | 0.96 | \(-\)0.01 | 0.44 | 0.63 | 0.58 | |
Mushroom | DT | 0.47 | 0.06 | 0.20 | 0.94 | 0.34 | 0.54 | 0.05 | 1.00 | 0.00 | 0.24 | 0.06 | 0.15 |
LR | 1.00 | 1.00 | 0.86 | 0.48 | 0.58 | 0.33 | 0.01 | 1.00 | 0.00 | 0.29 | 0.58 | 0.11 | |
NB | 0.06 | 0.20 | 0.44 | 0.86 | 0.60 | 0.34 | 0.02 | 0.97 | 0.00 | 0.14 | 0.11 | 0.24 | |
Nursery | DT | 1.00 | 1.00 | 0.27 | 1.00 | 1.00 | 1.00 | 0.04 | 1.00 | 0.00 | 0.13 | 0.26 | 0.18 |
LR | 1.00 | 1.00 | 0.30 | 1.00 | 1.00 | 0.81 | 0.04 | 1.00 | 0.00 | 0.14 | 0.27 | 0.10 | |
NB | 1.00 | 1.00 | 0.78 | 1.00 | 1.00 | 0.88 | 0.04 | 1.00 | 0.00 | 0.14 | 0.29 | 0.11 |
Dataset | Model | LIME | SHAP | ACV | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Pr | Re | Tau | Pr | Re | Tau | Pr | Re | Tau | ||
Bike Rentals | DT | 0.74 | 0.07 | 0.38 | 0.99 | 0.06 | 0.51 | 0.47 | 0.06 | 0.22 |
LR | 1.00 | 0.88 | 0.78 | 1.00 | 0.14 | 0.63 | 0.06 | 0.01 | \(-\)0.16 | |
Facebook | DT | 1.00 | 0.18 | 0.50 | 0.92 | 0.17 | 0.59 | 0.34 | 0.03 | 0.37 |
LR | 1.00 | 1.00 | 0.74 | 1.00 | 0.33 | 0.59 | 0.00 | 0.00 | 0.17 | |
Housing | DT | 0.81 | 0.11 | 0.34 | 1.00 | 0.14 | 0.55 | 0.78 | 0.12 | 0.43 |
LR | 0.53 | 0.33 | 0.44 | 0.44 | 0.23 | 0.45 | 0.18 | 0.12 | 0.41 | |
Real Estate | DT | 1.00 | 0.33 | 0.48 | 1.00 | 0.31 | 0.56 | 0.84 | 0.27 | 0.12 |
LR | 0.40 | 0.48 | 0.44 | 0.30 | 0.30 | 0.45 | 0.40 | 0.43 | 0.10 | |
Solar Flare | DT | 0.89 | 0.16 | 0.51 | 0.96 | 0.22 | 0.84 | 0.21 | 0.03 | 0.33 |
LR | 1.00 | 1.00 | 0.92 | 0.16 | 0.16 | 0.51 | 0.00 | 0.00 | \(-\)0.02 | |
Student Results | DT | 0.96 | 0.56 | 0.19 | 1.00 | 0.57 | 0.48 | 0.77 | 0.43 | 0.43 |
LR | 1.00 | 1.00 | 0.91 | 0.08 | 0.08 | 0.61 | 0.00 | 0.00 | \(-\)0.30 | |
Wine Quality | DT | 0.99 | 0.33 | 0.59 | 1.00 | 0.31 | 0.70 | 0.45 | 0.18 | 0.10 |
LR | 0.43 | 0.52 | 0.65 | 0.34 | 0.39 | 0.68 | 0.20 | 0.22 | \(-\)0.08 |
XAI techniques | Precision | Recall | F1-score | Rank correlation | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
T | z | p | T | z | p | T | z | p | T | z | p | |
LIME & SHAP | 64,472 | \(-\)6.6 | 0.00 | 62,995 | \(-\)4.9 | 0.00 | 81,245 | \(-\)5.4 | 0.00 | 542,663 | \(-\)12.9 | 0.00 |
LIME & LINDA-BN | 101,122 | \(-\)28.8 | 0.00 | 111,778 | \(-\)11.5 | 0.00 | 124,550 | \(-\)27.6 | 0.00 | 9730 | \(-\)37.1 | 0.00 |
LIME & ACV | 73,746 | \(-\)25.5 | 0.00 | 97,228 | \(-\)15.4 | 0.00 | 155,344 | \(-\)20.9 | 0.00 | 198,713 | \(-\)29 | 0.00 |
SHAP & LINDA-BN | 47,991 | \(-\)31.6 | 0.00 | 108,833 | \(-\)8.1 | 0.00 | 62,908 | \(-\)30.8 | 0.00 | 3148 | \(-\)37.4 | 0.00 |
SHAP & ACV | 43,307 | \(-\)29.1 | 0.00 | 74,658 | \(-\)19.6 | 0.00 | 109,395 | \(-\)25.4 | 0.00 | 46,797 | \(-\)35.5 | 0.00 |
LINDA-BN & ACV | 385,472 | \(-\)11.1 | 0.00 | 82,189 | \(-\)23.1 | 0.00 | 392,239 | \(-\)11.6 | 0.00 | 83,384 | \(-\)29.7 | 0.00 |
XAI techniques | Precision | Recall | F1-score | Rank corrrelation | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
T | z | p | T | z | p | T | z | p | T | z | p | |
LIME & SHAP | 19,874 | \(-\)9.8 | 0.00 | 11,408 | \(-\)13.5 | 0.00 | 16,542 | \(-\)13.1 | 0.00 | 190,176 | \(-\)17.9 | 0.00 |
LIME & ACV | 31,400 | \(-\)22 | 0.00 | 34,864 | \(-\)19.6 | 0.00 | 39,536 | \(-\)20.2 | 0.00 | 118,162 | \(-\)24.6 | 0.00 |
SHAP & ACV | 13,599 | \(-\)18.5 | 0.00 | 28,498 | \(-\)13.3 | 0.00 | 29,000 | \(-\)14.4 | 0.00 | 36,792 | \(-\)30 | 0.00 |
5.1 Analysis of fidelity results
5.2 Comparison of XAI techniques
6 Discussion
6.1 Correctness versus completeness
6.2 LIME
6.2.1 Impact of surrogate model
6.2.2 Impact of permutation
6.3 SHAP
6.4 LINDA-BN
Strengths | Weaknesses | Should be used for | |
---|---|---|---|
LIME | Generally precise in identifying features | Generally, does not identify all important features | Model types with relatively distinct relationships between features and prediction |
Performs well for decision tree and linear and logistic regression models | Does not correctly rank features based on importance | Datasets with mostly categorical variables | |
More accurate for datasets with higher proportion of categorical variables | Cannot accurately work with Naïve Bayes model - relationship between prediction and features are too nonlinear | Contexts in which not all relevant variables may be important (for example, not in medical decision-making) | |
More continuous variables result in poorer explanation correctness | |||
Explanation quality is subject to dataset characteristics, and, in some cases, model characteristics | |||
SHAP | Generally precise in identifying features | Generally, does not identify all important features | Not for investigations of model quality or model fairness - performance may be inconsistent across model types |
Performs well for decision tree and Naïve Bayes models | Does not correctly rank features based on importance | End user decision-making | |
Cannot accurately work with linear and logistic regression models | Contexts in which not all relevant variables may be important (for example, not in medical decision-making) | ||
Explanation quality is subject to model type and SHAP implementation for model type | |||
LINDA-BN | Fidelity is relatively consistent across model types | Cannot always distinguish between most important and least important features | Not feature attribution |
Explanations cannot be taken as accurate feature attribution/feature ranking | Model debugging and determining model confidence with no ground truth (i.e. in context) | ||
ACV | Can show feature necessity to prediction (i.e. features without which predictions cannot be accurately made) | Cannot always distinguish between most important and least important features | A self-explaining model - does not function well post hoc |
Explanations cannot be taken as accurate feature attribution/feature ranking |
6.5 ACV
6.6 Summary of insights
6.7 Limitations and future work
7 Conclusion
-
the explanation mechanism, i.e. the engineering of the XAI method, has a strong effect on explanation quality, which may once again require a technical expert to examine and assess whether a given mechanism is suitable for the technical context; and
-
there is no one “best”, most faithful XAI method, even for a single dataset or model type—all of the methods show significant differences in performance across datasets and models.