1 Introduction

the development of a method to compare local explanations against the decisionmaking of three classes of fully transparent predictive models;

evaluation of four XAI methods in various experimental setups; and

insights to support the choice of XAI methods to be used within a particular context.
2 Background and related works
2.1 Explainable AI
2.2 Post hoc explainable methods
2.2.1 LIME
2.2.2 SHAP

z is a simplified representation of x,

\(z' \subseteq z\) represents all \(z'\) vectors where the nonzero entries are a subset of the nonzero entries in z,

\(z'\) is the number of nonzero entries in \(z'\), and

M is the number of features in z
2.2.3 LINDABN
2.2.4 ACV
2.3 Evaluating explanations

Applicationgrounded evaluation wherein the evaluation is conducted with real end users in full context replicating a realworld application;

Humangrounded evaluation wherein the evaluation is conducted with laymen in simpler simulated contexts or completing proxy tasks that reflect a target application context; and

Functionallygrounded evaluation which requires no input from users and relies on evaluating the inherent abilities of the system via a formal definition of interpretability.
2.4 Motivation

Local surrogate models are commonly used to generate local, post hoc explanations. To what degree do permutation and the choice of local surrogate model affect the quality of explanation generated by the surrogate model?

Modelagnostic XAI methods often use some theoretical construct to define and calculate explanations. For example, principles from game theory are used to derive SHAP explanations [27, 28] and statistical models are applied by LINDABN [34]. However, some assumptions and design decisions may be needed to make these theoretical foundations suitable for application. How do the design choices made in the implementation of the explanation generation mechanism affect explanation quality?

While many local explanation methods use local surrogate models (at the data point level), a few also use global surrogate models (i.e. at the dataset level). How does the use of a global surrogate model affect explanation quality?
3 Method
3.1 Evaluation method

LIME uses permutation, then a local, linear surrogate model to derive an explanation. Previous works have suggested that LIME’s permutation of input x affects explanation stability, particularly as the length of the input increases [46, 50]. In addition to the effect of permutation, in this work we will examine the effects of the choice of local surrogate model.

LINDABN is grounded in statistical modelling and provides a queryable explanation.

ACV uses a global, rather than local, surrogate model to generate explanations. Moreover, ACV provides explanations of multiple types, including feature subsets and feature attribution.
3.2 Evaluation metrics

The correctness of the explanation in identifying the most impactful features;

The completeness of the explanation in identifying the most impactful features; and

The correctness of the importance ranking of all features, which is particularly key for feature attribution methods.
4 Design of experiments
4.1 Datasets
Dataset  Variable types  Num variables  Training instances  Class balance (%) 

Adult Income  Mixed  104  10,977  50.47 
Breast Cancer  Continuous  30  296  51.01 
COMPAS  Mixed  20  2793  50.13 
Diabetes  Continuous  8  375  50.4 
Iris  Continuous  4  70  52.86 
Mushroom  Discrete  117  5842  50.09 
Nursery  Discrete  27  6048  50.39 
Dataset  Variable types  Num variables  Training instances  Target distribution 

Bike Rentals  Mixed  62  12,165  Exponential 
Facebook  Discrete  49  349  Exponential 
Housing  Mixed  23  354  Normal 
Real Estate  Continuous  6  289  Normal 
Solar Flare  Discrete  32  972  Exponential 
Student Scores  Mixed  58  454  Normal 
Wine Quality  Continuous  11  3428  Normal 
4.2 Predictive models
Dataset  Decision tree  Logistic regression  Naïve Bayes 

Adult Income  0.82  0.82  0.81 
Breast Cancer  0.88  0.98  0.95 
COMPAS  0.71  0.73  0.72 
Diabetes  0.69  0.71  0.68 
Iris  1.00  1.00  1.00 
Mushroom  1.00  1.00  1.00 
Nursery  1.00  1.00  1.00 
Dataset  Decision tree  Linear regression 

Bike Rentals  0.88  0.68 
Facebook  0.42  0.26 
Housing  0.63  0.71 
Real Estate  0.55  0.45 
Solar Flare  0.14  0.17 
Student Scores  0.79  0.90 
Wine Quality  0.28  0.29 
Model  Applied To  Feature extraction  Ranking extraction 

Decision tree  All Datasets  Features along the decision path were used as true features  Features were ranked in order and frequency of appearance on the decision path 
Logistic regression  All Classification Datasets  Features with coefficients in the top 5% of the range of coefficients  Features were ranked in order of the absolute values of their coefficients 
Linear regression  All Regression Datasets  Features with coefficients in the top 5% of the range of coefficients  Features were ranked in order of the absolute values of their coefficients 
Naïve Bayes  All Classification Datasets  Features for which the difference in likelihoods given each class were in the top 5% of the range of likelihoods  Features were ranked in order of the absolute values of the difference in likelihoods given each class 
4.2.1 Decision tree
4.2.2 Linear and logistic regression
4.2.3 Naïve Bayes
4.3 XAI techniques
XAI method  Explanationgeneration mechanism  Used for  Feature extraction  Ranking extraction 

LIME  Local, linear surrogate model to determine feature weights  All models  Features with weights in the top 5% of the range of weights  Features ranked in order of absolute value of weights 
SHAP  TreeSHAP: Tree traversal to determine contribution of each feature  All decision tree models  Features with contributions in the top 5% of the range of contributions  Features ranked in order of absolute value of contributions 
LinearSHAP: Examining feature coefficients and means to determine contribution  All linear and logistic regression models  
ExactExplainer and PermutationExplainer: Use of gray codes to determine feature contribution  All Naïve Bayes models  
LINDABN  Local Bayesian Network surrogate model to determine conditional dependence between all variables, including the target variable  All classification models  Features with the greatest impact on the target variable in the surrogate model  Features ranked in order of impact on the target variable in the surrogate model 
ACV  Global, random forest surrogate model to determine sufficient features for prediction  All models  Smallest set of sufficient features returned as explanation  Features ranked by percentage of occurrence across all sufficient feature sets 
4.3.1 LIME
4.3.2 SHAP
4.3.3 LINDABN
4.3.4 ACV
4.4 Identifying feature subsets
5 Results and analysis
Dataset  Model  LIME  SHAP  LINDABN  ACV  

Pr  Re  Tau  Pr  Re  Tau  Pr  Re  Tau  Pr  Re  Tau  
Adult Income  DT  0.95  0.15  0.24  1.00  0.19  0.59  0.47  0.42  0.07  0.62  0.15  0.43 
LR  1.00  1.00  0.75  0.32  0.34  0.11  0.00  0.46  0.01  0.09  0.09  0.14  
NB  0.15  0.15  0.48  0.68  0.68  0.67  0.03  0.91  0.04  0.22  0.24  0.20  
Breast Cancer  DT  1.00  0.43  0.35  1.00  0.43  0.62  0.09  0.43  0.10  0.26  0.25  0.30 
LR  0.75  0.24  0.52  0.60  0.17  0.53  0.24  0.57  0.06  0.38  0.20  0.29  
NB  0.41  0.42  0.61  0.40  0.37  0.69  0.05  0.53  \(\)0.07  0.18  0.20  0.44  
COMPAS  DT  0.96  0.26  0.45  1.00  0.26  0.59  0.33  0.11  0.13  0.70  0.35  0.34 
LR  0.56  0.57  0.69  0.50  0.55  0.39  0.01  0.24  \(\)0.04  0.32  0.55  0.18  
NB  0.06  0.05  0.35  0.82  0.82  0.52  0.11  0.55  0.01  0.25  0.38  0.21  
Diabetes  DT  0.89  0.30  0.49  1.00  0.35  0.61  0.49  0.42  0.05  0.52  0.35  0.28 
LR  0.53  0.54  0.53  0.50  0.51  0.55  0.25  0.66  0.02  0.36  0.48  0.23  
NB  0.64  0.64  0.63  0.82  0.83  0.80  0.24  0.60  \(\)0.07  0.27  0.42  0.20  
Iris  DT  1.00  1.00  0.71  1.00  1.00  1.00  0.23  0.88  0.03  0.75  1.00  0.24 
LR  0.85  0.88  0.83  0.71  0.81  0.72  0.24  0.96  \(\)0.01  0.25  0.50  0.74  
NB  0.81  0.75  0.83  0.94  0.94  0.88  0.27  0.96  \(\)0.01  0.44  0.63  0.58  
Mushroom  DT  0.47  0.06  0.20  0.94  0.34  0.54  0.05  1.00  0.00  0.24  0.06  0.15 
LR  1.00  1.00  0.86  0.48  0.58  0.33  0.01  1.00  0.00  0.29  0.58  0.11  
NB  0.06  0.20  0.44  0.86  0.60  0.34  0.02  0.97  0.00  0.14  0.11  0.24  
Nursery  DT  1.00  1.00  0.27  1.00  1.00  1.00  0.04  1.00  0.00  0.13  0.26  0.18 
LR  1.00  1.00  0.30  1.00  1.00  0.81  0.04  1.00  0.00  0.14  0.27  0.10  
NB  1.00  1.00  0.78  1.00  1.00  0.88  0.04  1.00  0.00  0.14  0.29  0.11 
Dataset  Model  LIME  SHAP  ACV  

Pr  Re  Tau  Pr  Re  Tau  Pr  Re  Tau  
Bike Rentals  DT  0.74  0.07  0.38  0.99  0.06  0.51  0.47  0.06  0.22 
LR  1.00  0.88  0.78  1.00  0.14  0.63  0.06  0.01  \(\)0.16  
Facebook  DT  1.00  0.18  0.50  0.92  0.17  0.59  0.34  0.03  0.37 
LR  1.00  1.00  0.74  1.00  0.33  0.59  0.00  0.00  0.17  
Housing  DT  0.81  0.11  0.34  1.00  0.14  0.55  0.78  0.12  0.43 
LR  0.53  0.33  0.44  0.44  0.23  0.45  0.18  0.12  0.41  
Real Estate  DT  1.00  0.33  0.48  1.00  0.31  0.56  0.84  0.27  0.12 
LR  0.40  0.48  0.44  0.30  0.30  0.45  0.40  0.43  0.10  
Solar Flare  DT  0.89  0.16  0.51  0.96  0.22  0.84  0.21  0.03  0.33 
LR  1.00  1.00  0.92  0.16  0.16  0.51  0.00  0.00  \(\)0.02  
Student Results  DT  0.96  0.56  0.19  1.00  0.57  0.48  0.77  0.43  0.43 
LR  1.00  1.00  0.91  0.08  0.08  0.61  0.00  0.00  \(\)0.30  
Wine Quality  DT  0.99  0.33  0.59  1.00  0.31  0.70  0.45  0.18  0.10 
LR  0.43  0.52  0.65  0.34  0.39  0.68  0.20  0.22  \(\)0.08 
XAI techniques  Precision  Recall  F1score  Rank correlation  

T  z  p  T  z  p  T  z  p  T  z  p  
LIME & SHAP  64,472  \(\)6.6  0.00  62,995  \(\)4.9  0.00  81,245  \(\)5.4  0.00  542,663  \(\)12.9  0.00 
LIME & LINDABN  101,122  \(\)28.8  0.00  111,778  \(\)11.5  0.00  124,550  \(\)27.6  0.00  9730  \(\)37.1  0.00 
LIME & ACV  73,746  \(\)25.5  0.00  97,228  \(\)15.4  0.00  155,344  \(\)20.9  0.00  198,713  \(\)29  0.00 
SHAP & LINDABN  47,991  \(\)31.6  0.00  108,833  \(\)8.1  0.00  62,908  \(\)30.8  0.00  3148  \(\)37.4  0.00 
SHAP & ACV  43,307  \(\)29.1  0.00  74,658  \(\)19.6  0.00  109,395  \(\)25.4  0.00  46,797  \(\)35.5  0.00 
LINDABN & ACV  385,472  \(\)11.1  0.00  82,189  \(\)23.1  0.00  392,239  \(\)11.6  0.00  83,384  \(\)29.7  0.00 
XAI techniques  Precision  Recall  F1score  Rank corrrelation  

T  z  p  T  z  p  T  z  p  T  z  p  
LIME & SHAP  19,874  \(\)9.8  0.00  11,408  \(\)13.5  0.00  16,542  \(\)13.1  0.00  190,176  \(\)17.9  0.00 
LIME & ACV  31,400  \(\)22  0.00  34,864  \(\)19.6  0.00  39,536  \(\)20.2  0.00  118,162  \(\)24.6  0.00 
SHAP & ACV  13,599  \(\)18.5  0.00  28,498  \(\)13.3  0.00  29,000  \(\)14.4  0.00  36,792  \(\)30  0.00 
5.1 Analysis of fidelity results
5.2 Comparison of XAI techniques
6 Discussion
6.1 Correctness versus completeness
6.2 LIME
6.2.1 Impact of surrogate model
6.2.2 Impact of permutation
6.3 SHAP
6.4 LINDABN
Strengths  Weaknesses  Should be used for  

LIME  Generally precise in identifying features  Generally, does not identify all important features  Model types with relatively distinct relationships between features and prediction 
Performs well for decision tree and linear and logistic regression models  Does not correctly rank features based on importance  Datasets with mostly categorical variables  
More accurate for datasets with higher proportion of categorical variables  Cannot accurately work with Naïve Bayes model  relationship between prediction and features are too nonlinear  Contexts in which not all relevant variables may be important (for example, not in medical decisionmaking)  
More continuous variables result in poorer explanation correctness  
Explanation quality is subject to dataset characteristics, and, in some cases, model characteristics  
SHAP  Generally precise in identifying features  Generally, does not identify all important features  Not for investigations of model quality or model fairness  performance may be inconsistent across model types 
Performs well for decision tree and Naïve Bayes models  Does not correctly rank features based on importance  End user decisionmaking  
Cannot accurately work with linear and logistic regression models  Contexts in which not all relevant variables may be important (for example, not in medical decisionmaking)  
Explanation quality is subject to model type and SHAP implementation for model type  
LINDABN  Fidelity is relatively consistent across model types  Cannot always distinguish between most important and least important features  Not feature attribution 
Explanations cannot be taken as accurate feature attribution/feature ranking  Model debugging and determining model confidence with no ground truth (i.e. in context)  
ACV  Can show feature necessity to prediction (i.e. features without which predictions cannot be accurately made)  Cannot always distinguish between most important and least important features  A selfexplaining model  does not function well post hoc 
Explanations cannot be taken as accurate feature attribution/feature ranking 
6.5 ACV
6.6 Summary of insights
6.7 Limitations and future work
7 Conclusion

the explanation mechanism, i.e. the engineering of the XAI method, has a strong effect on explanation quality, which may once again require a technical expert to examine and assess whether a given mechanism is suitable for the technical context; and

there is no one “best”, most faithful XAI method, even for a single dataset or model type—all of the methods show significant differences in performance across datasets and models.