1 Introduction
2 Shapley values
2.1 Shapley values in cooperative game theory
-
Efficiency: They sum to the value of the grand coalition \({\mathcal {M}}\) over the empty set \(\emptyset \), that is, \(\sum _{j=1}^M \phi _j = v({\mathcal {M}}) - v(\emptyset )\).
-
Symmetry: Two equally contributing players j and k, that is,\(v({\mathcal {S}}\cup \{j\}) = v({\mathcal {S}}\cup \{k\})\) for all \({\mathcal {S}}\), receive equal payouts \(\phi _j = \phi _k\).
-
Dummy: A non-contributing player j, that is, \(v({\mathcal {S}}) = v({\mathcal {S}}\cup \{j\})\) for all \({\mathcal {S}}\), receives \(\phi _j = 0\).
-
Linearity: A linear combination of n games \(\{v_1, \dots , v_n\}\), that is, \(v({\mathcal {S}}) = \sum _{k=1}^nc_kv_k({\mathcal {S}})\), has Shapley values given by \(\phi _j(v) = \sum _{k=1}^nc_k\phi _j(v_k)\).
2.2 Shapley values in model explanation
2.2.1 Monte Carlo integration
2.2.2 Regression
2.2.3 Approximation strategies
KernelSHAP
strategy introduced in Lundberg and Lee (2017) and improved by Covert and Lee (2021). In the KernelSHAP
strategy, we sample, e.g., \(N_{\mathcal {S}}= 2000 < 2^M\) coalitions and use only these coalitions to approximate the Shapley value explanations. This strategy enables us to approximate the explanations in tractable time even for large values of M; however, a \(N_{\mathcal {S}}\ll 2^M\) will (likely) produce poor approximations.TreeSHAP
algorithm for tree-based models. There are speed-up strategies for deep neural network models, too, but they are limited to marginal Shapley values (Ancona et al. 2019; Wang et al. 2020).FastSHAP
, which sidesteps the Shapley value formula by training a black-box neural network to directly output the Shapley value explanations. Another strategy for reducing the computations is to explain groups of similar/correlated features instead of individual features (Jullum et al. 2021).3 Conditional expectation estimation
independence
, empirical
, parametric
, generative
, separate regression
, and surrogate regression
, and they are described in Sects. 3.1, 3.2, 3.3, 3.4, 3.5, and 3.6, respectively. The first four classes estimate the conditional expectation in (2) using Monte Carlo integration, while the last two classes use regression.3.1 The independence method
independence
approach, the conditional distribution
simplifies to
, and the corresponding Shapley values are the marginal Shapley values discussed in Sect. 1. The Monte Carlo samples
are generated by randomly sampling observations from the training data; thus, no modeling is needed and
follows the assumed true data distribution. However, for dependent features, which is common in observational studies, the independence
approach produces biased estimates of the contribution function (2) and the conditional Shapley values. Thus, the independence
approach can lead to incorrect conditional Shapley value explanations for real-world data (Aas et al. 2021a; Merrick and Taly 2020; Frye et al. 2021; Olsen et al. 2022).3.2 The empirical method
empirical
method samples only from similar observations in the training data. The optimal procedure is to use only samples that perfectly match the feature values \({\varvec{x}_{\mathcal {S}}^*}\), as this approach exactly estimates the conditional expectation when the number of matching observations tends to infinity (Chen et al. 2022). However, this is not applicable in practice, as data sets can have few observations, contain a high number of features to match, or have continuous features where an exact match is very unlikely. A natural extension is to relax the perfect match criterion and allow for similar observations (Mase et al. 2019; Sundararajan and Najmi 2020; Aas et al. 2021a). However, this procedure will also be influenced by the curse of dimensionality as conditioning on many features can yield few similar observations and thereby inaccurate estimates of the conditional expectation (2). We can relax the similarity criterion and include less similar observations, but then we break the feature dependencies. The empirical
approach coincides with the independence
approach when the similarity measure defines all observations in the training data as similar.empirical
approach described in Aas et al. (2021a). The approach uses a scaled version of the Mahalanobis distance to calculate a distance \(D_{\mathcal {S}}(\varvec{x}^*, \varvec{x}^{[i]})\) between the observation being explained \(\varvec{x}^*\) and every training instance \(\varvec{x}^{[i]}\). Then they use a Gaussian distribution kernel to convert the distance into a weight \(w_{\mathcal {S}}(\varvec{x}^*, \varvec{x}^{[i]})\) for a given bandwidth parameter \(\sigma \). All the weights are sorted in increasing order with \(\varvec{x}^{\left\{ k \right\} }\) having the kth largest value. Finally, they approximate (2) by a weighted version of (3), namely,
. The number of samples used is \(K^* = \min _{L \in \mathbb {N}} \big \{\sum _{k=1}^L w_{\mathcal {S}}(\varvec{x}^*, \varvec{x}^{\left\{ k \right\} }) \big / \sum _{i=1}^{N_\text {train}} w_{\mathcal {S}}(\varvec{x}^*, \varvec{x}^{[i]}) > \eta \big \}\), that is, the ratio between the sum of the \(K^*\) largest weights and the sum of all weights must be at least \(\eta \), for instance, 0.95.3.3 The parametric method class
parametric
method class, we make a parametric assumption about the distribution of the data. This simplifies the process of generating the conditional Monte Carlo samples
. The idea is to assume a distribution whose conditional distributions have closed-form solutions or are otherwise easily obtainable after estimating the parameters of the full joint distribution. The parametric
approaches can yield very accurate representations if the data truly follows the assumed distribution, but they may impose a large bias for incorrect parametric assumptions. In this section, we discuss two previously proposed parametric
approaches and introduce two new methods. The current parametric
approaches do not support categorical features, which is a major drawback, but one can potentially use the same type of encodings or entity embeddings of the categorical variables as for the empirical
method.3.3.1 Gaussian
Gaussian
approach, we sample the conditional samples
from
, for \(k=1,2,\dots ,K\) and \({\mathcal {S}}\in {\mathcal {P}}^*({\mathcal {M}})\), and use them in (3) to estimate the Shapley values in (1).3.3.2 Gaussian copula
copula
approach. The idea is to represent the marginals of the features by their empirical distributions and then model the dependence structure by a Gaussian copula. See Appendix B.1 for additional information about copulas.3.3.3 Burr and generalized hyperbolic
Burr
and GH
, respectively. In contrast to the Gaussian distribution, whose parameters can easily be estimated by the sample means and covariance matrix, the parameters of the Burr and GH distributions are more cumbersome to estimate. We describe the distributions in more detail in Appendix B. The GH distribution is unbounded and can model any continuous data set, while the Burr distribution is strictly positive and is therefore limited to positive data sets. The GH distribution is related to the Gaussian distribution through the t-distribution, where the latter is a special case of the GH distribution and coincides with the Gaussian distribution when the degree of freedom tends to infinity.3.4 The generative method class
generative
and parametric
methods are similar in that they both generate Monte Carlo samples from the estimated conditional distributions. However, the generative
methods do not make a parametric assumption about the data. We consider two generative
approaches: the ctree
approach of Redelmeier et al. (2020) and the \(\texttt {VAEAC}\) approach of Olsen et al. (2022). The latter is an extension of the approach suggested by Frye et al. (2021). Both methods support mixed data, i.e., both continuous and categorical data.3.4.1 Ctree
ctree
). A ctree
is a type of recursive partitioning algorithm that builds trees recursively by making binary splits on features until a stopping criterion is satisfied (Hothorn et al. 2006). The process is sequential, where the splitting feature is chosen first using statistical significance tests, and then the splitting point is chosen using any type of splitting criterion. The ctree
algorithm is independent of the dimension of the response, which in our case is
, while the input features are \({\varvec{x}_{\mathcal {S}}}\), which varies in dimension based on the coalition \({\mathcal {S}}\). That is, for each coalition \({\mathcal {S}}\in {\mathcal {P}}^*({\mathcal {M}})\), a ctree
with \({\varvec{x}_{\mathcal {S}}}\) as the features and
as the response is fitted to the training data. For a given \({\varvec{x}_{\mathcal {S}}^*}\), the ctree
approach finds the corresponding leaf node and samples K observations with replacement from the
part of the training observations in the same node to generate the conditional Monte Carlo samples
. We get duplicated Monte Carlo samples when K is larger than the number of samples in the leaf node. Thus, the ctree
method weighs the Monte Carlo samples based on their sampling frequencies to bypass redundant calls to f. Therefore, the contribution function \(v({\mathcal {S}})\) is not estimated by (3) but rather by the weighted average
, where \(K^*\) is the number of unique Monte Carlo samples. For more details, see Redelmeier et al. (2020, Sect. 3).3.4.2 VAEAC
ctree
trains \(2^M-2\) different models, which eventually becomes computationally intractable for large M. The \(\texttt {VAEAC}\) model is trained by maximizing a variational lower bound, which conceptually corresponds to artificially masking features, and then trying to reproduce them using a probabilistic representation. In deployment, the \(\texttt {VAEAC}\) method considers the unconditional features
as masked features to be imputed.3.5 The separate regression method class
separate regression
methods, we train a new regression model \(g_{\mathcal {S}}({\varvec{x}_{\mathcal {S}}})\) to estimate the conditional expectation for each coalition of features. Related ideas have been explored by Lipovetsky and Conklin (2001), Strumbelj et al. (2009), Williamson and Feng (2020). However, to the best of our knowledge, we are the first to compare different regression models for estimating the conditional expectation as the contribution function \(v(\mathcal {S})\) in the local Shapley value explanation framework.3.5.1 Linear regression model
LM separate
.3.5.2 Generalized additive model
GAM separate
.3.5.3 Projection pursuit regression
PPR separate
.3.5.4 Random forest
RF separate
.3.5.5 Boosting
CatBoost
(Prokhorenkova et al. 2018). We call this approach CatBoost separate
.3.6 The surrogate regression method class
separate regression
methods train a new regression model \(g_{\mathcal {S}}({\varvec{x}_{\mathcal {S}}})\) for each coalition \(\mathcal {S} \in {\mathcal {P}}^*({\mathcal {M}})\), a total of \(2^M-2\) models has to be trained, which can be time-consuming for slowly fitted models. The surrogate regression
method class builds on the ideas from the separate regression
class, but instead of fitting a new regression model for each coalition, we train a single regression model \(g(\tilde{\varvec{x}}_{\mathcal {S}})\) for all coalitions \({\mathcal {S}}\in {\mathcal {P}}^*({\mathcal {M}})\), where \(\tilde{\varvec{x}}_{\mathcal {S}}\) is defined in Sect. 3.6.1. The surrogate regression
idea is used by Frye et al. (2021), Covert et al. (2021), but their setup is limited to neural networks. In Sect. 3.6.1, we propose a general and novel framework that allows us to use any regression model. Then, we relate our framework to the previously proposed neural network setup in Sect. 3.6.2.3.6.1 General surrogate regression framework
surrogate regression
method, we must consider that most regression models g rely on a fixed-length input, while the size of \({\varvec{x}_{\mathcal {S}}}\) varies with coalition \({\mathcal {S}}\). Thus, we are either limited to regression models that support variable-length input, or we can create a fixed-length representation \(\tilde{\varvec{x}}_{\mathcal {S}}\) of \({\varvec{x}_{\mathcal {S}}}\) for all coalitions \({\mathcal {S}}\). The \(\tilde{\varvec{x}}_{\mathcal {S}}\) representation must also include fixed-length information about the coalition \({\mathcal {S}}\) to enable the regression model g to distinguish between coalitions. Finally, we need to augment the training data to reflect that g is to predict the conditional expectation for all coalitions \({\mathcal {S}}\).surrogate regression
methods to work.surrogate regression
method class, we consider the same regression models as in Sect. 3.5. We call the methods for LM surrogate
, GAM surrogate
, PPR surrogate
, RF surrogate
, and CatBoost surrogate
, and they take the following forms:-
LM surrogate
: \(g(\tilde{\varvec{x}}_{{\mathcal {S}}}) = \beta _{0} + \sum _{j = 1}^{2\,M} \beta _{j}\tilde{x}_{{\mathcal {S}}, j} = \tilde{\varvec{x}}_{\mathcal {S}}^{T}\varvec{\beta }\). -
GAM surrogate
: \(g(\tilde{\varvec{x}}_{{\mathcal {S}}}) = \beta _{0} + \sum _{j = 1}^{M} g_{j}(\tilde{x}_{{\mathcal {S}}, j}) + \sum _{j = M+1}^{2\,M} \beta _{j}\tilde{x}_{{\mathcal {S}}, j}\). That is, we add nonlinear effect functions to the augmented features \(\hat{\varvec{x}}_{{\mathcal {S}}} = \varvec{x} \circ I({\mathcal {S}})\) in \(\tilde{\varvec{x}}_{{\mathcal {S}}}\) while letting the binary mask indicators in \(\tilde{\varvec{x}}_{{\mathcal {S}}}\) be linear. -
PPR surrogate
: \(g(\tilde{\varvec{x}}_{{\mathcal {S}}}) = \beta _{0} + \sum _{l=1}^L g_{l}(\varvec{\beta }_{l}^T\tilde{\varvec{x}}_{{\mathcal {S}}})\), where \(g_{l}\) and \(\varvec{\beta }_{l}\) are the lth ridge function and parameter vector, respectively. -
RF surrogate
: \(g(\tilde{\varvec{x}}_{{\mathcal {S}}})\) is aRF
model fitted to the augmented data on the same form as in (5). -
CatBoost surrogate
: \(g(\tilde{\varvec{x}}_{{\mathcal {S}}})\) is aCatBoost
model fitted to the augmented data on the same form as in (5).
3.6.2 Surrogate regression: neural networks
surrogate regression
neural network (NN-Frye surrogate
) approach in Frye et al. (2021) differs from our general setup above in that they do not train the model on the complete augmented data. Instead, for each observation in every batch in the training process, they randomly sample a coalition \({\mathcal {S}}\) with probability \(\frac{|\mathcal {S}|!(M-|\mathcal {S}|-1)!}{M!}\). Then they set the masked entries of the observation, i.e., the features not in \({\mathcal {S}}\), to an off-distribution value not present in the data. Furthermore, they do not concatenate the masks to the data, as we do in (5).NN-Olsen surrogate
) approach to illustrate that one can improve on the NN-Frye surrogate
method. The main conceptual differences between the methods are the following. First, for each batch, we generate a missing completely at random (MCAR) mask with paired sampling. MCAR means that the binary entries in the mask
are Bernoulli distributed with probability 0.5, which ensures that all coalitions are equally likely to be considered. Further, paired sampling means that we duplicate the observations in the batch and apply the complement mask, \({\mathcal {S}}\), on these duplicates. This ensures more stable training as the network can associate both \(\varvec{x}_{{\mathcal {S}}}\) and
with the response \(f(\varvec{x})\). Second, we set the masked entries to zero and include the binary mask entries as additional features, as done in (5) and Olsen et al. (2022). This enables the network to learn to distinguish actual zeros in the data set and zeros induced by the masking, removing the need to set an off-distribution masking value. Additional differences due to implementation, for example, network architecture and optimization routine, are elaborated in Appendix A.3.7 Time complexity
-
Training: The computation time of the training step depends on several method (class) specific attributes. The
independence
method is trivial as no training is needed; hence, its training time is zero. For all other methods, the computation time is affected by the number of features M, training observations \(N_\text {train}\), and models within the method. The number of models is one for the \(\texttt {VAEAC}\) andsurrogate regression
methods, while it is \(2^M-2\) for the other methods. In the former case, only a single model is trained, but in return, thesurrogate regression
methods using (5) have doubled M and increased \(N_\text {train}\) by a factor of \(2^M-2\). In the latter case, the number of features in each of the \(2^M-2\) models varies from 1 to M, depending on the coalition \({\mathcal {S}}\), which also alters the computation time. The computation time is also increased if cross-validation is used to tune some or all of the hyperparameters. Thus, the overall training time complexity is \(\mathcal {O}(C_\text {train}2^M)\), where \(C_\text {train}\) is method-specific and depends on the factors discussed above. For example, \(C_\text {train} = 0\) for the trivialindependence
method. In contrast, estimating a single conditional multivariate normal distribution in theGaussian
method yields \(C_\text {train} = \mathcal {O}(M^2(M+N_\text {train}))\), and the same for training a single non-cross-validated linear model in theLM separate
method. While \(C_\text {train} = \mathcal {O}(MN_\text {trees}N_\text {train}\log _2N_\text {train})\) for theRF separate
method, where \(N_\text {trees}\) is the number of balanced trees in the forest. Thus, the complexities of the four methods are constant, linear, linear, and log-linear in the number of training observations \(N_\text {train}\), respectively. -
Generating: The computation time of the generating step is only applicable to the Monte Carlo-based methods as the regression-based methods do not generate any Monte Carlo samples. The time needed to generate an -dimension Monte Carlo sample can vary for different coalitions due to the coalition size. The time complexity of generating the K Monte Carlo samples for the \(2^M-2\) coalitions and \(N_\text {test}\) test observations is \(\mathcal {O}(C_\text {MC} K N_\text {test} 2^M)\), where \(C_\text {MC}\) is method-specific and represents the generation of one Monte Carlo sample. For example, \(C_\text {MC}\) corresponds to sample one observation from the training data in the
independence
method, which is done in constant time, that is, \(C_\text {MC} = \mathcal {O}(1)\). While in theGaussian
method, \(C_\text {MC}\) represents the cost of generating standard Gaussian data and converting it to the associated multivariate conditional Gaussian distribution using the Cholesky decomposition of the conditional covariance matrix , which is \(\mathcal {O}(M^3)\). Note, however, that the cost of the Cholesky decomposition can be shared among all K Monte Carlos samples and \(N_\text {test}\) test observations. The computation of the Cholesky decomposition could also have been considered part of the training step. -
Predicting: The computation time of the predicting step varies between the Monte Carlo and regression paradigms due to their conceptually different techniques for computing \(v({\mathcal {S}})\). The Monte Carlo paradigm computes the contribution function \(v({\mathcal {S}})\) based on (3), while the regression paradigm uses (4). The Monte Carlo paradigm relies on averaging K calls to the predictive model f for each of the \((2^M-2)\) coalition and \(N_\text {test}\) test observation. Thus, the overall predicting time complexity for the Monte Carlo paradigm is \(\mathcal {O}(C_f K N_\text {test} 2^M)\), where \(C_f\) represents the computation time of calling f once. In the regression paradigm, the value of the contribution function is directly estimated as the output of a regression model g. Thus, the overall predicting time complexity for the regression paradigm is \(\mathcal {O}(C_g N_\text {test} 2^M)\), where \(C_g\) represents the computation time of calling g once. Both \(C_f\) and \(C_g\) are influenced by, e.g., the number of features M and training observations \(N_\text {train}\), but also by the intricacy of the predictive and regression model, respectively. For example, the time complexity of calling a linear regression model with M features once is \(\mathcal {O}(M)\), while it is \(\mathcal {O}(N_\text {trees}\log _2N_\text {train})\) for a random forest model with \(N_\text {trees}\) balanced trees.
Paradigm/Method Class | Training | Generating \(\varvec{x}_{\mathcal {S}}^{(k)}\) | Predicting \(v({\mathcal {S}})\) |
---|---|---|---|
Monte Carlo | \(\mathcal {O}(C_\text {train}2^M)\) | \(\mathcal {O}(C_\text {MC} K N_\text {test} 2^M)\) | \(\mathcal {O}(C_f K N_\text {test} 2^M)\) |
Separate regression | \(\mathcal {O}(C_\text {train}2^M)\) | – | \(\mathcal {O}(C_g N_\text {test} 2^M)\) |
Surrogate regression | \(\mathcal {O}(C_\text {train})\) | – | \(\mathcal {O}(C_g N_\text {test} 2^M)\) |
3.8 Additional methods in the supplement
generative
, separate regression
, and surrogate regression
methods in the Supplement. These methods are not included in the main text as they generally perform worse than the introduced methods. For the generative
method class, we consider three additional \(\texttt {VAEAC}\) approaches with methodological differences and point to eleven other potential generative methods. For the separate regression
method class, we consider twenty other regression models, and most of these are also applicable to the surrogate regression
method class. Among the regression methods are linear regression with interactions, polynomial regression with and without interactions, elastic nets, generalized additive models, principal component regression, partial least squares, K-nearest neighbors, support vector machines, decision trees, boosting, and neural networks. In the Supplement, we apply the additional methods to the numerical simulation studies and real-world data experiments conducted in Sects. 4 and 5, respectively.4 Numerical simulation studies
4.1 Linear regression models
-
lm_no_interactions
: \(f_{\text {lm}, \text {no}}(\varvec{x}) = \beta _0 + \sum _{j=1}^{M} \beta _jx_j\), -
lm_more_interactions
: \(f_{\text {lm}, \text {more}}(\varvec{x}) = f_{lm, \text {no}}(\varvec{x}) + \gamma _1x_1x_2 + \gamma _2x_3x_4\), -
lm_numerous_interactions
:\(f_{\text {lm}, \text {numerous}}(\varvec{x}) = f_{\text {lm}, \text {more}}(\varvec{x}) + \gamma _3x_5x_6 + \gamma _4x_7x_8\),
lm_more_interactions
setup, the predictive linear model f has eight linear terms and two interaction terms reflecting the form of \(f_{\text {lm}, \text {more}}\). We fit the predictive models using the lm
function in base R.
-
lm_no_interactions
(Fig. 2): For \(\rho = 0\), we see thatctree
andLM surrogate
perform the best. Theindependence
approach, which makes the correct feature independence assumption, is close behind. For \(\rho > 0\), theparametric
andseparate regression
(LM
,GAM
, andPPR
) methods generally perform the best. In particular, theLM separate
method, which makes the correct model assumption, is the best-performing approach. Thegenerative
andempirical
approaches form the mid-field, while thesurrogate regression
andindependence
methods seem to be the least precise. -
lm_more_interactions
(Fig. 3): In this case, theLM separate
method performs poorly, which is reasonable due to the incorrect model assumption. For \(\rho = 0\), thectree
approach is the most accurate, but theindependence
andparametric
methods are close behind. For \(\rho > 0\), theparametric
methods are clearly the best approaches as they make the correct parametric assumption. ThePPR separate
method performs very well, and thegenerative
approaches are almost on par for moderate correlation. TheNN-Olsen surrogate
method is the most accuratesurrogate
regression
approach. In general, theseparate regression
methods perform better as \(\rho \) increases due to the simpler regression problems/prediction tasks. The performance of theGAM separate
method particularly improves for larger values of \(\rho \). -
lm_numerous_interactions
(Fig. 4): The overall tendencies are very similar to those in thelm_more_interactions
experiment. Theparametric
methods are by far the most accurate. Further,ctree
is the bestgenerative
approach, theNN-Olsen surrogate
is the bestsurrogate regression
method, and thePPR separate
method is the bestseparate regression
approach.
4.2 Generalized additive models
lm_no_interactions
model to a full generalized additive model by applying the nonlinear function \({\text {cos}}(x_j)\) to a subset of the features in \(\varvec{x}\). Then, we extend the full generalized additive model by also including pairwise nonlinear interaction terms of the form \(g(x_j,x_k) = x_jx_k + x_jx_k^2 + x_kx_j^2\). We generate the features \(\varvec{x}^{[i]}\) as before, but the response value \(y^{[i]}\) is now generated according to:-
gam_three
: \(f_{\text {gam}, \text {three}}(\varvec{x}) = \beta _0 + \sum _{j=1}^{3}\beta _j\cos (x_j) + \sum _{j=4}^{M} \beta _jx_j\), -
gam_all
: \(f_{\text {gam}, \text {all}}(\varvec{x}) = \beta _0 + \sum _{j=1}^{M}\beta _j\cos (x_j)\), -
gam_more_interactions
:\(f_{\text {gam}, \text {more}}(\varvec{x}) = f_{\text {gam}, \text {all}}(\varvec{x}) + \gamma _1g(x_1, x_2) + \gamma _2g(x_3, x_4)\), -
gam_numerous_interactions
:\(f_{\text {gam}, \text {numerous}}(\varvec{x}) = f_{\text {gam}, \text {more}}(\varvec{x}) + \gamma _3g(x_5, x_6) + \gamma _4g(x_7, x_8)\),
gam_three
experiment, the fitted predictive model f uses splines on the three first features while the others are linear. In the gam_more_interactions
experiment, f uses splines on all eight features, and tensor product smooths on the two nonlinear interaction terms. We fit the predictive models using the mgcv
package with default parameters (Wood 2006a, 2022). In what follows, we briefly summarize the results of the different simulation setups.-
gam_three
(Fig. 5): On the contrary to thelm_no_interactions
experiment, we see that theLM separate
approach performs much worse than theGAM separate
approach, which makes sense as we have moved from a linear to a nonlinear setting. For \(\rho = 0\), we see thatctree
andindependence
are the best approaches. For \(\rho > 0\), theparametric
approaches are superior, but theGAM separate
approach is not far behind, while theNN-Olsen surrogate
method is the bestsurrogate regression
approach. -
gam_all
(Fig. 6): The performance of theLM
approaches continue to degenerate. Theseparate regression
methods get gradually better for higher values of \(\rho \), but theparametric
methods are still superior. Thegenerative
methods constitute the second-best class for \(\rho \in \{0.3, 0.5\}\), but theGAM separate
andPPR separate
approaches are relatively close. The latter approaches outperform thegenerative
methods when \(\rho = 0.9\). -
gam_more_interactions
(Fig. 7): We see similar results to those in thegam_all
experiment. Theparametric
approaches are superior in all settings. Thegenerative
methods perform quite well for \(\rho < 0.5\), but they are beaten by thePPR separate
method for \(\rho = 0.9\). Note that theGAM separate
approach now falls behind thePPR separate
approach, as it is not complex enough to model the nonlinear interaction terms. This indicates that complexseparate regression
approaches are needed to model complex predictive models. Furthermore, theRF surrogate
method is on par or outperforms theNN
basedsurrogate regression
approaches. -
gam_numerous_interactions
(Fig. 8): We get nearly identical results as in the previous experiment. Hence, we do not provide further comments on the results.
4.3 Computation time
gam_more_interactions
experiment with \(\rho = 0.5\) and \(N_\text {train} = 1000\)gam_more_interactions
experiment with \(\rho = 0.5\), \(N_\text {train} = 1000\), and \(N_\text {test} = 250\) in Sect. 4.2. We split the total time into the same three time components as in Sect. 3.7. That is, time used training the approaches, time used generating the Monte Carlo samples, and time used predicting the \(v({\mathcal {S}})\) using Monte Carlo integration (including the calls to f) or regression. We denote these three components by training, generating, and predicting, respectively. The matrix multiplication needed to estimate the Shapley values from the estimated contribution functions is almost instantaneous and is part of the predicting time. Furthermore, creating the augmented training data for the surrogate regression
methods in Sect. 3.6.1 takes around one second and is part of the training time. We see a wide spread in the times, but, in general, the Monte Carlo approaches take, on average, around half an hour, while the regression methods are either much faster or slower, depending on the approach.gam_more_interactions
experiment is slow, as we can see in Table 2, since the predicting time constitutes the majority of the total time. To compare, \(N_f\) calls to the linear model in the lm_more_interactions
experiment takes approximately 3 CPU seconds, while the GAMs in the gam_three
and gam_more_interactions
experiments take roughly 13 and 35 CPU minutes, respectively. In the latter experiment, the PPR and RF models in Sect. 4.5 take around 0.5 and 40 CPU minutes, respectively.empirical
and ctree
approaches have lower predicting time than the other Monte Carlo-based methods due to fewer calls to f since they use weighted Monte Carlo samples; see Sects. 3.2 and 3.4.1. The three influential time factors for the Monte Carlo methods are: the training time of the approach (estimating the parameters), the sampling time of the Monte Carlo samples, and the computational cost of calling f; see Sect. 3.7.separate regression
and surrogate regression
methods use roughly the same time to estimate the Shapley values for different predictive models f, as f is only called \(N_\text {train}\) times when creating the training data sets. After that, we train the separate regression
and surrogate regression
approaches and use them to directly estimate the contribution functions. The influential factors for the regression methods are the training time of the \(2^M-2\) separate models (or the one surrogate model) and the prediction time of calling them a total of \(N_\text {test}(2^M-2)\) times. The former is the primary factor, and it is influenced by, e.g., hyperparameter tuning and the training data size. The latter can be a problem for the augmented training data for the surrogate regression
methods, as we will see in Sect. 5.4.gam_more_interactions
experiment with \(\rho = 0.5\), i.e., the Gaussian
and PPR separate
methods, respectively. The Gaussian
approach uses approximately 37 CPU minutes to explain 250 predictions, an average of 8.88 seconds per explanation. In contrast, the PPR separate
method explains all the \(N_\text {test} = 250\) predictions in half a second. Thus, the PPR separate
method is approximately 4440 times faster than the Gaussian
approach per explanation, which is essential for large values of \(N_\text {test}\). However, note that this factor is substantially lower for predictive models that are less computationally expensive to call.RF separate
method is by far the slowest separate regression
method, but the large training time is (mainly) due to the method’s extensive hyperparameter tuning described in Appendix A. As the number of folds and hyperparameter combinations in the cross-validation procedure is \(N_\text {folds} = 4\) and \(N_\text {hyper} = 12\), respectively, we fit a total of \(N_\text {folds}N_\text {hyper}(2^M-2) = 12\,192\) random forest models. In contrast, the LM separate
method directly fits the \(2^M-2\) separate models without any tuning. In the Supplement, we omit the cross-validation procedure and use default hyperparameter values, which significantly reduces the computation time but also the performance. The two slowest methods are the NN-Frye
and NN-Olsen surrogate
methods, which consider six and nine hyperparameter combinations each. Thus, using default values would reduce the training time by a factor of 6 and 9, respectively, but at the cost of precision.4.4 Number of training observations
independence
approach becomes relatively more accurate compared to the other methods when \(N_\text {train} = 100\), and worse when \(N_\text {train} \in \left\{ 5000, 20,000 \right\} \). This is intuitive, as modeling the data distribution/response is easier when the methods have access to more data. Second, in the simple experiments in Sects. 4.1 and 4.2 and \(N_\text {train} \in \left\{ 5000, 20,000 \right\} \), the GAM separate
and PPR separate
approaches become even better, but are still beaten by the Gaussian
and copula
approaches in most experiments. Third, we observe that the MAE has a tendency to decrease when \(N_\text {train}\) increases. However, we cannot directly compare the MAE scores as they depend on the fitted predictive model f, which changes when \(N_\text {train}\) is adjusted.4.5 Other choices for the predictive model
gam_more_interactions
experiment, the MSE test prediction values were 1.32, 3.67, 7.36 for GAM, PPR, RF, respectively, where 1 is the theoretical optimum as \({\text {Var}}(\varepsilon ) = 1\).gam_more_interactions
experiment, as the corresponding figures for the other experiments are almost identical. The results are displayed in Figs. 9 and 10 for the PPR and RF models, respectively, and the results are quite similar to those obtained for the GAM model in Fig. 7. In general, the parametric
methods are superior, followed by the generative
methods, while the empirical
, separate regression
, and surrogate regression
approaches are worse. Some separate regression
approaches perform, however, much better for high dependence. The independence
method performs well when \(\rho = 0\), but it gradually degenerates as the dependence level increases, as expected. We see that the PPR separate
approach performs well for the PPR predictive model, but it is outperformed by the CatBoost separate
method for the RF models. These results indicate that for our experiments, it is beneficial to choose a regression method similar to the predictive model; that is, for a non-smooth model, one should consider using a non-smooth regression method. However, note that the difference in the MAE is minuscule.4.6 Different data distribution
parametric
Burr
approach, which assumes Burr distributed data, not surprisingly, is the most accurate. The Gaussian
method, which now incorrectly assumes Gaussian distributed data, performs worse. The \(\texttt {VAEAC}\) approach performs very well on the Burr distributed data, which was also observed by Olsen et al. (2022). In general, \(\texttt {VAEAC}\) is the second-best approach after Burr
. The PPR separate
method also performs well, but compared to the Burr
and \(\texttt {VAEAC}\) approaches, it is less precise in the experiments with nonlinear interaction terms.4.7 Summary of the experiments
parametric
methods significantly outperform the other approaches in most settings. In general, if the distribution is unknown, the second-best option for low to moderate levels of dependence is the generative
method class. The separate regression
approaches improve relative to the other methods when the feature dependence increases, and for highly dependent features, the PPR separate
approach is a prime choice. Furthermore, the separate regression
methods that match the form of f often give more accurate Shapley value estimates. The PPR model in the PPR separate
approach is simple to fit but is still very flexible and can, therefore, accurately model complex predictive models. The independence
approach is accurate for no (or very low) feature dependence, but it is often the worst approach for high feature dependence. The NN-Olsen surrogate
method outperforms the NN-Frye surrogate
approach in most settings and is generally the best surrogate regression
approach.separate regression
and surrogate regression
methods to make them more competitive. Using default hyperparameter values usually resulted in less accurate Shapley value explanations; see additional experiments in the Supplement. The hyperparameter tuning can be time-consuming, but it was feasible in our setting with \(M=8\) features and \(N_\text {train} = 1000\) training observations. The regression-based methods use most of their computation time on training, while the predicting step is almost instantaneous for several methods. The opposite holds for the Monte Carlo-based approaches, which are overall slower than most regression-based methods. Hence, we have a trade-off between computation time and Shapley value accuracy in the numerical simulation studies. We did not conduct hyperparameter tuning for the empirical
, parametric
, and generative
methods. Thus, the methods where we conduct hyperparameter tuning have an unfair advantage regarding the precision of the estimated Shapley values.5 Real-world data experiments
q
. Thus, a low value of (7) indicates that the estimated contribution function \(\hat{v}_{\texttt {q}}\) is closer to the true counterpart \(v_{\texttt {true}}\) than a high value.gam_more_interactions
experiment with Gaussian distributed data with \(\rho = 0.5\). Note that the orderings of the two criteria are not one-to-one, but they give fairly similar rankings of the methods.
5.1 Abalone
Length
, Diameter
, Height
, WholeWeight
, ShuckedWeight
, VisceraWeight
, ShellWeight
, and Sex
. All features are continuous except for Sex
which is a three-level categorical feature (infant, female, male). Thus, the empirical
and parametric
methods are not applicable. However, to remedy this, we train two PPR models to act as our predictive models; one based on all features (\(\text {PPR}_\text {all}\)) and another based solely on the continuous features (\(\text {PPR}_\text {cont}\)). We chose the PPR model as it outperformed the other prediction models we fitted (GAM, RF, CatBoost). The test MSE increases from 2.04 to 2.07 when excluding \(\texttt {Sex}\). Cross-validation determined that number of terms in \(\text {PPR}_\text {all}\) and \(\text {PPR}_\text {cont}\) should be 4 and 7, respectively.PPR separate
, NN-Olsen surrogate
, and \(\texttt {VAEAC}\) methods. For the \(\text {Abalone}_\text {cont}\) data set, the PPR separate
and NN-Olsen surrogate
methods perform equally well and share first place, but both methods are marginally outperformed by the \(\texttt {VAEAC}\) approach for the \(\text {Abalone}_\text {all}\) data set. However, both the \(\texttt {VAEAC}\) and NN-Olsen surrogate
methods are very slow compared to the PPR separate
approach. The second-best Monte Carlo-based method for the \(\text {Abalone}_\text {cont}\) data set is the Gaussian copula
approach, even though the Abalone data set is far from Gaussian distributed. This is probably because the copula
method does not make a parametric assumption about the marginal distributions of the data, but rather the copula/dependence structure, which makes it a more robust method than the Gaussian
approach.5.2 Diabetes
Age
, Sex
, BMI
, BP
(blood pressure), and six blood serum measurements (S1
, S2
, S3
, S4
, S5
, S6
) obtained from 442 diabetes patients. The response of interest is a quantitative measure of disease progression one year after the baseline. Like Efron et al. (2004), we treat Sex
as numerical and standardize all features; hence, we can apply all methods. Many features are strongly correlated, with a mean absolute correlation of 0.35, while the maximum is 0.90. The Age
feature is the least correlated with the other features. Most scatter plots and marginal density functions display structures and marginals somewhat similar to the Gaussian distribution, except those related to the S4
feature, which has a multi-modal marginal. We split the data into a training and test data set at a 75–25 ratio, and we let the predictive model be a principle component regression (PCR) model with six principal components. This model outperformed the linear model and cross-validated random forest, XGBoost, CatBoost, PPR, and NN models in prediction error on the test data. The PCR model is not easily interpretable as it does not directly depend on the features but on their principal components.LM separate
, GAM separate
, and PPR separate
methods obtain the lowest \({\text {MSE}}_v\) scores, with the \(\texttt {VAEAC}\), Gaussian
, and copula
approaches having nearly as low \({\text {MSE}}_v\) scores. We are not surprised that the latter two methods are competitive due to the Gaussian-like structures in the Diabetes data set. The LM separate
method is the fastest approach, with a CPU time of 1.9 seconds.5.3 Red wine
quality
score between 0 and 10, while the \(M = 11\) continuous features are based on physicochemical tests: fixed acidity
, volatile acidity
, citric acid
, residual sugar
, chlorides
, free sulfur dioxide
, total sulfur dioxide
, density
, pH
, sulphates
, and alcohol
. For the Red Wine data set, most scatter plots and marginal density functions display structures and marginals far from the Gaussian distribution, as most of the marginals are right-skewed. Many of the features have no to moderate correlation, with a mean absolute correlation of 0.20, while the largest correlation in absolute value is 0.683 between pH
and fix_acid
. The data set contains 1599 wines, and we split it into a training (1349) and a test (250) data set. A cross-validated XGBoost model and a random forest with 500 trees perform equally well on the test data, and we use the latter as the predictive model f.RF separate
approach is the best method by far. Next, we have the CatBoost separate
, RF surrogate
, empirical
, and \(\texttt {VAEAC}\) methods. The RF surrogate
and CatBoost surrogate
perform well compared to the other surrogate regression
methods. The good performance of the non-smooth RF separate
and CatBoost separate
methods on the non-smooth predictive model f supports our findings from the simulation studies, where we observed that using a separate regression
method with the same form as f was beneficial. The generative
methods perform better than the GH
and copula
methods, while the Gaussian
method falls behind. This is intuitive as the data distribution of the Red Wine data set is far from the Gaussian distribution.5.4 Adult
age
(cont.), workclass
(7 cat.), fnlwgt
(cont.), education
(16 cat.), education-num
(cont.), marital-status
(7 cat.), occupation
(14 cat.), relationship
(6 cat.), race
(5 cat.), sex
(2 cat.), capital-gain
(cont.), capital-loss
(cont.), hours-per-week
(cont.), and native-country
(41 cat.). The pairwise Pearson correlation coefficients for the continuous features are all close to zero, with a mean absolute correlation of 0.06. The data set contains 30,162 individuals, and we split it into a training (30,000) and a test (162) data set. We train a CatBoost model on the training data to predict an individual’s probability of making over $50,000 a year and use the test data to compute the evaluation criterion. We used a relatively small test set due to memory constraints, and we chose the CatBoost as it outperformed the other prediction models we fitted (LM, GAM, RF, NN).CatBoost separate
approach, while second place is shared by the RF separate
and \(\texttt {VAEAC}\) methods. Note that the difference in the \({\text {MSE}}_v\) score is very small. Like in the previous experiments, we observe that using a separate regression
method with the same form as f is beneficial. The ctree
approach supports mixed data, but we deemed it infeasible due to a very long computation time. Furthermore, the surrogate regression
methods based on (5) ran out of memory as \(\mathcal {X}_\text {aug}\) consists of 30,000 \(\times (2^{14}-2) =\) 491,460,000 training observations.RF separate
method is due to the hyperparameter tuning, as discussed in Sect. 4.3. In the Supplement, we include a random forest method without hyperparameter tuning but instead with default values and a lower number of trees, denoted by RF-def separate
. It obtains an \({\text {MSE}}_v = 0.028\) with a training time of 12:49:17.0, i.e., a \(99.5\%\) reduction in the training time. The competitive \({\text {MSE}}_v\) illustrates that hyperparameter tuning was not essential in this experiment, unlike in the simulation studies where hyperparameter tuning was crucial for obtaining a competitive method. Furthermore, as discussed in Sect. 2.2.3, using an approximation strategy with \(N_{\mathcal {S}}< 2^M-2\) coalitions would also speed up the computations. This could also make the other surrogate regression
methods applicable, as the size of the augmented training data would be reduced by a factor of \(N_{\mathcal {S}}/(2^M-2)\).6 Recommendations
independence
approach is the simplest method to use.parametric
approach with the correct (or nearly correct) parametric assumption about the data distribution generates the most accurate Shapley values.-
The
copula
method does not make an assumption about the marginals of the data, but rather the copula/dependence structure, which makes it a more robust method. -
For features that do not fit the assumed distribution in the
parametric
approach, one can consider transformations, for example, power transformations, to make the data more Gaussian-like distributed. -
For categorical features, one can use, e.g., encodings or entity embeddings to represent the categorical features as numerical. This is needed, as no directly applicable multivariate distribution exists for mixed data. However, there exist copulas that support mixed data.
-
If the
parametric
methods are not applicable, the next best option is (often) agenerative
orseparate regression
method, where all considered approaches support mixed data sets by default.
separate
and surrogate regression
methods, using a method with the same form as the predictive model f provides more precise Shapley value estimates.-
For some predictive models, e.g., the linear regression model in Fig. 2, we know that the true conditional model is also a linear model. Thus, using a regression method that can model a linear model (e.g.,
lm
,GAM
,PPR
) produces more accurate Shapley values. However, the form of the true conditional model is usually unknown for most predictive models. -
It is important that the regression method used is flexible enough to estimate/model the predictive model f properly.
-
In the numerical simulation studies, the
separate regression
methods performed relatively better compared to the other method classes for higher feature dependence. In the real-world experiments, theseparate regression
methods were also (among) the best approaches on data sets with moderate dependence. -
In general, conducting hyperparameter tuning of the regression methods improves the precision of the produced explanations, but this increases the computation time.
-
In the simulation studies, a \(\texttt {PPR separate}\) approach with fixed \(L = |{\mathcal {S}}|\) (often) provides fast and accurate Shapley value explanations; see the Supplement.
-
For popular data sets, one can fine-tune an
empirical
,parametric
, orgenerative
method and let other researchers reuse the method to estimate Shapley values for their own predictive models. -
If a researcher is to explain several predictive models fitted to the same data, then reusing the generated Monte Carlo samples will save computation time.
-
The simplest
separate
andsurrogate regression
methods are rapidly trained, while the complex methods are time-consuming. This is, however, a one-time upfront time cost. In return, all regression-based methods produce the Shapley value explanations almost instantly. Thus, developers can develop the predictive model f simultaneously with a suitable regression-based method and deploy them together. The user of f will then get predictions and explanations almost instantaneously. -
In contrast, several of the Monte Carlo-based methods are trained considerably faster than many of the regression-based methods but are, in return, substantially slower at producing the Shapley value explanations. Generating Monte Carlo samples and using them to estimate the Shapley values for new predictions are computationally expensive and cannot be done in the development phase. Thus, the Monte Carlo-based methods cannot produce explanations in real-time.
-
If the predictive model f is computationally expensive to call, then the Monte Carlo-based methods will be extra time-consuming due to \(\mathcal {O}(KN2^M)\) calls to f. Here, K, N, and M are the number of Monte Carlo samples, predictions to explain, and features, respectively. In contrast, the
separate
andsurrogate regression
methods make only \(\mathcal {O}(N2^M)\) calls to their fitted regression model(s). -
The regression-based methods can be computationally tractable when the Monte Carlo-based methods are not, for example, when N is large. We can reduce the time by decreasing the number of Monte Carlo samples K, but this results in less stable and accurate Shapley value explanations.
-
If accurate Shapley values are essential, then a suitable
parametric
,generative
, orseparate regression
approach with the same form as f yields desirable estimates, depending on the dependence level. TheNN-Olsen surrogate
method also provided accurate Shapley values for some real-world data sets. Furthermore, hyperparameter tuning should be conducted for extra accuracy. -
If coarsely estimated Shapley values are acceptable, then some of the simple
separate regression
methods can be trained and produce estimates almost immediately, such asLM separate
. ThePPR separate
approach with fixed \(L = |{\mathcal {S}}|\) is often a fair trade-off between time and accuracy, especially for smooth predictive functions.
PPR separate
, performed even better when trained on more training observations.separate regression
method can train the individual models in parallel, while a surrogate regression
method can cross-validate the model’s hyperparameters on different cores.separate regression
class is infeasible. Then, the surrogate regression
methods and the \(\texttt {VAEAC}\) approach with arbitrary conditioning can be useful. However, their accuracy will likely also decrease with higher dimensions. In high-dimensional settings, one can, e.g., group the features into relevant groups (Jullum et al. 2021) or use approximation strategies to simplify the Shapley value computations, as described in Sect. 2.2.3.7 Conclusion
parametric
method with a correctly (or nearly correctly) assumed data distribution. This is intuitive, as making a correct parametric assumption is advantageous throughout statistics. However, the true data distribution is seldom known, e.g., for real-world data sets. In the simulation studies with moderate feature dependence levels, the second-best method class was generally the generative
class with the ctree
and \(\texttt {VAEAC}\) methods, which outperformed the independence
, empirical
, separate regression
, and surrogate regression
methods. For high feature dependence, the separate regression
methods improved relative to the other classes, particularly the PPR separate
method. Using a separate regression
method with the same form as the predictive model proved beneficial.parametric
methods fell behind the best approaches, except for the simplest data set with Gaussian-like structures. In general, the best approaches in the real-world data set experiments belong to the separate regression
method class and have the same form as the predictive model. However, the NN-Olsen surrogate
method tied the best separate regression
method in one experiment, and the \(\texttt {VAEAC}\) approach was marginally more precise in another experiment. The second-best method class varied for the different data sets, with all method classes, except the independence
and empirical
, taking at least one second place each.generative
and surrogate regression
methods, investigate how non-optimal approaches change the estimated conditional Shapley values, and finally evaluate bias in estimated conditional Shapley values for data with known conditional distributions.