Skip to main content
Top

Open Access 07-02-2025 | Original Research Paper

Accelerating the computation of Shapley effects for datasets with many observations

Authors: Giovanni Rabitti, George Tzougas

Published in: European Actuarial Journal

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Shapley effects are enjoying increasing popularity as importance measures. These indices allocate the variance of the quantity of interest among every risk factor, and a risk factor explaining more variance than another one is more important. Recently, Vallarino et al. (ASTIN Bull J IAA, 2023. https://​doi.​org/​10.​1017/​asb.​2023.​34) propose a computational strategy for Shapley effects using the idea of cohorts of similar observations. However, this strategy becomes extremely computationally demanding if the dataset contains many observations. In this work we propose a computational shortcut based on design of experiments and clustering techniques to speed up the computational time. Using the well-known French claim frequency dataset, we demonstrate the huge reduction in computational time, without a significant loss of accuracy in the estimation of the Shapley effects.
Notes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Shapley effects have gained widespread recognition as a variable importance measure, finding applications across various sectors. They are based on the Shapley value from game theory literature [23], used as an attribution method. Considering a team of players generating a value, the Shapley value assigns to each player their fair contribution. If one associates players with input variables (or risk factors) of a model and specifies a total value to be distributed among the variables, the Shapley value allocates this quantity. This framework is employed by [8] to identify the most important variables in variable annuity models. In the context of computer experiments, Owen [16] suggests to adopt the variance of the quantity of interest as the value to distribute among variables using the Shapley value, called in this case Shapley effects [25]. Shapley effects are a variable importance measure and enjoy desirable properties. For example, they remain well-defined even in the case of dependent variables, differently from other variance-based importance indices [17, 25]. Plischke et al. [18] introduce a method to compute Shapley effects based on the concept of Moebius inverse, while Rabitti and Borgonovo [19] apply Shapley effects to identify the most important risk factors in annuity models with dependent mortality and interest rate risks. In the dynamic landscape of the insurance industry, Shapley effects can enhance risk management: distributing variance among risk factors, they enable ranking and prioritization of risk factors and aid the interpretation of machine learning models for insurance claims.
An open research area on Shapley effects is their statistical estimation. Recently, Mase et al. [11] proposed a strategy to estimate Shapley effects directly from a given sample using the concept of cohorts of similar observations. Vallarino et al. [26] combined this idea of cohorts with the Moebius inverse approach and demonstrated that constructing a rating system using Shapley effects leads to more homogeneous risk classes. However, computing the similarity cohorts requires processing all possible observations, making it computationally demanding for datasets with a large number of observations, which is typical in the context of “big data”. This encompasses structured, unstructured, and semi-structured data commonly encountered in various insurance fields, such as analyzing information from telematics and usage-based insurance for computing insurance premiums, detecting insurance fraud, automatically diagnosing diseases from images, managing crop insurance, monitoring weather conditions, understanding variable annuity products, etc.
In this work, we propose an approach to address this computational issue by selecting “representative” observations from the dataset, rather than considering all of them. To achieve this, we adopt a strategy inspired by [35, 7], who employed experimental designs and clustering techniques to identify representative policyholders in extensive portfolios of variable annuities. Specifically, we explore the application of the Latin hypercube design [10, 13], the conditional Latin hypercube design [14] and the hierarchical k-means algorithm [15]. Using the observations selected through these three methods, we compute the Shapley effects employing the algorithm proposed by [26]. This approach is then applied to the well-known French claim frequency dataset, demonstrating a substantial reduction in computational time, with only a limited loss in the accuracy of the Shapley effects estimate.
The subsequent sections of this paper are organized as follows. Section 2 provides an exposition on Shapley effects and their computation. Section 3 presents the Latin hypercube, the conditional Latin hypercube, and the hierarchical k-means algorithm. Section 4 outlines the application of the proposed strategy, highlighting its implementation on the French Motor Third Party Liability (MTPL) dataset. Finally, Sect. 5 concludes the work, summarizing the key findings and implications.

2 Shapley effects and their computation

The Shapley value [23] is an attribution method originating from game theory. Its purpose is to allocate the total value generated by a coalition of players into the contribution of each individual player in the coalition.
Let \(\phi _j\) denote the Shapley value of the j-th player. Consider a set of players \(K = \{1,2, \dots ,k\}\) participating to a game. We denote with \(u\subseteq K\) any possible sub-coalition of players and \(\text {val}:2^{K} \rightarrow \mathbb {R}\) as the value function of the game, with \(\text {val}(\emptyset ) = 0\). The value generated by players in the subset u is \(\text {val}(u)\), and the total value of the game to be allocated is \(\text {val}(K)\).
The Shapley value enjoys the following desirable properties as an attribution method:
Efficiency:
\(\sum _{j=1}^{k}\phi _{j} = \text {val}(K)\).
 
Symmetry:
If \(\text {val}(u \cup \{j_1\}) = \text {val}(u \cup \{j_2\})\) for all \(u \subseteq K\) with \(j_1,j_2 \notin K\), then \(\phi _{j_1}=\phi _{j_2}\).
 
Dummy:
If j is a dummy player, \(\text {val}(u \cup \{j\}) = \text {val}(u)\) for all \(u \subseteq K\), then \(\phi _{j}=0\).
 
Additivity:
If the games with \(\text {val}\) and \(\text {val}'\) have Shapley values \(\phi \) and \(\phi '\) respectively, then the game with value \(\text {val}(u) + \text {val}'(u)\) has Shapley value \(\phi _{j}+ \phi '_{j}\) for all \(j \in K\).
 
The interpretation of the properties is as follows: Efficiency ensures that the total payoff is distributed among all players. Symmetry states that if two players contribute the same value when joining the same coalition for all coalitions, their Shapley values are equal. The Dummy property requires that a player who contributes nothing to any coalition receives a Shapley value of zero. Additivity implies that if two games are combined, the Shapley value for any player is the sum of the Shapley values that the player would receive in the individual games.
Shapley [23] proves that the unique representation of \(\phi _{j}\) value satisfying these four properties is
$$\begin{aligned} \phi _{j} = \sum _{u \subseteq K: j \notin u} \dfrac{(k - |u| -1)!|u|!}{k!}(\text {val}(u \cup \{j\}) - \text {val}(u)), \end{aligned}$$
(1)
where |u| denotes the cardinality of the subset u. The Shapley value is widely used to assess the sensitivity of a quantity of interest Y in terms of variations of variables \(\textbf{X}=(X_1,X_2,\ldots ,X_k)\). Owen [16] suggests using the Sobol’ index as the value function
$$\begin{aligned} \text {val}(u)=\frac{\mathbb {V}\left[ \mathbb {E}(Y|\textbf{X}_u)\right] }{\mathbb {V}[Y]}, \end{aligned}$$
(2)
where \(\textbf{X}_u\) denotes the sub-vector of \(\textbf{X}\) whose components are indexed by u, \(\mathbb {E}(\cdot )\) denotes the expectation and \(\mathbb {V}(\cdot )\) the variance. The Shapley values obtained with this value function are called Shapley effects [25]. With this chosen value function, Shapley effects have become very popular because they are always non-negative (i.e. \(\phi _j \ge 0\) for any j) and they sum up to unity because the value function in (2) has been normalized by the variance of Y (i.e. \(\sum _{j=1}^n \phi _j=1\)) since \(\mathbb {V}[E(Y|\textbf{X})]=\mathbb {V}[Y]\).1 The Shapley effects allocate the variance of Y among variables, which receive a non-negative fraction of it. The Shapley effect \(\phi _j\) is then interpretable as the fraction of the variance that the j-th variable explains. The higher the variance this variable explains, the more important it is. Moreover, Shapley effects are well-defined under any dependence structure among the variables \(\textbf{X}\) [25].
While Shapley effects enjoy the appealing properties mentioned above, their computation is an open problem. Song et al. [25] propose a double loop Monte Carlo, while Plischke et al. [18] suggest an algorithm based on the Moebius inverse of the value function, reducing the number of evaluations from \(k!\cdot k\) to \(2^k-1\). Denote with \(\text {mob}(u)\) the Moebius inverse of the value function \(\text {val}( u)\), that is \(\text {mob}(u)=\sum _{v\subset u}(-1)^{|u-v|} \text {val}(v)\). Equivalently to Eq. (1), Shapley values can be written as:
$$\begin{aligned} \phi _{j}=\sum _{u \subseteq K: j \in u}\dfrac{\text {mob}(u)}{|u|}. \end{aligned}$$
(3)
To compute the Shapley effect \(\phi _j\) from given data, Mase et al. [11, 12] introduce the notion of cohorts of similar observations. Assume we have at hand a dataset of observations \(\{(y_i,\textbf{x}_i)\}_{i=1}^n\), where \(y_i\) and \(\textbf{x}_i\) represent the i-th realization of Y and \(\textbf{X}\) respectively. The cohorts are defined as subsets of observations sharing values of variables similar to those of a target observation denoted by t.
Let \(x_{ij}\) be the value of the j-th variable for the i-th observation (with \(j=1, \dots ,k\) and \(i=1, \dots , n\)), and \(x_{tj}\) be the target value for the selected variable j. A similarity function \(z_{tj}\) is constructed for each variable j; for continuous variables, similarity is defined using a measure of relative distance. A possible similarity function is given by:
$$\begin{aligned} z_{tj}(x_{ij}) = {\left\{ \begin{array}{ll} 1 & \quad \text {if } |x_{ij} - x_{tj}| \le \delta _{j}, \\ 0 & \quad \text {otherwise}, \end{array}\right. } \end{aligned}$$
(4)
where \(\delta _{j}\) is the specified distance. The cohort of t with respect to variables in u is defined as:
$$\begin{aligned} C_{t,u}=\{i \in 1:n \ | \ z_{tj}(x_{ij})=1, \ \forall j \in u\}. \end{aligned}$$
This set contains subjects similar to the target subject t for all variables \(j \in u\). By construction, \(C_{t,u}\) is never empty since it includes at least the target observation \(x_t\).
Using cohorts, Mase et al. [11] estimate the conditional expectation in (2) by considering cohort means:
$$\begin{aligned} \frac{1}{|C_{t,u}|} \sum _{i \in C_{t,u}}y_i = \bar{y}_{t,u}, \end{aligned}$$
where \(|C_{t,u}|\) is the cardinality. The cohort average \(\bar{y}_{t,u}\) is an estimate of the conditional mean for the target \(\mathbb {E}[Y| \textbf{X}_{t,u}]\). With each observation having probability \(n^{-1}\), an estimate of the value function is obtained:
$$\begin{aligned} \widehat{\text {val}}(u)= \frac{1}{n} \frac{\sum _{t=1}^{n}{( \bar{y}_{t,u} - \bar{y})}^{2}}{\hat{\sigma }^2}, \end{aligned}$$
where \(\bar{y}\) and \(\hat{\sigma }^2\) are an estimate of the mean and of the variance of Y respectively. Shapley effects are then obtained using Eq. (3). This algorithm is contained in the Matlab function highercohortshapmob.m available at https://​github.​com/​giovanni-rabitti/​ratingsystemshap​ley.
To find the complexity of this algorithm, we note that for each target subject t, the algorithm runs a loop over \(n\) observations, with each iteration having a complexity of \(O(n)\). This results in a total complexity per main loop iteration of \(O(n^2)\). Given that the outer loop runs \(2^k\) times (i.e. all possible coalition values are evaluated), the overall algorithmic complexity becomes \(O(2^k \cdot n^2)\). The computational complexity of the algorithm exhibits a quadratic growth with respect to \( n \) (the number of observations) and an exponential growth with respect to \( k \) (the number of variables). When a reduced sample of size \( n_0 \) is used instead of the full sample \( n \), the theoretical factor by which the computational complexity is reduced can be expressed as the ratio:
$$\begin{aligned} \text {Reduction Factor} = \frac{O(2^k \cdot n^2)}{O(2^k \cdot n_0^2)} = \left( \frac{n}{n_0} \right) ^2. \end{aligned}$$
This ratio indicates the reduction in computational complexity when the number of observations is decreased from \( n \) to \( n_0 \). Hence, to reduce the complexity, we want to reduce the number of observations n without significantly sacrificing the accuracy of the estimated Sobol’ indices in Eq. (2). This can be possible selecting representative observations from the sample.

3 Speeding-up strategies based on experimental designs and clustering

To select the \(n_0\) representative observations, we follow the presentation of [4, 6]. The codes are implemented in \(\texttt {R}\), are described in [6] and are publicly available at https://​www2.​math.​uconn.​edu/​~gan/​software.​html. All computations were conducted using a standard laptop equipped with 8GB of RAM.

3.1 Latin hypercube sampling

The Latin hypercube (LH) sampling is a statistical design proposed by [13] for sampling points from a multidimensional space. These design points are sampled using a Latin square, a grid where only one point is sampled in each row and column, ensuring maximal coverage of the range of every variable. This sampling design is more space-filling compared to Monte Carlo sampling, as illustrated in Fig. 1, which contrasts 5 design points sampled with both approaches.
Since the number of possible Latin hypercubes grows exponentially as the dimension k increases, it is impossible to compute them all to find the best Latin hypercube, i.e. the one having the largest minimum distance between its points. Gan [3] proposes the following procedure to select a “good” Latin hypercube with \(n_0\) points:
1.
divide the range of each continuous variable into \(n_0\) equal intervals. For each continuous variable, randomly sample points within these intervals. For categorical variables, randomly select samples from the available categories. This process results in a total of \(n_0\) points;
 
2.
calculate the distance between each pair of sampled points in the Latin hypercube. Distances are computed separately for continuous and categorical variables and then summed up;
 
3.
evaluate the quality of the sample by calculating the minimum distance between pairs of points. This score is used to select the best sample among those generated;
 
4.
repeat the steps above many times (we did 50 times, as done in [6]). The sample with the highest score (i.e., the largest minimum distance) is selected as the best sample;
 
5.
after generating a Latin hypercube sample, an optimization procedure is applied to find the policyholders in the dataset that are closest to these design points, ensuring that the sample accurately reflects the features of the data. This matching step is necessary because the Latin hypercube sample itself does not directly correspond to actual observations in the dataset.
 
Gan et al. [4, 6] explore the use of this sampling design to identify representative policyholders in a large dataset of variable annuities. For more technical details, we refer to [6].

3.2 Conditional Latin hypercube sampling

The conditional Latin hypercube (CLH) sampling differs from the (unconditional) Latin hypercube sampling by directly selecting a sub-sample from a dataset [14]. This sub-sample is chosen to construct a Latin hypercube, ensuring that the sample is representative of the overall data structure while maintaining the conditional relationships between variables. To implement conditional Latin hypercube sampling, Gan and Valdez [4] adopt the clhs function from the R package clhs. The clhs function provides a set of indices for representative observations, eliminating the need to find the observations closest to the design points. The clhs function works as follows:
1.
divide the range of each continuous variable using quantiles to create \(n_0\) intervals (called strata), each containing an equal number of observations;
 
2.
select \(n_0\) random observations from the original dataset of \(n\) observations;
 
3.
apply a probabilistic search algorithm (simulated annealing) that iteratively improves the sample by replacing or swapping points with other points from the original dataset. This is done by optimizing an objective function consisting of three components:
  • one component used to match the empirical distribution of the continuous variables;
  • one component used to match the empirical distribution of the categorical variables;
  • one component used to match the observed correlations among the continuous variables.
 
In step 1, each continuous variable is divided into strata, with each stratum containing \( n/ n_0 \) observations. In step 3, the final sample is required to form a Latin Hypercube, meaning that each stratum should contain exactly one observation. To achieve this, we penalize the absolute deviations of the marginal stratum sample sizes from one. If a stratum contains more than one observation, it is oversampled; if it contains fewer than one, it is undersampled. By construction, the selected sample preserves the empirical distribution as well as the dependencies observed in the data. For more details one can refer to [14].

3.3 Hierarchical k-means

We also consider a different sampling strategy than LH and CLH, which are based on statistical experimental designs. This approach is based on data clustering, which is a statistical technique for dividing a dataset into homogeneous groups, known as clusters, in such a way that observations in the same cluster are more similar to each other than to those in other clusters. Hierarchical k-means is a modification of the classical k-means clustering algorithm.
To divide the dataset into \(n_0\) clusters, the k-means algorithm operates through an iterative process consisting of the following steps: First, \(n_0\) initial centroids are randomly chosen. In the assignment step, each data point \(\textbf{x}_i\), \(i=1,\ldots ,n\), is assigned to the nearest centroid, forming \(n_0\) clusters based on the Euclidean distance. In the update step, the centroids are recalculated as the mean of all points assigned to each cluster. These two steps are repeated until convergence, meaning the centroids no longer move significantly, or a maximum number of iterations is reached.
The k-means algorithm becomes inefficient when applied to find a large number \(n_0\) of clusters. Nister and Stewenius [15] propose hierarchical k-means to address this scalability issue. Their algorithm is a sequential application of the k-means algorithm to split clusters into two until the required number of clusters is reached. Specifically:
1.
the first step of hierarchical k-means is to use k-means to divide the dataset into two clusters;
 
2.
the k-means algorithm is applied to divide the largest cluster into two;
 
3.
step 2 is repeated until \(n_0\) clusters are obtained.
 
Gan and Valdez [7] show with numerical experiments that the hierarchical k-means algorithm is superior to other clustering algorithms to select representative contracts in a large portfolio of variable annuities.

4 Analysis of the French MTPL dataset

We apply the algorithms to the French MTPL dataset, which comprises 678,013 insurance policies. This dataset includes for every policyholder the number of claims, policy exposure and 9 explanatory variables: VehPower (the power of the car), VehAge (the vehicle age), DrivAge (the driver’s age), BonusMalus (the bonus/malus rating of the policyholder), VehBrand (the brand of the car), VehGas (the car’s fuel type), Area (the density-rating of the community that the car driver lives in), Density (the density of inhabitants of the city where the car driver lives in), Region (the policy region in France). A detailed exploratory data analysis of this dataset is available in the tutorials by [22, 2] for the Swiss Association of Actuaries (SAA), and the book of [27]. The rebalancing of the dataset by randomly subsampling observations without claims (due to the imbalance in the dataset, which contains more policies without claims) was undertaken for illustrative purposes2 and to create a balanced dataset of 68,120 observations, considered as our benchmark data. This operation is a common practice in data analysis [28].
In our numerical illustration, the negative binomial regression model (see, for instance [9]), which can accommodate overdispersion in the data, is fitted on the number of claims using all nine explanatory variables. Finally, the Shapley methodology was used to assess the contribution of each covariate in the negative binomial regression by calculating their impact on the predicted number of claims, which is measured by the explained variance. To compute the Shapley effects, we followed the suggestion of Mase et al. [11] and set \(\delta _j\) to one-tenth of the interquartile range between the 95th and 5th percentiles for continuous variables, while \(\delta _{j}=0\) if they are discrete. The computational times are presented in Table 1.
Table 1
Computational times for estimating the Shapley effects
Sample size
Fraction of selected observations (%)
LH time
CLH time
HK time
Shapley time
n = 500
0.73
106.19
0.81
23
3.88
n = 1500
2.2
216.75
1.71
183.93
25.1
n = 2500
3.67
351.49
2.89
533.12
61.23
n = 68,120
100
65,760.46
The sample size denotes the number of selected observations from the 68,120 observations of the dataset. The LH time, the CLH time and the HK time denote the time in seconds required to find the representative observations of specified dimension using the Latin hypercube, the conditional Latin hypercube and the hierarchical k-means, respectively. The Shapley time is the time required by the algorithm of [26] to estimate the Shapley effects from the given sample identified from LH or CLH or HK (the three times being identical). At \(n=68{,}120\), no designs have been applied because the whole sample is used
Table 1 reveals that the strategy based on the conditional Latin hypercube is significantly faster compared to the one based on the Latin hypercube, being approximately ten times quicker. The computational time for the Shapley effects is the duration required by the algorithm of [26] for a given sample size. A non-linear increase in computational time is observed. In particular, estimating the Shapley effects with the full dataset requires more than 18 h and 16 min, while the computation with \(n=2500\) selected observations takes approximately one minute in total (2.89 s to find the representative observations using the CLH plus 61.23 s to estimate the Shapley effects from those observations). The selection of representative observations using Latin hypercube and hierarchical k-means is significantly slower compared to the conditional Latin hypercube method. Interestingly, as the sample size increases, the time required to find the representative policies with these approaches exceeds the time needed to compute the Shapley effects for that sample.
The estimated Shapley effects are illustrated in Fig. 2.
In particular, Fig. 2 presents the Shapley effects estimated with varying numbers of observations selected using the Latin hypercube, the conditional Latin hypercube and the hierarchical k-means. Several observations can be made. Firstly, Bonus-Malus is identified as the most important variable, as also found in Richman et al. [20, 21]. Secondly, the Shapley effects computed with a small sample size (\(n=500\)) using representative observations effectively distinguish between relevant and irrelevant risk factors, with Density and VehGas showing small Shapley effects across all sample dimensions (with the exception of the Shapley effect estimated using hierarchical k-means, which is slightly biased). In general, accuracy tends to increase with larger sample sizes. To confirm this, we considered the sum of the mean squared errors across all sample sizes and sampling strategies, \( \frac{1}{20}\sum _{j=1}^k\sum _{m=1}^{20}(\widehat{\phi }_{j,m}-\phi _j)^2,\) where \(\widehat{\phi }_{j,m}\) is the Shapley effect of the \(j\)-th variable computed for the \(m\)-th replication, with \(m=1,\ldots ,20\), using a specified sampling method, and \(\phi _j\) is the Shapley effect of the \(j\)-th variable computed using the entire dataset. This approach involves summing the squared errors for each Shapley effect across all replications for a given sample size and sampling method, then repeating this process 20 times, and finally averaging the results. Figure 3 shows the results.
Figure 3 shows that at \(n=500\), the error for HK is higher compared to the LH and CLH methods. This difference decreases as n increases, and all three methods exhibit comparable error at \(n=2500\). Interestingly, we also observe that the LH method shows very small oscillations for the sum of mean squared errors. We believe this is a specific feature of this dataset since it is expected that in general the accuracy is increasing as the sample size increases.
Overall, taking into account the trade off between the small loss of accuracy and the significant reduction in computational time, the CLH sampling strategy seems to be optimal choice in this application.

5 Conclusions

In this work, we proposed a strategy to speed up the computation of Shapley effects, especially when dealing with a large number of observations, a common scenario in various insurance settings involving structured, semi-structured, and unstructured data. We selected representative observations using the Latin hypercube, conditional Latin hypercube, and hierarchical k-means methods, following a methodology similar to that of Gan and Valdez [4] applied to a large portfolio of variable annuities. Subsequently, we computed the Shapley effects using the selected sample.
Applying this strategy to the well-known French MTPL dataset, we observed a notable reduction in computational time without a significant loss of accuracy. Moreover, we emphasize that in our work we assumed a negative binomial predictive model linking the variables \(\textbf{x}\) to the output y (the claim number standardized by policy exposure). Nonetheless, it can be applied to identify the most crucial variables for any model by substituting the observed values \(\{y_i\}_{i=1}^n\) with the predicted values \(\{\hat{y}_i\}_{i=1}^n\) obtained using a specified machine learning model.
In our opinion, our proposed strategy holds promise for a big-data framework, and new research in this direction could be pursued. In particular, when the dataset has a large number of variables (i.e., more than 20), it becomes impractical to estimate all possible coalition values, making it necessary to use stochastic methods to approximate the Shapley value (see [1, 24]). These methods reduce the number of \(2^k\) coalition values to be computed by sampling only a subset of coalitions for which the value is then estimated. We believe that combining our proposed approach with stochastic approximation methods could represent a promising direction for future research, especially when dealing with datasets containing many observations and covariates.
Finally, we note that while our strategy accelerates computation, it introduces an error as it is an approximation. Therefore, deriving theoretical error bounds for the Shapley effects under different sampling designs is an open issue for future research.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Footnotes
1
This is true as long as we consider all possible variables \((X_1,\ldots ,X_k)\) which can make the output Y vary. If Y depends on other variables—for instance, a stochastic term—then the variance explained by the covariates does not generally coincide with \(\mathbb {V}[Y]\).
 
2
It’s crucial to emphasize that this rebalancing was done solely for illustrative purposes. Importantly, the intent was to demonstrate that our analysis remains unaffected, ensuring that the obtained benchmark dataset serves its illustrative role without compromising the integrity of our analysis.
 
Literature
1.
go back to reference Castro J, Gómez D, Tejada J (2009) Polynomial calculation of the Shapley value based on sampling. Comput Oper Res 36(5):1726–1730MathSciNetCrossRefMATH Castro J, Gómez D, Tejada J (2009) Polynomial calculation of the Shapley value based on sampling. Comput Oper Res 36(5):1726–1730MathSciNetCrossRefMATH
2.
go back to reference Ferrario A, Noll A, Wüthrich MV (2020) Insights from inside neural networks. Available at SSRN 3226852 Ferrario A, Noll A, Wüthrich MV (2020) Insights from inside neural networks. Available at SSRN 3226852
3.
go back to reference Gan G, Lin XS (2015) Valuation of large variable annuity portfolios under nested simulation: a functional data approach. Insur Math Econ 62:138–150MathSciNetCrossRefMATH Gan G, Lin XS (2015) Valuation of large variable annuity portfolios under nested simulation: a functional data approach. Insur Math Econ 62:138–150MathSciNetCrossRefMATH
4.
go back to reference Gan G, Valdez EA (2017) Valuation of large variable annuity portfolios: Monte Carlo simulation and synthetic datasets. Depend Model 5(1):354–374MathSciNetCrossRefMATH Gan G, Valdez EA (2017) Valuation of large variable annuity portfolios: Monte Carlo simulation and synthetic datasets. Depend Model 5(1):354–374MathSciNetCrossRefMATH
5.
6.
go back to reference Gan G, Valdez EA (2019) Metamodeling for variable annuities. Chapman & Hall/CRC Press, LondonCrossRefMATH Gan G, Valdez EA (2019) Metamodeling for variable annuities. Chapman & Hall/CRC Press, LondonCrossRefMATH
8.
go back to reference Godin F, Hamel E, Gaillardetz P, Ng EH-M (2023) Risk allocation through Shapley decompositions, with applications to variable annuities. ASTIN Bull J IAA 53(2):311–331MathSciNetCrossRefMATH Godin F, Hamel E, Gaillardetz P, Ng EH-M (2023) Risk allocation through Shapley decompositions, with applications to variable annuities. ASTIN Bull J IAA 53(2):311–331MathSciNetCrossRefMATH
9.
10.
go back to reference Iman RL, Conover WJ (1980) Small sample sensitivity analysis techniques for computer models. With an application to risk assessment. Commun Stat Theory Methods 9(17):1749–1842MathSciNetCrossRefMATH Iman RL, Conover WJ (1980) Small sample sensitivity analysis techniques for computer models. With an application to risk assessment. Commun Stat Theory Methods 9(17):1749–1842MathSciNetCrossRefMATH
11.
13.
go back to reference McKay MD, Beckman RJ, Conover WJ (1979) A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 21(2):239–245MathSciNetMATH McKay MD, Beckman RJ, Conover WJ (1979) A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 21(2):239–245MathSciNetMATH
14.
go back to reference Minasny B, McBratney AB (2006) A conditioned Latin hypercube method for sampling in the presence of ancillary information. Comput Geosci 32(9):1378–1388CrossRefMATH Minasny B, McBratney AB (2006) A conditioned Latin hypercube method for sampling in the presence of ancillary information. Comput Geosci 32(9):1378–1388CrossRefMATH
15.
go back to reference Nister D, Stewenius H (2006) Scalable recognition with a vocabulary tree. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), vol 2, pp 2161–2168 Nister D, Stewenius H (2006) Scalable recognition with a vocabulary tree. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), vol 2, pp 2161–2168
17.
go back to reference Owen AB, Prieur C (2017) On Shapley value for measuring importance of dependent inputs. SIAM/ASA J Uncertain Quantif 5(1):986–1002MathSciNetCrossRefMATH Owen AB, Prieur C (2017) On Shapley value for measuring importance of dependent inputs. SIAM/ASA J Uncertain Quantif 5(1):986–1002MathSciNetCrossRefMATH
18.
go back to reference Plischke E, Rabitti G, Borgonovo E (2021) Computing Shapley effects for sensitivity analysis. SIAM/ASA J Uncertain Quantif 9(4):1411–1437MathSciNetCrossRefMATH Plischke E, Rabitti G, Borgonovo E (2021) Computing Shapley effects for sensitivity analysis. SIAM/ASA J Uncertain Quantif 9(4):1411–1437MathSciNetCrossRefMATH
19.
go back to reference Rabitti G, Borgonovo E (2020) Is mortality or interest rate the most important risk in annuity models? A comparison of sensitivity analysis methods. Insur Math Econ 95:48–58MathSciNetCrossRefMATH Rabitti G, Borgonovo E (2020) Is mortality or interest rate the most important risk in annuity models? A comparison of sensitivity analysis methods. Insur Math Econ 95:48–58MathSciNetCrossRefMATH
22.
go back to reference Schelldorfer J, Wüthrich MV (2019) Nesting classical actuarial models into neural networks. Available at SSRN 3320525 Schelldorfer J, Wüthrich MV (2019) Nesting classical actuarial models into neural networks. Available at SSRN 3320525
23.
go back to reference Shapley LS (1953) A value for n-person games. In: Kuhn HW, Tucker AW (eds) Contributions to the theory of games. Princeton University Press, Princeton, pp 307–317MATH Shapley LS (1953) A value for n-person games. In: Kuhn HW, Tucker AW (eds) Contributions to the theory of games. Princeton University Press, Princeton, pp 307–317MATH
24.
go back to reference Simon G, Vincent T (2020) A projected stochastic gradient algorithm for estimating Shapley value applied in attribute importance. In: Holzinger A, Kieseberg P, Tjoa AM, Weippl E (eds) Machine learning and knowledge extraction. Springer, Cham, pp 97–115CrossRefMATH Simon G, Vincent T (2020) A projected stochastic gradient algorithm for estimating Shapley value applied in attribute importance. In: Holzinger A, Kieseberg P, Tjoa AM, Weippl E (eds) Machine learning and knowledge extraction. Springer, Cham, pp 97–115CrossRefMATH
25.
go back to reference Song E, Nelson BL, Staum J (2016) Shapley effects for global sensitivity analysis: theory and computation. SIAM/ASA J Uncertain Quantif 4(1):1060–1083MathSciNetCrossRefMATH Song E, Nelson BL, Staum J (2016) Shapley effects for global sensitivity analysis: theory and computation. SIAM/ASA J Uncertain Quantif 4(1):1060–1083MathSciNetCrossRefMATH
27.
go back to reference Wüthrich MV, Merz M (2023) Statistical foundations of actuarial learning and its applications. Springer, LondonCrossRefMATH Wüthrich MV, Merz M (2023) Statistical foundations of actuarial learning and its applications. Springer, LondonCrossRefMATH
28.
go back to reference Xue J-H, Hall P (2015) Why does rebalancing class-unbalanced data improve AUC for linear discriminant analysis? IEEE Trans Pattern Anal Mach Intell 37(5):1109–1112CrossRefMATH Xue J-H, Hall P (2015) Why does rebalancing class-unbalanced data improve AUC for linear discriminant analysis? IEEE Trans Pattern Anal Mach Intell 37(5):1109–1112CrossRefMATH
Metadata
Title
Accelerating the computation of Shapley effects for datasets with many observations
Authors
Giovanni Rabitti
George Tzougas
Publication date
07-02-2025
Publisher
Springer Berlin Heidelberg
Published in
European Actuarial Journal
Print ISSN: 2190-9733
Electronic ISSN: 2190-9741
DOI
https://doi.org/10.1007/s13385-025-00412-z