1 Introduction
NormFormer
proposal of Shleifer et al. [34], which applies a special Transformer pre-training that can cope with different gradient magnitudes in stochastic gradient descent training. We verify that this proposal of Shleifer et al. [34] is beneficial in our Credibility Transformer architecture resulting in superior performance compared to plain-vanilla trained architectures. Building on this initial exploration, we then augment the Credibility Transformer with several advances from the LLM and deep learning literature, producing a deep and multi-head attention version of the initial Credibility Transformer architecture. Furthermore, we implement the concept of Gated Linear Units (GLU) of Dauphin et al. [7] to select more important covariate components, and we exploit the Piecewise Linear Encoding (PLE) of Gorshniy et al. [14] which should be thought of as a more informative embedding, especially adapted for numerical covariates, than its one-hot encoded counterpart. It is more informative because PLE preserves the ordinal relation of continuous covariates, while enabling subsequent network layers to produce a multidimensional embedding for the numerical data. This improved deep Credibility Transformer equips us with regression models that have an excellent predictive performance. Remarkably, we show that the Credibility Transformer approach can improve a state-of-the-art deep Transformer model applied to a non-life pricing problem. Finally, we examine the explainability of the Credibility Transformer approach by exploring a fitted model.2 The credibility transformer
2.1 Construction of the input tensor
2.1.1 Feature tokenizer
2.1.2 Positional encoding
2.1.3 Classify (CLS) token
Module | Variable/layer | \(\#\) Weights |
---|---|---|
Feature tokenizer (raw input tensor) | \({\varvec{x}}_{1:9}^\circ \) | 405 |
Positional encoding | \({\varvec{e}}^{\textrm{pos}}_{1:9}\) | 45 |
CLS tokens | \({\varvec{c}}\) | 10 |
Time-distributed normalization layer | \({\varvec{z}}^{\textrm{norm}}\) | 20 |
Credibility Transformer | \({\varvec{c}}^{\textrm{cred}}\) | 1,073 |
FNN decoder | \({\varvec{z}}^{(2:1)}\) | 193 |
2.2 Credibility transformer layer
2.2.1 Normalization layer
2.2.2 Transformer architecture
2.2.3 Credibility mechanism
2.2.4 Rationale of credibility weighted CLS token
2.3 Decoding of the tokenized information
3 Real data example
3.1 Description of data
3.2 Input tokenizer
3.3 Description of the selected credibility transformer architecture
3.4 Gradient descent network fitting: 1st version
3.4.1 Plain-vanilla gradient descent fitting
nadam
version of stochastic gradient descent (SGD) with its pre-selected parametrization implemented in the keras
package [6].3 For SGD we use a batch size of \(2^{10}=1,024\) instances, and since neural networks are prone to overfitting, we exploit an early stopping strategy by partitioning the learning data at random into a training dataset and a validation dataset at a ratio of 9:1. We use the standard callback
of keras
[6] using its pre-selected parametrization to retrieve the learned weights with the smallest validation loss. Optimal network weights were found in less than 200 SGD epochs.Model | # | In-sample | Out-of-sample | ||
---|---|---|---|---|---|
Param | Poisson loss | Poisson loss | |||
Poisson null | 1 | 25.213 | 25.445 | ||
Poisson GLM3 | 50 | 24.084 | 24.102 | ||
Poisson plain-vanilla FNN | 792 | 23.728 | (± 0.026) | 23.819 | (± 0.017) |
Ensemble Poisson plain-vanilla FNN | 792 | 23.691 | 23.783 | ||
Credibility Transformer: nadam | 1,746 | 23.648 | (± 0.071) | 23.796 | (± 0.037) |
Ensemble Credibility Transformer: nadam | 1,746 | 23.565 | 23.717 | ||
Credibility Transformer: NormFormer | 1,746 | 23.641 | (± 0.053) | 23.788 | (± 0.040) |
Ensemble Credibility Transformer: NormFormer | 1,746 | 23.562 | 23.711 |
3.4.2 Results of 1st fitting approach
nadam
). These results of the Credibility Transformer are benchmarked with the ones in Wüthrich–Merz [38, Tables 7.4\(-\)7.7 and 7.9] and Brauer [3, Tables 2 and 4]. We conclude that the Credibility Transformer clearly outperforms any of these other proposals out-of-sample, even the ensembled plain-vanilla FNN model is not much better than a single run of the Credibility Transformer. By building the ensemble predictor of the Credibility Transformers we receive a significantly better model providing an out-of-sample average Poisson deviance loss of \(23.717 \cdot 10^{-2}\), see Table 2. The best ensemble predictor in Brauer [3, Table 4] called \(\hbox {CAFFT}_{\textrm{def}}\) has an average Poisson deviance loss of \(23.726 \cdot 10^{-2}\).3.5 NormFormer gradient descent fitting: 2nd version
nadam
with a standard early stopping callback
do not work particularly well on Transformers because of gradient magnitudes mismatches, typically gradients on earlier layers are much bigger than gradients at later layers; see Shleifer et al. [34]. Therefore, we exploit the NormFormer
proposed by Shleifer et al. [34] in our 2nd fitting attempt with the adam
version of SGD with a learning rate of 0.002 and \(\beta _2=0.98\); reducing this parameter of adam
has been suggested, e.g., in Zhai et al. [40]. The optimal network weights were reached in less than 200 SGD epochs. The results are given in the lower part of Table 2. We observe a slight improvement in prediction accuracy, the Credibility Transformer ensemble having a smaller out-of-sample average Poisson deviance loss of \(23.711 \cdot 10^{-2}\).NormFormer
SGD fitting) against the GLM predictions (out-of-sample). The red line gives a GAM smoother. On average the two predictors are rather similar, except in the tails. However, the black cloud of predictors has individual points that substantially differ from the blue diagonal line, indicating that there are bigger differences on an individual insurance policy scale. This is verified by the density plot of the individual log-differences of the predictors on the right-hand side of Fig. 3.NormFormer
SGD fitting against nadam
fittingNormFormer
SGD fitting procedure and the nadam
fitting procedure, the former slightly outperforming the latter in terms of out-of-sample loss. The figure shows that the individual predictions lie fairly much on the diagonal line saying that there are not any bigger individual differences. This supports robustness of the fitting procedure.4 Improving the credibility transformer
4.1 Multi-head attention and deep transformer architecture
4.1.1 Multi-head attention
4.1.2 Deep credibility transformer
4.2 Gated layers
4.3 Improving the continuous covariate embedding
4.4 Other modifications
adamW
optimizer of Loshchilov–Hutter [25], with a weight decay set to 0.02. Finally, we set \(\beta _2=0.95\) in the optimizer, which is a best practice to stabilize optimization with Transformer architectures that use large batches and versions of the adam
optimizer, see Zhai et al. [40].4.5 Results of the improved deep credibility transformer
Model | In-sample | Out-of-sample | ||
---|---|---|---|---|
Poisson loss | Poisson loss | |||
Ensemble Poisson plain-vanilla FNN | 23.691 | 23.783 | ||
Ensemble Credibility Transformer (best-performing) | 23.562 | 23.711 | ||
Improved Credibility Transformer with \(\alpha = 90\%\) | 23.533 | (± 0.058) | 23.670 | (± 0.031) |
Ensemble Credibility Transformer (\(\alpha = 90\%\)) | 23.454 | 23.587 | ||
Improved Credibility Transformer with \(\alpha = 95\%\) | 23.557 | (± 0.058) | 23.676 | (± 0.027) |
Ensemble Credibility Transformer (\(\alpha = 95\%\)) | 23.465 | 23.593 | ||
Improved Credibility Transformer with \(\alpha =98\%\) | 23.544 | (± 0.042) | 23.670 | (± 0.032) |
Ensemble Credibility Transformer (\(\alpha =98\%\)) | 23.460 | 23.577 | ||
Improved pure Transformer with \(\alpha = 100\%\) | 23.535 | (± 0.051) | 23.689 | (± 0.044) |
Ensemble pure Transformer (\(\alpha = 100\%)\) | 23.447 | 23.607 |
NormFormer
fitting) against the improved deep Credibility Transformer predictions (out-of-sample). Similarly to the above, the red line gives a local GAM smoother. We see that, on average the two predictors are rather similar, in this case, even in the tails. Furthermore, we see that there are bigger differences on an individual insurance policy scale; nonetheless, the density plot of the individual log-differences of the predictors on the right-hand side of Fig. 6 shows that these differences are smaller than the differences between the GLM and Credibility Transformer shown in Fig. 3.BonusMalus
BonusMalus
(which is the most significant explanatory variable as we will see below). To these attention weights we applied a principal component analysis (PCA) on both attention heads separately. Figure 7 shows the first two PCA scores of the two attention heads (in red and blue). Moreover, we have labelled five individual instances in both attention heads to show the calibration of these two attention heads. From this figure we conclude that the two attention heads have learned different representations and structure of the data because they have different shapes.5 Exploring the credibility transformer predictions
NormFormer
variation.BonusMalus
score, followed by VehAge
, VehBrand
and Area
code. The CLS token itself, which we recall is calibrated to produce the portfolio average frequency, has an attention probability of about 6.5% on average. That is, with reference to equation (2.13), \(P = a_{T+1,T+1}\) is comparably low, implying that the complement probability, \(1-P\), referring to the weight given to the covariates, is relatively high. This is exactly as we would expect it for a large MTPL insurance portfolio. In Fig. 8 (rhs), we show a histogram of the attention outputs for \(P=a_{T+1,T+1}\) over all test instances. The histogram across the test dataset which takes values between zero and approximately \(12.5\%\), says that even in the most extreme cases, only a comparably small weight is allocated to the portfolio average experience.BonusMalus
, DrivAge
, VehBrand
and Region
covariatesBonusMalus
, DrivAge
, VehBrand
and Region
(see the next paragraph for a description of how these variables were chosen). The figures show interesting relationships which indicate that the model attends strongly to the information contained within the embedding of these variables, for certain values of the covariate. For example, low values of the (correlated) BonusMalus
and DrivAge
covariates receive strong attention scores. Likewise, some values of the VehBrand
and Region
covariates receive very strong attention scores. These relationships can be investigated further by trying to understand the context in which variations may occur, e.g., in the top-left panel of Fig. 9 we can see that, at low values of the BonusMalus
covariate, two different attention patterns occur for some segments of the data. In Fig. 10, we reproduce the top-left panel of Fig. 9, but color the points according to the Density
covariate, revealing that drivers with a low BonusMalus
score in high density regions have lower attention scores for the BonusMalus
covariate, than drivers in medium and low Density
areas, in other words, these variables interact quite strongly.BonusMalus
covariate and the attention scores, points colored by the value of the Density
covariateBonusMalus
, DrivAge
, VehBrand
and Region
covariates; points are colored according to the value taken by the covariate under considerationBonusMalus
, DrivAge
, VehBrand
and Region
covariates. These covariates were selected as the top four covariates maximizing the variable importance scores of a Random Forest model fitted to predict the CLS token attention scores based on the attention scores for the other covariates. The analysis shows a very strong relationship between the covariate and the CLS attention scores, i.e., for how the credibility given to the portfolio experience varies with the values of the other covariates. In particular, we can see that the highest credibility is given to the portfolio experience when the BonusMalus
scores are low, when the DrivAge
is middle-aged, when the VehBrand
is not B12
, and in several of the Regions
in the dataset. In summary, this analysis leads us to expect that - at least to some extent - the Credibility Transformer produces frequency predictions close to the portfolio average for these, and similar, values of the covariates. We test this insight in Fig. 12, which shows density plots of the values of the BonusMalus
and DrivAge
covariates, as well as the average value of the covariate and the average value of the covariate producing predictions close to the predicted portfolio mean. It can be seen that for low values of the BonusMalus
covariate, and for middle-aged drivers, the Credibility Transformer produces predictions close to the portfolio mean. As just mentioned, for these “average" policyholders, we give higher credibility to the CLS token.BonusMalus
(lhs) and the DrivAge
(rhs) covariates; dotted red line indicates the average value of the covariate and the green dot represents the average value of the covariate producing predictions close to the predicted portfolio mean