Skip to main content
Top

Open Access 15-02-2025 | Original Research Paper

The credibility transformer

Authors: Ronald Richman, Salvatore Scognamiglio, Mario V. Wüthrich

Published in: European Actuarial Journal

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Inspired by the large success of Transformers in Large Language Models, these architectures are increasingly applied to tabular data. This is achieved by embedding tabular data into low-dimensional Euclidean spaces resulting in similar structures as time-series data. We introduce a novel credibility mechanism to this Transformer architecture. This credibility mechanism is based on a special token that should be seen as an encoder that consists of a credibility weighted average of prior information and observation based information. We demonstrate that this novel credibility mechanism is very beneficial to stabilize training, and our Credibility Transformer leads to predictive models that are superior to state-of-the-art deep learning models.
Notes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Feed-forward neural networks (FNNs) provide state-of-the-art deep learning regression models for actuarial pricing. FNNs can be seen as extensions of generalized linear models (GLMs), taking covariates as inputs to these FNNs, feature-engineering these covariates through several hidden FNN layers, and then using these feature-engineered covariates as inputs to a GLM. Advantages of FNNs over classical GLMs are that they are able to find functional forms and interactions in the covariates that cannot easily be captured by GLMs, and which typically require the modeler to have specific deeper insights into the data generation process. Since these specific deeper insights are not always readily available, FNNs may support the modeler in finding such structure and insight; for a broader discussion of FNNs in an actuarial context we refer to Wüthrich–Merz [38, Chapter 7].
Taking inspiration from the recent huge success of large language models (LLMs), the natural question arises whether there are network architectures other than FNNs that share more similarity with LLMs and which can further improve predictive performance of neural networks in actuarial pricing. LLMs use the Transformer architecture which has been invented by Vaswani et al. [35]. The Transformer architecture is based on attention layers; and for an early work on the attention mechanism, see Bahdanau et al. [2]. Attention layers are special network modules that allow covariate components to communicate with each other. Specifically, one may think of each covariate component receiving a so-called query and a key, and the attention mechanism tries to find queries and keys of different covariate components that match in order to send forward a positive or negative signal. For example, in the case of pricing car insurance, young drivers may have a key labeled ‘risky’, and car brands may have queries, for instance, a sports car label trying to find the age group of risky drivers. Having a match of a query and a key then leads to an increase of the expected claim frequency, which accounts for a corresponding interaction of these two covariates in the regression function. This idea of keys, queries and values, which is central to Transformer architectures, originates from the information retrieval literature, where a search term (the query) is matched to relevant documents (the keys), the contents of which provide the information the user is looking for (the values); for a general reference on information retrieval, see Manning et al. [26].
Transformer architectures have been introduced to process natural language data which have a canonical time-series structure; see Vaswani et al. [35]. A main question is whether there is a similar beneficial way to use Transformers and attention layers on tabular input data, which lack the same time-series structure. The main tool for applying Transformers to tabular data is the so-called entity embedding mechanism, which is a specific way of information representation. Embedding all covariates into a low-dimensional Euclidean space, one receives embedded tabular covariates that share similar features as time-series data. The first attempt at building this type of model is found in Huang et al. [20]. These authors proposed a model embedding only the categorical components of tabular data, applying a Transformer encoder to these, and then concatenating the Transformer encoded outputs with the numerical input variables for further processing this input information by a FNN. This TabTransformer model of Huang et al. [20] has been applied by Kuo–Richman [22], who found that this model provides a small advantage over FNN architectures for predicting severity of flood claims. A more comprehensive approach has recently been proposed by Gorishniy et al. [15], the feature tokenizer (FT) Transformer architecture, which embeds numerical components of tabular data as well. This proposal has been considered in the actuarial literature by Brauer [3]. The numerical results presented in Brauer [3] show that using a FT-Transformer architecture for tabular data enhances the predictive performance over classical FNN architectures on a specific commonly used motor insurance dataset (which we also use below). This motivated us to exploit whether this approach can further be improved. We are going to modify the FT-Transformer architecture by a novel credibility mechanism. Using this additional credibility mechanism we find Transformer-based network architectures that outperform the previous network results. This gives clear evidence that Transformer encoders can be very beneficial on tabular input data if the network architectures are designed and trained in a sophisticated manner.
The FT-Transformer architecture of Gorishniy et al. [15] and Brauer [3] include a special token that has been added to the classical Transformer architecture. This additional token is inspired by the Bidirectional Encoder Representations from Transformers (BERT) architecture of Devlin et al. [9]. This special token (called the CLS token by Devlin et al. [9]) is used to encode the covariate information gained from the attention mechanism in a lower dimensional representation. Our modification uses an additional credibility structure on this special CLS token. Through the training process described later, one is able interpret this CLS token as playing the role of prior information consisting of the portfolio mean in a Bayesian credibility sense. This prior information-based predictor is then combined in a linear credibility fashion (in the abstract embedding space of the model) with a second predictor that plays the role of observed information in Bayesian credibility theory. Specifically, we consider a credibility-based average of the two sets of information, and this provides us with a similar credibility structure as, e.g., the classical linear credibility formula of Bühlmann–Straub [5]. This motivates the use of the term Credibility Transformer for our proposal. In other words, the Credibility Transformer does not only try to find matches between queries and keys, but it also weighs this information by a credibility mechanism. As noted above, in this work, we make two main connections to the linear credibility formula: first, when training the model, we force the special CLS token to provide prior information to the model using a credibility factor, and second, we interpret the attention mechanism within the Transformer as a linear credibility formula between prior information contained in the CLS token and the rest of the covariate information. This second inherent credibility mechanism is more hidden and can only be seen by decomposing the attention mechanism into its single parts. It is useful for interpretation, and, in particular, it allows for a variable importance measure which also quantifies the importance of the prior information.
Moreover, we exploit special fitting strategies of these Credibility Transformers since training Transformer architectures needs a sort of tempered learning to not get trapped in too early stopping decisions. For this we adapt the NormFormer proposal of Shleifer et al. [34], which applies a special Transformer pre-training that can cope with different gradient magnitudes in stochastic gradient descent training. We verify that this proposal of Shleifer et al. [34] is beneficial in our Credibility Transformer architecture resulting in superior performance compared to plain-vanilla trained architectures. Building on this initial exploration, we then augment the Credibility Transformer with several advances from the LLM and deep learning literature, producing a deep and multi-head attention version of the initial Credibility Transformer architecture. Furthermore, we implement the concept of Gated Linear Units (GLU) of Dauphin et al. [7] to select more important covariate components, and we exploit the Piecewise Linear Encoding (PLE) of Gorshniy et al. [14] which should be thought of as a more informative embedding, especially adapted for numerical covariates, than its one-hot encoded counterpart. It is more informative because PLE preserves the ordinal relation of continuous covariates, while enabling subsequent network layers to produce a multidimensional embedding for the numerical data. This improved deep Credibility Transformer equips us with regression models that have an excellent predictive performance. Remarkably, we show that the Credibility Transformer approach can improve a state-of-the-art deep Transformer model applied to a non-life pricing problem. Finally, we examine the explainability of the Credibility Transformer approach by exploring a fitted model.
Organization of this manuscript. This paper is organized as follows. Section 2 describes the architecture of the Credibility Transformer in detail, including the input tokenization process, the Transformer layer with its attention mechanism, and our novel credibility mechanism. Section 3 presents a real data example using the French motor third party liability (MTPL) claims count dataset, demonstrating the implementation and performance of the Credibility Transformer. Section 4 explores improvements to the Credibility Transformer, incorporating insights from LLMs and recent advances in deep learning. Section 5 discusses the insights that can be gained from a fitted Credibility Transformer. Finally, Sect. 6 concludes by summarizing our contributions and findings, and it discusses potential future research directions.

2 The credibility transformer

This section describes the architecture of the Credibility Transformer. The reader should have the classical Transformer architecture of Vaswani et al. [35] in mind, with one essential difference, namely, the entire set of relevant covariate information is going to be encoded into additional classify (CLS) tokens, and these CLS tokens are going to be combined with a credibility mechanism. For this we extend the classical Transformer encoder of Vaswani et al. [35] by CLS tokens. CLS tokens have been introduced by Devlin et al. [9] in the language model Bidirectional Encoder Representations from Transformers (BERT). In BERT, these tokens are used to summarize the probability of one sentence following an other one, therefore these authors have used the term ‘classify token’. In our architecture, this terminology may be a bit misleading because our CLS tokens will just be real numbers not relating to any classification probabilities, but they will rather present biases in regression models. Nevertheless, we keep the term CLS token to emphasize its structural analogy to BERT.

2.1 Construction of the input tensor

We first describe the input pre-processing. This input pre-processing tokenizes all input variables (features, covariates) by embedding them into a low dimensional Euclidean space. These embeddings are complemented by positional embeddings and CLS tokens. We describe this in detail in the next three subsections. The main difference to the classical Transformer of Vaswani et al. [35] concerns the last step of adding the CLS tokens for encoding the information from the attention mechanism, this feature is described in Sect. 2.1.3, below.

2.1.1 Feature tokenizer

We start by describing the tokenization of the covariates (input features). Assume we have \(T_1\) categorical covariates \((x_t)_{t=1}^{T_1}\) and \(T_2\) continuous covariates \((x_t)_{t=T_1+1}^{T} \in {{\mathbb {R}}}^{T_2}\); we set \(T=T_1+T_2\) for the total number of covariate components in input \({\varvec{x}}=(x_t)_{t=1}^T\).
We bring these categorical and continuous input variables into the same structure by applying entity embedding \(({\varvec{e}}^{\textrm{EE}}_t)_{t=1}^{T_1}\) to the categorical variables and FNN embedding \(({\varvec{e}}^{\textrm{FNN}}_t)_{t=T_1+1}^T\) to the continuous variables; this pre-processing step is called feature tokenizer. Entity embedding has been introduced in the context of natural language processing (NLP) by Brébisson et al. [4] and Guo–Berkhahn [17], and it has been introduced to the actuarial community by Richman [27, 28] and Delong–Kozak [8]. The main purpose of entity embedding is to represent (high-cardinality) nominal features by low dimensional Euclidean embeddings such that proximity in these low dimensional Euclidean space means similarity w.r.t. the prediction task at hand. Additionally, we also embed the continuous input variables into the same low dimensional Euclidean space such that all input variables have the same structure. This approach has been considered in Gorshniy et al. [15], and it is used by Brauer [3] in the actuarial literature.
The embeddings are defined as follows. For the entity embeddings of the categorical covariates we select mappings
$$\begin{aligned} {\varvec{e}}^{\textrm{EE}}_t: \{a_1,\ldots , a_{n_t}\} \rightarrow {{\mathbb {R}}}^b, \qquad x_t \mapsto {\varvec{e}}^{\textrm{EE}}_t(x_t), \end{aligned}$$
where \(a_1,\ldots , a_{n_t}\) are the different levels of the t-th categorical covariate \(x_t\), and \(b \in {{\mathbb {N}}}\) is the fixed, pre-chosen embedding dimension; this embedding dimension b is a hyperparameter that needs to be selected by the modeler. Importantly, for our Credibility Transformer all embeddings must have the same embedding dimension b (possibly enlarged, as we do later, by concatenating other vectors to these embeddings); this embedding dimension b is called the model dimension of the Transformer model.
Each of these categorical entity embeddings involves \(b n_t\) embedding weights (parameters), thus, we have \(\sum _{t=1}^{T_1} b n_t\) parameters in total from the embedding of the categorical covariates \((x_t)_{t=1}^{T_1}\).
For the continuous input variables we select fully-connected FNN architectures \({\varvec{z}}_t^{(2:1)}={\varvec{z}}_t^{(2)}\circ {\varvec{z}}_t^{(1)}\) of depth 2 being composed of two FNN layers
$$\begin{aligned} {\varvec{z}}_t^{(1)}: {{\mathbb {R}}}\rightarrow {{\mathbb {R}}}^b \qquad \text { and } \qquad {\varvec{z}}_t^{(2)}: {{\mathbb {R}}}^b \rightarrow {{\mathbb {R}}}^b. \end{aligned}$$
(2.1)
We select such a FNN architecture \({\varvec{z}}_t^{(2:1)}\) for each continuous covariate component \(x_t\), \(T_1+1\le t \le T\). Each of these continuous FNN embeddings \({\varvec{z}}_t^{(2:1)}\) involves \(2b + b(b+1)\) network weights (including the bias parameters). This gives us \(T_2 (2b + b(b+1))\) parameters for the embeddings of the continuous covariates \((x_t)_{t=T_1+1}^{T}\); for more technical details and the notation used for FNNs we refer to Wüthrich–Merz [38, Chapter 7]. Remark that using two FNN layers is a specific choice that can be replaced by other ones. This choice has been successfully used in Rossouw–Richman [32].
Concatenating and reshaping these categorical and continuous covariate embeddings equips us with the raw input tensor
$$\begin{aligned} {\varvec{x}}=(x_t)_{t=1}^T ~\mapsto ~ {\varvec{x}}^\circ _{1:T} & =\left[ {\varvec{e}}^{\textrm{EE}}_1(x_1), \ldots , {\varvec{e}}^{\textrm{EE}}_{T_1}(x_{T_1}), {\varvec{z}}_{T_1+1}^{(2:1)}(x_{T_1+1}), \ldots , {\varvec{z}}_{T}^{(2:1)}(x_{T})\right] ^\top \\ & \in ~ {{\mathbb {R}}}^{T \times b}. \end{aligned}$$
This raw input tensor \({\varvec{x}}^\circ _{1:T}\) involves the following number of parameters to be fitted/learned
$$\begin{aligned} \varrho ^{\textrm{input}} = b \sum _{t=1}^{T_1} n_t + T_2 (2b + b(b+1)). \end{aligned}$$
Having the input in this raw tensor structure \({\varvec{x}}^\circ _{1:T}\in {{\mathbb {R}}}^{T \times b}\) allows us to employ attention layers and Transformers to further process this input data.

2.1.2 Positional encoding

Since attention layers do not have a natural notion of position and/or time in the tensor information, we complement the raw tensor \({\varvec{x}}^\circ _{1:T}\) by a b-dimensional positional encoding; see Vaswani et al. [35] for positional encoding in the context of NLP. However, we choose a simpler learned positional encoding compared to the sine-cosine encoding used by Vaswani et al. [35], as this simpler version is sufficient for our purpose of learning on tabular data which do not have the natural ordering exploited in the original Transformer architecture.
The positional encoding can be achieved by another embedding that considers the positions \(t \in \{1,\ldots , T\}\) in a b-dimensional representation
$$\begin{aligned} {\varvec{e}}^{\textrm{pos}}: \{1,\ldots , T\} \rightarrow {{\mathbb {R}}}^b, \qquad t \mapsto {\varvec{e}}^{\textrm{pos}}(t). \end{aligned}$$
This is equivalent to the entity embedding, where each label (position) t is mapped to a b-dimensional vector \({\varvec{e}}^{\textrm{pos}}(t) \in {{\mathbb {R}}}^b\). This positional encoding involves another \(\varrho ^{\textrm{position}}=Tb\) parameters. We concatenate this positional encoding with the raw input tensor giving us the input tensor
$$\begin{aligned} {\varvec{x}}_{1:T}= & \begin{bmatrix}{\varvec{x}}^\circ _{1:T} & \begin{pmatrix}{\varvec{e}}^{\textrm{pos}}(1)^\top \\ \vdots \\ {\varvec{e}}^{\textrm{pos}}(T)^\top \end{pmatrix}\end{bmatrix} \\ = & \begin{bmatrix} {\varvec{e}}^{\textrm{EE}}_1(x_1)& \cdots & {\varvec{e}}^{\textrm{EE}}_{T_1}(x_{T_1})& {\varvec{z}}_{T_1+1}^{(2:1)}(x_{T_1+1})& \cdots & {\varvec{z}}_{T}^{(2:1)}(x_{T})\\ {\varvec{e}}^{\textrm{pos}}(1)& \cdots & {\varvec{e}}^{\textrm{pos}}(T_1)& {\varvec{e}}^{\textrm{pos}}(T_1+1) & \cdots & {\varvec{e}}^{\textrm{pos}}(T) \end{bmatrix}^\top ~\in ~ {{\mathbb {R}}}^{T \times 2b}. \end{aligned}$$
Remark 2.1
There is an ongoing discussion about the necessity of having positional encodings in case of tabular input data; see, e.g., Huang et al. [20]. We cannot give a definite answer to this question, but in our example these positional encodings provided better predictive results. Intuitively, the positional encodings seem unnecessary because the covariate components can arbitrarily be permuted not changing the model, because there is no a canonical adjacency in tabular data.
We emphasize that all components of this input tensor \({\varvec{x}}_{1:T}\) originate from a specific input being either a covariate \(x_t\) or the position t of that covariate, i.e., it contains information about the individual instances. In the next subsection, we are going to add an additional variable, the CLS token, to this input tensor, but this additional variable does not originate from a covariate variable with a specific meaning.

2.1.3 Classify (CLS) token

We extend the input tensor \({\varvec{x}}_{1:T}\) by an additional component, the CLS token. This is the step that differs from the classical Transformer proposal of Vaswani et al. [35]. The purpose of the CLS token is to encode every column \(1\le j \le 2b\) of the input tensor \({\varvec{x}}_{1:T}\in {{\mathbb {R}}}^{T \times 2b}\) into a single variable. This then gives us the augmented input tensor
$$\begin{aligned} {\varvec{x}}^+_{1:T+1}= & \begin{bmatrix}{\varvec{x}}_{1:T} \\ {\varvec{c}}^\top \end{bmatrix} \\ = & \nonumber \begin{bmatrix} {\varvec{e}}^{\textrm{EE}}_1(x_1)& \cdots & {\varvec{e}}^{\textrm{EE}}_{T_1}(x_{T_1})& {\varvec{z}}_{T_1+1}^{(2:1)}(x_{T_1+1})& \cdots & {\varvec{z}}_{T}^{(2:1)}(x_{T})& {\varvec{c}}_1\\ {\varvec{e}}^{\textrm{pos}}(1)& \cdots & {\varvec{e}}^{\textrm{pos}}(T_1)& {\varvec{e}}^{\textrm{pos}}(T_1+1) & \cdots & {\varvec{e}}^{\textrm{pos}}(T)& {\varvec{c}}_2 \end{bmatrix}^\top \in ~ {{\mathbb {R}}}^{(T+1) \times 2b}, \end{aligned}$$
(2.2)
where \({\varvec{c}}=({\varvec{c}}_1^\top ,{\varvec{c}}_2^\top )^\top =(c_1,\ldots , c_{2b})^\top \in {{\mathbb {R}}}^{2b}\) denote the CLS tokens. Each of the scalars \(c_j \in {{\mathbb {R}}}\) comprising the CLS tokens, \(1\le j \le 2b\), will encode one column of the input tensor \({\varvec{x}}_{1:T} \in {{\mathbb {R}}}^{T\times 2b}\), i.e., it will provide a one-dimensional projection of the corresponding j-th T-dimensional vector to a scalar \(c_j \in {{\mathbb {R}}}\); in text recognition models, one may imagine that the input tensor \({\varvec{x}}_{1:T}\) consists of T embeddings of size 2b, and the j-th dimension of each token embedding is summarized into a real-valued token \(c_j\). For further processing, only the information contained in the CLS token \({\varvec{c}}\) will be forwarded to make predictions, as it reflects a compressed (encoded) version of the entire tensor information (after training of course). We illustrate the augmented input tensor \({\varvec{x}}^+_{1:T+1}\) in Fig. 1.
Table 1
Credibility transformer architecture used on the French MTPL dataset
Module
Variable/layer
\(\#\) Weights
Feature tokenizer (raw input tensor)
\({\varvec{x}}_{1:9}^\circ \)
405
Positional encoding
\({\varvec{e}}^{\textrm{pos}}_{1:9}\)
45
CLS tokens
\({\varvec{c}}\)
10
Time-distributed normalization layer
\({\varvec{z}}^{\textrm{norm}}\)
20
Credibility Transformer
\({\varvec{c}}^{\textrm{cred}}\)
1,073
FNN decoder
\({\varvec{z}}^{(2:1)}\)
193

2.2 Credibility transformer layer

2.2.1 Normalization layer

The CLS token augmented input tensor \({\varvec{x}}^+_{1:T+1}\in {{\mathbb {R}}}^{(T+1) \times 2b}\) is first processed through a normalization layer (not indicated in our notation, but highlighted in Table 1, below) before entering the Transformer architecture. Whereas several previous works in the actuarial literature have used batch-normalization (BN) [21], Transformer models usually use layer-normalization (LN) [1], which is what is utilized here. We briefly describe the difference between the two. Network training often requires normalizing the outputs of each layer before entering the next one, to ensure stable training. This is required since the weights in each layer are strongly dependent on the outputs of the previous layer, and if those outputs are modified significantly during the training process, the network training procedure can become destabilized. BN [21] normalizes the layer output activations across the batch dimension and for each feature channel. This typically works well when batches are large and statistically representative. However, in scenarios with very small or variable batch sizes, BN can become unstable or less effective. These scenarios are typically realized when training larger models using smaller batches, or when using recurrent models with variable sequence lengths. In the latter case, the calculation of batch statistics at each time step of the model is difficult. In contrast, LN [1] normalizes across the feature dimension for each individual sample, i.e., there is no dependence on the size of the batch, thus, LN is not sensitive to the batch size and is more suitable for sequential models, where each sample (or each time step in a sequence) requires its own stable normalization. Because Transformers applied for NLP often process sequences (e.g., tokens in language models) and because they can handle varying lengths and batch sizes, LN has become the standard approach to normalizing activations in these architectures, ensuring consistent scaling and shifting of the hidden representations across different samples, leading to more stable training.

2.2.2 Transformer architecture

We briefly describe the crucial modules of Transformers that are relevant for our Credibility Transformer architecture; for a more detailed description (and illustration) we refer to Vaswani et al. [35] and Richman–Wüthrich [31, Section 3.6]. We remark that we only use the encoder part of the Transformer of Vaswani et al. [35].
Transformers are based on attention layers, and attention layers consist of queries \({\varvec{q}}_t\), keys \({\varvec{k}}_t\) and values \({\varvec{v}}_t\), \(1\le t \le T+1\), given by the FNN layer transformed inputs
$$\begin{aligned} {\varvec{q}}_t= & \phi \left( {\varvec{b}}_Q +W_Q x^+_t\right) ~\in ~{{\mathbb {R}}}^{2b},\nonumber \\ {\varvec{k}}_t= & \phi \left( {\varvec{b}}_K + W_K x^+_t\right) ~\in ~{{\mathbb {R}}}^{2b},\nonumber \\ {\varvec{v}}_t= & \phi \left( {\varvec{b}}_V +W_V x^+_t\right) ~\in ~{{\mathbb {R}}}^{2b}, \end{aligned}$$
(2.3)
with weight matrices \(W_Q, W_K, W_V \in {{\mathbb {R}}}^{2b\times 2b}\), biases \({\varvec{b}}_Q, {\varvec{b}}_K, {\varvec{b}}_V \in {{\mathbb {R}}}^{2b}\), and where the activation function \(\phi :{{\mathbb {R}}}\rightarrow {{\mathbb {R}}}\) is applied element-wise. Since the weight matrices and biases are assumed to be t-independent, we apply the same FNN transformation (2.3) in a time-distributed manner to all components \(1\le t \le T+1\) of the augmented input tensor \({\varvec{x}}^+_{1:T+1}\). This gives us the tensors
$$\begin{aligned} Q~=~Q({\varvec{x}}^+_{1:T+1})= & [ {\varvec{q}}_1, \ldots , {\varvec{q}}_{T+1}]^\top ~\in ~ {{\mathbb {R}}}^{(T+1)\times 2b},\nonumber \\ K~=~K({\varvec{x}}^+_{1:T+1})= & [ {\varvec{k}}_1, \ldots , {\varvec{k}}_{T+1}]^\top ~\in ~ {{\mathbb {R}}}^{(T+1)\times 2b},\nonumber \\ V~=~V({\varvec{x}}^+_{1:T+1})= & [ {\varvec{v}}_1, \ldots , {\varvec{v}}_{T+1}]^\top ~\in ~ {{\mathbb {R}}}^{(T+1)\times 2b}. \end{aligned}$$
(2.4)
The queries Q and keys K are used to construct the attention weight matrix A by applying the softmax function to all rows (in a row-wise manner) in the following matrix
$$\begin{aligned} A =A({\varvec{x}}^+_{1:T+1})= \textsf{softmax}\left( \frac{Q K^\top }{\sqrt{2b}} \right) ~\in ~ {{\mathbb {R}}}^{(T+1)\times (T+1)},\end{aligned}$$
(2.5)
where the softmax operation is defined for each row as
$$\begin{aligned} \textsf{softmax}({\varvec{z}})_j = \frac{\exp (z_j)}{\sum _{k=1}^{T+1} \exp (z_k)}, \quad \text {for } j = 1, \ldots , T+1, \end{aligned}$$
(2.6)
with \({\varvec{z}}\) being a row of the matrix \(QK^\top /\sqrt{2b}\).
Finally, the attention head of the Transformer is received by the matrix multiplication
$$\begin{aligned} H=H({\varvec{x}}^+_{1:T+1}) = A \, V ~\in ~ {{\mathbb {R}}}^{(T+1) \times 2b}. \end{aligned}$$
(2.7)
This function encodes the augmented input tensor \({\varvec{x}}^+_{1:T+1}\in {{\mathbb {R}}}^{(T+1)\times 2b}\) into the attention head \(H({\varvec{x}}^+_{1:T+1}) \in {{\mathbb {R}}}^{(T+1) \times 2b}\) of the same dimension. Essentially, this attention mechanism is a weighting scheme reflected by the attention weights \(A({\varvec{x}}^+_{1:T+1})\) applied to the values \(V({\varvec{x}}^+_{1:T+1})=[ {\varvec{v}}_1, \ldots , {\varvec{v}}_{T+1}]^\top \). For our first application of the Transformer, we only use one attention head, we modify this to a multi-head attention in Sect. 4, below.
A Transformer layer is then obtained by adding the attention head to the augmented input tensor resulting in a so-called skip-connection transformation
$$\begin{aligned} {\varvec{x}}^+_{1:T+1}~\mapsto ~ {\varvec{z}}^{\textrm{skip1}}({\varvec{x}}^+_{1:T+1})= {\varvec{x}}^+_{1:T+1}+H({\varvec{x}}^+_{1:T+1}) ~\in ~ {{\mathbb {R}}}^{(T+1) \times 2b}. \end{aligned}$$
(2.8)
Typically, this transformed input is further processed by a time-distributed normalization layer \({\varvec{z}}^{\textrm{norm}}\), drop-out layers \({\varvec{z}}^{\textrm{drop}}\) and post-processing time-distributed FNN layers \({\varvec{z}}^{\mathrm{t-FNN}}\) in combination with skip connections \({\varvec{z}}^{\textrm{skip}}\). Remark that a time-distributed layer is a network layer that applies a fixed operation to all time slices \(x^+_{t}\) of a (time-series) tensor \({\varvec{x}}^+_{1:T+1}\), i.e., the identical network weights/parameters are used for all indices \(1\le t \le T+1\); this construction was already used in the query, key and value definitions (2.3)–(2.4); for more details we refer to Vaswani et al. [35]. In our architecture, we process the output \({\varvec{z}}^{\textrm{skip1}}({\varvec{x}}^+_{1:T+1})\) of (2.8) further by applying the composed transformations
$$\begin{aligned} {\varvec{z}}^{\textrm{trans}}({\varvec{x}}^+_{1:T+1})= & {\varvec{z}}^{\textrm{skip1}}({\varvec{x}}^+_{1:T+1}) \nonumber \\ & + \left( {\varvec{z}}^{\textrm{norm2}}\circ {\varvec{z}}^{\textrm{drop2}}\circ {\varvec{z}}^{\mathrm{t-FNN2}}\circ {\varvec{z}}^{\textrm{drop1}}\circ {\varvec{z}}^{\mathrm{t-FNN1}}\circ {\varvec{z}}^{\textrm{norm1}}\right) \nonumber \\ & \times \left( {\varvec{z}}^{\textrm{skip1}}({\varvec{x}}^+_{1:T+1})\right) . \end{aligned}$$
(2.9)
Thus, we first normalize \({\varvec{z}}^{\textrm{norm1}}\), then apply a time-distributed FNN layer followed by drop-out \({\varvec{z}}^{\textrm{drop1}}\circ {\varvec{z}}^{\mathrm{t-FNN1}}\), and a second time-distributed FNN layer followed by drop-out \({\varvec{z}}^{\textrm{drop2}}\circ {\varvec{z}}^{\mathrm{t-FNN2}}\), and finally, we normalize again \({\varvec{z}}^{\textrm{norm2}}\). In a second skip-connection transformation we aggregate input and output of this transformation. This Transformer architecture outputs a tensor shape \({{\mathbb {R}}}^{(T+1) \times 2b}\). We remark that the drop-out layers are only used during model fitting, to prevent the architecture from in-sample overfitting.
While the Transformer architecture is quite complex, we want to emphasize that the model presented here is now quite standard in the machine learning literature. We provide a diagram for ease of understanding the various components of the Transformer in Fig. 13 in the appendix.

2.2.3 Credibility mechanism

The remaining steps differ from Vaswani et al. [35], and here, we take advantage of the integrated CLS tokens to complement this Transformer encoder \({\varvec{z}}^{\textrm{trans}}(\cdot )\) by a credibility mechanism. For this we recall that the augmented input tensor \({\varvec{x}}^+_{1:T+1}\) considers the (embedded) covariates \({\varvec{x}}=(x_t)_{t=1}^T\) and their positional encodings \(({\varvec{e}}^{\textrm{pos}}(t))_{t=1}^T\) in the first T components, and it is complemented by the CLS tokens \({\varvec{c}}={\varvec{x}}^+_{T+1} \in {{\mathbb {R}}}^{2b}\) in the \((T+1)\)-st component of the augmented input tensor \({\varvec{x}}^+_{1:T+1}\), see (2.2).
At this stage, i.e., before processing the CLS tokens through the Transformer encoder, the CLS tokens do not carry any covariate specific information. This is because before applying the attention mechanism, these CLS tokens \({\varvec{c}}={\varvec{x}}^+_{T+1}\) have not interacted with the covariates; and they are going to play the role of a global prior parameter in the following credibility considerations. For this purpose, we will extract the CLS tokens before the Transformer processing. After the Transformer processing, these CLS tokens have interacted with the covariates through the attention mechanism (2.5)–(2.7), and we are going to extract them a second time after this interaction, providing an embedded covariate summary.
The first version of the CLS token is extracted from the value matrix \(V=[ {\varvec{v}}_1, \ldots , {\varvec{v}}_{T+1}]^\top \). Importantly, to get this CLS token to be in the same embedding space as the outputs of the Transformer encodings, we need to process this first version of the CLS token with exactly the same operations (with the same weights) as in the Transformer (2.9), except that we do not need to apply the attention mechanism nor time-distribute the layers. This provides us with the following prior information
$$\begin{aligned} {\varvec{c}}^{\textrm{prior}} = \left( {\varvec{z}}^{\textrm{norm2}}\circ {\varvec{z}}^{\textrm{drop2}}\circ {\varvec{z}}^{\textrm{FNN2}}\circ {\varvec{z}}^{\textrm{drop1}}\circ {\varvec{z}}^{\textrm{FNN1}}\circ {\varvec{z}}^{\textrm{norm1}}\right) \left( {\varvec{v}}_{T+1}\right) ~\in ~ {{\mathbb {R}}}^{2b},\nonumber \\ \end{aligned}$$
(2.10)
with (2.9) and (2.10) sharing the identical weights.
Second, we extract the CLS token after processing through the Transformer encoder (2.9), providing us with the tokenized information of the covariates and their positional embeddings
$$\begin{aligned} {\varvec{c}}^{\textrm{trans}} = {\varvec{z}}_{T+1}^{\textrm{trans}}({\varvec{x}}^+_{1:T+1})~\in ~ {{\mathbb {R}}}^{2b}, \end{aligned}$$
(2.11)
i.e., this is the \((T+1)\)-st row of the Transformer output (2.9). In this case, the CLS token has had the attention mechanism applied, and we emphasize that this version of the CLS token now contains a summary of the covariate information.
These two tokens (2.10) and (2.11) give us two different predictors for the responses, the former representing only prior information, and the latter, the prior information augmented by the covariates. We will use both of these tokens to make predictions from the Credibility Transformer by assigning weights to these two representations.
This is done by selecting a fixed probability weight \(\alpha \in (0,1)\), and then, during gradient descent training, sampling independent Bernoulli random variables \(Z \sim \textrm{Bernoulli}(\alpha )\) that encode which CLS token is forwarded to the rest of the network to make predictions, by setting
$$\begin{aligned} {\varvec{c}}^{\textrm{cred}}= Z\, {\varvec{c}}^{\textrm{trans}} + \left( 1-Z\right) {\varvec{c}}^{\textrm{prior}}~\in ~ {{\mathbb {R}}}^{2b}. \end{aligned}$$
(2.12)
Thus, in \(\alpha \cdot 100\%\) of the gradient descent steps we forward the Transformer processed CLS token \({\varvec{c}}^{\textrm{trans}}\) which has interacted with the covariates, and in \((1-\alpha )\cdot 100\%\) of the gradient descent steps we forward the prior value CLS token \({\varvec{c}}^{\textrm{prior}}\). This can be seen as assigning a credibility of \(\alpha \) to the Transformer token \({\varvec{c}}^{\textrm{trans}}\) and the complementary credibility of \(1-\alpha \) to the prior value \({\varvec{c}}^{\textrm{prior}}\) of that CLS token, to receive reasonable network parameters during gradient descent training. This mechanism (2.12) is only applied during training, and for forecasting we set \(Z \equiv 1\). The probability \(\alpha \in (0,1)\) is treated as a hyperparameter that can be optimized by a grid search. We selected it bigger than 1/2, because we would like to put more emphasis on the tokenized covariate information. This procedure (2.12) shares similarity with the attention drop-out of Zehui et al. [39]. In our construction, we set the entire attention matrix to zero if \(Z=0\), and in that case we only forward the prior CLS token \({\varvec{c}}^{\textrm{prior}}\). As we will see below, a main benefit of this mechanism is a better network training result, as this version of the gradient descent algorithm allows for more gradient descent steps before (in-sample) overfitting compared to the standard procedure. Typically, this leads to better predictive models as more systematic structure can be extracted from the (noisy) data.
We call \({\varvec{c}}^{\textrm{prior}}\) the prior CLS token because it plays the role of the (collective, prior) global portfolio mean in the Bühlmann–Straub [5] sense. We will not input any prior information into the learning procedure, but we rather let the prior CLS token \({\varvec{c}}^{\textrm{prior}}\) learn the global portfolio average in an empirical Bayesian way. That is, in every gradient descent step, the terms with \(Z=0\) will contribute to this global portfolio average learning. For gradient descent training, we initialize this prior token randomly with a uniform distribution.

2.2.4 Rationale of credibility weighted CLS token

We explain the rationale of the proposed credibility weighted CLS token \({\varvec{c}}^{\textrm{cred}}\). There are two credibility mechanisms involved, a more obvious one that involves the hyperparameter \(\alpha \in (0,1)\), but, also, there is a second credibility mechanism involved in the CLS token which learns an optimal credibility weight. This latter credibility mechanism arises from the prior information learned by the CLS token as a result of training the network with the hyperparameter \(\alpha \). We now discuss these two credibility mechanisms.
First, when training the Credibility Transformer, the CLS token \({\varvec{c}}^{\textrm{cred}}\) will randomly be set to be either the token \({\varvec{c}}^{\textrm{trans}}\), including the covariate information, or the token \({\varvec{c}}^{\textrm{prior}}\), encoding only the prior information. In the latter case, the best prediction (in the sense of minimizing a deviance based loss function) that can be made without any covariate information will be the portfolio mean. Thus, in \((1-\alpha )\cdot 100\%\) of the training iterations of the Credibility Transformer, we will encourage the CLS token to incorporate prior information in the form of the global portfolio average. In the remaining iterations of the training, the CLS token will be trained on the covariate information to make more precise predictions than can be made using the portfolio average. In this sense, we can consider the CLS token \({\varvec{v}}_{T+1}\) before processing, i.e., as given in (2.10), representing the portfolio average in the embedding space of the rest of the tokens.
We now dive deeper into the attention mechanism as it relates to the CLS token. Let \({\varvec{a}}_{T+1}=(a_{T+1,1},\ldots , a_{T+1,T+1})^\top \in {{\mathbb {R}}}^{T+1}\) denote the \((T+1)\)-st row of the attention weight matrix A defined in (2.5). This row corresponds to the attention weights for the CLS token that forwards the extracted information. By using the softmax function, we have the normalization
$$\begin{aligned} \sum _{j=1}^{T+1} a_{T+1,j} = 1. \end{aligned}$$
We can interpret the last element of this vector, \(a_{T+1,T+1}\), as the probability assigned to the CLS token itself, and we set
$$\begin{aligned} P = a_{T+1,T+1}~\in ~ (0,1). \end{aligned}$$
Consequently, the remaining probability \(1-P\) is distributed among the covariate information
$$\begin{aligned} \sum _{j=1}^{T} a_{T+1,j} =1-P. \end{aligned}$$
This formulation allows us to interpret the attention mechanism in a way that is analogous to a linear credibility formula. Specifically, the attention output for the CLS token can be expressed as
$$\begin{aligned} {\varvec{v}}^{\textrm{trans}} = P \, {\varvec{v}}_{T+1} + (1-P) \, {\varvec{v}}^{\textrm{covariate}}, \end{aligned}$$
(2.13)
where \({\varvec{v}}^{\textrm{trans}}\) is the CLS embedding processed by the attention mechanism, in fact, this is the \((T+1)\)-st row of the attention head H given in (2.7), and the covariate information is a weighted sum of all values that consider the covariate information
$$\begin{aligned} {\varvec{v}}^{\textrm{covariate}} = \sum _{j=1}^{T}\, \frac{a_{T+1,j}}{1-P} \, {\varvec{v}}_{j}. \end{aligned}$$
(2.14)
This formulation demonstrates how the attention mechanism for the CLS token can be interpreted as a credibility weighted average of the CLS token’s own information (representing collective experience) and the information from the covariates (representing individual experience), see Fig. 2. Essentially, this is a Bühlmann–Straub [5] type linear credibility formula, or a time-dynamic version thereof, with the credibility weights in this context learned and being input-dependent; this is similar to Wüthrich [37, Chapter 5].

2.3 Decoding of the tokenized information

In the previous two sections we have encoded the covariate input \({\varvec{x}}\) in two steps to a tokenized variable \({\varvec{c}}^{\textrm{cred}}\):
$$\begin{aligned} {\varvec{x}}\qquad \mapsto \qquad {\varvec{x}}^+_{1:T+1}={\varvec{x}}^+_{1:T+1}({\varvec{x}})\qquad \mapsto \qquad {\varvec{c}}^{\textrm{cred}}={\varvec{c}}^{\textrm{cred}}({\varvec{x}}). \end{aligned}$$
The final step is to decode this tokenized variable \({\varvec{c}}^{\textrm{cred}}={\varvec{c}}^{\textrm{cred}}({\varvec{x}})\) so that it is suitable to predict a response variable Y. This decoder is problem-specific. Because our example below corresponds to a one-dimensional single-task prediction problem, we use a plain-vanilla FNN with a single output component to decode the credibilitized token \({\varvec{c}}^{\textrm{cred}}\). This results in the decoder
$$\begin{aligned} {\varvec{z}}^{(2:1)}: {{\mathbb {R}}}^{2b} \rightarrow {{\mathbb {R}}}, \qquad {\varvec{c}}^{\textrm{cred}}({\varvec{x}}) ~ \mapsto ~ {\varvec{z}}^{(2:1)}({\varvec{c}}^{\textrm{cred}}({\varvec{x}})), \end{aligned}$$
where \({\varvec{z}}^{(2:1)}\) is a shallow FNN that maps from \({{\mathbb {R}}}^{2b}\) to \({{\mathbb {R}}}\). Since our responses Y is a non-negative random variable, we are going to model claims counts, we apply the exponential output activation, which all together results in the Credibility Transformer (CT)
$$\begin{aligned} {\varvec{x}}~ \mapsto ~ \mu ^{\textrm{CT}}({\varvec{x}})= \exp \left\{ {\varvec{z}}^{(2:1)}({\varvec{c}}^{\textrm{cred}}({\varvec{x}}))\right\} ~>~0. \end{aligned}$$
(2.15)
The next section presents a real data example to exemplify the use of the Credibility Transformer.

3 Real data example

We benchmark the Credibility Transformer1 on the commonly used French motor third party liability (MTPL) claims count dataset of Dutang et al. [10].

3.1 Description of data

We perform the data pre-processing exactly as in Wüthrich–Merz [38], so that all our results can directly be benchmarked by the results in that reference, and they are also directly comparable to the ones in Brauer [3, Table 2]. For a detailed description of the French MTPL dataset we refer to Wüthrich–Merz [38, Section 13.1]; and the data cleaning and pre-processing is described in [38, Listings 13.1 and 5.2]. We use the identical split into learning and test datasets as in [38].2 The learning dataset consists of 610,206 instances and the test dataset of 67,801 instances; we refer to [38, Section 5.2.4]. This partition provides an aggregated time exposure of 322,392 calendar years on the learning dataset, and 35,967 calendar years on the test dataset. There are 23,738 claim events on the learning data, providing an empirical frequency of \({\widehat{\lambda }}=7.35\%\), i.e., on average we observe an accident every \(13.58=1/{\widehat{\lambda }}\) calendar years. This is a commonly observed expected frequency in European MTPL insurance. On the test dataset we have 2,645 claim events providing almost the same empirical frequency as on the learning dataset.

3.2 Input tokenizer

There are \(T_1=4\) categorical covariates with numbers of levels \(n_t=6, 2, 11\) and 22, and there are \(T_2=5\) continuous covariates, providing \(T=T_1+T_2=9\) covariates. For embedding and tokenizing these input covariates we select an embedding dimension of \(b=5\). This provides us with \(\varrho ^{\textrm{input}} = 405\) weights from the feature tokenizer giving us the raw input tensor \({\varvec{x}}_{1:9}^\circ \in {{\mathbb {R}}}^{9\times 5}\). For the continuous input variable tokenizer we select the linear activation function in the first FNN layers \({\varvec{z}}_t^{(1)}\) and the hyperbolic tangent activation function in the second FNN layers \({\varvec{z}}_t^{(2)}\).

3.3 Description of the selected credibility transformer architecture

Table 1 summarizes the Credibility Transformer architecture used.
The total number of weights that need to be fitted is 1,746. The Credibility Transformer uses 1,073 weights, these are 330 weights from the time-distributed layers providing the queries \({\varvec{q}}_t\), keys \({\varvec{k}}_t\) and values \({\varvec{v}}_t\), 682 neurons from the two time-distributed FNN layers \({\varvec{z}}^{\mathrm{t-FNN1}}\) and \({\varvec{z}}^{\mathrm{t-FNN2}}\) having 32 and \(2b=10\) neurons, respectively, and the remaining weights are from the normalization layers. The decoder \({\varvec{z}}^{(2:1)}\) uses 193 weights coming from a first FNN layer having 16 neurons and the output layer, resulting in the one-dimensional real-valued positive predictor (2.15). In all hidden layers we use the Gaussian error linear unit (GELU) activation function that takes the form \(x \in {{\mathbb {R}}}\,\mapsto \, x\Phi (x)\) with the standard Gaussian distribution \(\Phi \). The credibility parameter is set to \(\alpha =90\%\), this is an optimal value found by a grid search. For the drop-out layers we choose a drop-out rate of 1%. Both the credibility mechanism and the drop-out is only used during network fitting, and for prediction we set \(\alpha =1\) and the drop-out rate to zero.

3.4 Gradient descent network fitting: 1st version

3.4.1 Plain-vanilla gradient descent fitting

The Poisson deviance loss is used for claims counts model fitting. Minimizing the Poisson deviance loss is equivalent to maximum likelihood estimation (MLE) under a Poisson assumption. The average Poisson deviance loss is given by, see Wüthrich–Merz [38, formula (5.28)],
$$\begin{aligned} \frac{2}{n}\sum _{i=1}^n {w_i} \mu ^{\textrm{CT}}({\varvec{x}}_i) - Y_i - Y_i \log \left( \frac{{w_i} \mu ^{\textrm{CT}}({\varvec{x}}_i)}{Y_i}\right) ~\ge ~0, \end{aligned}$$
(3.1)
where \(Y_i\) are the observed numbers of claims on policy i, \({w_i}>0\) is the time exposure of policy i, and \(\mu ^{\textrm{CT}}({\varvec{x}}_i)\) is the estimated expected claims frequency of that policy received from the Credibility Transformer (2.15). For \(Y_i=0\), the term under the summation in (3.1) is set equal to \(w_i \mu ^{\textrm{CT}}({\varvec{x}}_i)\); this agrees with Wüthrich–Merz [38, Remark 4.4]. The sum runs over all instances (insurance policies) \(1\le i \le n\) that are included in the learning data. Note that the Poisson deviance loss is a strictly consistent loss function for mean estimation which is an important property that selected loss functions should possess for mean estimation; see Gneiting–Raftery [13] and Gneiting [12].
In our first fitting approach we use the nadam version of stochastic gradient descent (SGD) with its pre-selected parametrization implemented in the keras package [6].3 For SGD we use a batch size of \(2^{10}=1,024\) instances, and since neural networks are prone to overfitting, we exploit an early stopping strategy by partitioning the learning data at random into a training dataset and a validation dataset at a ratio of 9:1. We use the standard callback of keras [6] using its pre-selected parametrization to retrieve the learned weights with the smallest validation loss. Optimal network weights were found in less than 200 SGD epochs.
Table 2
Number of parameters, in-sample and out-of-sample Poisson deviance losses (units are in \(10^{-2}\)); benchmark models in the upper part of the table are taken from [38, Table 7.9]
Model
#
In-sample
Out-of-sample
Param
Poisson loss
Poisson loss
Poisson null
1
25.213
 
25.445
 
Poisson GLM3
50
24.084
 
24.102
 
Poisson plain-vanilla FNN
792
23.728
(± 0.026)
23.819
(± 0.017)
Ensemble Poisson plain-vanilla FNN
792
23.691
 
23.783
 
Credibility Transformer: nadam
1,746
23.648
(± 0.071)
23.796
(± 0.037)
Ensemble Credibility Transformer: nadam
1,746
23.565
 
23.717
 
Credibility Transformer: NormFormer
1,746
23.641
(± 0.053)
23.788
(± 0.040)
Ensemble Credibility Transformer: NormFormer
1,746
23.562
 
23.711
 
Neural network fitting involves several elements of randomness such as the (random) initialization of the SGD algorithm; see Wüthrich–Merz [38, Section 7.4.4]. To improve the robustness of the results we always run 20 SGD fittings with different random initializations, and the reported results correspond to averaged losses over the 20 different SGD fittings, in round brackets we state the standard deviation of the losses over these 20 SGD fittings.
Finally, in Richman–Wüthrich [29] we have seen that one can substantially improve predictive performance by ensembling over different SGD runs. On average it takes 10 to 20 SGD fittings to get good ensemble predictors on this data, see Wüthrich–Merz [38, Figure 7.19]. For this reason, in the last step of our procedure, we consider the ensemble predictor over the 20 different SGD runs. From our results it can be verified that ensembling significantly improves predictive models.

3.4.2 Results of 1st fitting approach

The results are given in Table 2 (rows nadam). These results of the Credibility Transformer are benchmarked with the ones in Wüthrich–Merz [38, Tables 7.4\(-\)7.7 and 7.9] and Brauer [3, Tables 2 and 4]. We conclude that the Credibility Transformer clearly outperforms any of these other proposals out-of-sample, even the ensembled plain-vanilla FNN model is not much better than a single run of the Credibility Transformer. By building the ensemble predictor of the Credibility Transformers we receive a significantly better model providing an out-of-sample average Poisson deviance loss of \(23.717 \cdot 10^{-2}\), see Table 2. The best ensemble predictor in Brauer [3, Table 4] called \(\hbox {CAFFT}_{\textrm{def}}\) has an average Poisson deviance loss of \(23.726 \cdot 10^{-2}\).
Remark 3.1
The fitted Credibility Transformer will not automatically fulfill the balance property, though the prior value CLS token \({\varvec{c}}^{\textrm{prior}}\) serves at adjusting to the overall portfolio level. The failure of the balance property is mainly a deficiency of the gradient descent method for model fitting. We early stop the algorithm before it reaches a critical point of the loss function and, therefore, also under a canonical link choice, the balance property typically fails to hold; see Wüthrich [36]. If one works with the canonical for the readout in (2.15), the balance property can be rectified by a simple additional GLM-step in the output layer, see Wüthrich [36], and under non-canonical link choices, the bias parameter in the output can be adjusted for rectifying the balance property, see Lindholm–Wüthrich [23]. The results in Table 2 do not consider this additional balance correction step.

3.5 NormFormer gradient descent fitting: 2nd version

Standard SGD optimizers such as nadam with a standard early stopping callback do not work particularly well on Transformers because of gradient magnitudes mismatches, typically gradients on earlier layers are much bigger than gradients at later layers; see Shleifer et al. [34]. Therefore, we exploit the NormFormer proposed by Shleifer et al. [34] in our 2nd fitting attempt with the adam version of SGD with a learning rate of 0.002 and \(\beta _2=0.98\); reducing this parameter of adam has been suggested, e.g., in Zhai et al. [40]. The optimal network weights were reached in less than 200 SGD epochs. The results are given in the lower part of Table 2. We observe a slight improvement in prediction accuracy, the Credibility Transformer ensemble having a smaller out-of-sample average Poisson deviance loss of \(23.711 \cdot 10^{-2}\).
Figure 3 (lhs) gives a scatterplot that compares one run of the Credibility Transformer predictions (NormFormer SGD fitting) against the GLM predictions (out-of-sample). The red line gives a GAM smoother. On average the two predictors are rather similar, except in the tails. However, the black cloud of predictors has individual points that substantially differ from the blue diagonal line, indicating that there are bigger differences on an individual insurance policy scale. This is verified by the density plot of the individual log-differences of the predictors on the right-hand side of Fig. 3.
Figure 4 compares the two ensemble Credibility Transformer predictions obtained from the NormFormer SGD fitting procedure and the nadam fitting procedure, the former slightly outperforming the latter in terms of out-of-sample loss. The figure shows that the individual predictions lie fairly much on the diagonal line saying that there are not any bigger individual differences. This supports robustness of the fitting procedure.
We emphasize that the credibility mechanism (2.12) is only applied during SGD training. For prediction, we turn this credibility mechanism off by setting \(\alpha =100\%\), resulting in weights \(Z\equiv 1\) in (2.12), thus, only the observation based CLS token \({\varvec{c}}^{\textrm{trans}}\) is considered for prediction. We may ask whether the network has also learned any structure for the prior CLS token \({\varvec{c}}^{\textrm{prior}}\). This can be checked by setting \(\alpha =0\%\), resulting in weights \(Z\equiv 0\) in (2.12). Using the resulting predictor we receive a constant prediction of \({\widehat{\lambda }}=7.35\%\), i.e., the homogeneous model without considering any covariate information. Of course, this is expected by the design of the network architecture. However, it is still worth to verify this property to ensure that the network architecture works as expected. If we set \(Z\equiv 90\%\) to the same rate as used for network fitting we receive a bigger out-of-sample loss of \(23.727 \cdot 10^{-2}\) not supporting credibility robustification of the predictor, once properly trained.

4 Improving the credibility transformer

We now work to further improve the Credibility Transformer. The following modifications can be categorized into three main areas. First, we augment the Transformer with a multi-head attention and utilize a deep version of the architecture. Second, we use insights from the LLM literature to more flexibly handle the inputs to the model by using gated layers. Third, we modify the inputs to the network by using a differentiable continuous covariate embedding layer based on a novel idea of Gorishny et al. [15]. Through applying these changes, we achieve a very strong out-of-sample performance of the Credibility Transformer. We start by discussing each of these changes and then return to demonstrate the performance of the improved architecture.

4.1 Multi-head attention and deep transformer architecture

4.1.1 Multi-head attention

In Sect. 2.2.2, we described a Transformer model with a single attention head, see (2.7). An important extension of this simpler model is a multi-head attention (MHA), which applies the attention mechanism (2.5)–(2.7) with multiple independent copies of the query, key and value projections of the input data. This allows the model to attend to information from different learned representations. After learning these multiple representations, these are projected back to the original dimension of the model, which is 2b in our case.
Formally, we chose the number of attention heads \(M \in {{\mathbb {N}}}\). The MHA is defined as
$$\begin{aligned} H^{\text {MHA}}({\varvec{x}}^+_{1:T+1}) = \text {Concat}(H_1, \ldots , H_M) W^O ~\in ~ {{\mathbb {R}}}^{(T+1) \times 2b}, \end{aligned}$$
(4.1)
where we concatenate multiple attention heads to a tensor
$$\begin{aligned} \text {Concat}(H_1, \ldots , H_M) = [H_1, H_2, \ldots , H_M] ~\in ~ {{\mathbb {R}}}^{(T+1) \times (M d)}, \end{aligned}$$
with attention heads \(H_m \in {{\mathbb {R}}}^{(T+1) \times d}\) of dimension \(d \in {{\mathbb {N}}}\), for \(1 \le m \le M\), and output projection matrix \(W^O \in {{\mathbb {R}}}^{(M d) \times 2b}\). Typically, one chooses \(M d = 2b\), in other words, the total dimension of the model is chosen for Md, and it is proportionally allocated to each attention head.
For each head, \(1 \le m \le M\), we use the same formulation as above for the attention operation, with attention head specific queries, keys and values
$$\begin{aligned} H_m= & A_m \, V_m ~\in ~ {{\mathbb {R}}}^{(T+1) \times d}, \\ A_m= & \textsf{softmax}\left( \frac{Q_m K_m^\top }{\sqrt{d}}\right) ~\in ~ {{\mathbb {R}}}^{(T+1) \times (T+1)}, \\ Q_m= & \phi \left( {\varvec{b}}^{(m)}_Q + {\varvec{x}}^+_{1:T+1} W_Q^{(m)} \right) ~\in ~ {{\mathbb {R}}}^{(T+1) \times d}, \\ K_m= & \phi \left( {\varvec{b}}^{(m)}_K + {\varvec{x}}^+_{1:T+1} W_K^{(m)}\right) ~\in ~ {{\mathbb {R}}}^{(T+1) \times d}, \\ V_m= & \phi \left( {\varvec{b}}^{(m)}_V + {\varvec{x}}^+_{1:T+1} W_V^{(m)} \right) ~\in ~ {{\mathbb {R}}}^{(T+1) \times d}. \end{aligned}$$
At this point, we have described a standard implementation of a MHA. To improve the stability of fitting this model we again follow Shleifer et al. [34] by adding a multiplicative scaling coefficient to each head before it enters into (4.1), i.e., we update this to
$$\begin{aligned} H^{\text {MHA}}({\varvec{x}}^+_{1:T+1}) = \text {Concat}(\alpha _1 H_1, \ldots , \alpha _M H_M) W^O ~\in ~ {{\mathbb {R}}}^{(T+1) \times 2b}, \end{aligned}$$
(4.2)
where \(\alpha _m \in (0,1]\), for \(1 \le m \le M\). During training, we apply a drop-out mechanism to \(\alpha _m\) to improve the out-of-sample performance of the network. After this MHA modification, the rest of the Transformer layer encoding used here is the same as in Sects. 2.2.2 and 2.2.3.

4.1.2 Deep credibility transformer

We use multiple Transformer layers composed together to create a deep version of the model applied above, assuming that we have Transformer layers \(1 \le \ell \le L\), for \(L\in {{\mathbb {N}}}\). To compose these layers, while retaining the ability to use the credibility mechanism 2.12, we need to modify the inputs to the Transformer layers. The first layer of these is exactly the same as presented above, producing outputs \({\varvec{z}}^{\textrm{trans, 1}}({\varvec{x}}^+_{1:T+1})\), including the CLS token \({\varvec{c}}^{\textrm{trans, 1}}\), which could be stripped out if needed at this point, and \({\varvec{c}}^{\textrm{prior, 1}}\), where we have added an upper index \(^1\) to show that this is the first Transformer layer with its outputs.
For the second and subsequent layers, \(2 \le \ell \le L\), we input \({\varvec{z}}^{\textrm{trans}, \ell -1}\) and \({\varvec{c}}^{\textrm{prior}, \ell -1}\) recursively into layer specific versions of (2.9)
$$\begin{aligned} {\varvec{z}}^{\textrm{trans}, \ell }({\varvec{x}}^+_{1:T+1})= & {\varvec{z}}^{\textrm{skip1},\ell }({\varvec{z}}^{\textrm{trans}, \ell -1}({\varvec{x}}^+_{1:T+1})) \nonumber \\ & +~ \left( {\varvec{z}}^{\textrm{norm2}, \ell }\circ {\varvec{z}}^{\textrm{drop2}, \ell }\circ {\varvec{z}}^{\mathrm{t-FNN2}, \ell }\circ {\varvec{z}}^{\textrm{drop1}, \ell }\circ {\varvec{z}}^{\mathrm{t-FNN1}, \ell }\circ {\varvec{z}}^{\textrm{norm1}, \ell }\right) \nonumber \\ & \qquad \circ ~ \left( {\varvec{z}}^{\textrm{skip1}, \ell }({\varvec{z}}^{\textrm{trans}, \ell -1}({\varvec{x}}^+_{1:T+1}))\right) ~\in ~ {{\mathbb {R}}}^{(T+1) \times 2b}, \end{aligned}$$
(4.3)
and (2.10)4
$$\begin{aligned} {\varvec{c}}^{\textrm{prior}, \ell } & = \left( {\varvec{z}}^{\textrm{norm2}, \ell }\circ {\varvec{z}}^{\textrm{drop2,} \ell }\circ {\varvec{z}}^{\textrm{FNN2}, \ell }\circ {\varvec{z}}^{\textrm{drop1}, \ell }\circ {\varvec{z}}^{\textrm{FNN1}, \ell }\circ {\varvec{z}}^{\textrm{norm1}, \ell }\right) \left( {\varvec{c}}^{\textrm{prior}, \ell -1}\right) \nonumber \\ & \in ~ {{\mathbb {R}}}^{2b}. \end{aligned}$$
(4.4)
Finally, the credibility weighting between the outputs is performed
$$\begin{aligned} {\varvec{c}}^{\textrm{cred, L}}= Z\, {\varvec{c}}^{\textrm{trans, L}} + \left( 1-Z\right) {\varvec{c}}^{\textrm{prior, L}}~\in ~ {{\mathbb {R}}}^{2b}. \end{aligned}$$
(4.5)
The same rationale of linear credibility as given in Sect. 2.2.4 applies here too; in each attention head (a projection of) the prior information in the CLS token is reweighted with the (projected) covariate information being processed there. For an illustration, see Fig. 14 in the appendix.

4.2 Gated layers

The model presented to this point uses a standard FNN to process the outputs of the attention mechanism. Modern LLMs usually rely on a more advanced mechanism incorporating a gating principle for the FNN; this has been introduced to the LLM literature by Dauphin et al. [7], with the concept of a Gated Linear Unit (GLU). The main difference between the usual FNNs and a GLU is that not all of the inputs to a FNN have equal importance, and, indeed, some of the less important inputs should be down-weighted or even removed from the computations performed by the network. A GLU accomplishes this by adding a second FNN with a sigmoid activation function - which produces outputs between 0 and 1 - to the usual FNN, and then multiplies these by an (element-wise) Hadamard product \(\odot \), i.e.,
$$\begin{aligned} {\varvec{z}}^{\textrm{GLU}}({\varvec{x}}) = {\varvec{z}}^{\textrm{FNN}_{\textrm{sigmoid}}}({\varvec{x}})\odot {\varvec{z}}^{\textrm{FNN}_{\textrm{linear}}}({\varvec{x}}), \end{aligned}$$
(4.6)
where the GLU layer \({\varvec{z}}^{\textrm{GLU}}\) is the Hadamard product of a standard affine projection \({\varvec{z}}^{\mathrm{FNN_{linear}}}\) and a FNN layer with sigmoid activation \({\varvec{z}}^{\mathrm{FNN_{sigmoid}}}\). Beyond this simple formulation of the GLU, modern LLMs usually follow Shazeer [33], who replaces the sigmoid activation in (4.6) with a different activation function. Especially popular is to replace the sigmoid function with a so-called sigmoid linear unit (SiLU) activation function, due to Elfwing et al. [11], defined as
$$\begin{aligned} \textrm{SiLU}(x) = x \odot \sigma (x). \end{aligned}$$
The SiLU-GLU layer, abbreviated to SwiGLU for Swish GLU, see Shazeer [33], is defined as
$$\begin{aligned} {\varvec{z}}^{\textrm{SwiGLU}}({\varvec{x}}) = {\varvec{z}}^{\mathrm{FNN_{SiLU}}}({\varvec{x}}) \odot {\varvec{z}}^{\textrm{FNN}_{\textrm{linear}}}({\varvec{x}}). \end{aligned}$$
In our implementation, we replace the first FNNs in equations (4.3) and (4.4) with these SwiGLU layers. This modification allows one for more complex interactions between features and improves the model’s ability to down-weight less important components of the input data.

4.3 Improving the continuous covariate embedding

The final major modification to the Credibility Transformer is to improve the continuous covariate embedding (2.1) by replacing the simple two-layer FNN with a more advanced approach. We start with the Piecewise Linear Encoding (PLE) of Gorishny et al. [14], which encodes continuous covariates using a numerical approach that, in principle, is similar to an extension of one-hot encoding to continuous variables. The main idea of PLE is to split the range of each continuous covariate into bins, and then encode the covariate value based on which bin it falls into. This provides a more expressive representation compared to the original scalar values.
Formally, for the t-th continuous covariate of the \(T_2\) continuous covariates available in the dataset, we partition its range of values into \(B_t \in {{\mathbb {N}}}\) disjoint intervals - which are called bins - \({{\mathfrak {B}}}_t^j = [b_{t}^{j-1}, b_t^j)\), for \(1\le j \le B_{t}\), and with \(b_{t}^{j-1}< b_t^j\). The PLE encoding of covariate value \(x_t\) is then defined by the vector
$$\begin{aligned} \textrm{PLE}_t(x_t) = \left( e^{\textrm{PLE}}_1, \ldots , e^{\textrm{PLE}}_{B_t}\right) ^\top ~\in ~{{\mathbb {R}}}^{B_t}, \end{aligned}$$
where for \(1\le j \le B_{t}\)
$$\begin{aligned} e^{\textrm{PLE}}_j ~=~ e^{\textrm{PLE}}_j(x_t) ~=~ \frac{x_t - b_t^{j-1}}{b_t^j - b_t^{j-1}}\, \mathbbm {1}_{\{x_t \in {\mathfrak B}_t^j\}} +\mathbbm {1}_{\{x_t \ge b_t^j\}}. \end{aligned}$$
This encoding gives a \(B_t\)-dimensional vector \(\textrm{PLE}_t(x_t)\), which has components one as long as \(x_t \ge b_t^{j}\), for \(x_t \in {{\mathfrak {B}}}_t^j\) we interpolate linearly, and for \(x_t < b_t^{j-1}\) the components of \(\textrm{PLE}_t(x_t)\) are zero. This encoding has several desirable properties, among which, we note that it provides a more expressive yet lossless representation of the original scalar values that preserves the ordinal nature of continuous covariates. In the original proposal of Gorishny et al. [14], the bin boundaries are determined either using quantiles of the observed values of the covariate components \(x_t\) or they are derived from a decision tree. Differently to these approaches, we introduce a differentiable version of PLE that allows for an end-to-end training of the bin boundaries. This approach enables the model to learn optimal bin placements for each continuous covariate during the training process. The key idea is to parameterize the bin boundaries indirectly through trainable weights that define the length of each bin. The bin boundaries \((b_t^j)_{j=0}^{B_t}\) are computed as
$$\begin{aligned} b_t^j = s_t + \sum _{k=0}^j \delta _t^k, \end{aligned}$$
where \(s_t\in {{\mathbb {R}}}\) is a given (fixed) starting value, and \(\delta _t^k>0\) are learnable bin lengths. To ensure numerical stability and enforce non-negative bin widths, we parameterize the bin lengths using their logarithms
$$\begin{aligned} \delta _t^k = \exp (\log (\delta _t^k)). \end{aligned}$$
The logged bin lengths \(\log (\delta _t^k)\) are initialized randomly or from pre-computed initial bins using quantiles. Finally, to handle potential numerical issues and to ensure a minimum bin length, we introduce a threshold \(\epsilon >0\)
$$\begin{aligned} \delta _t^k ~\leftarrow ~\delta _t^k \,\mathbbm {1}_{\{\delta _t^k \ge \epsilon \}}. \end{aligned}$$
Thus, if the size of the learned bin is less than \(\epsilon \), we collapse the bin into the previous one. This differentiable PLE allows the model to adapt the bin boundaries during training, which, taken together with the next part, allows for more effective feature representations to be learned.
To use the PLE in the Transformer architecture, following Gorishny et al. [14], we add a trainable FNN, \({\varvec{z}}_t^{\textrm{FNN}}\), with a hyperbolic tangent activation after the PLE to produce the final feature embedding
$$\begin{aligned} x_t~\mapsto ~ {\varvec{e}}^{\textrm{NE}}_t(x_t) = \left( {\varvec{z}}_t^{\textrm{FNN}} \circ \textrm{PLE}_t\right) (x_t) ~\in ~ {{\mathbb {R}}}^b, \end{aligned}$$
where \({\varvec{e}}_t^{\textrm{NE}}\) is the numerical embedding (NE) from the PLE layer. We expect that the PLE followed by a hyperbolic tangent layer is more effective than a simple two-layer FNN for continuous covariate embeddings since, unlike a FNN which applies a global transformation to its inputs, this numerical embedding captures local patterns within the different intervals of the covariate’s range. This is because the FNN with a hyperbolic tangent activation expands the information contained in the PLE into a fixed-size embedding \({\varvec{e}}_t^{\textrm{NE}} \in {{\mathbb {R}}}^b\), which varies linearly within bins but non-linearly across different bins. A diagram of the differentiable PLE is shown in Fig. 5,

4.4 Other modifications

Here, we briefly mention some of the other less complex modifications made to the Credibility Transformer. Two of these are inspired by Holzmüller et al. [19]. First, continuous inputs to the network are scaled by subtracting the median of the data and then dividing by the inter-quartile range; this provides inputs that are more robust to outliers. Second, we allow for learned feature selection by scaling each embedding by a constant in the range (0, 1] before these enter the Transformer. We initialize all FNN layers that are followed by the GeLU activation function using the scheme of He et al. [18], and we use the adamW optimizer of Loshchilov–Hutter [25], with a weight decay set to 0.02. Finally, we set \(\beta _2=0.95\) in the optimizer, which is a best practice to stabilize optimization with Transformer architectures that use large batches and versions of the adam optimizer, see Zhai et al. [40].

4.5 Results of the improved deep credibility transformer

The improved version of the Credibility Transformer was fit to the same split of the French MTPL dataset as above. Compared to the details discussed in Sect. 3, we follow the same approach, except that we fit on batches of size \(2^{12} = 4096\) instances. We set the number of attention heads in each layer \(M=2\), and we use three transformer layers, i.e., we use a deep Transformer architecture. Moreover, we set \(b=40\), in other words, we use a model that is eight times wider than the one used above. Since this results in a very large number of parameters to optimize - approximately 320,000 - it becomes infeasible to fit these models without utilizing Graphics Processing Units (GPUs); the approach taken was to utilize a cloud computing service to fit these models. Each training run of the model takes about 7 min,5 i.e., fitting such a large and complex model is entirely feasible using GPUs. We show the in-sample and out-of-sample Poisson deviance losses for each selected credibility parameter in Table 3, where we evaluate the model’s performance for different values of the credibility parameter \(\alpha \), ranging from \(90\%\) to \(100\%\). Recall that setting the credibility parameter to \(\alpha = 100\%\) corresponds to the pure Transformer training approach without using the credibility mechanism for training.
Table 3
In-sample and out-of-sample Poisson deviance losses (units are \(10^{-2}\)) for each credibility parameter applied to the improved deep credibility transformer of Sect. 4; ensembling results are shown for each credibility parameter
Model
In-sample
Out-of-sample
Poisson loss
Poisson loss
Ensemble Poisson plain-vanilla FNN
23.691
 
23.783
 
Ensemble Credibility Transformer (best-performing)
23.562
 
23.711
 
Improved Credibility Transformer with \(\alpha = 90\%\)
23.533
(± 0.058)
23.670
(± 0.031)
Ensemble Credibility Transformer (\(\alpha = 90\%\))
23.454
 
23.587
 
Improved Credibility Transformer with \(\alpha = 95\%\)
23.557
(± 0.058)
23.676
(± 0.027)
Ensemble Credibility Transformer (\(\alpha = 95\%\))
23.465
 
23.593
 
Improved Credibility Transformer with \(\alpha =98\%\)
23.544
(± 0.042)
23.670
(± 0.032)
Ensemble Credibility Transformer (\(\alpha =98\%\))
23.460
 
23.577
 
Improved pure Transformer with \(\alpha = 100\%\)
23.535
(± 0.051)
23.689
(± 0.044)
Ensemble pure Transformer (\(\alpha = 100\%)\)
23.447
 
23.607
 
From Table 3 we observe that the different credibility parameters \(\alpha \in \{90\%, 95\%, 98\%, 100\%\}\) lead to similar in-sample Poisson deviance losses, and they are relatively stable, with minor fluctuations. Specifically, the in-sample losses are all very close to \(23.544 \cdot 10^{-2}\). The standard deviations are also comparable, indicating consistent performance across different fitting runs.
When examining the out-of-sample Poisson deviance losses, we notice a slightly different pattern. The losses slightly decrease as the credibility parameter increases from \(90\%\) to \(98\%\), reaching the lowest value at \(\alpha = 98\%\) with an average loss of \(23.670 \cdot 10^{-2}\) (standard deviation \(\pm 0.032\)). Setting \(\alpha = 98\%\) yields the best out-of-sample performance among the individual models, suggesting that incorporating the credibility mechanism enhances the model’s generalization ability. Comparing the models with and without the credibility mechanism, we find that the models with \(\alpha < 100\%\) generally outperform the pure Transformer approach (\(\alpha = 100\%\)) in out-of-sample predictions. This indicates that the credibility mechanism effectively leverages prior information, improving the predictive accuracy on unseen data.
The benefits of the ensemble modeling through averaging are evident from the results, i.e., even with a state-of-the-art Transformer architecture, significant performance gains can be made through ensembling. This is a somewhat surprising result, since the usual practice with large Transformer-based models is usually to use only one model. The ensemble models consistently achieve lower Poisson deviance losses compared to their individual counterparts. For example, the ensembling architecture with \(\alpha = 98\%\) attains the lowest out-of-sample loss of \(23.577 \cdot 10^{-2}\), outperforming both the individual models and the pure Transformer ensemble approach.
Moreover, in comparison to the original Credibility Transformer presented earlier, the improved version shows substantial performance gains. The best-performing model with \(\alpha = 98\%\) achieves an out-of-sample loss of \(23.577 \cdot 10^{-2}\), which is a significant improvement over the original model’s performance of \(23.711 \cdot 10^{-2}\). We observe that the model’s performance appears to be somewhat sensitive to the credibility parameter, with an optimal performance achieved at \(\alpha = 98\%\). This suggests that while the credibility mechanism is beneficial, it is helpful to fine-tune this parameter for optimal results. We also observe that the out-of-sample standard deviation of the Poisson deviance loss is somewhat higher in the plain-vanilla training approach, i.e., when the credibility parameter is set to \(100\%\). Thus, we see that the proposed credibility approach helps to stabilize the training process.
While the improved model’s complexity necessitates the use of GPUs, the performance gains likely justify the increased computational requirements for many practical applications in insurance pricing and risk assessment. However, practitioners should carefully consider the trade-offs between model complexity, performance, and interpretability in their specific context. In summary, the improved deep Credibility Transformer with a credibility parameter \(\alpha \) slightly less than \(100\%\), i.e., incorporating the credibility regularization, achieves superior predictive performance. We thus conclude that even in a state-of-the-art Transformer approach, as presented here, incorporating the credibility mechanism can improve the out-of-sample performance of the model.
Figure 6 (lhs) gives a scatterplot that compares the original Credibility Transformer predictions (NormFormer fitting) against the improved deep Credibility Transformer predictions (out-of-sample). Similarly to the above, the red line gives a local GAM smoother. We see that, on average the two predictors are rather similar, in this case, even in the tails. Furthermore, we see that there are bigger differences on an individual insurance policy scale; nonetheless, the density plot of the individual log-differences of the predictors on the right-hand side of Fig. 6 shows that these differences are smaller than the differences between the GLM and Credibility Transformer shown in Fig. 3.
While the MHA mechanism significantly enhances LLMs, its benefit on tabular data is less clear. Figure 7 verifies that the two attention heads \(M=2\) learn different structures. For this graph we extracted the attention weights of the two attention heads for the input variable BonusMalus (which is the most significant explanatory variable as we will see below). To these attention weights we applied a principal component analysis (PCA) on both attention heads separately. Figure 7 shows the first two PCA scores of the two attention heads (in red and blue). Moreover, we have labelled five individual instances in both attention heads to show the calibration of these two attention heads. From this figure we conclude that the two attention heads have learned different representations and structure of the data because they have different shapes.

5 Exploring the credibility transformer predictions

The Credibility Transformer architecture provides rich information about how the model’s predictions are formed, in particular, we refer to Sect. 2.2.4 for a discussion of how the attention operation updating the CLS token can be seen as a linear credibility formula. We focus on the single-head variations of Sect. 2 for simplicity; with some effort, the same analysis can be produced for the improved deep Credibility Transformer, after aggregating the attention outputs across heads and layers. Here, we select one trained Credibility Transformer model to explore the workings of this model and note that we have selected a model that performs similarly to the out-of-sample Poisson deviance loss as reported in Table 2 for the NormFormer variation.
We start by examining the mean values of the attention outputs for the CLS token, \((a_{T+1,j})_{j = 1}^{T+1}\). These mean values are produced by taking the average over the test dataset used for assessing the model’s out-of-sample performance. In Fig. 8 (lhs), we show that the variable to which the model attends the most is the BonusMalus score, followed by VehAge, VehBrand and Area code. The CLS token itself, which we recall is calibrated to produce the portfolio average frequency, has an attention probability of about 6.5% on average. That is, with reference to equation (2.13), \(P = a_{T+1,T+1}\) is comparably low, implying that the complement probability, \(1-P\), referring to the weight given to the covariates, is relatively high. This is exactly as we would expect it for a large MTPL insurance portfolio. In Fig. 8 (rhs), we show a histogram of the attention outputs for \(P=a_{T+1,T+1}\) over all test instances. The histogram across the test dataset which takes values between zero and approximately \(12.5\%\), says that even in the most extreme cases, only a comparably small weight is allocated to the portfolio average experience.
In Fig. 9, we investigate the relationships learned by the model between the covariates and the attention scores; this is done for two continuous and two categorical variables, namely, BonusMalus, DrivAge, VehBrand and Region (see the next paragraph for a description of how these variables were chosen). The figures show interesting relationships which indicate that the model attends strongly to the information contained within the embedding of these variables, for certain values of the covariate. For example, low values of the (correlated) BonusMalus and DrivAge covariates receive strong attention scores. Likewise, some values of the VehBrand and Region covariates receive very strong attention scores. These relationships can be investigated further by trying to understand the context in which variations may occur, e.g., in the top-left panel of Fig. 9 we can see that, at low values of the BonusMalus covariate, two different attention patterns occur for some segments of the data. In Fig. 10, we reproduce the top-left panel of Fig. 9, but color the points according to the Density covariate, revealing that drivers with a low BonusMalus score in high density regions have lower attention scores for the BonusMalus covariate, than drivers in medium and low Density areas, in other words, these variables interact quite strongly.
Finally, in Fig. 11, we show how the attention scores for the CLS token vary with the attention scores given to the other covariates, for the BonusMalus, DrivAge, VehBrand and Region covariates. These covariates were selected as the top four covariates maximizing the variable importance scores of a Random Forest model fitted to predict the CLS token attention scores based on the attention scores for the other covariates. The analysis shows a very strong relationship between the covariate and the CLS attention scores, i.e., for how the credibility given to the portfolio experience varies with the values of the other covariates. In particular, we can see that the highest credibility is given to the portfolio experience when the BonusMalus scores are low, when the DrivAge is middle-aged, when the VehBrand is not B12, and in several of the Regions in the dataset. In summary, this analysis leads us to expect that - at least to some extent - the Credibility Transformer produces frequency predictions close to the portfolio average for these, and similar, values of the covariates. We test this insight in Fig. 12, which shows density plots of the values of the BonusMalus and DrivAge covariates, as well as the average value of the covariate and the average value of the covariate producing predictions close to the predicted portfolio mean. It can be seen that for low values of the BonusMalus covariate, and for middle-aged drivers, the Credibility Transformer produces predictions close to the portfolio mean. As just mentioned, for these “average" policyholders, we give higher credibility to the CLS token.

6 Conclusions

In this paper, we introduced and developed the Credibility Transformer, a novel approach to using Transformers for actuarial modeling in a non-life pricing context, which integrates traditional credibility theory with state-of-the-art deep learning techniques for tabular data processing using Transformers. We have demonstrated that the Credibility Transformer consistently outperforms standard Transformer encoders in out-of-sample prediction as measured by the Poisson deviance loss. This improvement is significant and robust across multiple runs, illustrating the effectiveness of our approach in enhancing predictive accuracy for insurance pricing. Moreover, our results support the use of ensembling techniques even when applying complex state-of-the-art Transformers since it was shown that ensemble models consistently achieved lower Poisson deviance losses compared to the individual models.
Building on the initial Credibility Transformer that was introduced using a single-head attention mechanism within a shallow architecture, the enhanced version of the Credibility Transformer incorporates several architectural enhancements, including multi-head attentions, gated layers, and improved numerical embeddings of continuous covariates. These modifications collectively contributed to the model’s superior performance. While the improved model’s complexity necessitates the use of GPUs to train the model, the performance gains appear to justify the increased computational requirements. Moreover, the ability to train these complex models in a reasonable time-frame (approximately 7 min per run) on cloud-based GPUs makes them feasible for real-world deployment.
The consistent performance across different credibility parameters and the reduced out-of-sample standard deviation (compared to pure Transformers) indicate that the approach introduced here enhances the stability and reliability of predictions. Thus, we can conclude that this work demonstrates a successful integration of traditional actuarial concepts (credibility theory) with state-of-the-art deep learning techniques.
Finally, an analysis of the rich information provided by the model shows that, in line with expectations on a personal lines motor insurance portfolio, the trained Credibility Transformer gives only small weight to the CLS token representing the portfolio’s average experience and that, for example, interactions can be identified using the attention scores of the model.
Future research directions could include developing methods to dynamically adjust the credibility parameter during training or even on a per-instance basis.

Acknowledgements

We would like to kindly thank the two anonymous referees for their very constructive and useful reports that helped us to significantly improve this manuscript. Parts of this research were carried out while Mario V. Wüthrich was a KAW guest professor at Stockholm University, while he was hosted at Ewha Womans University, Seoul, and while he was a Distinguished Visiting Scholar at UNSW Business School, Sydney.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix

Appendix

See Figs. 13, 14
Footnotes
2
The cleaned data can be downloaded from https://​people.​math.​ethz.​ch/​~wueth/​Lecture/​freMTPL2freq.​rda, and for the partition into learning and test data we use the random number generator under R version 3.5.0.
 
3
We used keras 3.7 with TensorFlow.
 
4
For notational convenience, we have not included the head-specific query, key and value projections of \({\varvec{c}}^{\textrm{prior}, \ell -1}\) to derive \({\varvec{c}}^{\textrm{prior}, \ell }\), nonetheless, these are also performed.
 
5
This was done on an L4 GPU on the Google Colab service.
 
Literature
2.
4.
go back to reference Brébisson de A, Simon É, Auvolat A, Vincent P, Bengio Y (2015) Artificial neural networks applied to taxi destination prediction. arXiv:1508.00021 Brébisson de A, Simon É, Auvolat A, Vincent P, Bengio Y (2015) Artificial neural networks applied to taxi destination prediction. arXiv:​1508.​00021
5.
go back to reference Bühlmann H, Straub E (1970) Glaubwürdigkeit für Schadensätze. Bull Swiss Assoc Actuaries 1970:111–131MATH Bühlmann H, Straub E (1970) Glaubwürdigkeit für Schadensätze. Bull Swiss Assoc Actuaries 1970:111–131MATH
7.
go back to reference Dauphin YN, Fan A, Auli M, Grangier D (2017) Language modeling with gated convolutional networks. In: Proceedings of the 34th international conference on machine learning, 70, 933-941 Dauphin YN, Fan A, Auli M, Grangier D (2017) Language modeling with gated convolutional networks. In: Proceedings of the 34th international conference on machine learning, 70, 933-941
8.
go back to reference Delong Ł, Kozak A (2023) The use of autoencoders for training neural networks with mixed categorical and numerical features. ASTIN Bull J IAA 53(2):213–232MathSciNetCrossRefMATH Delong Ł, Kozak A (2023) The use of autoencoders for training neural networks with mixed categorical and numerical features. ASTIN Bull J IAA 53(2):213–232MathSciNetCrossRefMATH
9.
go back to reference Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional Transformers for language understanding. arXiv:1810.04805 Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional Transformers for language understanding. arXiv:​1810.​04805
11.
go back to reference Elfwing S, Uchibe E, Doya K (2018) Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw 107:3–11CrossRefMATH Elfwing S, Uchibe E, Doya K (2018) Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw 107:3–11CrossRefMATH
13.
14.
go back to reference Gorishniy Y, Rubachev I, Babenko A (2022) On embeddings for numerical features in tabular deep learning. Adv Neural Inf Process Syst 35:24991–25004 Gorishniy Y, Rubachev I, Babenko A (2022) On embeddings for numerical features in tabular deep learning. Adv Neural Inf Process Syst 35:24991–25004
15.
go back to reference Gorishniy Y, Rubachev I, Khrulkov V, Babenko A (2021) Revisiting deep learning models for tabular data. In: Beygelzimer A, Dauphin Y, Liang P, Wortman Vaughan J (eds) Advances in neural information processing systems, 34. Curran Associates Inc, New York, pp 18932–18943 Gorishniy Y, Rubachev I, Khrulkov V, Babenko A (2021) Revisiting deep learning models for tabular data. In: Beygelzimer A, Dauphin Y, Liang P, Wortman Vaughan J (eds) Advances in neural information processing systems, 34. Curran Associates Inc, New York, pp 18932–18943
16.
go back to reference Grinsztajn L, Oyallon E, Varoquaux G (2022) Why do tree-based models still outperform deep learning on typical tabular data? Adv Neural Inf Process Syst 35:507–520 Grinsztajn L, Oyallon E, Varoquaux G (2022) Why do tree-based models still outperform deep learning on typical tabular data? Adv Neural Inf Process Syst 35:507–520
18.
go back to reference He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE international conference on computer vision (ICCV), 1026-1034 He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE international conference on computer vision (ICCV), 1026-1034
19.
go back to reference Holzmüller D, Grinsztajn L, Steinwart I (2024) Better by default: strong pre-tuned MLPs and boosted trees on tabular data. arXiv:2407.04491 Holzmüller D, Grinsztajn L, Steinwart I (2024) Better by default: strong pre-tuned MLPs and boosted trees on tabular data. arXiv:​2407.​04491
20.
21.
go back to reference Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd international conference on machine learning, 37, 448-456 Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd international conference on machine learning, 37, 448-456
23.
go back to reference Lindholm M, Wüthrich MV (2024) The balance property in insurance pricing. SSRN Manuscript ID 4925165 Lindholm M, Wüthrich MV (2024) The balance property in insurance pricing. SSRN Manuscript ID 4925165
25.
go back to reference Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. In: International conference on learning representations (ICLR) Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. In: International conference on learning representations (ICLR)
26.
go back to reference Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University PressCrossRefMATH Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University PressCrossRefMATH
27.
go back to reference Richman R (2020) AI in actuarial science - a review of recent advances - part 1. Ann Actuarial Sci 15(2):207–229CrossRefMATH Richman R (2020) AI in actuarial science - a review of recent advances - part 1. Ann Actuarial Sci 15(2):207–229CrossRefMATH
28.
go back to reference Richman R (2020) AI in actuarial science - a review of recent advances - part 2. Ann Actuarial Sci 15(2):230–258CrossRefMATH Richman R (2020) AI in actuarial science - a review of recent advances - part 2. Ann Actuarial Sci 15(2):230–258CrossRefMATH
29.
30.
31.
go back to reference Richman R, Wüthrich MV (2024) High-cardinality categorical covariates in network regressions. Japan J Stat Data Sci 7(2):921–965MathSciNetCrossRefMATH Richman R, Wüthrich MV (2024) High-cardinality categorical covariates in network regressions. Japan J Stat Data Sci 7(2):921–965MathSciNetCrossRefMATH
35.
36.
37.
go back to reference Wüthrich MV (2024) Experience rating in insurance pricing. SSRN Manuscript ID 4726206 Wüthrich MV (2024) Experience rating in insurance pricing. SSRN Manuscript ID 4726206
39.
go back to reference Zehui L, Liu P, Huang L, Chen J, Qiu X, Huang X (2019) DropAttention: a regularization method for fully-connected self-attention networks. arXiv:1907.11065 Zehui L, Liu P, Huang L, Chen J, Qiu X, Huang X (2019) DropAttention: a regularization method for fully-connected self-attention networks. arXiv:​1907.​11065
40.
go back to reference Zhai X, Mustafa B, Kolesnikov A, Beyer L (2023) Sigmoid loss for language image pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision, 11975-11986 Zhai X, Mustafa B, Kolesnikov A, Beyer L (2023) Sigmoid loss for language image pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision, 11975-11986
Metadata
Title
The credibility transformer
Authors
Ronald Richman
Salvatore Scognamiglio
Mario V. Wüthrich
Publication date
15-02-2025
Publisher
Springer Berlin Heidelberg
Published in
European Actuarial Journal
Print ISSN: 2190-9733
Electronic ISSN: 2190-9741
DOI
https://doi.org/10.1007/s13385-025-00413-y