Elsevier

Insurance: Mathematics and Economics

Volume 64, September 2015, Pages 417-428
Insurance: Mathematics and Economics

Dependent frequency–severity modeling of insurance claims

https://doi.org/10.1016/j.insmatheco.2015.07.006Get rights and content

Abstract

Standard ratemaking techniques in non-life insurance assume independence between the number and size of claims. Relaxing the independence assumption, this article explores methods that allow for the correlation among frequency and severity components for micro-level insurance data. To introduce granular dependence, we rely on a hurdle modeling framework where the hurdle component concerns the occurrence of claims and the conditional component looks into the number and size of claims given occurrence. We propose two strategies to correlate the number of claims and the average claim size in the conditional component. The first is based on conditional probability decomposition and treats the number of claims as a covariate in the regression model for the average claim size, the second employed a mixed copula approach to formulate the joint distribution of the number and size of claims. We perform a simulation study to evaluate the performance of the two approaches and then demonstrate their application using a U.S. auto insurance dataset. The hold-out sample validation shows that the proposed model is superior to the industry benchmarks including the Tweedie and the two-part generalized linear models.

Introduction

Insurance claims modeling is a critical actuarial task in property–casualty insurance. A direct output–the predictive distribution of claims–serves as a foundation in various actuarial decision-making process. At individual level, predictive models are used for risk classification and to determine the premium and loadings for each policyholder. At aggregate level, predictive models quantify the risk of a portfolio or a block of business, which helps insurers choose the appropriate level of risk capital and treaty or facultative reinsurance arrangements.

The function of insurance as a risk management tool relies on the law of large numbers, i.e. the insurer can spread the risk of an individual among a pool of homogeneous policyholders. This risk pooling mechanism determines the unique semi-continuous feature of insurance claims data. Specifically, when examining the claims from a random sample of policyholders, one often observes a significant fraction of zeros associated with a relatively small percentage of positive claim amounts, also known as zero-inflated data in the literature. The zeros correspond to the policyholders without any claim during the policy year and usually account for the majority of observations. The phenomenon of zero inflation is not surprising when imaging the odds of car accidents or hail damage to a real property.

Modeling insurance claims is commonly done within the generalized linear model GLM framework (de Jong and Heller, 2008). Popular GLM-based approaches to handling zero inflation include the frequency–severity model (see, for example,  Frees (2014)) and the Tweedie compound Poisson model (see, for example,  Jørgensen and de Souza (1994)). The former, also known as two-part model, decomposes the cost of claims into two pieces, the frequency part examines whether or not a claim occurs (a logit regression) or the number of claims (a Poisson regression), and the severity part looks into the amount of claims conditional on occurrence (a gamma or inverse Gaussian regression). The latter is defined as a Poisson sum of i.i.d. gamma variables and a mass probability at zero is naturally incorporated into an otherwise continuous distribution through the compound Poisson process. The resulting Tweedie distribution belongs to the exponential family and hence the inference procedure for GLMs directly applies. There is a vast literature extending the two classes of approaches so as to capture different features of the insurance data. For example, hurdle and zero-inflated models are employed to accommodate the overdispersion and zero inflation in claim counts (see  Boucher et al. (2007)), various fat-tailed regression techniques have been proposed to model the claim size (see  Shi (2014)), and dispersion model is introduced for the Tweedie model in the context of double GLMs (see  Smyth and Jørgensen (2002)). Each method has its own advantages. For instance, the frequency–severity model is more flexible in the modeling of the occurrence and the size of insurance claims. In contrast, with a more parsimonious specification, the Tweedie model simplifies the variable selection process.

However, both methods assume an independent relationship between the frequency and severity of claims. In the Tweedie compound Poisson model, the derivation of the Tweedie distribution is based on the independence between the Poisson and gamma models in the underlying stochastic process. In the two-part framework, the explicit effect of claim frequency on claim severity is usually ignored, although, technically speaking, the decomposition is based on conditional probability and does not require the independence assumption between the two components. However, the number and the size of insurance claims could be correlated and this is the case for the automobile claims data in this study. In the spirit of estimating simultaneous equations, failure to account for the dependence among a set of regression models could bias the parameter estimates and thus the predictive distribution. Despite of the development in the recent actuarial literature, the effort toward relaxing the independence assumption is still sparse. We note that a few papers are in line with our study. In the spatial modeling of frequency and severity of automobile insurance claims,  Gschlößl and Czado (2007) used claims counts as a predictor for the claim size. This conditional probability approach has also been used in modeling health care expenditures, for example, see  Frees et al. (2011) and  Erhardt and Czado (2012).  Czado et al. (2012) and  Krämer et al. (2013) employed parametric copulas to jointly model the number and average size of claims for aggregated car insurance data.

Motivated by the above observations, this work aims to design the claims modeling framework that could accommodate the statistical association between the number and the size of insurance claims. We focus on the modeling strategy for cross-sectional data. We assume that the insurance dataset contains at least policy level information, including whether there is any accident, the number and the amount of claims if one or more accidents occur during the observation period, as well as a set of predictors. To introduce granular dependence, we propose a hurdle modeling framework where the hurdle component examines the probability of at least one claim occurring, and the conditional component models the number of claims and their size, given that at least one claim has occurred. We employ two strategies to correlate the number of claims and the average claim size in the conditional model. The first is based on conditional probability decomposition and treats the number of claims as a covariate in the regression model for the average claim size, the second employed a mixed copula approach to formulate the joint distribution of the number and size of claims. We evaluate the performance of both methods through a simulation study and then demonstrate their application using a U.S. auto insurance dataset.

Our work differs from and contributes to the existing literature in the sense that we extend the traditional two-part model to a three-part framework so as to incorporate the association between the frequency and severity component for micro-level insurance claims data. First, we emphasize the inference for granular observations because aggregating claims data could result in significant information loss. Second, the proposed framework preserves the flexibility of capturing some unique features in the insurance claims data, such as the zero inflation in claim counts and the heavy tails in claim severity. Note that, unlike existing studies, one does not need to limit to the GLM framework for the marginal models. Third, for model comparison, we emphasize hold-out sample validation in addition to the goodness-of-fit, which is of more practical importance for prediction purposes.

The rest of the article is structured as follows: Section  2 introduces the hurdle modeling framework under which two different approaches are proposed to accommodate the dependence between the frequency and severity of insurance claims. Section  3 presents a simulation study to investigate the performance of the two approaches. Section  4 summarizes the application of Massachusetts automobile insurance, including the characteristics of the dataset, inference and model comparison, and prediction of the proposed methods. Section  5 concludes the paper.

Section snippets

Modeling

For a portfolio of business containing I policyholders, we use three random variables to describe the claim experience of the ith (i=1,,I) policyholder during the policy year: Ri indicates whether there is any claim; Ni represents the observed number of claims; Si denotes the observed average amount of claims. The three outcomes are assumed to be dictated by the underlying latent variables according to the following relations: Ri={1ifRi>00ifRi0,Ni={NiifRi=1ifRi=0,Si={SiifRi=1ifRi=0.

Simulation

The simulation focuses on the in-sample inference for the correlated frequency–severity models and emphasizes the importance of dependence modeling. We simulate a portfolio of policyholders of size I=5000 and I=20,000 for each model. Two predictors are used in the simulation, x1 indicates a rating class and x2 indicates territory class. Both are generated from Bernoulli(0.5) independently. Claim experiences of the policyholders are generated from the methods described in Section  2. We use

Data

In the application, we consider a database of personal automobile insurance from the Commonwealth Automobile Reinsurers (CAR) in the state of Massachusetts in the United States. The CAR is a statistical agent for motor vehicle insurance in the Commonwealth of Massachusetts and collects insurance data for both private passengers and commercial automobiles in the state. In Massachusetts, all drivers are required to purchase third party liability (property damage and bodily injury) and personal

Conclusion

Current practice in modeling insurance claims often assumes independence among the frequency and severity of claims. This article explored strategies that could explicitly incorporate the association between the two components. We investigated two approaches and both were based on a hurdle modeling framework, the conditional probability model and the mixed copula model. It was noted that the latter captures the dependence in a broader sense than the former. The proposed framework is flexible to

Acknowledgments

We are grateful to the reviewers for their comments and suggestions that helped improved the quality and presentation of the paper. We acknowledge the financial support from the Centers of Actuarial Excellence (CAE) Research Grant from the Society of Actuaries.

References (28)

  • K. Aas et al.

    Pair-copula constructions of multiple dependence

    Insur. Math. Econ.

    (2009)
  • J.-P. Boucher et al.

    Risk classification for claim counts: a comparative analysis of various zero-inflated mixed Poisson and hurdle models

    N. Am. Actuar. J.

    (2007)
  • J.-P. Boucher et al.

    Models of insurance claim counts with time dependence based on generalisation of Poisson and negative binomial distributions

    Variance

    (2008)
  • C. Czado et al.

    Predictive model assessment for count data

    Biometrics

    (2009)
  • C. Czado et al.

    A mixed copula model for insurance claims and claim sizes

    Scand. Act. J.

    (2012)
  • P. de~Jong et al.

    Generalized Linear Models for Insurance Data

    (2008)
  • V. Erhardt et al.

    Modeling dependent yearly claim totals including zero claims in private health insurance

    Scand. Act. J.

    (2012)
  • E. Frees

    Frequency and severity models

  • E.W. Frees et al.

    Predicting the frequency and amount of health care expenditures

    N. Am. Actuar. J.

    (2011)
  • E.W. Frees et al.

    Summarizing insurance scores using a gini index

    J. Amer. Statist. Assoc.

    (2011)
  • P. Ghosh et al.

    A bayesian analysis for longitudinal semicontinuous data with an application to an acupuncture clinical trial

    Comput. Stat. Data Anal.

    (2009)
  • S. Gschlößl et al.

    Spatial modelling of claim frequency and claim size in non-life insurance

    Scand. Act. J.

    (2007)
  • B. Jørgensen et al.

    Fitting tweedies compound Poisson model to insurance claims data

    Scand. Act. J.

    (1994)
  • N. Krämer et al.

    Total loss estimation using copula-based regression models

    Insurance Math. Econom.

    (2013)
  • Cited by (80)

    • Frequency-severity experience rating based on latent Markovian risk profiles

      2022, Insurance: Mathematics and Economics
      Citation Excerpt :

      Alternative Bayesian forms of experience rating typically depend only on the frequency component as well or consider the two components separately (see, e.g., Denuit and Lang (2004); Bühlmann and Gisler (2005); Mahmoudvand and Hassani (2009); Bermúdez and Karlis (2011, 2017)). While it makes commercial sense to only include a customer's number of claims in these experience rating systems, recent studies show that the independence assumption is actually violated in practice (Czado et al., 2012; Krämer et al., 2013; Frees et al., 2014; Baumgartner et al., 2015; Shi et al., 2015; Garrido et al., 2016; Park et al., 2018; Lee et al., 2019; Valdez et al., 2021). Customers who rarely claim may, for instance, have much smaller claims than those who claim relatively often.

    • A multi-year microlevel collective risk model

      2021, Insurance: Mathematics and Economics
    View all citing articles on Scopus
    View full text