Are we modelling the right thing? The impact of incorrect problem specification in credit scoring

https://doi.org/10.1016/j.eswa.2008.12.016Get rights and content

Abstract

Classification and regression models are widely used by mainstream credit granting institutions to assess the risk of customer default. In practice, the objectives used to derive model parameters and the business objectives used to assess models differ. Models parameters are determined by minimising some function or error or by maximising likelihood, but performance is assessed using global measures such as the GINI coefficient, or the misclassification rate at a specific point in the score distribution. This paper seeks to determine the impact on performance that results from having different objectives for model construction and model assessment. To do this a genetic algorithm (GA) is utilized to generate linear scoring models that directly optimise business measures of interest. The performance of the GA models is then compared to those constructed using logistic and linear regression. Empirical results show that all models perform similarly well, suggesting that modelling and business objectives are well aligned.

Introduction

All mainstream credit granting institutions use credit scoring – mechanically derived forecasting models of customer behaviour – to make decisions about whom to extend credit to and on what terms. The most widely used credit scoring models predict a simple binary outcome; that is, the likelihood that an individual will be a ‘good’ customer who repays the credit advanced to them, or a ‘bad’ customer who defaults. Despite much research into the applicability of a wide variety of classification and regression methods to credit scoring problems, logistic regression remains the most widely used method in practice (Crook et al., 2007, Finlay, 2008). This is mainly attributed to the fact that logistic regression produces simple models that are easily interpretable, as well as empirical evidence suggesting that the performance of simple linear models is only fractionally worse than more complex model forms such as neural networks and support vector machines (Baesens et al., 2003).

In many real world situations, the objective a lender is trying to optimise through the use of a credit scoring model is different from the objective used during model development. Therefore, a key question – that has not been widely considered by the credit scoring community – is; are we modelling the right thing? And if not, what is impact of not doing so? As a simple illustration, consider logistic regression applied to a binary classification problem, where the dependent variable, y, takes values of 0 or 1. Through the application of an appropriate algorithm, a model is derived that maximises likelihood over the set of n observed cases:i=1nPiyi+(1-Pi)(1-yi)where Pi is the posterior probability that yi = 1, calculated as a function of independent variables. Yet, for many practitioners the actual point estimate for an observation is of little interest. What is of primary importance is the relative performance at specific points in the distribution of ranked model scores (Thomas, Banasik, & Crook, 2001). It is also true that for some decisions (such as where a fixed accept rate policy is in operation) the only concern is that observations fall on the correct side of the decision rule applied. Whether, an individual only just passes the cut-off score or exceeds it by a great margin is irrelevant (Hand, 2005). This can be demonstrated by considering a hypothetical example. Imagine that there exist two models that generate probabilistic estimates of credit applications being good credit risks. Two credit applications are scored by each model to produce the results shown in Table 1.

Now assume that both cases are revealed to be good payers. From a maximum likelihood perspective, Model 1 outperforms Model 2. Yet, if a lender was using these models to make credit granting decisions, say on the basis of accepting only those where the estimated probability of being good exceeds 0.8, then Model 2 is better because both cases would be accepted. Maximising likelihood is therefore no guarantee of optimal model performance in this case.

A Genetic Algorithm (GA) is a data driven, non-parametric heuristic search process, where the training algorithm can be chosen to optimise a wide range of objective functions. Because the training algorithm is guided only by the performance of competing solutions, GAs have the potential to generate models that outperform other approaches to credit scoring in situations where the objective function that a user wishes to optimise, differs from that used within the modelling process.

Previous studies where GAs have been used to develop credit scoring models have reported mixed findings. Fogarty and Ireson (1993/4) took a sample of over fifty thousand accepted credit card applications and compared a GA derived Bayesian classifier with decision rules derived from a number of techniques including a nearest neighbour clustering algorithm, a decision tree and a simple Bayesian classifier. They found that the GA derived classifier performed better than other methods when assessed on classification rates, but did not perform better than a simple decision rule to classify all cases as good. Desai, Conway, Crook, and Overstreet (1997) looked at a three-way classification problem where accounts were classified as good, poor or bad payers. They reported that a GA approach was marginally better at classifying the worst accounts (bad payers) than linear discriminant analysis, logistic regression and a variety of neural network models, but did not perform as well when measured in terms of classification performance on good and poor paying accounts. Yobas, Crook, and Ross (2000) reported that while a GA derived model performed better than neural networks and decision trees on the development sample (no validation sample performance was available for the GA derived model), all three methods were outperformed by linear discriminant analysis. While the results and methodologies applied in these previous studies differ, one feature that they all have in common is that they only considered misclassification performance metrics for which the non-GA approaches used in the study were generally known to provide good levels of performance. It is, therefore, no surprise that a GA approach was not found to significantly outperform the alternative model development approaches examined.

In this paper, a GA approach is again explored, but incorporating a number of features that differentiate it from previous studies. First, the objective is primarily to determine the sensitivity of models developed using standard approaches to differences between modelling and business objectives. The actual performance of GA derived models is only a secondary consideration. Second, rather than simply judging performance of competing models on the basis of a single misclassification measure, model performance is assessed using several different criteria:

  • The maximisation of the GINI coefficient (a measure of the area under the receiver operator curve) which for a discrete population of n observations that fall into one of two classes and ranked by model score, can be calculated using the Brown formula: 1-i=2n[G(i)+G(i-1)][B(i)-B(i-1)] where G and B represent the cumulative proportion of cases falling into each class respectively.

  • The minimisation of the proportion of bads within the highest scoring x% of the population; that is, the number of bads scoring ⩾c, where c is the cut-off score at or above x% of the population score. For the purposes of this study values of x of 5, 10, 25 and 50 percent were considered.

In each case a GA is applied to generate a scoring model that maximised each objective independently, whereas a single competing model was constructed and assessed using the competitor approaches. Second, two large real world data sets are used, whereas previous studies have been based on relatively low dimensional data sets and small samples (with the exception of Fogarty and Ireson’s study). Third, solutions are considered with and without seeding – the process whereby a genetic algorithm is initialised using a number of pre-existing solutions found using some alternative technique. The GA is then applied in an attempt to improve upon the performance of the original seed solution(s).

Empirical results are presented for the two data sets; with the performance of the GA derived models compared to models constructed using logistic regression and multiple OLS regression.

Section snippets

Overview of genetic algorithms

The theory of GAs was developed in the late 1960s and early 1970s by John Holland and his associates as a means to study evolutionary processes in nature (Holland, 1975), but they were quickly adopted as a heuristic approach applicable to a wide range of optimisation problems (De Jong, 1975, Hollstien, 1971). The general principles of GAs are analogous to Darwinian principles of natural selection and survival of the fittest, and the terminology employed to describe GA training and selection is

Design and implementation of genetic algorithms

A number of parameters need to be selected for GA training, and as with methods such as neural networks, the parameters that deliver the best solution tend to be problem specific. Consequently, a good deal of trial and error can be required to find the most appropriate training parameters. The first question is how to encode solutions to a given problem? The approach favoured by Holland (1975) is to use binary strings. For example, if the objective is to optimise some function of two

Data

Two data sets were available for study. The first data set (Set A) was supplied by Experian UK and contained details of credit applications made between April and June, 2002, and for which performance information was attached 12 months after the application date. After removal of outliers and indeterminates the sample contained 88,792 observations of which 75,528 were classified as good credit risks and 13,264 as bad credit risks. Goods were classified as no more than 1 month in arrears, bads

Methodology

The goal of the exercise was to apply a GA to create linear scoring functions in the form Y = BTX, where X is a column vector of independent variables and B, a column vector of parameter coefficients. It should be noted that the resulting score, Y, should not be interpreted as an estimate of individual performance, but merely as the relative score of each observation within the dataset. Although other types of scoring function exist, such as a multi-layer perception, a linear function of the

Results

The performance of the GA derived models produced without seeding, and that of the competitor models are shown in Table 2. The performance of the GA derived models produced with seeding and that of the competitor models are shown in Table 3. The best performing model for each measure is highlighted in bold.

The first point to note from Table 2, Table 3 is how similar the performance of many of the competing models was across all performance measures for both data sets. It is also the case that

Concluding remarks

In this paper, the suitability of the modelling objectives used to create credit scoring models have been drawn into question, given that they differ from the business measures that are widely used to assess model performance. To explore this issue genetic algorithms were used to create a set of linear scoring models that directly optimised individual measures of business interest. In all cases, there was no significant differences between the performances of the GA derived models and models

Acknowledgements

The author would like to thank the ESRC and Experian UK for their support for this research, and Professor Robert Fildes of Lancaster University for his comments on an early draft of the paper. The author is also grateful for the contribution from a third organisation that has requested that its identity remains anonymous.

References (30)

  • R.K. Ahuja et al.

    A greedy genetic algorithm for the quadratic assignment problem

    Computers & Operations Research

    (2000)
  • J.N. Crook et al.

    Recent developments in consumer credit risk assessment

    European Journal of Operational Research

    (2007)
  • C.R. Reeves

    A genetic algorithm for flowshop sequencing

    Computers & Operations Research

    (1995)
  • R.K. Ahuja et al.

    Developing fitter genetic algorithms

    INFORMS Journal On Computing

    (1997)
  • B. Baesens et al.

    Benchmarking state-of-the-art classification algorithms for credit scoring

    Journal of the Operational Research Society

    (2003)
  • D.A. Coley

    An introduction to genetic algorithms for scientists and engineers

    (1999)
  • K.A. De Jong

    An analysis of the behavior of a class of genetic adaptive systems

    (1975)
  • V.S. Desai et al.

    Credit-scoring models in the credit union environment using neural networks and genetic algorithms

    IMA Journal of Mathematics Applied in Business and Industry

    (1997)
  • S. Finlay

    The management of consumer credit: Theory and practice

    (2008)
  • T.C. Fogarty et al.

    Evolving Bayesian classifiers for credit control – A comparison with other machine learning methods

    IMA Journal of Mathematics Applied in Business and Industry

    (1993/4)
  • J. Fox

    Nonparametric simple regression

    (2000)
  • D.E. Goldberg

    Genetic algorithms in search optimization & machine learning

    (1989)
  • Goldberg, D. E. (1989b). Sizing populations for serial and parallel genetic algorithms. In J. D. Schaffer (Ed.),...
  • D.J. Hand

    Good practice in retail credit scorecard assessment

    Journal of the Operational Research Society

    (2005)
  • D.J. Hand et al.

    Defining attributes for scorecard construction in credit scoring

    Journal of Applied Statistics

    (2000)
  • Cited by (20)

    • A novel tree-based dynamic heterogeneous ensemble method for credit scoring

      2020, Expert Systems with Applications
      Citation Excerpt :

      Evolutionary computation has been applied in credit scoring directly and indirectly. Regarding direct applications, evolutionary computation techniques can be applied to classification and achieve competitive results relative to statistical and ML algorithms (Finlay, 2009). For indirect applications, evolutionary computation techniques play important roles in hyper-parameter optimization and variable selection in some studies (Chi & Hsu, 2012; Vukovic, Delibasic, Uzelac, & Suknovic, 2012).

    • Behaviour-based short-term invoice probability of default evaluation

      2017, European Journal of Operational Research
      Citation Excerpt :

      Its shortcomings are also strongly disputed. Finlay (2009), for instance, addresses the issues of incorrect problem specification, while the lack of time definition in PD is addressed by invoking time-varying covariates (Orth, 2013) and past-due time thresholds (Harris, 2013). A valid solution systematically resolving the PD time indistinctness has not been found in the literature.

    • Application of a rule extraction algorithm family based on the Re-RX algorithm to financial credit risk assessment from a Pareto optimal perspective

      2016, Operations Research Perspectives
      Citation Excerpt :

      As a result, numerous computational methods, including linear discriminant analysis (LDA), logistic regression (LR), and multiple discriminant analysis (MDA), have been considered for application in automated loan decision-making processes. A number of alternative approaches have been developed for the computation of credit scores over the last several decades, including support vector machines (SVMs) [8], neural networks (NNs) [3,9], ensemble classifiers [10] and various genetic algorithms [11]. However, these methods have not been widely adopted because they are considered to be more complex and to require greater resources while offering less interpretability, even though they have been shown to produce significantly better results compared with those obtained from LDA, LR, and MDA.

    • Genetic algorithms for credit scoring: Alternative fitness function performance comparison

      2015, Expert Systems with Applications
      Citation Excerpt :

      Gordini (2014) used genetic algorithms to generate classification rules for SME bankruptcy prediction. Competitive results have been achieved by Finlay (2009), who compared the performance of logistic and linear regression with GAs using a linear fitness function. Most literature, however, does not give enough detail as to the type of fitness function used.

    View all citing articles on Scopus
    View full text