Are we modelling the right thing? The impact of incorrect problem specification in credit scoring

doi:10.1016/j.eswa.2008.12.016

Expert Systems with Applications

Volume 36, Issue 5, July 2009, Pages 9065-9071

https://doi.org/10.1016/j.eswa.2008.12.016 Get rights and content

Abstract

Classification and regression models are widely used by mainstream credit granting institutions to assess the risk of customer default. In practice, the objectives used to derive model parameters and the business objectives used to assess models differ. Models parameters are determined by minimising some function or error or by maximising likelihood, but performance is assessed using global measures such as the GINI coefficient, or the misclassification rate at a specific point in the score distribution. This paper seeks to determine the impact on performance that results from having different objectives for model construction and model assessment. To do this a genetic algorithm (GA) is utilized to generate linear scoring models that directly optimise business measures of interest. The performance of the GA models is then compared to those constructed using logistic and linear regression. Empirical results show that all models perform similarly well, suggesting that modelling and business objectives are well aligned.

Introduction

All mainstream credit granting institutions use credit scoring – mechanically derived forecasting models of customer behaviour – to make decisions about whom to extend credit to and on what terms. The most widely used credit scoring models predict a simple binary outcome; that is, the likelihood that an individual will be a ‘good’ customer who repays the credit advanced to them, or a ‘bad’ customer who defaults. Despite much research into the applicability of a wide variety of classification and regression methods to credit scoring problems, logistic regression remains the most widely used method in practice (Crook et al., 2007, Finlay, 2008). This is mainly attributed to the fact that logistic regression produces simple models that are easily interpretable, as well as empirical evidence suggesting that the performance of simple linear models is only fractionally worse than more complex model forms such as neural networks and support vector machines (Baesens et al., 2003).

In many real world situations, the objective a lender is trying to optimise through the use of a credit scoring model is different from the objective used during model development. Therefore, a key question – that has not been widely considered by the credit scoring community – is; are we modelling the right thing? And if not, what is impact of not doing so? As a simple illustration, consider logistic regression applied to a binary classification problem, where the dependent variable, y, takes values of 0 or 1. Through the application of an appropriate algorithm, a model is derived that maximises likelihood over the set of n observed cases: $\prod_{i = 1}^{n} (P_{i}^{y_{i}} + (1 - P_{i})^{(1 - y_{i})})$ where P_i is the posterior probability that y_i = 1, calculated as a function of independent variables. Yet, for many practitioners the actual point estimate for an observation is of little interest. What is of primary importance is the relative performance at specific points in the distribution of ranked model scores (Thomas, Banasik, & Crook, 2001). It is also true that for some decisions (such as where a fixed accept rate policy is in operation) the only concern is that observations fall on the correct side of the decision rule applied. Whether, an individual only just passes the cut-off score or exceeds it by a great margin is irrelevant (Hand, 2005). This can be demonstrated by considering a hypothetical example. Imagine that there exist two models that generate probabilistic estimates of credit applications being good credit risks. Two credit applications are scored by each model to produce the results shown in Table 1.

Now assume that both cases are revealed to be good payers. From a maximum likelihood perspective, Model 1 outperforms Model 2. Yet, if a lender was using these models to make credit granting decisions, say on the basis of accepting only those where the estimated probability of being good exceeds 0.8, then Model 2 is better because both cases would be accepted. Maximising likelihood is therefore no guarantee of optimal model performance in this case.

A Genetic Algorithm (GA) is a data driven, non-parametric heuristic search process, where the training algorithm can be chosen to optimise a wide range of objective functions. Because the training algorithm is guided only by the performance of competing solutions, GAs have the potential to generate models that outperform other approaches to credit scoring in situations where the objective function that a user wishes to optimise, differs from that used within the modelling process.

Previous studies where GAs have been used to develop credit scoring models have reported mixed findings. Fogarty and Ireson (1993/4) took a sample of over fifty thousand accepted credit card applications and compared a GA derived Bayesian classifier with decision rules derived from a number of techniques including a nearest neighbour clustering algorithm, a decision tree and a simple Bayesian classifier. They found that the GA derived classifier performed better than other methods when assessed on classification rates, but did not perform better than a simple decision rule to classify all cases as good. Desai, Conway, Crook, and Overstreet (1997) looked at a three-way classification problem where accounts were classified as good, poor or bad payers. They reported that a GA approach was marginally better at classifying the worst accounts (bad payers) than linear discriminant analysis, logistic regression and a variety of neural network models, but did not perform as well when measured in terms of classification performance on good and poor paying accounts. Yobas, Crook, and Ross (2000) reported that while a GA derived model performed better than neural networks and decision trees on the development sample (no validation sample performance was available for the GA derived model), all three methods were outperformed by linear discriminant analysis. While the results and methodologies applied in these previous studies differ, one feature that they all have in common is that they only considered misclassification performance metrics for which the non-GA approaches used in the study were generally known to provide good levels of performance. It is, therefore, no surprise that a GA approach was not found to significantly outperform the alternative model development approaches examined.

In this paper, a GA approach is again explored, but incorporating a number of features that differentiate it from previous studies. First, the objective is primarily to determine the sensitivity of models developed using standard approaches to differences between modelling and business objectives. The actual performance of GA derived models is only a secondary consideration. Second, rather than simply judging performance of competing models on the basis of a single misclassification measure, model performance is assessed using several different criteria:

•
The maximisation of the GINI coefficient (a measure of the area under the receiver operator curve) which for a discrete population of n observations that fall into one of two classes and ranked by model score, can be calculated using the Brown formula: $1 - \sum_{i = 2}^{n} [G (i) + G (i - 1)] [B (i) - B (i - 1)]$ where G and B represent the cumulative proportion of cases falling into each class respectively.
•
The minimisation of the proportion of bads within the highest scoring x% of the population; that is, the number of bads scoring ⩾c, where c is the cut-off score at or above x% of the population score. For the purposes of this study values of x of 5, 10, 25 and 50 percent were considered.

In each case a GA is applied to generate a scoring model that maximised each objective independently, whereas a single competing model was constructed and assessed using the competitor approaches. Second, two large real world data sets are used, whereas previous studies have been based on relatively low dimensional data sets and small samples (with the exception of Fogarty and Ireson’s study). Third, solutions are considered with and without seeding – the process whereby a genetic algorithm is initialised using a number of pre-existing solutions found using some alternative technique. The GA is then applied in an attempt to improve upon the performance of the original seed solution(s).

Empirical results are presented for the two data sets; with the performance of the GA derived models compared to models constructed using logistic regression and multiple OLS regression.

Section snippets

Overview of genetic algorithms

The theory of GAs was developed in the late 1960s and early 1970s by John Holland and his associates as a means to study evolutionary processes in nature (Holland, 1975), but they were quickly adopted as a heuristic approach applicable to a wide range of optimisation problems (De Jong, 1975, Hollstien, 1971). The general principles of GAs are analogous to Darwinian principles of natural selection and survival of the fittest, and the terminology employed to describe GA training and selection is

Design and implementation of genetic algorithms

A number of parameters need to be selected for GA training, and as with methods such as neural networks, the parameters that deliver the best solution tend to be problem specific. Consequently, a good deal of trial and error can be required to find the most appropriate training parameters. The first question is how to encode solutions to a given problem? The approach favoured by Holland (1975) is to use binary strings. For example, if the objective is to optimise some function of two

Data

Two data sets were available for study. The first data set (Set A) was supplied by Experian UK and contained details of credit applications made between April and June, 2002, and for which performance information was attached 12 months after the application date. After removal of outliers and indeterminates the sample contained 88,792 observations of which 75,528 were classified as good credit risks and 13,264 as bad credit risks. Goods were classified as no more than 1 month in arrears, bads

Methodology

The goal of the exercise was to apply a GA to create linear scoring functions in the form Y = B^TX, where X is a column vector of independent variables and B, a column vector of parameter coefficients. It should be noted that the resulting score, Y, should not be interpreted as an estimate of individual performance, but merely as the relative score of each observation within the dataset. Although other types of scoring function exist, such as a multi-layer perception, a linear function of the

Results

The performance of the GA derived models produced without seeding, and that of the competitor models are shown in Table 2. The performance of the GA derived models produced with seeding and that of the competitor models are shown in Table 3. The best performing model for each measure is highlighted in bold.

The first point to note from Table 2, Table 3 is how similar the performance of many of the competing models was across all performance measures for both data sets. It is also the case that

Concluding remarks

In this paper, the suitability of the modelling objectives used to create credit scoring models have been drawn into question, given that they differ from the business measures that are widely used to assess model performance. To explore this issue genetic algorithms were used to create a set of linear scoring models that directly optimised individual measures of business interest. In all cases, there was no significant differences between the performances of the GA derived models and models

Acknowledgements

The author would like to thank the ESRC and Experian UK for their support for this research, and Professor Robert Fildes of Lancaster University for his comments on an early draft of the paper. The author is also grateful for the contribution from a third organisation that has requested that its identity remains anonymous.

References (30)

R.K. Ahuja et al.
A greedy genetic algorithm for the quadratic assignment problem
Computers & Operations Research
(2000)
J.N. Crook et al.
Recent developments in consumer credit risk assessment
European Journal of Operational Research
(2007)
C.R. Reeves
A genetic algorithm for flowshop sequencing
Computers & Operations Research
(1995)
R.K. Ahuja et al.
Developing fitter genetic algorithms
INFORMS Journal On Computing
(1997)
B. Baesens et al.
Benchmarking state-of-the-art classification algorithms for credit scoring
Journal of the Operational Research Society
(2003)
D.A. Coley
An introduction to genetic algorithms for scientists and engineers
(1999)
K.A. De Jong
An analysis of the behavior of a class of genetic adaptive systems
(1975)
V.S. Desai et al.
Credit-scoring models in the credit union environment using neural networks and genetic algorithms
IMA Journal of Mathematics Applied in Business and Industry
(1997)
S. Finlay
The management of consumer credit: Theory and practice
(2008)
T.C. Fogarty et al.
Evolving Bayesian classifiers for credit control – A comparison with other machine learning methods
IMA Journal of Mathematics Applied in Business and Industry
(1993/4)

J. Fox

Nonparametric simple regression

(2000)

D.E. Goldberg

Genetic algorithms in search optimization & machine learning

(1989)

Goldberg, D. E. (1989b). Sizing populations for serial and parallel genetic algorithms. In J. D. Schaffer (Ed.),...

D.J. Hand

Good practice in retail credit scorecard assessment

Journal of the Operational Research Society

(2005)

D.J. Hand et al.

Defining attributes for scorecard construction in credit scoring

Journal of Applied Statistics

(2000)

Cited by (20)

A novel tree-based dynamic heterogeneous ensemble method for credit scoring
2020, Expert Systems with Applications
Citation Excerpt :
Evolutionary computation has been applied in credit scoring directly and indirectly. Regarding direct applications, evolutionary computation techniques can be applied to classification and achieve competitive results relative to statistical and ML algorithms (Finlay, 2009). For indirect applications, evolutionary computation techniques play important roles in hyper-parameter optimization and variable selection in some studies (Chi & Hsu, 2012; Vukovic, Delibasic, Uzelac, & Suknovic, 2012).
Ensemble models have been extensively applied to credit scoring. However, advanced tree-based classifiers have been seldom utilized as components of ensemble models. Moreover, few studies have considered dynamic ensemble selection. To fill the research gap, this paper aims to develop a novel tree-based overfitting-cautious heterogeneous ensemble model (i.e., OCHE) for credit scoring which departs from existing literature on base models and ensemble selection strategy. Regarding base models, tree-based techniques are employed to acquire a balance between predictive accuracy and computational cost. In terms of ensemble selection, the proposed method can assign weights to base models dynamically according to the overfitting measure. Validated on five public datasets, the proposed approach is compared with several popular benchmark models and selection strategies on predictive accuracy and computational cost measures. For predictive accuracy, the proposed approach outperforms the benchmark models significantly in most cases based on the non-parametric significance test. It also performs marginally better than several state-of-the-art studies. Our proposal remains robust in several scenarios. In terms of computational cost, the proposed method provides acceptable performance and benefits from GPU acceleration considerably.
The effects of handling outliers on the performance of bankruptcy prediction models
2019, Socio-Economic Planning Sciences
Ratio type financial indicators are the most popular explanatory variables in bankruptcy prediction models. These measures often exhibit heavily skewed distribution because of the presence of outliers. In the absence of clear definition of outliers, ad hoc approaches can be found in the literature for identifying and handling extreme values. However, it is not clear how these different approaches can affect the predictive power of models. There seems to be consensus in the literature on the necessity of handling outliers, at the same time, it is not clear how to define extreme values to be handled in order to maximize the predictive power of models. There are two possible ways to reduce the bias originating from outliers: omission and winsorization. Since the first approach has been examined previously in the literature, we turn our attention to the latter. We applied the most popular classification methodologies in this field: discriminant analysis, logistic regression, decision trees (CHAID and CART) and neural networks (multilayer perceptron). We assessed the predictive power of models in the framework of tenfold stratified crossvalidation and area under the ROC curve. We analyzed the effect of winsorization at 1, 3 and 5% and at 2 and 3 standard deviations, furthermore we discretized the range of each variable by the CHAID method and used the ordinal measures so obtained instead of the original financial ratios. We found that this latter data preprocessing approach is the most effective in the case of our dataset. In order to check the robustness of our results, we carried out the same empirical research on the publicly available Polish bankruptcy dataset from the UCI Machine Learning Repository. We obtained very similar results on both datasets, which indicates that the CHAID-based categorization of financial ratios is an effective way of handling outliers with respect to the predictive performance of bankruptcy prediction models.
Behaviour-based short-term invoice probability of default evaluation
2017, European Journal of Operational Research
Citation Excerpt :
Its shortcomings are also strongly disputed. Finlay (2009), for instance, addresses the issues of incorrect problem specification, while the lack of time definition in PD is addressed by invoking time-varying covariates (Orth, 2013) and past-due time thresholds (Harris, 2013). A valid solution systematically resolving the PD time indistinctness has not been found in the literature.
In this paper, the effect of behavioural analytics on short-term default predictions at the invoice level is addressed by answering a question that slightly diverges from the traditional probability of default definition: ‘What is the probability that this invoice will be paid within the next 30 days?’ Resultantly improving short-term liquidity planning accuracy and supporting financial management in companies.
To provide a valid answer to the research question, a set of issues needs to be resolved, including identifying an appropriate data set, increasing the data predictive power, and creating and testing predictive models. Since the appropriate data set is not yet presented, we primarily focus on the first two issues: identifying appropriate data and raising its predictive power.
In this paper, we propose to build predictive models upon a new data source from multiple companies, acquired by business partners' data sharing concept. Furthermore, we upgrade these data with behavioural analysis to test the assumption that the probability of default depends not only on payment capability but also on payment preparedness.
The predictive power of shared invoice data and the effects of behavioural analysis are tested in a two-phase experiment: first, basic shared data are used to predict short-term invoice defaults, and in the second phase, the behavioural analysis results are included in the dataset. Lastly, the predictive models’ test results are compared. Both results are positive: the already high accuracy of models, build upon basic data is significantly upgraded in models, using the behaviour analysis extended data set.
Classification methods applied to credit scoring: Systematic review and overall comparison
2016, Surveys in Operations Research and Management Science
The need for controlling and effectively managing credit risk has led financial institutions to excel in improving techniques designed for this purpose, resulting in the development of various quantitative models by financial institutions and consulting companies. Hence, the growing number of academic studies about credit scoring shows a variety of classification methods applied to discriminate good and bad borrowers. This paper, therefore, aims to present a systematic literature review relating theory and application of binary classification techniques for credit scoring financial analysis. The general results show the use and importance of the main techniques for credit rating, as well as some of the scientific paradigm changes throughout the years.
Application of a rule extraction algorithm family based on the Re-RX algorithm to financial credit risk assessment from a Pareto optimal perspective
2016, Operations Research Perspectives
Citation Excerpt :
As a result, numerous computational methods, including linear discriminant analysis (LDA), logistic regression (LR), and multiple discriminant analysis (MDA), have been considered for application in automated loan decision-making processes. A number of alternative approaches have been developed for the computation of credit scores over the last several decades, including support vector machines (SVMs) [8], neural networks (NNs) [3,9], ensemble classifiers [10] and various genetic algorithms [11]. However, these methods have not been widely adopted because they are considered to be more complex and to require greater resources while offering less interpretability, even though they have been shown to produce significantly better results compared with those obtained from LDA, LR, and MDA.
Historically, the assessment of credit risk has proved to be both highly important and extremely difficult. Currently, financial institutions rely on the use of computer-generated credit scores for risk assessment. However, automated risk evaluations are currently imperfect, and the loss of vast amounts of capital could be prevented by improving the performance of computerized credit assessments. A number of approaches have been developed for the computation of credit scores over the last several decades, but these methods have been considered too complex without good interpretability and have therefore not been widely adopted. Therefore, in this study, we provide the first comprehensive comparison of results regarding the assessment of credit risk obtained using 10 runs of 10-fold cross validation of the Re-RX algorithm family, including the Re-RX algorithm, the Re-RX algorithm with both discrete and continuous attributes (Continuous Re-RX), the Re-RX algorithm with J48graft, the Re-RX algorithm with a trained neural network (Sampling Re-RX), NeuroLinear, NeuroLinear+GRG, and three unique rule extraction techniques involving support vector machines and Minerva from four real-life, two-class mixed credit-risk datasets. We also discuss the roles of various newly-extended types of the Re-RX algorithm and high performance classifiers from a Pareto optimal perspective. Our findings suggest that Continuous Re-RX, Re-RX with J48graft, and Sampling Re-RX comprise a powerful management tool that allows the creation of advanced, accurate, concise and interpretable decision support systems for credit risk evaluation. In addition, from a Pareto optimal perspective, the Re-RX algorithm family has superior features in relation to the comprehensibility of extracted rules and the potential for credit scoring with Big Data.
Genetic algorithms for credit scoring: Alternative fitness function performance comparison
2015, Expert Systems with Applications
Citation Excerpt :
Gordini (2014) used genetic algorithms to generate classification rules for SME bankruptcy prediction. Competitive results have been achieved by Finlay (2009), who compared the performance of logistic and linear regression with GAs using a linear fitness function. Most literature, however, does not give enough detail as to the type of fitness function used.
Credit scoring methods have been widely investigated by researchers; recently, genetic algorithms have attracted particular attention. Many research papers comparing the performance of genetic algorithms and traditional scoring techniques have been published, but most do not provide enough detail about the fitness function used by the genetic algorithm—despite the fact that fitness function has a key influence on the model’s overall performance. The aim of this paper is to evaluate the predictive performance of different fitness functions used by genetic algorithms in credit scoring. An alternative fitness function based on a variable bitmask is proposed, and its performance then compared with fitness functions based on a polynomial equation as well as an estimation of parameter range. The results suggest that the bitmask is superior to the two other methods in both accuracy and sensitivity. The Wilcoxon matched-pairs sign rank test and paired t-Test indicate these results are statistically significant.

View all citing articles on Scopus

View full text

Published by Elsevier Ltd.

Are we modelling the right thing? The impact of incorrect problem specification in credit scoring

Abstract

Introduction

Section snippets

Overview of genetic algorithms

Design and implementation of genetic algorithms

Data

Methodology

Results

Concluding remarks

Acknowledgements

Computers & Operations Research

European Journal of Operational Research

Computers & Operations Research

Developing fitter genetic algorithms

INFORMS Journal On Computing

Benchmarking state-of-the-art classification algorithms for credit scoring

Journal of the Operational Research Society

An introduction to genetic algorithms for scientists and engineers

An analysis of the behavior of a class of genetic adaptive systems

Credit-scoring models in the credit union environment using neural networks and genetic algorithms

IMA Journal of Mathematics Applied in Business and Industry

The management of consumer credit: Theory and practice

Evolving Bayesian classifiers for credit control – A comparison with other machine learning methods

IMA Journal of Mathematics Applied in Business and Industry

Nonparametric simple regression

Genetic algorithms in search optimization & machine learning

Good practice in retail credit scorecard assessment

Journal of the Operational Research Society

Defining attributes for scorecard construction in credit scoring

Journal of Applied Statistics