Stochastics and StatisticsBenchmarking state-of-the-art classification algorithms for credit scoring: An update of research
Graphical abstract
Introduction
Credit scoring is concerned with developing empirical models to support decision making in the retail credit business (Crook, Edelman, & Thomas, 2007). This sector is of considerable economic importance. For example, the volume of consumer loans held by banks in the US was $1132bn in 2013; compared to $1541bn in the corporate business.1 In UK, loans and mortgages to individuals were even higher than corporate loans in 2012 (£11,676 m c.f. £10,388 m).2 These figures indicate that financial institutions require formal tools to inform lending decisions.
A credit score is a model-based estimate of the probability that a borrower will show some undesirable behavior in the future. In application scoring, for example, lenders employ predictive models, called scorecards, to estimate how likely an applicant is to default. Such PD (probability of default) scorecards are routinely developed using classification algorithms (e.g., Hand & Henley, 1997). Many studies have examined the accuracy of alternative classifiers. One of the most comprehensive classifier comparisons to date is the benchmarking study of Baesens et al. (2003).
Albeit much research, we argue that the credit scoring literature does not reflect several recent advancements in predictive learning. For example, the development of selective multiple classifier systems that pool different algorithms and optimize their weighting through heuristic search represents an important trend in machine learning (e.g., Partalas, Tsoumakas, & Vlahavas, 2010). Yet, no attempt has been made to systematically examine the potential of such approach for credit scoring. More generally, recent advancements concern three dimensions: (i) novel classification algorithms to develop scorecards (e.g., extreme learning machines, rotation forest, etc.), (ii) novel performance measures to assess scorecards (e.g., the H-measure or the partial Gini coefficient), and (iii) statistical hypothesis tests to compare scorecard performance (e.g., García, Fernández, Luengo, & Herrera, 2010). An analysis of the PD modeling literature confirms that these developments have received little attention in credit scoring, and reveals further limitations of previous studies; namely (i) using few and/or small data sets, (ii) not comparing different state-of-the-art classifiers to each other, and (iii) using only a small set of conceptually similar accuracy indicators. We elaborate on these issues in Section 2.
The above research gaps warrant an update of Baesens et al. (2003). Therefore, the motivation of this paper is to provide a holistic view of the state-of-the-art in predictive modeling and how it can support decision making in the retail credit business. In pursuing this objective, we make the following contributions: first, we perform a large scale benchmark of 41 classification methods across eight credit scoring data sets. Several of the classifiers are new to the community and for the first time assessed in credit scoring. Second, using the principles of cost-sensitive learning, we shed light on the link between the (statistical) accuracy of scorecard predictions and the business value of a scorecard. This offers some guidance whether deploying advanced—more accurate—classification models is economically sensible. Third, we examine the correspondence between empirical results obtained using different accuracy indicators. In particular, we clarify the reliability of scorecard comparisons in the light of recently identified limitations of the area under a receiver operating characteristics curve (Hand, 2009, Hand and Anagnostopoulos, 2013). Finally, we illustrate the use of advanced nonparametric testing procedures to secure empirical findings and, thereby, offer guidance how to organize future classifier comparisons.
In the remainder of the paper we first review related work in Section 2. We then summarize the classifiers that we compare (Section 3) and describe our experimental design (Section 4). Next, we discuss empirical results (Section 5). Section 6 concludes the paper. The online appendix3 provides a detailed description of the classification algorithms and additional results.
Section snippets
Literature review
Much literature explores the development, application, and evaluation of predictive decision support models in the credit industry (see, Crook et al., 2007, Kumar and Ravi, 2007 for reviews). Such models estimate credit worthiness based on a set of explanatory variables. Corporate risk models employ data from balance sheets, financial ratios, or macro-economic indicators, whereas retail models use data from application forms, customer demographics, and transactional data from the customer
Classification algorithms for scorecard construction
We illustrate the development of a credit scorecard in the context of application scoring. Let be an m-dimensional vector with application characteristics and let be a binary variable that distinguishes good and bad loans . A scorecard estimates the (posterior) probability that a default event will be observed for loan i; where is a shorthand form of . To decide on an application, a credit analyst compares the estimated default
Credit scoring data sets
The empirical evaluation includes eight retail credit scoring data sets. The data sets Australian credit (AC) and German credit (GC) from the UCI Library (Lichman, 2013) and the Th02 data set from Thomas, Edelman, and Crook (2002) have been used in several previous papers (see Section 2). Three other data sets, Bene-1, Bene-2, and UK, also used in Baesens et al. (2003), were collected from major financial institutions in the Benelux and UK, respectively. Note that our data set UK encompasses
Empirical results
The empirical results consist of performance estimates of the 41 classifiers across the eight credit scoring data sets in terms of the six performance measures. Interested readers find these raw results in Table A.2–A.7 in the online appendix.8 Below, we report aggregated results.
Conclusions
We set out to update Baesens et al. (2003) and to explore the relative effectiveness of alternative classification algorithms in retail credit scoring. To that end, we compared 41 classifiers in terms of six performance measures across eight real-world credit scoring data sets. Our results suggest that several classifiers predict credit risk significantly more accurately than the industry standard LR. Especially heterogeneous ensembles classifiers perform well. We also provide some evidence
Acknowledgements
We thank Immanuel Bomze for his efforts, support, and advice during the handling of the paper. We are also grateful to five anonymous referees for constructive comments and valuable suggestions how to further improve the paper.
References (94)
Genetic programming for credit scoring: The case of Egyptian public sector banks
Expert Systems with Applications
(2009)- et al.
Neural nets versus conventional techniques in credit scoring in Egyptian banking
Expert Systems with Applications
(2008) - et al.
Improving experimental studies about ensembles of classifiers for bankruptcy prediction and credit scoring
Expert Systems with Applications
(2014) An empirical comparison of conventional techniques, neural networks and the three stage hybrid Adaptive Neuro Fuzzy Inference System (ANFIS) model for credit scoring analysis: The case of Turkish credit card data
European Journal of Operational Research
(2012)- et al.
Support vector machines for credit scoring and discovery of significant features
Expert Systems with Applications
(2009) - et al.
An experimental comparison of classification algorithms for imbalanced credit scoring data sets
Expert Systems with Applications
(2012) Downturn loss given default: Mixture distribution estimation
European Journal of Operational Research
(2014)- et al.
Mining the customer credit using hybrid support vector machine technique
Expert Systems with Applications
(2009) - et al.
The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing
European Journal of Operational Research
(2006) - et al.
Recent developments in consumer credit risk assessment
European Journal of Operational Research
(2007)
An Akaike information criterion for multiple event mixture cure models
European Journal of Operational Research
An introduction to ROC analysis
Pattern Recognition Letters
Multiple classifier architectures and their application to credit risk assessment
European Journal of Operational Research
Stochastic gradient boosting
Computational Statistics & Data Analysis
Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power
Information Sciences
A Kolmogorov–Smirnov statistic based segmentation approach to learning from imbalanced datasets: With application in property refinance prediction
Expert Systems with Applications
When is the area under the receiver operating characteristic curve an appropriate measure of classifier performance?
Pattern Recognition Letters
Optimal bipartite scorecards
Expert Systems with Applications
Computational time reduction for credit scoring: An integrated approach based on support vector machine and stratified sampling method
Expert Systems with Applications
Adapting a classification rule to local and global shift when only unlabelled data are available
European Journal of Operational Research
A data driven ensemble classifier for credit scoring analysis
Expert Systems with Applications
Credit scoring with a data mining approach based on support vector machines
Expert Systems with Applications
Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem
Nonlinear Analysis: Real World Applications
Consumer credit risk: Individual probability estimates using machine learning
Expert Systems with Applications
A two-stage hybrid credit scoring model using artificial neural networks and multivariate adaptive regression splines
Expert Systems with Applications
Mining the customer credit using classification and regression tree and multivariate adaptive regression splines
Computational Statistics & Data Analysis
An evolution strategy-based multiple kernels multi-criteria programming approach: The case of credit decision making
Decision Support Systems
The evaluation of consumer loans using support vector machines
Expert Systems with Applications
Relevance vector machine based infinite decision agent ensemble learning for credit risk analysis
Expert Systems with Applications
Identifying future defaulters: A hierarchical Bayesian method
European Journal of Operational Research
Evaluating consumer loans using neural networks
Omega
Exploring the behaviour of base classifiers in credit scoring ensembles
Expert Systems with Applications
Two-level classifier ensembles for credit risk assessment
Expert Systems with Applications
An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring
Expert Systems with Applications
Building credit scoring models using genetic programming
Expert Systems with Applications
Subagging for credit scoring models
European Journal of Operational Research
Pruning an ensemble of classifiers via reinforcement learning
Neurocomputing
Neighborhood rough set and SVM based hybrid credit scoring classifier
Expert Systems with Applications
Incorporating domain knowledge into data mining classifiers: An application in indirect lending
Decision Support Systems
Modelling the profitability of credit cards by Markov decision processes
European Journal of Operational Research
Consumer credit scoring models with limited data
Expert Systems with Applications
Mixture cure models in credit scoring: if and when borrowers default
European Journal of Operational Research
Combining cluster analysis with classifier ensembles to predict financial distress
Information Fusion
Using neural network ensembles for bankruptcy prediction and credit scoring
Expert Systems with Applications
The consumer loan default predicting model—An application of DEA-DA and neural network
Expert Systems with Applications
Multiple classifier application to credit risk assessment
Expert Systems with Applications
New insights into churn prediction in the telecommunication sector: A profit driven data mining approach
European Journal of Operational Research
Cited by (791)
Optimizing credit limit adjustments under adversarial goals using reinforcement learning
2024, European Journal of Operational ResearchImproved credit risk prediction based on an integrated graph representation learning approach with graph transformation
2024, European Journal of Operational ResearchProfit- and risk-driven credit scoring under parameter uncertainty: A multiobjective approach
2024, Omega (United Kingdom)The role of banks’ technology adoption in credit markets during the pandemic
2024, Journal of Financial StabilityA new perspective on classification: Optimally allocating limited resources to uncertain tasks
2024, Decision Support SystemsInvestigating the beneficial impact of segmentation-based modelling for credit scoring
2024, Decision Support Systems