Stochastics and Statistics
Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research

https://doi.org/10.1016/j.ejor.2015.05.030Get rights and content

Highlights

  • Large-scale benchmark of 41 classifiers across eight real-word credit scoring data sets.

  • Introduction of ensemble selection routines to the credit scoring community.

  • Analysis of six established and novel indicators to measure scorecard accuracy.

  • Assessment of the financial impact of different scorecards.

Abstract

Many years have passed since Baesens et al. published their benchmarking study of classification algorithms in credit scoring [Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., & Vanthienen, J. (2003). Benchmarking state-of-the-art classification algorithms for credit scoring. Journal of the Operational Research Society, 54(6), 627–635.]. The interest in prediction methods for scorecard development is unbroken. However, there have been several advancements including novel learning methods, performance measures and techniques to reliably compare different classifiers, which the credit scoring literature does not reflect. To close these research gaps, we update the study of Baesens et al. and compare several novel classification algorithms to the state-of-the-art in credit scoring. In addition, we examine the extent to which the assessment of alternative scorecards differs across established and novel indicators of predictive accuracy. Finally, we explore whether more accurate classifiers are managerial meaningful. Our study provides valuable insight for professionals and academics in credit scoring. It helps practitioners to stay abreast of technical advancements in predictive modeling. From an academic point of view, the study provides an independent assessment of recent scoring methods and offers a new baseline to which future approaches can be compared.

Introduction

Credit scoring is concerned with developing empirical models to support decision making in the retail credit business (Crook, Edelman, & Thomas, 2007). This sector is of considerable economic importance. For example, the volume of consumer loans held by banks in the US was $1132bn in 2013; compared to $1541bn in the corporate business.1 In UK, loans and mortgages to individuals were even higher than corporate loans in 2012 (£11,676 m c.f. £10,388 m).2 These figures indicate that financial institutions require formal tools to inform lending decisions.

A credit score is a model-based estimate of the probability that a borrower will show some undesirable behavior in the future. In application scoring, for example, lenders employ predictive models, called scorecards, to estimate how likely an applicant is to default. Such PD (probability of default) scorecards are routinely developed using classification algorithms (e.g., Hand & Henley, 1997). Many studies have examined the accuracy of alternative classifiers. One of the most comprehensive classifier comparisons to date is the benchmarking study of Baesens et al. (2003).

Albeit much research, we argue that the credit scoring literature does not reflect several recent advancements in predictive learning. For example, the development of selective multiple classifier systems that pool different algorithms and optimize their weighting through heuristic search represents an important trend in machine learning (e.g., Partalas, Tsoumakas, & Vlahavas, 2010). Yet, no attempt has been made to systematically examine the potential of such approach for credit scoring. More generally, recent advancements concern three dimensions: (i) novel classification algorithms to develop scorecards (e.g., extreme learning machines, rotation forest, etc.), (ii) novel performance measures to assess scorecards (e.g., the H-measure or the partial Gini coefficient), and (iii) statistical hypothesis tests to compare scorecard performance (e.g., García, Fernández, Luengo, & Herrera, 2010). An analysis of the PD modeling literature confirms that these developments have received little attention in credit scoring, and reveals further limitations of previous studies; namely (i) using few and/or small data sets, (ii) not comparing different state-of-the-art classifiers to each other, and (iii) using only a small set of conceptually similar accuracy indicators. We elaborate on these issues in Section 2.

The above research gaps warrant an update of Baesens et al. (2003). Therefore, the motivation of this paper is to provide a holistic view of the state-of-the-art in predictive modeling and how it can support decision making in the retail credit business. In pursuing this objective, we make the following contributions: first, we perform a large scale benchmark of 41 classification methods across eight credit scoring data sets. Several of the classifiers are new to the community and for the first time assessed in credit scoring. Second, using the principles of cost-sensitive learning, we shed light on the link between the (statistical) accuracy of scorecard predictions and the business value of a scorecard. This offers some guidance whether deploying advanced—more accurate—classification models is economically sensible. Third, we examine the correspondence between empirical results obtained using different accuracy indicators. In particular, we clarify the reliability of scorecard comparisons in the light of recently identified limitations of the area under a receiver operating characteristics curve (Hand, 2009, Hand and Anagnostopoulos, 2013). Finally, we illustrate the use of advanced nonparametric testing procedures to secure empirical findings and, thereby, offer guidance how to organize future classifier comparisons.

In the remainder of the paper we first review related work in Section 2. We then summarize the classifiers that we compare (Section 3) and describe our experimental design (Section 4). Next, we discuss empirical results (Section 5). Section 6 concludes the paper. The online appendix3 provides a detailed description of the classification algorithms and additional results.

Section snippets

Literature review

Much literature explores the development, application, and evaluation of predictive decision support models in the credit industry (see, Crook et al., 2007, Kumar and Ravi, 2007 for reviews). Such models estimate credit worthiness based on a set of explanatory variables. Corporate risk models employ data from balance sheets, financial ratios, or macro-economic indicators, whereas retail models use data from application forms, customer demographics, and transactional data from the customer

Classification algorithms for scorecard construction

We illustrate the development of a credit scorecard in the context of application scoring. Let x=(x1,x2,,xm)Rm be an m-dimensional vector with application characteristics and let y{1;+1} be a binary variable that distinguishes good (y=1) and bad loans (y=+1). A scorecard estimates the (posterior) probability p(+|xi) that a default event will be observed for loan i; where p(+|x) is a shorthand form of p(y=+1|x). To decide on an application, a credit analyst compares the estimated default

Credit scoring data sets

The empirical evaluation includes eight retail credit scoring data sets. The data sets Australian credit (AC) and German credit (GC) from the UCI Library (Lichman, 2013) and the Th02 data set from Thomas, Edelman, and Crook (2002) have been used in several previous papers (see Section 2). Three other data sets, Bene-1, Bene-2, and UK, also used in Baesens et al. (2003), were collected from major financial institutions in the Benelux and UK, respectively. Note that our data set UK encompasses

Empirical results

The empirical results consist of performance estimates of the 41 classifiers across the eight credit scoring data sets in terms of the six performance measures. Interested readers find these raw results in Table A.2–A.7 in the online appendix.8 Below, we report aggregated results.

Conclusions

We set out to update Baesens et al. (2003) and to explore the relative effectiveness of alternative classification algorithms in retail credit scoring. To that end, we compared 41 classifiers in terms of six performance measures across eight real-world credit scoring data sets. Our results suggest that several classifiers predict credit risk significantly more accurately than the industry standard LR. Especially heterogeneous ensembles classifiers perform well. We also provide some evidence

Acknowledgements

We thank Immanuel Bomze for his efforts, support, and advice during the handling of the paper. We are also grateful to five anonymous referees for constructive comments and valuable suggestions how to further improve the paper.

References (94)

  • DirickL. et al.

    An Akaike information criterion for multiple event mixture cure models

    European Journal of Operational Research

    (2015)
  • FawcettT.

    An introduction to ROC analysis

    Pattern Recognition Letters

    (2006)
  • FinlayS.

    Multiple classifier architectures and their application to credit risk assessment

    European Journal of Operational Research

    (2011)
  • FriedmanJ.H.

    Stochastic gradient boosting

    Computational Statistics & Data Analysis

    (2002)
  • GarcíaS. et al.

    Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power

    Information Sciences

    (2010)
  • GongR. et al.

    A Kolmogorov–Smirnov statistic based segmentation approach to learning from imbalanced datasets: With application in property refinance prediction

    Expert Systems with Applications

    (2012)
  • HandD.J. et al.

    When is the area under the receiver operating characteristic curve an appropriate measure of classifier performance?

    Pattern Recognition Letters

    (2013)
  • HandD.J. et al.

    Optimal bipartite scorecards

    Expert Systems with Applications

    (2005)
  • HensA.B. et al.

    Computational time reduction for credit scoring: An integrated approach based on support vector machine and stratified sampling method

    Expert Systems with Applications

    (2012)
  • HoferV.

    Adapting a classification rule to local and global shift when only unlabelled data are available

    European Journal of Operational Research

    (2015)
  • HsiehN.-C. et al.

    A data driven ensemble classifier for credit scoring analysis

    Expert Systems with Applications

    (2010)
  • HuangC.-L. et al.

    Credit scoring with a data mining approach based on support vector machines

    Expert Systems with Applications

    (2007)
  • HuangY.-M. et al.

    Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem

    Nonlinear Analysis: Real World Applications

    (2006)
  • KruppaJ. et al.

    Consumer credit risk: Individual probability estimates using machine learning

    Expert Systems with Applications

    (2013)
  • LeeT.-S. et al.

    A two-stage hybrid credit scoring model using artificial neural networks and multivariate adaptive regression splines

    Expert Systems with Applications

    (2005)
  • LeeT.-S. et al.

    Mining the customer credit using classification and regression tree and multivariate adaptive regression splines

    Computational Statistics & Data Analysis

    (2006)
  • LiJ. et al.

    An evolution strategy-based multiple kernels multi-criteria programming approach: The case of credit decision making

    Decision Support Systems

    (2011)
  • LiS.-T. et al.

    The evaluation of consumer loans using support vector machines

    Expert Systems with Applications

    (2006)
  • LiS. et al.

    Relevance vector machine based infinite decision agent ensemble learning for credit risk analysis

    Expert Systems with Applications

    (2012)
  • LiuF. et al.

    Identifying future defaulters: A hierarchical Bayesian method

    European Journal of Operational Research

    (2015)
  • MalhotraR. et al.

    Evaluating consumer loans using neural networks

    Omega

    (2003)
  • MarquésA.I. et al.

    Exploring the behaviour of base classifiers in credit scoring ensembles

    Expert Systems with Applications

    (2012)
  • MarquésA.I. et al.

    Two-level classifier ensembles for credit risk assessment

    Expert Systems with Applications

    (2012)
  • NanniL. et al.

    An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring

    Expert Systems with Applications

    (2009)
  • OngC.-S. et al.

    Building credit scoring models using genetic programming

    Expert Systems with Applications

    (2005)
  • PaleologoG. et al.

    Subagging for credit scoring models

    European Journal of Operational Research

    (2010)
  • PartalasI. et al.

    Pruning an ensemble of classifiers via reinforcement learning

    Neurocomputing

    (2009)
  • PingY. et al.

    Neighborhood rough set and SVM based hybrid credit scoring classifier

    Expert Systems with Applications

    (2011)
  • SinhaA.P. et al.

    Incorporating domain knowledge into data mining classifiers: An application in indirect lending

    Decision Support Systems

    (2008)
  • SoM.M.C. et al.

    Modelling the profitability of credit cards by Markov decision processes

    European Journal of Operational Research

    (2011)
  • ŠušteršičM. et al.

    Consumer credit scoring models with limited data

    Expert Systems with Applications

    (2009)
  • TongE.N.C. et al.

    Mixture cure models in credit scoring: if and when borrowers default

    European Journal of Operational Research

    (2012)
  • TsaiC.-F.

    Combining cluster analysis with classifier ensembles to predict financial distress

    Information Fusion

    (2014)
  • TsaiC.-F. et al.

    Using neural network ensembles for bankruptcy prediction and credit scoring

    Expert Systems with Applications

    (2008)
  • TsaiM.-C. et al.

    The consumer loan default predicting model—An application of DEA-DA and neural network

    Expert Systems with Applications

    (2009)
  • TwalaB.

    Multiple classifier application to credit risk assessment

    Expert Systems with Applications

    (2010)
  • VerbekeW. et al.

    New insights into churn prediction in the telecommunication sector: A profit driven data mining approach

    European Journal of Operational Research

    (2012)
  • Cited by (791)

    View all citing articles on Scopus
    View full text