Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research

doi:10.1016/j.ejor.2015.05.030

European Journal of Operational Research

Volume 247, Issue 1, 16 November 2015, Pages 124-136

https://doi.org/10.1016/j.ejor.2015.05.030 Get rights and content

Highlights

•
Large-scale benchmark of 41 classifiers across eight real-word credit scoring data sets.
•
Introduction of ensemble selection routines to the credit scoring community.
•
Analysis of six established and novel indicators to measure scorecard accuracy.
•
Assessment of the financial impact of different scorecards.

Abstract

Many years have passed since Baesens et al. published their benchmarking study of classification algorithms in credit scoring [Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., & Vanthienen, J. (2003). Benchmarking state-of-the-art classification algorithms for credit scoring. Journal of the Operational Research Society, 54(6), 627–635.]. The interest in prediction methods for scorecard development is unbroken. However, there have been several advancements including novel learning methods, performance measures and techniques to reliably compare different classifiers, which the credit scoring literature does not reflect. To close these research gaps, we update the study of Baesens et al. and compare several novel classification algorithms to the state-of-the-art in credit scoring. In addition, we examine the extent to which the assessment of alternative scorecards differs across established and novel indicators of predictive accuracy. Finally, we explore whether more accurate classifiers are managerial meaningful. Our study provides valuable insight for professionals and academics in credit scoring. It helps practitioners to stay abreast of technical advancements in predictive modeling. From an academic point of view, the study provides an independent assessment of recent scoring methods and offers a new baseline to which future approaches can be compared.

Graphical abstract

Introduction

Credit scoring is concerned with developing empirical models to support decision making in the retail credit business (Crook, Edelman, & Thomas, 2007). This sector is of considerable economic importance. For example, the volume of consumer loans held by banks in the US was $1132bn in 2013; compared to $1541bn in the corporate business.¹ In UK, loans and mortgages to individuals were even higher than corporate loans in 2012 (£11,676 m c.f. £10,388 m).² These figures indicate that financial institutions require formal tools to inform lending decisions.

A credit score is a model-based estimate of the probability that a borrower will show some undesirable behavior in the future. In application scoring, for example, lenders employ predictive models, called scorecards, to estimate how likely an applicant is to default. Such PD (probability of default) scorecards are routinely developed using classification algorithms (e.g., Hand & Henley, 1997). Many studies have examined the accuracy of alternative classifiers. One of the most comprehensive classifier comparisons to date is the benchmarking study of Baesens et al. (2003).

Albeit much research, we argue that the credit scoring literature does not reflect several recent advancements in predictive learning. For example, the development of selective multiple classifier systems that pool different algorithms and optimize their weighting through heuristic search represents an important trend in machine learning (e.g., Partalas, Tsoumakas, & Vlahavas, 2010). Yet, no attempt has been made to systematically examine the potential of such approach for credit scoring. More generally, recent advancements concern three dimensions: (i) novel classification algorithms to develop scorecards (e.g., extreme learning machines, rotation forest, etc.), (ii) novel performance measures to assess scorecards (e.g., the H-measure or the partial Gini coefficient), and (iii) statistical hypothesis tests to compare scorecard performance (e.g., García, Fernández, Luengo, & Herrera, 2010). An analysis of the PD modeling literature confirms that these developments have received little attention in credit scoring, and reveals further limitations of previous studies; namely (i) using few and/or small data sets, (ii) not comparing different state-of-the-art classifiers to each other, and (iii) using only a small set of conceptually similar accuracy indicators. We elaborate on these issues in Section 2.

The above research gaps warrant an update of Baesens et al. (2003). Therefore, the motivation of this paper is to provide a holistic view of the state-of-the-art in predictive modeling and how it can support decision making in the retail credit business. In pursuing this objective, we make the following contributions: first, we perform a large scale benchmark of 41 classification methods across eight credit scoring data sets. Several of the classifiers are new to the community and for the first time assessed in credit scoring. Second, using the principles of cost-sensitive learning, we shed light on the link between the (statistical) accuracy of scorecard predictions and the business value of a scorecard. This offers some guidance whether deploying advanced—more accurate—classification models is economically sensible. Third, we examine the correspondence between empirical results obtained using different accuracy indicators. In particular, we clarify the reliability of scorecard comparisons in the light of recently identified limitations of the area under a receiver operating characteristics curve (Hand, 2009, Hand and Anagnostopoulos, 2013). Finally, we illustrate the use of advanced nonparametric testing procedures to secure empirical findings and, thereby, offer guidance how to organize future classifier comparisons.

In the remainder of the paper we first review related work in Section 2. We then summarize the classifiers that we compare (Section 3) and describe our experimental design (Section 4). Next, we discuss empirical results (Section 5). Section 6 concludes the paper. The online appendix³ provides a detailed description of the classification algorithms and additional results.

Section snippets

Literature review

Much literature explores the development, application, and evaluation of predictive decision support models in the credit industry (see, Crook et al., 2007, Kumar and Ravi, 2007 for reviews). Such models estimate credit worthiness based on a set of explanatory variables. Corporate risk models employ data from balance sheets, financial ratios, or macro-economic indicators, whereas retail models use data from application forms, customer demographics, and transactional data from the customer

Classification algorithms for scorecard construction

We illustrate the development of a credit scorecard in the context of application scoring. Let $x = (x_{1}, x_{2}, \dots, x_{m}) \in R^{m}$ be an m-dimensional vector with application characteristics and let $y \in {- 1; + 1}$ be a binary variable that distinguishes good $(y = - 1)$ and bad loans $(y = + 1)$ . A scorecard estimates the (posterior) probability $p (+ | x_{i})$ that a default event will be observed for loan i; where $p (+ | x)$ is a shorthand form of $p (y = + 1 | x)$ . To decide on an application, a credit analyst compares the estimated default

Credit scoring data sets

The empirical evaluation includes eight retail credit scoring data sets. The data sets Australian credit (AC) and German credit (GC) from the UCI Library (Lichman, 2013) and the Th02 data set from Thomas, Edelman, and Crook (2002) have been used in several previous papers (see Section 2). Three other data sets, Bene-1, Bene-2, and UK, also used in Baesens et al. (2003), were collected from major financial institutions in the Benelux and UK, respectively. Note that our data set UK encompasses

Empirical results

The empirical results consist of performance estimates of the 41 classifiers across the eight credit scoring data sets in terms of the six performance measures. Interested readers find these raw results in Table A.2–A.7 in the online appendix.⁸ Below, we report aggregated results.

Conclusions

We set out to update Baesens et al. (2003) and to explore the relative effectiveness of alternative classification algorithms in retail credit scoring. To that end, we compared 41 classifiers in terms of six performance measures across eight real-world credit scoring data sets. Our results suggest that several classifiers predict credit risk significantly more accurately than the industry standard LR. Especially heterogeneous ensembles classifiers perform well. We also provide some evidence

Acknowledgements

We thank Immanuel Bomze for his efforts, support, and advice during the handling of the paper. We are also grateful to five anonymous referees for constructive comments and valuable suggestions how to further improve the paper.

References (94)

AbdouH.A.
Genetic programming for credit scoring: The case of Egyptian public sector banks
Expert Systems with Applications
(2009)
AbdouH.A. et al.
Neural nets versus conventional techniques in credit scoring in Egyptian banking
Expert Systems with Applications
(2008)
AbellánJ. et al.
Improving experimental studies about ensembles of classifiers for bankruptcy prediction and credit scoring
Expert Systems with Applications
(2014)
AkkocS.
An empirical comparison of conventional techniques, neural networks and the three stage hybrid Adaptive Neuro Fuzzy Inference System (ANFIS) model for credit scoring analysis: The case of Turkish credit card data
European Journal of Operational Research
(2012)
BellottiT. et al.
Support vector machines for credit scoring and discovery of significant features
Expert Systems with Applications
(2009)
BrownI. et al.
An experimental comparison of classification algorithms for imbalanced credit scoring data sets
Expert Systems with Applications
(2012)
CalabreseR.
Downturn loss given default: Mixture distribution estimation
European Journal of Operational Research
(2014)
ChenW. et al.
Mining the customer credit using hybrid support vector machine technique
Expert Systems with Applications
(2009)
CroneS.F. et al.
The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing
European Journal of Operational Research
(2006)
CrookJ.N. et al.
Recent developments in consumer credit risk assessment
European Journal of Operational Research
(2007)

DirickL. et al.

An Akaike information criterion for multiple event mixture cure models

European Journal of Operational Research

(2015)

FawcettT.

An introduction to ROC analysis

Pattern Recognition Letters

(2006)

FinlayS.

Multiple classifier architectures and their application to credit risk assessment

European Journal of Operational Research

(2011)

FriedmanJ.H.

Stochastic gradient boosting

Computational Statistics & Data Analysis

(2002)

GarcíaS. et al.

Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power

Information Sciences

(2010)

GongR. et al.

A Kolmogorov–Smirnov statistic based segmentation approach to learning from imbalanced datasets: With application in property refinance prediction

Expert Systems with Applications

(2012)

HandD.J. et al.

When is the area under the receiver operating characteristic curve an appropriate measure of classifier performance?

Pattern Recognition Letters

(2013)

HandD.J. et al.

Optimal bipartite scorecards

Expert Systems with Applications

(2005)

HensA.B. et al.

Computational time reduction for credit scoring: An integrated approach based on support vector machine and stratified sampling method

Expert Systems with Applications

(2012)

HoferV.

Adapting a classification rule to local and global shift when only unlabelled data are available

European Journal of Operational Research

(2015)

HsiehN.-C. et al.

A data driven ensemble classifier for credit scoring analysis

Expert Systems with Applications

(2010)

HuangC.-L. et al.

Credit scoring with a data mining approach based on support vector machines

Expert Systems with Applications

(2007)

HuangY.-M. et al.

Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem

Nonlinear Analysis: Real World Applications

(2006)

KruppaJ. et al.

Consumer credit risk: Individual probability estimates using machine learning

Expert Systems with Applications

(2013)

LeeT.-S. et al.

A two-stage hybrid credit scoring model using artificial neural networks and multivariate adaptive regression splines

Expert Systems with Applications

(2005)

LeeT.-S. et al.

Mining the customer credit using classification and regression tree and multivariate adaptive regression splines

Computational Statistics & Data Analysis

(2006)

LiJ. et al.

An evolution strategy-based multiple kernels multi-criteria programming approach: The case of credit decision making

Decision Support Systems

(2011)

LiS.-T. et al.

The evaluation of consumer loans using support vector machines

Expert Systems with Applications

(2006)

LiS. et al.

Relevance vector machine based infinite decision agent ensemble learning for credit risk analysis

Expert Systems with Applications

(2012)

LiuF. et al.

Identifying future defaulters: A hierarchical Bayesian method

European Journal of Operational Research

(2015)

MalhotraR. et al.

Evaluating consumer loans using neural networks

Omega

(2003)

MarquésA.I. et al.

Exploring the behaviour of base classifiers in credit scoring ensembles

Expert Systems with Applications

(2012)

MarquésA.I. et al.

Two-level classifier ensembles for credit risk assessment

Expert Systems with Applications

(2012)

NanniL. et al.

An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring

Expert Systems with Applications

(2009)

OngC.-S. et al.

Building credit scoring models using genetic programming

Expert Systems with Applications

(2005)

PaleologoG. et al.

Subagging for credit scoring models

European Journal of Operational Research

(2010)

PartalasI. et al.

Pruning an ensemble of classifiers via reinforcement learning

Neurocomputing

(2009)

PingY. et al.

Neighborhood rough set and SVM based hybrid credit scoring classifier

Expert Systems with Applications

(2011)

SinhaA.P. et al.

Incorporating domain knowledge into data mining classifiers: An application in indirect lending

Decision Support Systems

(2008)

SoM.M.C. et al.

Modelling the profitability of credit cards by Markov decision processes

European Journal of Operational Research

(2011)

ŠušteršičM. et al.

Consumer credit scoring models with limited data

Expert Systems with Applications

(2009)

TongE.N.C. et al.

Mixture cure models in credit scoring: if and when borrowers default

European Journal of Operational Research

(2012)

TsaiC.-F.

Combining cluster analysis with classifier ensembles to predict financial distress

Information Fusion

(2014)

TsaiC.-F. et al.

Using neural network ensembles for bankruptcy prediction and credit scoring

Expert Systems with Applications

(2008)

TsaiM.-C. et al.

The consumer loan default predicting model—An application of DEA-DA and neural network

Expert Systems with Applications

(2009)

TwalaB.

Multiple classifier application to credit risk assessment

Expert Systems with Applications

(2010)

VerbekeW. et al.

New insights into churn prediction in the telecommunication sector: A profit driven data mining approach

European Journal of Operational Research

(2012)

Cited by (791)

Optimizing credit limit adjustments under adversarial goals using reinforcement learning
2024, European Journal of Operational Research
Reinforcement learning has been explored for many problems, from video games with deterministic environments to portfolio and operations management in which scenarios are stochastic; however, there have been few attempts to test these methods in banking problems. In this study, we sought to find and automatize an optimal credit card limit adjustment policy by employing reinforcement learning techniques. In particular, because of the historical data available, we considered two possible actions per customer, namely increasing or maintaining an individual’s current credit limit. To find this policy, we first formulated this decision-making question as an optimization problem in which the expected profit was maximized; therefore, we balanced two adversarial goals: maximizing the portfolio’s revenue and minimizing the portfolio’s provisions. Second, given the particularities of our problem, we used an offline learning strategy to simulate the impact of the action based on historical data from a super-app (i.e., a mobile application that offers various services from goods deliveries to financial products) in Latin America to train our reinforcement learning agent. Our results, based on the proposed methodology involving synthetic experimentation, show that a Double Q-learning agent with optimized hyperparameters can outperform other strategies and generate a non-trivial optimal policy not only reflecting the complex nature of this decision but offering an incentive to explore reinforcement learning in real-world banking scenarios. Our research establishes a conceptual structure for applying reinforcement learning framework to credit limit adjustment, presenting an objective technique to make these decisions primarily based on data-driven methods rather than relying only on expert-driven systems. We also study the use of alternative data for the problem of balance prediction, as the latter is a requirement of our proposed model. We find the use of such data does not always bring prediction gains.
Improved credit risk prediction based on an integrated graph representation learning approach with graph transformation
2024, European Journal of Operational Research
Accurate credit risk prediction effectively supports decision makings and risk prevention in quantitative management. The general paradigm of previous works usually conducts supervised classification with internal information (credit attributes) of instances, while recent studies have introduced external information like texts, images, relations, to improve predictive accuracy. However, how to improve forecasting without explicit external relations still needs to be explored. Motivated by this and also by the increasing popularity of Graph Neural Network (GNN) with its fast infiltration into other disciplines, we propose an integrated graph representation learning approach to realize improved credit risk prediction. It includes two stages: (i) treat instances as nodes and use kNN to extract and construct edges; (ii) implement GNN models to discriminate risk/default cases by node classification. In this way, both “unsupervised” graph transformation and “supervised” node classification have been integrated to formulate the hybrid kNN–GNN model, and experiments on widely-used credit datasets demonstrate its outperformance over direct classification by conventional machine learning techniques. Sensitivity of hyperparameter $k$ indicating different graph sparsity is also analyzed to reveal its optimal selection. Furthermore, ensemble multi-graphs and introduce edge weights are examined to investigate possible advancements with some enhancements observed for both, providing feasible ways to extend the upper bound of this hybrid model’s performances. Our findings exhibit valid improvements in credit risk prediction under the circumstance of only internal information available, and in depth present the future prospects of innovative integrations and applications of GNN methods in dealing with many other operational research tasks.
Profit- and risk-driven credit scoring under parameter uncertainty: A multiobjective approach
2024, Omega (United Kingdom)
Profit-driven artificial intelligence (AI) systems and profit-based performance measures are widely used in credit scoring. When assessing the performance of an AI system for credit scoring, previous research typically assumes that the cost and benefit parameters and their distributional information are available. In reality, however, these parameters and their distributions are often not precisely known. This study considers parameter uncertainty in the development of credit-scoring models and the estimation of profits and risks generated by those models. We propose a novel profit-based metric—the worst-case expected minimum cost (WEMC)—to estimate the profit of credit-scoring models with uncertain parameters. Furthermore, we introduce the worst-case conditional value-at-risk (WCVaR) metric to measure the loss incurred from employing a classification model in credit scoring under the deterioration of cost parameters. A multiobjective feature-selection framework based on WEMC (or minimum cost) and WCVaR is then presented for model development. Using a comprehensive bankruptcy database, we compare the proposed methods with wrapper methods that use traditional metrics as selection criteria, as well as filter and embedding methods. We conduct extensive experiments to evaluate the economic benefits of the proposed methods under different scenarios that simulate dynamic changes in macroeconomic conditions. The results suggest that the proposed methods outperform other feature-selection methods in the aspects of profit and risk performance metrics in most cases.
The role of banks’ technology adoption in credit markets during the pandemic
2024, Journal of Financial Stability
This paper shows that higher information technology (IT) adoption by banks was associated to a larger increase in corporate lending in the months following the COVID-19 outbreak in Italy. Examining banks with heterogeneous degrees of IT adoption, we investigate the dynamics of credit and its allocation across firms using a new database with detailed information on banks’ IT expenditures and use of innovative technologies matched with bank-firm level data on credit growth before and during the pandemic. Using a diff-in-diff approach, we find that banks with a higher share of IT spending increased their credit more than others during the pandemic. The increase was concentrated in term loans extended to smaller and financially sounder companies; the effect was stronger in the initial phase of tighter restrictions to firm activity and individual mobility, and more significant for undertakings active in the sectors most affected by the shock. We provide evidence that these results are driven by bank’s ability to offer credit entirely online and bank’s use of artificial intelligence for credit risk assessment. Physical proximity between borrowers and lenders was important for credit provision during the pandemic, but only when combined with high level of IT adoption.
A new perspective on classification: Optimally allocating limited resources to uncertain tasks
2024, Decision Support Systems
A central problem in business concerns the optimal allocation of limited resources to a set of available tasks, where the payoff of these tasks is inherently uncertain. Typically, such problems are solved using a classification framework, where task outcomes are predicted given a set of characteristics. Then, resources are allocated to the tasks predicted to be the most likely to succeed. We argue, however, that using classification to address task uncertainty is inherently suboptimal as it does not take into account the available capacity. We present a novel solution that directly optimizes the assignment's expected profit given limited, stochastic capacity. This is achieved by optimizing a specific instance of the net discounted cumulative gain, a commonly used class of metrics in learning to rank. We demonstrate that our new method achieves higher expected profit and expected precision compared to a classification approach for a wide variety of application areas.
Investigating the beneficial impact of segmentation-based modelling for credit scoring
2024, Decision Support Systems
Due to its vital role in financial risk management, credit scoring has been investigated extensively in extant information systems studies. However, most credit scoring studies rely on one-size-fits-all classifiers with logistic regression (LR) as a popular benchmark. Moreover, extant literature largely focuses on predictive performance as an evaluation criterion. To find a better balance between predictive performance and interpretability though, the current study investigates the beneficial impact of segmentation-based modelling by benchmarking the logit leaf model (LLM) which is based on LR and decision trees. By a large experimental setup using a real-life credit scoring data set containing 65,536 active customers, we find that LLM is a viable classifier over its constituent parts, i.e., LR and decision trees, and is very competitive to state-of-the-art credit decision making techniques (neural networks, support vector machines, bagging, boosting and random forests) on three evaluation metrics (AUC, top-decile lift and profit). Furthermore, we show its extraordinary interpretability capacities by proposing a new visualization based on the LLM output. In sum, the excellence of the LLM as a classifier for credit decision making problems stems from its ability to combine strong predictive performance with interpretable insights that in turn can inform managerial decisions.

View all citing articles on Scopus

View full text

Stochastics and StatisticsBenchmarking state-of-the-art classification algorithms for credit scoring: An update of research

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Literature review

Classification algorithms for scorecard construction

Credit scoring data sets

Empirical results

Conclusions

Acknowledgements

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

European Journal of Operational Research

Expert Systems with Applications

Expert Systems with Applications

European Journal of Operational Research

Expert Systems with Applications

European Journal of Operational Research

European Journal of Operational Research

European Journal of Operational Research

Pattern Recognition Letters

European Journal of Operational Research

Computational Statistics & Data Analysis

Information Sciences

Expert Systems with Applications

Pattern Recognition Letters

Expert Systems with Applications

Expert Systems with Applications

European Journal of Operational Research

Expert Systems with Applications

Expert Systems with Applications

Nonlinear Analysis: Real World Applications

Expert Systems with Applications

Expert Systems with Applications

Computational Statistics & Data Analysis

Decision Support Systems

Expert Systems with Applications

Expert Systems with Applications

European Journal of Operational Research

Omega

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

European Journal of Operational Research

Neurocomputing

Expert Systems with Applications

Decision Support Systems

European Journal of Operational Research

Expert Systems with Applications

European Journal of Operational Research

Information Fusion

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

European Journal of Operational Research

Stochastics and Statistics
Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research