2021  OriginalPaper  Chapter Open Access
Sharpening the Accuracy of Credit Scoring Models with Machine Learning Algorithms
Published in:
Data Science for Economics and Finance
1 Introduction
Credit scoring consists of a set of risk management techniques that help lenders to decide whether to grant a loan to a given applicant [
42]. More precisely, financial institutions use credit scoring models to make two types of credit decisions. First, a lender should decide whether to grant a loan to a new customer. The process that leads to this decision is called
application scoring. Second, a lender may want to monitor the risk associated with existing customers (the socalled behavioral scoring). In the field of retail lending, credit scoring typically consists of a binary classification problem, where the objective is to predict whether an applicant will be a “good” one (i.e., she will repay her liabilities within a certain period of time) or a “bad” one (i.e., she will default in part or fully on her obligations) based on a set of observed characteristics (
features) of the borrower.
^{1} A feature can be of two types: continuous, when the value of the feature is a real number (an example can be the income of the applicant) or categorical, when the feature takes a value from a predefined set of categories (an example can be the rental status of the applicant, e.g., “owner,” “living with parents,” “renting,” or “other”). Notably, besides traditional categories, new predictive variables, such as those based on “soft” information have been proposed in the literature to improve the accuracy of the credit score forecasts. For instance, Wang et al. [
44] use text mining techniques to exploit the content of descriptive loan texts submitted by borrowers to support credit decisions in peertopeer lending.
Credit scoring plays a crucial role in lending decisions, considering that the cost of an error is relatively high. Starting in the 1990s, most financial institutions have been making lending decisions with the help of automated credit scoring models [
17]. However, according to the Federal Reserve Board [
15] the average delinquency rate on consumer loans has been increasing again since 2016 and has reached 2.28% in the first quarter of 2018, thus indicating that wide margins for improvement in the accuracy of credit scoring models remain. Given the size of the retail credit industry, even a small reduction in the hazard rate may yield significant savings for financial institutions in the future [
45].
Advertisement
Credit scoring also carries considerable regulatory importance. Since the Basel Committee on Banking Supervision released the Basel Accords, especially the second accord in 2004, the use of credit scoring has grown considerably, not only for credit granting decisions but also for risk management purposes. Basel III, released in 2013, enforced increasingly accurate calculations of default risk, especially in consideration of the limitations that external rating agencies have shown during the 2008–2009 financial crisis [
38]. As a result, over the past decades, the problem of developing superior credit scoring models has attracted significant attention in the academic literature. More recently, thanks to the increase in the availability of data and the progress in computing power the attention has moved towards the application of Artificial Intelligence (AI) and, in particular, Machine Learning (ML) algorithms to credit scoring, when machines may learn and make predictions without being explicitly assigned program instructions.
There are four major ML paradigms:
supervised learning,
semisupervised learning,
unsupervised learning, and
reinforcement learning [
32]. In supervised learning, a training dataset should consist of both input data and their corresponding output target values (also called labels). Then, the algorithm is trained on the data to find relationships between the input variables and selected output labels. If only some target output values are available in a training dataset, then such a problem is known as semisupervised learning. Unsupervised learning requires only the input data to be available. Finally, reinforcement learning does not need labelled inputs/outputs but focuses instead on agents making optimal decisions in a certain environment; a feedback is provided to the algorithm in terms of “reward” and “punishment” so that the final goal is to maximize the agent’s cumulative reward. Typically, lending institutions store both the input characteristics and the output historical data concerning their customers. As a result, supervised learning is the main focus of this chapter.
Simple linear classification models remain a popular choice among financial institutions, mainly due to their adequate accuracy and straightforward implementation [
29]. Furthermore, the majority of advanced ML techniques lack the necessary transparency and are regarded as “black boxes”, which means that one is not able to easily explain how the decision to grant a loan is made and on which parameters it is based. In the financial industry, however, transparency and simplicity play a crucial role, and that is the main reason why advanced ML techniques have not yet become widely adopted for credit scoring purposes.
^{2} However, Chui et al. [
12] emphasize that the financial industry is one of the leading sectors in terms of current and prospective ML adoption, especially in credit scoring and lending applications as they document that more than 25% of the companies that provide financial services have adopted at least one advanced ML solution in their daytoday business processes.
Even though the number of papers on advanced scoring techniques has increased dramatically, a consensus regarding the bestperforming models has not yet been reached. Therefore, in this chapter, besides providing an overview of the most common classification methods adopted in the context of credit scoring, we will also try to answer three key questions:

Which individual classifiers show the best performance both in terms of accuracy and of transparency?

Do ensemble classifiers consistently outperform individual classification models and which (if any) is the superior ensemble method?

Do oneclass classification models score higher accuracy compared to the best individual classifiers when tested on imbalanced datasets (i.e., datasets where one class is underrepresented)?
Advertisement
Our survey shows that, despite that ML techniques rarely significantly outperform simple linear methods as far as individual classifiers are concerned, ensemble methods tend to show a considerably better classification performance than individual methods, especially when the financial costs of misclassification are accounted for.
2 Preliminaries and Linear Methods for Classification
A (supervised) learning problem is an attempt to predict a certain output using a set of variables (
features in ML jargon) that are believed to exercise some influence on the output. More specifically, what we are trying to learn is the function
\(h(\vec {x})\) that best describes the relationship between the predictors (the features) and the output. Technically, we are looking for the function
h ∈
H that minimizes a
loss function.
When the outcome is a categorical variable
C (a label), the problem is said to be a classification problem and the function that maps the inputs
\(\vec {x}\) into the output is called
classifier. The estimate
\(\hat {C}\) of
C takes values in
\(\mathscr {C}\), the set of all possible classes. As discussed in Sect.
1, credit scoring is usually a classification problem where only two classes are possible, either the applicant is of the “good” (G) or of the “bad” (B) type. In a binary classification problem, the loss function can be represented by a 2 × 2 matrix L with zeros on the main diagonal and nonnegative values elsewhere.
L(
k,
l) is the cost of classifying an observation belonging to class
\(\mathscr {C}_k\) as
\(\mathscr {C}_l\). The expected prediction error (EPE) is
where
\( \hat {C}(X)\) is the predicted class C based on X (the matrix of the observed features),
\(\mathscr {C}_k\) represents the class with label
k, and
\(p(\mathscr {C}_kX)\) is the probability that the actual class has label
k conditional to the observed values of the features. Accordingly, the optimal prediction
\(\hat {C}(X)\) is the one that minimizes the EPE pointwise, i.e.,
where x is a realization of the features. Notably, when the loss function is of the 0–1 type, i.e., all misclassifications are charged a unit cost, the problem simplifies to
which means that
In this section, we shall discuss two popular classification approaches that result in linear
decision boundaries: logistic regressions (LR) and linear discriminant analysis (LDA). In addition, we also introduce the Naïve Bayes method, which is related to LR and LDA as it also considers a
logodds scoring function.
$$\displaystyle \begin{aligned} EPE= E[L(C,\hat{C}(X))] = E_X \sum_{k=1}^{2}{L[\mathscr{C}_k,}\hat{C}(X)] p(\mathscr{C}_kX), \end{aligned} $$
(1)
$$\displaystyle \begin{aligned} \hat{C}(x) = \mathrm{arg}\min_{c\in \mathscr{C}} \sum_{k=1}^{2}{L(\mathscr{C}_k},c) p(\mathscr{C}_kX=x), \end{aligned} $$
(2)
$$\displaystyle \begin{aligned} \hat{C}(x) = \mathrm{arg}\min_{c\in \mathscr{C}} [1p(cX=x)], \end{aligned} $$
(3)
$$\displaystyle \begin{aligned} \hat{C}(x) = \mathscr{C}_k \; \; \mathit{if} \; \; p(\mathscr{C}_kX=x) = \max_{c\in \mathscr{C}}p(cX=x). \end{aligned} $$
(4)
2.1 Logistic Regression
Because of its simplicity, LR is still one of the most popular approaches used in the industry for the classification of applicants (see, e.g., [
23]). This approach allows one to model the posterior probabilities of K different applicant classes using a linear function of the features, while at the same time ensuring that the probabilities sum to one and that their value ranges between zero and one. More specifically, when there are only two classes (coded via
y, a dummy variable that takes a value of 0 if the applicant is “good” and of 1 if she is “bad”), the posterior probabilities are modeled as
Applying the
logit transformation, one obtains the log of the probability odds (the logodds ratio) as
The input space is optimally divided by the set of points for which the logodds ratio is zero, meaning that the posterior probability of being in one class or in the other is the same. Therefore, the decision boundary is the hyperplane defined by
\(\left \{x\beta _0+\vec {\beta }^T\vec {x}=0\right \}\). Logistic regression models are usually estimated by maximum likelihood, assuming that all the observations in the sample are independently Bernoulli distributed, such that the loglikelihood functions is
where
T
_{0} are the observations in the training sample,
θ is the vector of parameters, and
\(p_k(\vec {x}_i;\theta )=p(C=kX=\vec {x}_i;\theta )\). Because in our case there are only two classes coded via a binary response variable
y
_{i} that can take a value of either zero or one,
\(\hat {\beta }\) is found by maximizing
$$\displaystyle \begin{aligned} \begin{array}{rcl} {} p(C=G X=x)=\frac{exp(\beta_0+\vec{\beta}^T \vec{x})}{1+exp(\beta_0+\vec{\beta}^T \vec{x})} \\ p(C=B X=x)=\frac{1}{1+exp(\beta_0+\vec{\beta}^T \vec{x})}. \end{array} \end{aligned} $$
(5)
$$\displaystyle \begin{aligned} \log{\frac{p(C=GX=x)}{p(C=BX=x)}}= \beta_0+\vec{\beta}^T \vec{x}. \end{aligned} $$
(6)
$$\displaystyle \begin{aligned} \mathscr{L}(\thetax)= p(yx;\theta)=\sum_{i=1}^{T_0}\log{p_{C_i}(\vec{x}_i;\theta}), \end{aligned} $$
(7)
$$\displaystyle \begin{aligned} \mathscr{L}(\beta)= \sum_{i=1}^{T_0}(y_i \vec{\beta}^T \vec{x}_i  \log{(1+exp(\vec{\beta}^T \vec{x}_i})). \end{aligned} $$
(8)
2.2 Linear Discriminant Analysis
A second popular approach used to separate “good” and “bad” applicants that lead to linear decision boundaries is LDA. The LDA method approaches the problem of separating two classes based on a set of observed characteristics
\(\vec {x}\) by modeling the class densities
\(f_G(\vec {x})\) and
\(f_B(\vec {x})\) as multivariate normal distributions with means
\(\vec {\mu }_G,\) and
\(\vec {\mu }_B\) and the same covariance matrix
Σ, i.e.,
To compare the two classes (“good” and “bad” applicants), one has then to compute and investigate the logratio
which is linear in
\(\vec {x}\). Therefore, the decision boundary, which is the set where
p(
C =
G
X =
x) =
p(
C =
B
X =
x), is also linear in
\(\vec {x}\). Clearly the Gaussian parameters
\(\vec {\mu }_G\),
\(\vec {\mu }_B\), and
Σ are not known and should be estimated using the training sample as well as the prior probabilities
π
_{G} and
π
_{B} (set to be equal to the proportions of good and bad applicants in the training sample). Rearranging Eq. (
10), it appears evident that the Bayesian optimal solution is to predict a point to belong to the “bad” class if
which can be rewritten as
where
\( \vec {w}=\hat {\varSigma }^{1}(\hat {\vec {\mu }}_B\hat {\vec {\mu }}_G)\) and
\(z=\frac {1}{2}\hat {\vec {\mu }}_B^T\hat {\varSigma }^{1}\hat {\vec {\mu }}_B\frac {1}{2}\hat {\vec {\mu }}_G^T\hat {\varSigma }^{1}\hat {\vec {\mu }}_G+\log {\hat {\pi }}_G  log{\hat {\pi }}_G\).
$$\displaystyle \begin{aligned} \begin{array}{rcl} {} f_G(\vec{x})=(2\pi)^{{K}/{2}}\;(\varSigma)^{{1}/{2}} \;\text{exp}\left(\frac{1}{2}(\vec{x}\vec{\mu_G})^T \varSigma^{1} (\vec{x}\vec{\mu_G})\right) \\ f_B(\vec{x})=(2\pi)^{{K}/{2}}\;(\varSigma)^{{1}/{2}} \;\text{exp}\left(\frac{1}{2}(\vec{x}\vec{\mu_B})^T \varSigma^{1} (\vec{x}\vec{\mu_B})\right). \end{array} \end{aligned} $$
(9)
$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \log{\frac{p(C=GX=x)}{p(C=BX=x)}} = \log{\frac{f_G(\vec{x})}{f_B(\vec{x})}}+\log{\frac{\pi_G}{\pi_B}} \\ = log{\frac{\pi_G}{\pi_B}}  \frac{1}{2}(\vec{\mu_B}+\vec{\mu_G})^T \varSigma^{1}(\vec{\mu_B}+\vec{\mu_G}) + \vec{x}^T \varSigma^{1} (\vec{\mu_B}+\vec{\mu_G}), \end{array} \end{aligned} $$
(10)
$$\displaystyle \begin{aligned} \vec{x}^T\hat{\varSigma}^{1}(\hat{\vec{\mu}}_B\hat{\vec{\mu}}_G)> \frac{1}{2}\hat{\vec{\mu}}_B^T\hat{\varSigma}^{1}\hat{\vec{\mu}}_B\frac{1}{2}\hat{\vec{\mu}}_G^T\hat{\varSigma}^{1}\hat{\vec{\mu}}_G+\log{\hat{\pi}}_G  log{\hat{\pi}}_G, \end{aligned} $$
(11)
$$\displaystyle \begin{aligned} \vec{x}^T \vec{w} > z \end{aligned} $$
(12)
Another way to approach the problem, which leads to the same coefficients
\(\vec {w}\) is to look for the linear combination of the features that gives the maximum separation between the means of the classes and the minimum variation within the classes, which is equivalent to maximizing the separating distance
M
Notably, the derivation of the coefficients
\(\vec {w}\) does not require that
\(f_G(\vec {x})\) and
\(f_B(\vec {x})\) follow a multivariate normal as postulated in Eq. (
9), but only that
Σ
_{G} =
Σ
_{B} =
Σ. However, the choice of z as a cutoff point in Eq. (
12) requires normality. An alternative is to use a cutoff point that minimizes the training error for a given dataset.
$$\displaystyle \begin{aligned} M=\vec{\omega}^T\frac{\hat{\vec{\mu}}_G\hat{\vec{\mu}}_B}{(\vec{\omega}^T \hat{\varSigma} \vec{\omega})^{{1}/{2}}}.\end{aligned} $$
(13)
2.3 Naïve Bayes
The Naïve Bayes (NB) approach is a probabilistic classifier that assumes that given a class (G or B), the applicant’s attributes are independent. Let
π
_{G} denote the prior probability that an applicant is “good” and
π
_{B} the prior probability that an applicant is “bad.” Then, because of the assumption that each attribute
x
_{i} is conditionally independent from any other attribute
x
_{j} for
i ≠
j, the following holds:
where
\(p(\vec {x}\ \left \ G\right )\) is the probability that a “good” applicant has attributes
\(\vec {x}\). The probability of an applicant being “good” if she is characterized by the attributes
\(\vec {x}\) can now be found by applying Bayes’ theorem:
The probability of an applicant being “bad” if she is characterized by the attributes
\(\vec {x}\) is
$$\displaystyle \begin{aligned} p\left(\vec{x}\ \right\ G)=p\left(x_1\rightG)p\left(x_2\rightG)\ldots p\left(x_n\middle G\right),\end{aligned} $$
(14)
$$\displaystyle \begin{aligned} p\left(G\ \right\ \vec{x})=\ \frac{p\left(\vec{x}\ \right\ G)\pi_G}{p(\vec{x})}. \end{aligned} $$
(15)
$$\displaystyle \begin{aligned} p\left(B\ \right\ \vec{x})=\ \frac{p\left(\vec{x}\ \right\ B)\pi_B}{p(\vec{x})}. \end{aligned} $$
(16)
The attributes
\(\vec {x}\) are typically converted into a score,
\(s(\vec {x})\), which is such that
\(p\left (G\ \right \ \vec {x})=\ p\left (G\ \right \ s(\vec {x}))\). A popular score function is the logodds score [
42]:
where
s
_{pop} is the log of the relative proportion of “good” and “bad” applicants in the population and
\(woe\left (\vec {x}\right )\) is the weight of evidence of the attribute combination
\(\vec {x}\). Because of the conditional independence of the attributes, we can rewrite Eq. (
17) as
If
\(woe\left (x_i\right )\) is equal to 0, then this attribute does not affect the estimation of the status of an applicant. The prior probabilities
π
_{G} and
π
_{B} are estimated using the proportions of good and bad applicants in the training sample; the same applies to the weight of evidence of the attributes, as illustrated in the example below.
$$\displaystyle \begin{aligned} \begin{array}{rcl} {} s\left(\vec{x}\right)=\log{\left(\frac{p\left(G\middle\vec{x}\right)}{p\left(B\middle\vec{x}\right)}\right)}=\log{\left(\frac{\pi_G p\left(\vec{x}\middle G\right)}{ \pi_B p\left(\vec{x}\middle B\right)}\right)}=\\ =\log{\left(\frac{\pi_G}{\pi_B}\right)}+\log{\left(\frac{p\left(\vec{x}\middle G\right)}{p\left(\vec{x}\middle B\right)}\right)}= s_{pop}+ woe\left(\vec{x}\right), {} \end{array} \end{aligned} $$
(17)
$$\displaystyle \begin{aligned} \begin{array}{rcl} {} s\left(\vec{x}\right)=\ln{\left(\frac{\pi_G}{\pi_B}\right)}+\ln{\left(\frac{p\left(x_1\middle G\right)}{p\left(x_1\middle B\right)}\right)}+\ldots+\ \ln{\left(\frac{p\left(x_n\middle G\right)}{p\left(x_n\middle B\right)}\right)}\\=s_{pop}+\ woe\left(x_1\right)+\ woe\left(x_2\right)+\ldots+\ woe\left(x_n\right). \end{array} \end{aligned} $$
(18)
Example Let us assume that a bank makes a lending decision based on two attributes: the residential status and the monthly income of the applicant. The data belonging to the training sample are given in Fig.
1. An applicant who has a monthly income of USD 2000 and owns a flat, will receive a score of:
$$\displaystyle \begin{aligned} s\left(\vec{x}\right)=\ln{\left(\frac{1300}{300}\right)}+\ln{\left(\frac{{950}/{1300}}{{150}/{300}}\right)}+\ \ln{\left(\frac{{700}/{1300}}{{100}/{300}}\right)=2.32}. \end{aligned}$$
×
If
\(p\left (G\ \right \ s(\vec {x})=2.32)\), the conditional probability of being “good” when the score is 2.32, is higher than
\(p\left (B\ \right \ s(\vec {x})=2.32)\), i.e., the conditional probability of being “bad,” this applicant is classified as “good” (and vice versa).
A lender can therefore define a cutoff score, below which applicants are automatically rejected as “bad.” Usually, the score
\(s(\vec {x})\) is linearly transformed so that its interpretation is more straightforward. The NB classifier performs relatively well in many applications but, according to Thomas et al. [
42], it shows poor performance in the field of credit scoring. However, its most significant advantage is that it is easy to interpret, which is a property of growing importance in the industry.
3 Nonlinear Methods for Classification
Although simple linear methods are still fairly popular with practitioners, because of their simplicity and their satisfactory accuracy [
29], more than 25% of the financial companies have recently adopted at least one advanced ML solution in their daytoday business processes [
12], as emphasized in Sect.
1. Indeed, these models have the advantage of being much more flexible and they may be able to uncover complex, nonlinear relationships in the data. For instance, the popular LDA approach postulates that an applicant will be “bad” if her/his score exceeds a given threshold; however, the path to default may be highly nonlinear in the mapping between scores and probability of default (see [
39]).
Therefore, in this section, we review several popular ML techniques for classification, such as Decision Trees (DT), Neural Network (NN), Support Vector Machines (SVM), kNearest Neighbor (kNN), and Genetic Algorithms (GA). Even if GA are not exactly classification methods, evolutionary computing techniques that help to find the “fittest” solution, we cover them in our chapter as this method is widely used in credit scoring applications (see, e.g., [
49,
35,
1]). Finally, we discuss ensemble methods that combine different classifiers to obtain better classification accuracy. For the sake of brevity, we do not cover deep learning techniques, which are also employed for credit scoring purposes; the interested reader can find useful references in [
36].
3.1 Decision Trees
Decision Trees (also known as Classification Trees) are a classification method that uses the training dataset to construct decision rules organized into treelike structures, where each branch represents an association between the input values and the output label. Although different algorithms exist (such as classification and regression trees, also known as CART), we focus on the popular C4.5 algorithm developed by Quinlan [
37]. At each node, the C4.5 algorithm splits the training dataset according to the most influential feature through an iterative process. The most influential feature is the one with the lowest entropy (or, similarly, with the highest information gain). Let
\(\hat {\pi }_G\) be the proportion of “good” applicants and
\(\hat {\pi }_B\) the proportion of “bad” applicants in the sample
S. The entropy of
S is then defined as in Baesens et al. [
5]:
According to this formula, the maximum value of the entropy is equal to 1 when
\(\hat {\pi }_G = \hat {\pi }_B = 0.5\) and it is minimal at 0, which happens when either
\(\hat {\pi }_G=0\) or
\(\hat {\pi }_B=0\). In other words, an entropy of 0 means that we have been able to identify the characteristics that lead to a group of good (bad) applicants. In order to split the sample, we compute the gain ratio:
\(Gain\left (S, x_i\right )\) is the expected reduction in entropy due to splitting the sample according to feature
x
_{i} and it is calculated as
where
υ ∈values(
x
_{i}),
S
_{υ} is a subset of the individuals in
S that share the same value of the feature
x
_{i}, and
where
k ∈values(
x
_{i}) and
S
_{k} is a subset of the individuals in
S that share the same value of the feature
x
_{i}. The latter term represents the entropy of
S relative to the feature
x
_{i}. Once such a tree has been constructed, we can predict the probability that a new applicant will be a “bad” one using the proportion of “bad” customers in the leaf that corresponds to the applicant’s characteristics.
$$\displaystyle \begin{aligned} \mathrm{Entropy}\left(S\right)= \hat{\pi}_G\log_2{\left(\hat{\pi}_G\right)\hat{\pi}_B\log_2{\left(\hat{\pi}_B\right)}}. \end{aligned} $$
(19)
$$\displaystyle \begin{aligned} \mathrm{Gain ratio}\left(S, x_i\right)=\frac{\mathrm{Gain}\left(S, x_i\right)}{\mathrm{Split Information}(S, x_i)}.\end{aligned} $$
(20)
$$\displaystyle \begin{aligned} \mathrm{Gain}\left(S, x_i\right)= \mathrm{Entropy}\left(S\right) \sum_{\upsilon}\frac{\leftS_\upsilon\right}{\leftS\right}\mathrm{Entropy}\left(S_\upsilon\right),\end{aligned} $$
(21)
$$\displaystyle \begin{aligned} \mathrm{Split Information}\left(S,\ x_i\right)=\ \sum_{k }{\frac{\leftS_k\right}{\leftS\right}\log_2{\frac{\leftS_k\right}{\leftS\right}}}\end{aligned} $$
(22)
3.2 Neural Networks
NN models were initially inspired by studies of the human brain [
8,
9]. A NN model consists of input, hidden, and output layers of interconnected neurons. Neurons in one layer are combined through a set of weights and fed to the next layer. In its simplest singlelayer form, a NN consists of an input layer (containing the applicants’ characteristics) and an output layer. More precisely, a singlelayer NN is modeled as follows:
where
x
_{1}, …,
x
_{n} are the applicant’s characteristics, which in a NN are typically referred to as
signals,
ω
_{k1}, …,
ω
_{kn} are the weights connecting characteristic
i to the layer
k (also called
synaptic weights), and
ω
_{k0} is the “bias” (which plays a similar role to the intercept term in a linear regression). Eq. (
23) describes a singlelayer NN, so that
k = 1. A positive weight is called
excitatory because it increases the effect of the corresponding characteristic, while a negative weight is called
inhibitory because it decreases the effect of a positive characteristic [
42]. The function
f that transform the inputs into the output is called
activation function and may take a number of specifications. However, in binary classification problems, it may be convenient to use a logistic function, as it produces an output value in the range [0, 1]. A cutoff value is applied to
y
_{k} to decide whether the applicant should be classified as good or bad. Figure
2 illustrates how a singlelayer NN works.
$$\displaystyle \begin{aligned} \begin{array}{rcl} {} u_k= \omega_{k0}+\sum_{i=1}^{n}{\omega_{ki}x_i} \\ y_k= f\left(u_k\right), \end{array} \end{aligned} $$
(23)
×
A singlelayer NN model shows a satisfactory performance only if the classes can be linearly separated. However, if the classes are not linearly separable, a multilayer model could be used [
33]. Therefore, in the rest of this section, we describe multilayer perceptron (MLP) models, which are the most popular NN models in classification problems [
5]. According to Bishop [
9], even though multiple hidden layers may be used, a considerable number of papers have shown that MLP NN models with one hidden layer are universal nonlinear discriminant functions that can approximate arbitrarily well any continuous function. An MLP model with one hidden layer, which is also called a threelayer NN, is shown in Fig.
3. This model can be represented algebraically as
where
f
^{(1)} is the activation function on the second (hidden) layer and
y
_{k} for
k = 1 …,
r are the outputs from the hidden layer that simultaneously represent the inputs to the third layer. Therefore, the final output values
z
_{v} can be written as
where
f
^{(2)} is the activation function of the third (output) layer,
z
_{v} for
v = 1, …,
s are the final outputs, and
K
_{vk} are the weights applied to the
y
_{k} values. The estimation of the weights is called
training of the model and to this purpose the most popular method is the backpropagation algorithm, in which the pairs of input values and output values are presented to the model many times with the goal of finding the weights that minimize an error function [
42].
$$\displaystyle \begin{aligned} y_k=f^{(1)}(\sum_{i=0}^{n}{\omega_{ki}x_i),} \end{aligned} $$
(24)
$$\displaystyle \begin{aligned} z_v= f^{(2)}\left(\sum_{k=1}^{r}{K_{vk}y_k}\right)=f^{(2)}\left(\sum_{k=1}^{r}{K_{vk}f^{(1)}\left(\sum_{i=0}^{n}{\omega_{ki}x_i)}\right)}\right) \end{aligned} $$
(25)
×
3.3 Support Vector Machines
The SVM method was initially developed by Vapnik [
43]. The idea of this method is to transform the input space into a highdimensional feature space by using a nonlinear function
φ(•). Then, a linear classifier can be used to distinguish between “good” and “bad” applicants. Given a training dataset of N pairs of observations
\({\left (\vec {x}_i, y_i\right )}_{i=1}^N\), where
\(\vec {x}_i\) are the attributes of customer
i and
y
_{i} is the corresponding binary label, such that
y
_{i} ∈ [−1, + 1], the SVM model should satisfy the following conditions:
which is equivalent to
The above inequalities construct a hyperplane in the feature space, defined by
\(\left \{x\vec {w}^T\varphi \left (\vec {x}_i\right )+b=0\right \}\), which distinguishes between two classes (see Fig.
4 for the illustration of a simple twodimensional case). The observations on the lines
\(\vec {w}^T\varphi \left (x_i\right )+b=1\) and
\(\vec {w}^T\varphi \left (x_i\right )+b=1\) are called the
support vectors. The parameters of the separating hyperplane are estimated by maximizing the perpendicular distance (called the
margin), between the closest support vector and the separating hyperplane while at the same time minimizing the misclassification error.
$$\displaystyle \begin{aligned} \begin{array}{rcl} \begin{cases} \vec{w}^T \varphi(\vec{x}_i) + b \geq +1 \; \; \text{if} \; \; y_i = +1 \\ \vec{w}^T \varphi(\vec{x}_i) + b \leq 1 \; \; \text{if} \; \; y_i = 1, \end{cases} \end{array} \end{aligned} $$
$$\displaystyle \begin{aligned} y_i\ \left[\vec{w}^T\varphi\left(\vec{x}_i\right)+b\right]\geq1,\; \; i=1,\ldots,N. \end{aligned} $$
(26)
×
The optimization problem is defined as:
where the variables
ξ
_{i} are slack variables and C is a positive tuning parameter [
5]. The Lagrangian to this optimization problem is defined as follows:
The classifier is obtained by minimizing
\(\mathscr {L}\left (\vec {w}, b, \xi ; \vec {\alpha }, \vec {\nu }\right )\) with respect to
\(\vec {w}, b,\xi \) and maximizing it with respect to
\(\vec {\alpha }\),
\(\vec {\nu }\). In the first step, by taking the derivatives with respect to
\(\vec {w},b,\xi \), setting them to zero, and exploiting the results, one may represent the classifier as
where
\(K\left (\vec {x}_i,\vec {x}\right )={\varphi \left (\vec {x}_i\right )}^T\varphi \left (\vec {x}_i\right )\) is computed using a positivedefinite kernel function. Some possible kernel functions are the radial basis function
\(K\left (\vec {x}_i, \vec {x}\right )=exp({x_ix_j{ }_2^2}/{\sigma ^2})\) and the linear function
\(K\left (\vec {x}_i, \vec {x}\right )=\vec {x}_i^T\vec {x}_j.\) At this point, the Lagrange multipliers
α
_{i} can be found by solving:
$$\displaystyle \begin{aligned} \begin{array}{rcl} {} \begin{cases}\min_{w, b, \xi}{\mathscr{J}(\vec{w},\ b,\ }\xi)=\frac{1}{2}\vec{w}^T\vec{w}+C\sum_{i=1}^{N}{\xi_i,} \\ \text{subject to:} \\ y_i\left[\vec{w}^T\ \varphi\left(x_i\right)+b\right]\geq1\xi_i,\; \; \; i=1,\ \ldots,\ N \\ \xi_i\geq0,\; \; \; i=1, \ldots, N, \end{cases} \end{array} \end{aligned} $$
(27)
$$\displaystyle \begin{aligned} \mathscr{L}\left(\vec{w}, b, \xi; \vec{\alpha},\ \vec{\nu}\right)= \mathscr{J}\left(\vec{w}, b,\xi\right)\sum_{i=1}^{N}\alpha_i\left\{y_i\left[\vec{w}^T\ \varphi\left(\vec{x_i}\right)+b\right]1+\xi_i\right\} \sum_{i=1}^{N}\nu_i\xi_i. \end{aligned} $$
(28)
$$\displaystyle \begin{aligned} y\left(\vec{x}\right)=sign\left(\sum_{i=1}^{N}{\alpha_iy_iK\left(\vec{x}_i,\ \vec{x}\right)}+B\right) \end{aligned} $$
(29)
$$\displaystyle \begin{aligned} \begin{array}{rcl} \begin{cases} \max_{\alpha_i}{}\frac{1}{2}\sum_{i,j=1}^{N}{y_iy_jK\left(\vec{x}_i,\ \vec{x}_j\right)\alpha_i\alpha_j+\sum_{i=1}^{N}\alpha_i} \\ \text{subject to:} \\ \sum_{i=1}^{N}{\alpha_iy_i=0}\\ 0\le\alpha_i\le C,\; \; \; i=1, \ldots, N. \end{cases} \end{array} \end{aligned} $$
3.4 kNearest Neighbor
In the kNN method, any new applicant is classified based on a comparison with the training sample using a distance metric. The approach consists of calculating the distances between the new instance that needs to be classified and each instance in the training sample that has been already classified and selecting the set of the knearest observations. Then, the class label is assigned according to the most common class among knearest neighbors using a majority voting scheme or a distanceweighted voting scheme [
41]. One major drawback of the kNN method is that it is extremely sensitive to the choice of the parameter k, as illustrated in Fig.
5. Given the same dataset, if k=1 the new instance is classified as “bad,” while if k=3 the neighborhood contains one “bad” and two “good” applicants, thus, the new instance will be classified as “good.” In general, using a small k leads to overfitting (i.e., excessive adaptation to the training dataset), while using a large k reduces accuracy by including data points that are too far from the new case [
41].
×
The most common choice of a distance metric is the Euclidean distance, which can be computed as:
where
\(\vec {x}_i\) and
\(\vec {x}_j\) are the vectors of the input data of instances
i and
j, respectively. Once the distances between the newest and every instance in the training sample are calculated, the new instance can be classified based on the information available from its knearest neighbors. As seen above, the most common approach is to use the majority class of knearest examples, the socalled majority voting approach
where
y
^{new} is the class of the new instance,
ν is a class label,
\(\vec {S}_k\) is the set containing kclosest training instances,
y
_{i} is the class label of one of the knearest observations, and
I(•) is a standard indicator function.
$$\displaystyle \begin{aligned} d\left(\vec{x}_i, \vec{x}_j\right)= \vec{x}_i\vec{x}_j = \left[(\vec{x}_i\vec{x}_j)^T(\vec{x}_i\vec{x}_j)\right]^{\frac{1}{2}} \end{aligned} $$
(30)
$$\displaystyle \begin{aligned} y^{new}=\mathrm{arg}\max_\nu\sum_{\left(x_i, y_i\right)\in\vec{S}_k}I(\nu=y_i), \end{aligned} $$
(31)
The major drawback of the majority voting approach is that it gives the same weight to every knearest neighbor. This makes the method very sensitive to the choice of k, as discussed previously. However, this problem might be overcome by attaching to each neighbor a weight based on its distance from the new instance, i.e.,
This approach is known as the distanceweighted voting scheme, and the class label of the new instance can be found in the following way:
One of the main advantages of kNN is its simplicity. Indeed, its logic is similar to the process of traditional credit decisions, which were made by comparing a new applicant with similar applicants [
10]. However, because estimation needs to be performed afresh when one is to classify a new instance, the classification speed may be slow, especially with large training samples.
$$\displaystyle \begin{aligned} \omega_i=\frac{1}{{d\left(\vec{x}_i,\vec{x}_j\right)}^2} \end{aligned} $$
(32)
$$\displaystyle \begin{aligned} y^{new}=\mathrm{arg}\max_\nu\sum_{\left(x_i, y_i\right)\in\vec{S}_k}\omega_i I(\nu=y_i), \end{aligned} $$
(33)
3.5 Genetic Algorithms
GA are heuristic, combinatorial optimization search techniques employed to determine automatically the adequate discriminant functions and the valid attributes [
35]. The search for the optimal solution to a problem with GA imitates the evolutionary process of biological organisms, as in Darwin’s natural selection theory. In order to understand how a GA works in the context of credit scoring, let us suppose that (
x
_{1}, …,
x
_{n}) is a set of attributes used to decide whether an applicant is good or bad according to a simple linear function:
Each solution is represented by the vector
\(\vec {\beta }=(\beta _0,\beta _1, \ldots , \beta _N)\) whose elements are the coefficients assigned to each attribute. The initial step of the process is the generation of a random population of solutions
\(\vec {\beta }_{J}^{0}\) and the evaluation of their fitness using a fitness function. Then, the following algorithms are applied:
The application of these algorithms results in the generation of a new population of solutions
\(\vec {\beta }_{J}^{0}\). The algorithms selectioncrossovermutation are applied recursively until an (approximate) optimal solution
\(\vec {\beta }_{J}^\ast \) is converged to.
$$\displaystyle \begin{aligned} y= \beta_0+\sum_{i=1}^{N}\beta_ix_i. \end{aligned} $$
(34)
1.
Selection: a genetic operator selects the solutions that survive (the fittest solutions)
2.
Crossover: a genetic operator recombines the survived solutions
3.
Mutation: a genetic operator allows for random mutations in the survived solutions, with a low probability
3.6 Ensemble Methods
In order to improve the accuracy of the individual (or
base) classifiers illustrated above, ensemble (or classifier combination) methods are often used [
41]. Ensemble methods are based on the idea of training multiple models to solve the same problem and then combine them to get better results. The main hypothesis is that when weak models are correctly combined, we can obtain more accurate and/or robust models. In order to understand why ensemble classifiers may reduce the error rate of individual models, it may be useful to consider the following example.
Example Suppose that an ensemble classifier is created by using 25 different base classifiers and that each classifier has an error rate
𝜖
_{i} = 0.25. If the final credit decision is taken through a majority vote (i.e., if the majority of the classifiers suggests that the customer is a “good” one, then the credit is granted), the error rate of the ensemble model is
where
i = 13, …, 25, which is much less than the individual rate of 0.25, because the ensemble model would make a wrong decision only if more than half of the base classifiers yield a wrong estimate.
$$\displaystyle \begin{aligned} \epsilon_{ensemble}=\sum_{i=13}^{25}\binom{25}{i}\epsilon^i(1\epsilon)^{25i}=0.003, \end{aligned} $$
(35)
It is easy to understand that ensemble classifiers perform especially well when they are uncorrelated. Although in realworld applications it is difficult to obtain base classifiers that are totally uncorrelated, considerable improvements in the performance of ensemble classifiers are observed even when some correlations exists but are low [
17]. Ensemble models can be split into homogeneous and heterogeneous. Homogeneous ensemble models use only one type of classifier and rely on resampling techniques to generate
k different classifiers that are then aggregated according to some rule (e.g., majority voting). Examples of homogeneous ensemble models are
bagging and
boosting methods. More precisely, the bagging algorithm creates
k bootstrapped samples of the same size as the original one by drawing with replacement from the dataset. All the samples are created in parallel and the estimated classifiers are aggregated according to majority voting. Boosting algorithms work in the same spirit as bagging but the models are not fitted in parallel: a sequential approach is used and at each step of the algorithm the model is fitted, giving more importance to the observations in the training dataset that were badly handled in the previous iteration. Although different boosting algorithms are possible, one of the most popular is AdaBoost. AdaBoost was first introduced by Freund and Schapire [
19]. This algorithm starts by calculating the error of a base classifier
h
_{t}:
Then, the importance of the base classifier
h
_{t} is calculated as:
The parameter
α
_{t} is used to update the weights assigned to the training instances. Let
\(\omega _i^{(t)}\) be the weight assigned to the training instance
i in the
t
^{th} boosting round. Then, the updated weight is calculated as:
where
Z
_{t} is the normalization factor, such that
\(\sum _{i}{\omega _i^{(t+1)}=1}\). Finally, the AdaBoost algorithm decision is based on
In contrast to homogeneous ensemble methods, heterogeneous ensemble methods combine different types of classifiers. The main idea behind these methods is that different algorithms might have different views on the data and thus combining them helps to achieve remarkable improvements in predictive performance [
47]. An example of heterogeneous ensemble method can be the following:
A comparative evaluation of alternative ensemble methods is provided in Sect.
4.2.
$$\displaystyle \begin{aligned} \epsilon_t=\frac{1}{N}\left[\sum_{j=1}^{N}{\omega_jI\left(h_t\left(x_j\right)\neq y_j\right)}\right]. \end{aligned} $$
(36)
$$\displaystyle \begin{aligned} \alpha_t=\frac{1}{2}\ln{\left(\frac{1\epsilon_t}{\epsilon_t}\right)}. \end{aligned} $$
(37)
$$\displaystyle \begin{aligned} \omega_i^{(t+1)}=\frac{\omega_i^{(t)}}{Z_t}\times\begin{cases} \text{exp}(\alpha_t) & \mbox{if} \; \; h_t (\vec{x}_i)=y_i \\ \text{exp}(\alpha_t) & \mbox{if} \; \; h_t (\vec{x}_i)\neq y_i, \end{cases} \end{aligned} $$
(38)
$$\displaystyle \begin{aligned} h\left(\vec{x}\right)=sign\left(\sum_{t=1}^{T}\alpha_th_t\left(\vec{x}\right)\right). \end{aligned} $$
(39)
1.
Create a set of different classifiers
\(H=\left \{h_t, t=1, \ldots , T\right \}\) that map an instance in the training sample to a class
y
_{i}:
\( h_t\left (\vec {x}_i\right )=y_i.\)
2.
Start with an empty ensemble (
S = ∅).
3.
Add to the ensemble the model from the initial set that maximizes the performance of the ensemble on the validation dataset according to the error metric.
4.
Repeat Step 3 for
k iterations, where
k is usually less than the number of models in the initial set.
4 Comparison of Classifiers in Credit Scoring Applications
The selection of the best classification algorithm among all methods that have been proposed in the literature has always been a challenging research area. Although many studies have examined the performance of different classifiers, most of these papers have traditionally focused only on a few novel algorithms at the time and, thus, have generally failed to provide a comprehensive overview of pros and cons of alternative methods. Moreover, in most of these papers, a relatively small number of datasets were used, which limited the practical applicability of the empirical results reported. One of the most comprehensive studies that attempts to overcome these issues and to apply thorough statistical tests to compare different algorithms has been published by Stefan Lessmann and his coauthors [
29]. By combining their results with other, earlier studies, this section seeks to isolate the best classification algorithms for the purposes of credit scoring.
4.1 Comparison of Individual Classifiers
In the first decade of the 2000s, the focus of most papers had been on performing comparisons among individual classifiers. Understandably, the question of whether advanced methods of classification, such as NN and SVM, might outperform LR and LDA had attracted much attention. While some authors have since then concluded that NN classifiers are superior to both LR and LDA (see, e.g., [
2]), generally, it has been shown that simple linear classifiers lead to a satisfactory performance and, in most cases, that the differences between NN and LR are not statistically significant [
5]. This section compares the findings of twelve papers concerning individual classifiers in the field of credit scoring. Papers were selected based on two features: first, the number of citations, and, second, the publishing date. The sample combines wellknown papers (i.e., [
45,
5]) with recent work (e.g., [
29,
3]) in an attempt to provide a wellrounded overview.
One of the first comprehensive comparisons of linear methods with more advanced classifiers was West [
45]. He tested five NN models, two parametric models (LR, LDA), and three nonparametric models (kNN, kernel density, and DT) on two realworld datasets. He found that in the case of both datasets, LR led to the lowest credit scoring error, followed by the NN models. He also found that the differences in performance scores of the superior models (LR and three different way to implement NN) vs. the outperformed models were not statistically significant. Overall, he concluded that LR was the best choice among individual classifiers he tested. However, his methodology presented a few drawbacks that made some of his findings potentially questionable. First, West [
45] used only one method of performance evaluation and ranking, namely, average scoring accuracy. Furthermore, the size of his datasets was small, containing approximately 1700 observations in total (1000 German credit applicants, 700 of which were creditworthy, and 690 Australian applicants, 307 of which were creditworthy).
Baesens et al. [
5] remains one of the most comprehensive comparisons of different individual classification methods. This paper overcame the limitations in West [
45] by using eight extensive datasets (for a total of 4875 observations) and multiple evaluation methods, such as the percentage of correctly classified cases, sensitivity, specificity, and the area under the receiver operating curve (henceforth, AUC, an accuracy metric that is widely used when evaluating different classifiers).
^{3} However, the results reported by Baesens et al. [
5] were similar to West’s [
45]: NN and SVM classifiers had the best average results; however, also LR and LDA showed a very good performance, suggesting that most of the credit datasets are only weakly nonlinear. These results have found further support in the work of Lessmann et al. [
29], who updated the findings in [
5] and showed that NN models perform better than LR model, but only slightly.
^{4}
These early papers did not contain any evidence on the performance of GA. One of the earliest papers comparing genetic algorithms with other credit scoring models is Yobas et al. [
49], who compared the predictive performance of LDA with three computational intelligence techniques (a NN, a decision tree, and a genetic algorithm) using a small sample (1001 individuals) of credit scoring data. They found that LDA was superior to genetic algorithms and NN. Fritz and Hosemann [
20] also reached a similar conclusion even though doubts existed on their use of the same training and test sets for different techniques. Recently, these early results have been overthrown. Ong et al. [
35] compared the performance of genetic algorithms to MLP, decision trees (CART and C4.5), and LR using two realworld datasets, which included 1690 observations. Genetic algorithms turned out to outperform other methods, showing a solid performance even on relatively small datasets. Huang et al. [
26] compared the performance of GA against NN, SVM, and decision tree models in a credit scoring application using the Australian and German benchmark data (for a total of almost 1700 credit applicants). Their study revealed superior classification accuracy from GA than under other techniques, although differences are marginal. Abdou [
1] has investigated the relative performance of GA using data from Egyptian public sector banks, comparing this technique with probit analysis, reporting that GA achieved the highest accuracy rate and also the lowest typeI and typeII errors when compared with other techniques.
One more recent and comprehensive study is that of Finlay [
16], who evaluated the performance of five alternative classifiers, namely, LR, LDA, CART, NN, and kNN, using the rather large dataset of Experian UK on credit applications (including a total of 88,789 applications, 13,261 of which were classified as “bad”). He found that the individual model with the best performance is NN; however, he also showed that the overperformance of nonlinear models over their linear counterparts is rather limited (in line with [
5]).
Starting in 2010, most papers have shifted their focus to comparisons of the performance of ensemble classifiers, which are covered in the next section. However, some recent studies exist that evaluate the performance of individual classifiers. For instance, Ala’raj and Abbod [
2] (who used five realworld datasets for a total of 3620 credit applications) and Bequé and Lessmann [
7] (who used three realworld credit datasets for a total of 2915 applications) have found that LR has the best performance among the range of individual classifiers they considered. Although ML approaches are better at capturing nonlinear relationships, similarly to what is typical in credit risk applications (see [
4]), it could be concluded that, in general, a simple LR model remains a solid choice among individual classifiers.
4.2 Comparison of Ensemble Classifiers
According to Lessmann et al. [
29], the new methods that have appeared in ML have led to superior performance when compared to individual classifiers. However, only a few papers concerning credit scoring have examined the potential of ensemble methods, and most papers have focused on simple approaches. This section attempts to determine whether ensemble classifiers offer significant improvements in performance when compared to the best available individual classifiers and examines the issue of uncovering which ensemble methods may provide the most promising results. To succeed in this objective, we have selected and surveyed ten key papers concerning ensemble classifiers in the field of credit scoring.
West et al. [
46] were among the first researchers to test the relative performance of ensemble methods in credit scoring. They selected three ensemble strategies, namely, crossvalidation, bagging, and boosting, and compared them to the MLP NN as a base classifier on two datasets.
^{5} West and coauthors concluded that among the three chosen ensemble classifiers, boosting was the most unstable and had a mean error higher than their baseline model. The remaining two ensemble methods showed statistically significant improvements in performance compared to MLP NN; however, they were not able to single out which ensemble strategy performed the best since they obtained contrasting results on the two test datasets. One of the main limitations of this seminal study is that only one metric of performance evaluation was employed. Another extensive paper on the comparative performance of ensemble classifiers is Zhou et al.’s [
51]. They compared six ensemble methods based on LSSVM to 19 individual classifiers, with applications to two different realworld datasets (for a total of 1113 observations). The results were evaluated using three different performance measures, i.e., sensitivity, the percentage of correctly classified cases, and AUC. They reported that the ensemble methods assessed in their paper could not lead to results that would be statistically superior to an LR individual classifier. Even though the differences in performance were not large, the ensemble models based on the LSSVM provided promising solutions to the classification problem that was not worse than linear methods. Similarly, Louzada et al. [
30] have recently used three famous and publicly available datasets (the Australian, the German, and the Japanese credit data) to perform simulations under both balanced (p = 0:5, 50% of bad payers) and imbalanced cases (p = 0:1, 10% of bad payers). They report that two methods, SVM and fuzzy complex systems offer a superior and statistically significant predictive performance. However, they also notice that in most cases there is a shift in predictive performance when the method is applied to imbalanced data. Huang and Wu [
25] report that the use of boosted GA methods improves the performance of underlying classifiers and appears to be more robust than single prediction methods. Marqués et al. [
31] have evaluated the performance of seven individual classifier techniques when used as members of five different ensemble methods (among them, bagging and AdaBoost) on six realworld credit datasets using a fivefold crossvalidation method (each original dataset was randomly divided into five stratified parts of equal size; for each fold, four blocks were pooled as the training data, and the remaining part was employed as the hold out sample). Their statistical tests show that decision trees constitute the best solution for most ensemble methods, closely followed by the MLP NN and LR, whereas the kNN and the NB classifiers appear to be significantly the worst.
All the papers discussed so far did not offer a comprehensive comparison of different ensemble methods, but rather they focused on a few techniques and compared them on a small number of datasets. Furthermore, they did not always adopt appropriate statistical tests of equal classification performance. The first comprehensive study that has attempted to overcome these issues is Lessmann et al. [
29], who have compared 16 individual classifiers with 25 ensemble algorithms over 8 datasets. The selected classifiers include both homogeneous (including bagging and boosting) and heterogeneous ensembles. The models were evaluated using six different performance metrics. Their results show that the best individual classifiers, namely, NN and LR, had average ranks of 14 and 16 respectively, being systematically dominated by ensemble methods. Based on the modest performance of individual classifiers, Lessmann et al. [
29] conclude that ML techniques have progressed notably since the first decade of the 2000s. Furthermore, they report that heterogeneous ensemble classifiers provide the best predictive performance.
Lessmann et al. [
29] have also examined the potential financial implications of using ensemble scoring methods. They considered 25 different cost ratios based on the assumption that accepting a “bad” application always costs more than denying a “good” application [
42]. After testing three models (NN, RF, and HCESBag) against LR, Lessmann et al. [
29] conclude that for all cost ratios, the more advanced classifiers led to significant cost savings. However, the most accurate ensemble classifier, HCESBag, on average achieved lower cost savings than the radial basis function NN method, 4.8 percent and 5.7 percent, respectively. Based on these results, they suggested that the most statistically accurate classifier may not always be the best choice for improving the profitability of the credit lending business.
Two additional studies, FlorezLopez and RamonJeronimo [
18] and Xia et al. [
48], have focused on the interpretability of ensemble methods, constructing ensemble models that can be used to support managerial decisions. Their empirical results confirmed the findings of Lessmann et al. [
29] that ensemble methods consistently lead to better performances than individual scoring. Furthermore, they concluded that it is possible to build an ensemble model that has both high interpretability and a high accuracy rate. Overall, based on the papers considered in this section, it is evident that ensemble models offer higher accuracy compared to the best individual models. However, it is impossible to select one ensemble approach that will have the best performance over all datasets and error costs. We expect that scores of future papers will appear with new, more advanced methods and that the search for “the silver bullet” in the field of credit scoring will not end soon.
4.3 OneClass Classification Methods
Another promising development in credit scoring concerns oneclass classification methods (OCC), i.e., ML methods that try to learn from one class only. One of the biggest practical obstacles to applying scoring methods is the class imbalance feature of most (all) datasets, the socalled lowdefault portfolio problem. Because financial institutions only store historical data concerning the accepted applicants, the characteristics of “bad” applicants present in their data bases may not be statistically reliable to provide a basis for future predictions ([
27]. Empirical and theoretical work has demonstrated that the accuracy rate may be strongly biased with respect to imbalance in class distribution and that it may ignore a range of misclassification costs [
14], as in applied work it is generally believed that the costs associated with typeII errors (bad customers misclassified as good) are much higher than the misclassification costs associated with typeI errors (good customers mispredicted as bad). OCC attempts to differentiate a set of target instances from all the others. The distinguishing feature of OCC is that it requires labeled instances in the training sample for the target class only, which in the case of credit scoring are “good” applicants (as the number of “good” applicants is larger than that of “bad” applicants). This section surveys whether OCC methods can offer a comparable performance to the best twoclass classifiers in the presence of imbalanced data features.
The literature on this topic is still limited. One of the most comprehensive studies is a paper by Kennedy [
27], in which he compared eight OCC methods, in which models are separately trained over different classes of datasets, with eight twoclass individual classifiers (e.g., kNN, NB, LR) over three datasets. Two important conclusions emerged. First, the performance of twoclass classifiers deteriorates significantly with an increasing class imbalance. However, the performance of some classifiers, namely, LR and NB, remains relatively robust even for imbalanced datasets, while the performance of NN, SVM, and kNN deteriorates rapidly. Second, oneclass classifiers show superior performance compared to twoclass classifiers only at high levels of imbalance (starting at 99% of “good” and 1% of “bad” applicants). However, the differences in performance between OCC models and LR model were not statistically significant in most cases. Kennedy [
27] concluded that OCC methods failed to show statistically significant improvements in performance compared to the best twoclass classification methods. Using a proprietary dataset from a major US commercial bank from January 2005 to April 2009, Khandani et al. [
28] showed that conditioning on certain changes in a consumer’s bank account activity can lead to considerably more accurate forecasts of credit card delinquencies by analyzing subtle nonlinear patterns in consumer expenditures, savings, and debt payments using CART and SVM compared to simple regression and logit approaches. Importantly, their trees are “boosted” to deal with the imbalanced features of the data: instead of equally weighting all the observations in the training set, they weight the scarcer observations more heavily than the more populous ones.
These findings are in line with studies in other fields. Overall, the conclusion that can be drawn is that OCC methods should not be used for classification problems in credit scoring. Twoclass individual classifiers show superior or comparable performance for all cases, except for cases of extreme imbalances.
5 Conclusion
The field of credit scoring represents an excellent example of how the application of novel ML techniques (including deep learning and GA) is in the process of revolutionizing both the computational landscape and the perception by practitioners and endusers of the relative merits of traditional vs. new, advanced techniques. On the one hand, in spite of their logical appeal, the available empirical evidence shows that ML methods often struggle to outperform simpler, traditional methods, such as LDA, especially when adequate tests of equal predictive accuracy are deployed. Although some of these findings may be driven by the fact that some of the datasets used by the researchers (especially in early studies) were rather small (as in the case, for instance, of West [
45]), linear methods show a performance that is often comparable to that of ML methods also when larger datasets are employed (see, e.g., Finlay [
17]). On the other hand, there is mounting experimental and onthefield evidence that ensemble methods, especially those that involve MLbased individual classifiers, perform well, especially when realistic cost functions of erroneous classifications are taken into account. In fact, it appears that the issues of ranking and assessing alternative methods under adequate loss functions, and the dependence of such rankings on the cost structure specifications, may turn into a fertile ground for research development.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (
http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Footnotes
1
There are also applications in which the outcome variable is not binary; for instance, multinomial models are used to predict the probability that an applicant will move from one class of risk to another. For example, Sirignano et al. [
40] propose a nonlinear model of the performance of a pool of mortgage loans over their life; they use neural networks to model the conditional probability that a loan will transition to a different state (e.g., prepayment or default).
2
Casual interpretations of “black box” ML models have attracted considerable attention. Zhao and Hastie [
50] provide a summary and propose partial dependence plots (PDP) and individual conditional expectations (ICE) as tools to enhance the interpretation of ML models. Dorie et al. [
13] report interesting results of a data analysis competition where different strategies for causal inference—including “black box” models—are compared.
3
A detailed description of the performance measurement metrics that are generally used to evaluate the accuracy of different classification methods can be found in the previous chapter by BargagliStoffi et al. [
6].
4
Importantly, compared to Baesens et al. [
5], Lessmann et al. [
29] used the more robust Hmeasure instead of the AUC as a key performance indicator for their analysis. Indeed, as emphasized by Hand [
21], the AUC has an important drawback as it uses different misclassification cost distributions for different classifiers (see also Hand and Anagnostopoulos [
22]).