A comparative assessment of ensemble learning for credit scoring

doi:10.1016/j.eswa.2010.06.048

Expert Systems with Applications

Volume 38, Issue 1, January 2011, Pages 223-230

https://doi.org/10.1016/j.eswa.2010.06.048 Get rights and content

Abstract

Both statistical techniques and Artificial Intelligence (AI) techniques have been explored for credit scoring, an important finance activity. Although there are no consistent conclusions on which ones are better, recent studies suggest combining multiple classifiers, i.e., ensemble learning, may have a better performance. In this study, we conduct a comparative assessment of the performance of three popular ensemble methods, i.e., Bagging, Boosting, and Stacking, based on four base learners, i.e., Logistic Regression Analysis (LRA), Decision Tree (DT), Artificial Neural Network (ANN) and Support Vector Machine (SVM). Experimental results reveal that the three ensemble methods can substantially improve individual base learners. In particular, Bagging performs better than Boosting across all credit datasets. Stacking and Bagging DT in our experiments, get the best performance in terms of average accuracy, type I error and type II error.

Introduction

The recent world financial tsunami arouses unprecedented attention of financial institutions on credit risk. Credit scoring has become one of the primary ways for financial institutions to assess credit risk, improve cash flow, reduce possible risks and make managerial decisions (Huang, Chen, & Wang, 2007).

The purpose of credit scoring is to classify the applicants into two types: applicants with good credit and applicants with bad credit. Applicants with good credit have great possibility to repay financial obligation. Applicants with bad credit have high possibility of defaulting. The accuracy of credit scoring is critical to financial institutions’ profitability. Even 1% of improvement on the accuracy of credit scoring of applicants with bad credit will decreases a great loss for financial institutions (Hand & Henley, 1997).

Credit scoring was originally evaluated subjectively according to personal experiences, and later it was based on 5Cs: the character of the consumer, the capital, the collateral, the capacity and the economic conditions. But with the tremendous increase of applicants, it is impossible to conduct the work manually. Two categories of automatic credit scoring techniques, i.e., statistical techniques and Artificial Intelligence (AI) techniques, have been studied by prior researches (e.g., Huang, Chen, Hsu, Chen, & Wu, 2004).

Some statistical techniques have been widely applied to build the credit scoring models, such as Linear Discriminant Analysis (LDA) (Karels and Prakash, 1987, Reichert et al., 1983), Logistic Regression Analysis (LRA) (Thomas, 2000, West, 2000), Multivariate Adaptive Regression Splines (MARS) (Friedman, 1991). However, the problem with applying these statistical techniques to credit scoring is that some assumptions, such as the multivariate normality assumptions for independent variables, are frequently violated in the practice of credit scoring, which makes these techniques theoretically invalid for finite samples (Huang et al., 2004).

In recent years, many studies have demonstrated that AI techniques such as Artificial Neural Networks (ANN) (Desai et al., 1996, West, 2000), Decision Tree (DT) (Hung and Chen, 2009, Makowski, 1985), Case-Based Reasoning (CBR) (Buta, 1994, Shin and Han, 2001), and Support Vector Machine (SVM) (Baesens et al., 2003, Huang et al., 2007, Schebesch and Stecking, 2005) can be used as alternative methods for credit scoring. In contrast with statistical techniques, AI techniques do not assume certain data distributions. These techniques automatically extract knowledge from training samples. According to previous studies, AI techniques are superior to statistical techniques in dealing with credit scoring problems, especially for nonlinear pattern classification (Huang et al., 2004).

However, there is no overall best AI techniques used in building credit scoring models, for what is best depends on the details of the problem, the data structure, the characteristics used, the extent to which it is possible to segregate the classes by using those characteristics, and the objective of the classification (Hand and Henley, 1997, Yu et al., 2008). Recently, there is a growing interest that existing applications of single AI technique can be further improved by ensemble methods. Latest researches (Hung and Chen, 2009, Yu et al., 2008) have shown that such ensemble methods have performed better than single AI technique for credit scoring. However, the application of ensemble methods in credit scoring is a relatively new and untried area. To the best of our knowledge, this may be the first attempt to systematically compare of classical ensemble methods for credit scoring.

Base on these considerations, we conduct a comparative assessment of the performance of three popular ensemble methods—Bagging, Boosting, and Stacking—on credit scoring problems. The aim of this study is to examine the performance of different ensemble methods for the field of credit scoring in terms of average accuracy, type I error and type II error. Besides two common used datasets, i.e., Australian and German credit datasets, which are from UCI machine learning repository (Asuncion & Newman, 2007), our studies use a new credit dataset from China, collected mainly by the Industrial and Commercial Bank of China. In experiments we choose four popular methods in the literature, i.e., LRA, DT, ANN and SVM, as base learner. The results reveal that the application of ensemble learning can bring substantial improvement for individual base learner. Especially in our experiments, Bagging performs better than Boosting across all datasets. In addition, Stacking, and Bagging DT get best results in terms of three performance indicators, i.e., average accuracy, type I error and type II error. And among four base learners, DT gets best improvement in terms of three performance indicators after the application of ensemble learning.

The remainder of the paper is organized as follows. In Section 2, the details of three different types of ensemble methods for credit scoring are presented. In Section 3, we present the details of experimental design. Section 4 reports the experimental results. Based on the observations and results of these experiments, Section 5 draws conclusions and future research directions.

Section snippets

Overviews of ensemble learning

Ensemble learning is a machine learning paradigm where multiple learners are trained to solve the same problem (Polikar, 2006). In contrast to ordinary machine learning approaches that try to learn one hypothesis from the training data, ensemble methods try to construct a set of hypotheses and combine them to use (Zhou, 2009). Learners composed of an ensemble are usually called base learners.

One of the earliest studies on ensemble learning is Dasarathy and Sheela’s research (1979), which

Real world credit dataset

Three real world credit datasets are used to evaluate the performance of the three ensemble methods: Australian credit dataset, German credit dataset and China credit dataset. The first two are from UCI machine learning repository (Asuncion & Newman, 2007) and have been widely used in credit scoring researches. The third is derived from 239 companies that were granted loans from the Industrial and Commercial Bank of China, a premier bank of China, between the year of 2006 and 2007. This dataset

Results and analyses

The experiments described in this section were performed on a PC with a 3.00 GHz Intel Core Duo CPU and 4 GB RAM, using Windows XP operating system. Data mining toolkit WEKA (Waikato Environment for Knowledge Analysis) version 3.6.0 was used for classification. WEKA is an open source toolkit, and it consists of a collection of machine learning algorithms for solving data mining problems (Witten & Frank, 2005).

For implementation of base learners, i.e., LRA, DT, ANN, and SVM, we chose logistic

Conclusions and future directions

Ensemble learning is a powerful machine learning paradigm which has exhibited apparent advantages in many applications. In this study, a comparative assessment of three popular ensemble methods, i.e. Bagging, Boosting, and Stacking, based on four base learners, i.e., LRA, DT, ANN and SVM, is carried out. All these ensemble methods have been applied to three real world credit datasets, i.e. Australian and German credit datasets, which are from UCI machine learning repository, and China credit

Acknowledgements

The authors would like to thank the Editor-in-Chief and reviewers for their recommendation and comments. This work is partially supported by the grants from the Innovation and Technology Fund (ITF) of HK (GHP/006/07, InP/007/08).

References (28)

V. Desai et al.
A comparison of neural networks and linear scoring models in the credit union environment
European Journal of Operations Research
(1996)
Z. Huang et al.
Credit rating analysis with support vector machines and neural networks: A market comparative study
Decision Support Systems
(2004)
C.L. Huang et al.
Credit scoring with a data mining approach based on support vector machines
Expert Systems with Applications
(2007)
C. Hung et al.
A selective ensemble based on expected probabilities for bankruptcy prediction
Expert Systems with Applications
(2009)
K.S. Shin et al.
A case-based approach using inductive indexing for corporate bond rating
Decision Support Systems
(2001)
L.C. Thomas
A survey of credit and behavioral scoring: Forecasting financial risks of lending to customers
International Journal of Forecasting
(2000)
D. West
Neural network credit scoring models
Computers and Operations Research
(2000)
D.H. Wolpert
Stacked generalization
Neural Networks
(1992)
L.A. Yu et al.
Credit risk assessment with a multistage neural network ensemble learning approach
Expert Systems with Applications
(2008)
Asuncion, A. & Newman, D. J. (2007). UCI machine learning repository. Irvine, CA: University of California, School of...

B. Baesens et al.

Benchmarking state-of-the-art classification algorithms for credit scoring

Journal of the Operational Research Society

(2003)

L. Breiman

Bagging predictors

Machine Learning

(1996)

P. Buta

Mining for financial knowledge with CBR

AI Expert

(1994)

B.V. Dasarathy et al.

Composite classifier system design: Concepts and methodology

Proceedings of the IEEE

(1979)

Cited by (425)

Credit risk prediction for small and medium enterprises utilizing adjacent enterprise data and a relational graph attention network
2024, Journal of Management Science and Engineering
Credit risk prediction for small and medium enterprises (SMEs) has long posed a complex research challenge. Traditional approaches have primarily focused on enterprise-specific variables, but these models often prove inadequate when applied to SMEs with incomplete data. In this innovative study, we push the theoretical boundaries by leveraging data from adjacent enterprises to address the issue of data deficiency. Our strategy involves constructing an intricate network that interconnects enterprises based on shared managerial teams and business interactions. Within this network, we propose a novel relational graph attention network (RGAT) algorithm capable of capturing the inherent complexity in its topological information. By doing so, our model enhances financial service providers' ability to predict credit risk even in the face of incomplete data from target SMEs. Empirical experiments conducted using China's SMEs highlight the predictive proficiency and potential economic benefits of our proposed model. Our approach offers a comprehensive and nuanced perspective on credit risk while demonstrating the advantages of incorporating network-wide data in credit risk prediction.
Ensemble learning based approach for traffic incident detection and multi-category classification
2024, Engineering Applications of Artificial Intelligence
Traffic incident is one of the important causes of road congestion. Traffic incident detection plays a crucial role in the safety application of intelligent transportation systems, which provides timely information for traffic management departments and reducing losses. Despite many researches on incident detection approach, the identification of different incident categories is not enough. In addition, traffic incident detection is still a challenging task due to the problem of data imbalance and feature selection. In this study, we propose a two-stage traffic incident detection framework based on ensemble learning. In the first stage, a binary classification algorithm based on XGBoost (eXtreme Gradient Boosting) is established to detect whether there is a traffic incident, and 24 feature variables are determined by model feature selection. In the second stage, three resampling algorithms are utilized to reconstruct and balance the dataset. Through comparative analysis, SMOTE (Synthetic Minority Over-sampling Technique)-XGBoost is the best method for incident multi-category classification with precision of 87.27%, 78.52% and 92.54%, respectively. Moreover, the baseline comparison experiments are conducted to evaluate our model performance with real-word datasets. The proposed model achieves the highest average accuracy of 93.45% in the first stage and the macro-precision of 86.11% in the second stage. The results indicate that the proposed method outperforms baselines and the two-stage framework can accurately realize the incident detection and multi-category classification.
Investigating the beneficial impact of segmentation-based modelling for credit scoring
2024, Decision Support Systems
Due to its vital role in financial risk management, credit scoring has been investigated extensively in extant information systems studies. However, most credit scoring studies rely on one-size-fits-all classifiers with logistic regression (LR) as a popular benchmark. Moreover, extant literature largely focuses on predictive performance as an evaluation criterion. To find a better balance between predictive performance and interpretability though, the current study investigates the beneficial impact of segmentation-based modelling by benchmarking the logit leaf model (LLM) which is based on LR and decision trees. By a large experimental setup using a real-life credit scoring data set containing 65,536 active customers, we find that LLM is a viable classifier over its constituent parts, i.e., LR and decision trees, and is very competitive to state-of-the-art credit decision making techniques (neural networks, support vector machines, bagging, boosting and random forests) on three evaluation metrics (AUC, top-decile lift and profit). Furthermore, we show its extraordinary interpretability capacities by proposing a new visualization based on the LLM output. In sum, the excellence of the LLM as a classifier for credit decision making problems stems from its ability to combine strong predictive performance with interpretable insights that in turn can inform managerial decisions.
Bankruptcy prediction with low-quality financial information
2024, Expert Systems with Applications
The corporate bankruptcy prediction literature has traditionally relied on data from public, audited companies. However, the vast majority of firms worldwide are privately-held and lack the same level of scrutiny over their financial statements. As a result, these businesses usually produce less accurate and transparent accounting reports. Our research problem is to address this gap: how stakeholders deal with these less reliable information? Using a novel dataset of 503 private firms that filed for reorganization in Brazil between 2007 and 2020, we found that financial ratios had a significantly lesser effect on explaining default and bankruptcy than what previous research suggested, due in part to the lower information content in the accounting statements within our database. Instead, lenders seem to focus on harder-to-conceal variables, such as collateralizable assets, as well as on institutional factors, like proxies of financial statement quality. There is also concerning evidence that specialized attorneys can ”work the system” in favor of distressed companies regardless of their financial fundamentals. Additionally, we found that machine learning models outperformed traditional statistical ones in different sorts of metrics, corroborating the literature on the superior performance of non-linear approaches on datasets having synergistic causality among its features.
Consumer credit risk assessment: A review from the state-of-the-art classification algorithms, data traits, and learning methods
2024, Expert Systems with Applications
Credit risk assessment is a crucial element in credit risk management. With the extensive research on consumer credit risk assessment in recent decades, the abundance of literature on this topic can be overwhelming for researchers. Therefore, this article aims to provide a more systematic and comprehensive analysis from three perspectives: classification algorithms, data traits, and learning methods. Firstly, the state-of-the-art classification algorithms are categorized into traditional single classifiers, intelligent single classifiers, hybrid and ensemble multiple classifiers. Secondly, considering the diversity of data traits in the credit dataset, data traits are divided into external structure information traits, data quality traits, data quantity traits, and internal information traits. Data traits-driven modeling framework based on multiple classifiers is proposed for solving credit risk assessment. Thirdly, considering the differences in data modeling methods, learning methods are classified into data status, label status, and structure form. Furthermore, model interpretability, model bias, model multi-pattern, and model fairness are discussed. Finally, the limitations and future research directions are presented. This review article serves as a helpful guide for researchers and practitioners in the field of credit risk modeling and analysis.
Stacking ensemble approach to diagnosing the disease of diabetes
2024, Informatics in Medicine Unlocked
Diabetes is a very common disease today and has acquired a worrying focus in the field of public health globally, in fact, it is estimated that the number of people with diabetes worldwide has reached 415 million.
Propose a method and 4 combined models based on Stacking ensemble to diagnose Diabetes. In addition, a web interface was developed with the best model proposed in this study.
The dataset collected from the Diabetes Dataset composed of 768 patient records was used. The data was then pre-processed using the Python programming language. To balance the data, it was divided into 4 values and an oversampling method was applied to distribute the data proportionally. Then, divisions were made on the balanced data using the cross-validation method for data training, and the models were calibrated. Regarding the development of base algorithms, 7 independent algorithms were used, and 4 combined algorithms based on Stacking were proposed, and finally obtain the evaluation of the model with their respective metrics.
Stacking 1A (Logistic regression) with Oversampling reached the best value of Accuracy = 91.5 %, Sensitivity = 91.6 %, F1-Score = 91.49 % and Precision = 91.5 %, while with respect to the metric ROC Curve, Stacking 1A (Logistic regression) with Oversampling, Stacking 2A (Random Forest) with oversampling, and Random Forest (Independent) reached the best percentage, this being 97 %.
Implementing 4 stacking models using the oversampling method, helps to make an adequate diagnosis of diabetes. Therefore, by using the combined method, an improvement in diabetes prediction was observed, surpassing the performance of the independent algorithms used.

View all citing articles on Scopus

View full text

A comparative assessment of ensemble learning for credit scoring

Abstract

Introduction

Section snippets

Overviews of ensemble learning

Real world credit dataset

Results and analyses

Conclusions and future directions

Acknowledgements

European Journal of Operations Research

Decision Support Systems

Expert Systems with Applications

Expert Systems with Applications

Decision Support Systems

International Journal of Forecasting

Computers and Operations Research

Neural Networks

Expert Systems with Applications

Benchmarking state-of-the-art classification algorithms for credit scoring

Journal of the Operational Research Society

Bagging predictors

Machine Learning

Mining for financial knowledge with CBR

AI Expert

Composite classifier system design: Concepts and methodology

Proceedings of the IEEE