A comparative assessment of ensemble learning for credit scoring

https://doi.org/10.1016/j.eswa.2010.06.048Get rights and content

Abstract

Both statistical techniques and Artificial Intelligence (AI) techniques have been explored for credit scoring, an important finance activity. Although there are no consistent conclusions on which ones are better, recent studies suggest combining multiple classifiers, i.e., ensemble learning, may have a better performance. In this study, we conduct a comparative assessment of the performance of three popular ensemble methods, i.e., Bagging, Boosting, and Stacking, based on four base learners, i.e., Logistic Regression Analysis (LRA), Decision Tree (DT), Artificial Neural Network (ANN) and Support Vector Machine (SVM). Experimental results reveal that the three ensemble methods can substantially improve individual base learners. In particular, Bagging performs better than Boosting across all credit datasets. Stacking and Bagging DT in our experiments, get the best performance in terms of average accuracy, type I error and type II error.

Introduction

The recent world financial tsunami arouses unprecedented attention of financial institutions on credit risk. Credit scoring has become one of the primary ways for financial institutions to assess credit risk, improve cash flow, reduce possible risks and make managerial decisions (Huang, Chen, & Wang, 2007).

The purpose of credit scoring is to classify the applicants into two types: applicants with good credit and applicants with bad credit. Applicants with good credit have great possibility to repay financial obligation. Applicants with bad credit have high possibility of defaulting. The accuracy of credit scoring is critical to financial institutions’ profitability. Even 1% of improvement on the accuracy of credit scoring of applicants with bad credit will decreases a great loss for financial institutions (Hand & Henley, 1997).

Credit scoring was originally evaluated subjectively according to personal experiences, and later it was based on 5Cs: the character of the consumer, the capital, the collateral, the capacity and the economic conditions. But with the tremendous increase of applicants, it is impossible to conduct the work manually. Two categories of automatic credit scoring techniques, i.e., statistical techniques and Artificial Intelligence (AI) techniques, have been studied by prior researches (e.g., Huang, Chen, Hsu, Chen, & Wu, 2004).

Some statistical techniques have been widely applied to build the credit scoring models, such as Linear Discriminant Analysis (LDA) (Karels and Prakash, 1987, Reichert et al., 1983), Logistic Regression Analysis (LRA) (Thomas, 2000, West, 2000), Multivariate Adaptive Regression Splines (MARS) (Friedman, 1991). However, the problem with applying these statistical techniques to credit scoring is that some assumptions, such as the multivariate normality assumptions for independent variables, are frequently violated in the practice of credit scoring, which makes these techniques theoretically invalid for finite samples (Huang et al., 2004).

In recent years, many studies have demonstrated that AI techniques such as Artificial Neural Networks (ANN) (Desai et al., 1996, West, 2000), Decision Tree (DT) (Hung and Chen, 2009, Makowski, 1985), Case-Based Reasoning (CBR) (Buta, 1994, Shin and Han, 2001), and Support Vector Machine (SVM) (Baesens et al., 2003, Huang et al., 2007, Schebesch and Stecking, 2005) can be used as alternative methods for credit scoring. In contrast with statistical techniques, AI techniques do not assume certain data distributions. These techniques automatically extract knowledge from training samples. According to previous studies, AI techniques are superior to statistical techniques in dealing with credit scoring problems, especially for nonlinear pattern classification (Huang et al., 2004).

However, there is no overall best AI techniques used in building credit scoring models, for what is best depends on the details of the problem, the data structure, the characteristics used, the extent to which it is possible to segregate the classes by using those characteristics, and the objective of the classification (Hand and Henley, 1997, Yu et al., 2008). Recently, there is a growing interest that existing applications of single AI technique can be further improved by ensemble methods. Latest researches (Hung and Chen, 2009, Yu et al., 2008) have shown that such ensemble methods have performed better than single AI technique for credit scoring. However, the application of ensemble methods in credit scoring is a relatively new and untried area. To the best of our knowledge, this may be the first attempt to systematically compare of classical ensemble methods for credit scoring.

Base on these considerations, we conduct a comparative assessment of the performance of three popular ensemble methods—Bagging, Boosting, and Stacking—on credit scoring problems. The aim of this study is to examine the performance of different ensemble methods for the field of credit scoring in terms of average accuracy, type I error and type II error. Besides two common used datasets, i.e., Australian and German credit datasets, which are from UCI machine learning repository (Asuncion & Newman, 2007), our studies use a new credit dataset from China, collected mainly by the Industrial and Commercial Bank of China. In experiments we choose four popular methods in the literature, i.e., LRA, DT, ANN and SVM, as base learner. The results reveal that the application of ensemble learning can bring substantial improvement for individual base learner. Especially in our experiments, Bagging performs better than Boosting across all datasets. In addition, Stacking, and Bagging DT get best results in terms of three performance indicators, i.e., average accuracy, type I error and type II error. And among four base learners, DT gets best improvement in terms of three performance indicators after the application of ensemble learning.

The remainder of the paper is organized as follows. In Section 2, the details of three different types of ensemble methods for credit scoring are presented. In Section 3, we present the details of experimental design. Section 4 reports the experimental results. Based on the observations and results of these experiments, Section 5 draws conclusions and future research directions.

Section snippets

Overviews of ensemble learning

Ensemble learning is a machine learning paradigm where multiple learners are trained to solve the same problem (Polikar, 2006). In contrast to ordinary machine learning approaches that try to learn one hypothesis from the training data, ensemble methods try to construct a set of hypotheses and combine them to use (Zhou, 2009). Learners composed of an ensemble are usually called base learners.

One of the earliest studies on ensemble learning is Dasarathy and Sheela’s research (1979), which

Real world credit dataset

Three real world credit datasets are used to evaluate the performance of the three ensemble methods: Australian credit dataset, German credit dataset and China credit dataset. The first two are from UCI machine learning repository (Asuncion & Newman, 2007) and have been widely used in credit scoring researches. The third is derived from 239 companies that were granted loans from the Industrial and Commercial Bank of China, a premier bank of China, between the year of 2006 and 2007. This dataset

Results and analyses

The experiments described in this section were performed on a PC with a 3.00 GHz Intel Core Duo CPU and 4 GB RAM, using Windows XP operating system. Data mining toolkit WEKA (Waikato Environment for Knowledge Analysis) version 3.6.0 was used for classification. WEKA is an open source toolkit, and it consists of a collection of machine learning algorithms for solving data mining problems (Witten & Frank, 2005).

For implementation of base learners, i.e., LRA, DT, ANN, and SVM, we chose logistic

Conclusions and future directions

Ensemble learning is a powerful machine learning paradigm which has exhibited apparent advantages in many applications. In this study, a comparative assessment of three popular ensemble methods, i.e. Bagging, Boosting, and Stacking, based on four base learners, i.e., LRA, DT, ANN and SVM, is carried out. All these ensemble methods have been applied to three real world credit datasets, i.e. Australian and German credit datasets, which are from UCI machine learning repository, and China credit

Acknowledgements

The authors would like to thank the Editor-in-Chief and reviewers for their recommendation and comments. This work is partially supported by the grants from the Innovation and Technology Fund (ITF) of HK (GHP/006/07, InP/007/08).

References (28)

  • B. Baesens et al.

    Benchmarking state-of-the-art classification algorithms for credit scoring

    Journal of the Operational Research Society

    (2003)
  • L. Breiman

    Bagging predictors

    Machine Learning

    (1996)
  • P. Buta

    Mining for financial knowledge with CBR

    AI Expert

    (1994)
  • B.V. Dasarathy et al.

    Composite classifier system design: Concepts and methodology

    Proceedings of the IEEE

    (1979)
  • Cited by (425)

    View all citing articles on Scopus
    View full text