Skip to main content
Erschienen in: Complex & Intelligent Systems 2/2023

Open Access 12.09.2022 | Original Article

Credit risk assessment mechanism of personal auto loan based on PSO-XGBoost Model

verfasst von: Congjun Rao, Ying Liu, Mark Goh

Erschienen in: Complex & Intelligent Systems | Ausgabe 2/2023

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

As online P2P loans in automotive financing grows, there is a need to manage and control the credit risk of the personal auto loans. In this paper, the personal auto loans data sets on the Kaggle platform are used on a machine learning based credit risk assessment mechanism for personal auto loans. An integrated Smote-Tomek Link algorithm is proposed to convert the data set into a balanced data set. Then, an improved Filter-Wrapper feature selection method is presented to select credit risk assessment indexes for the loans. Combining Particle Swarm Optimization (PSO) with the eXtreme Gradient Boosting (XGBoost) model, a PSO-XGBoost model is formed to assess the credit risk of the loans. The PSO-XGBoost model is compared against the XGBoost, Random Forest, and Logistic Regression models on the standard performance evaluation indexes of accuracy, precision, ROC curve, and AUC value. The PSO-XGBoost model is found to be superior on classification performance and classification effect.
Hinweise

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

China's auto finance market started relatively late, and the idea of buying a car by installment only appeared in 1993. In 1998, the Government introduced a policy of encouraging automobile consumer loans, effectively kick starting China's automobile finance market. By 2018, China's automotive finance market was 139 million yuan, a growth of 19.2%. With better personal credit information, this market is set to grow. According to China’s banking regulatory commission, from 2013 to 2017, the compounded annual growth rate of outstanding loans in China's auto financing business was as high as 29%. By the end of 2017, the loan balance of the auto finance business in China had reached 668.8 billion yuan, an increase of 28.39% year-on-year. Today, the auto finance industry accounts for an increasing proportion of the overall personal credit and finance industry, and its influence on China’s economy is also increasing, together with the accompanying financial credit risks. Two factors compound this phenomenon: greater lifestyle consumption and easier access to online finance [1, 2]. Indeed, the auto finance industry has many advantages, such as a flexible credit verification process and simpler vetting procedures, compared to the traditional financial institutions [35].
At present, a variety of auto finance products is available in the market, such as the highly popular P2P online auto finance or the micro-loan network. In the tide of the Internet, various companies are trying to attract consumers with new technologies and new models, hoping to take the lead in the field of auto finance. All these signs indicate that the auto finance industry will develop rapidly in the future. In the process of such development, not only brought a lot of benefits, but also brought some disadvantages. Because the auto finance industry is characterized by high risks and high returns, if there are no effective industrial control measures, then it is impossible to develop into a sustainable and healthy auto finance industry [69]. At present, China's auto finance industry is at a preliminary stage of development, in which there are still many problems, such as the imperfect establishment of personal credit investigation system, inadequate laws and regulations, and inadequate risk supervision and management, which all show that the credit risk problem is particularly important. Therefore, scientific and effective control of corporate credit risk has become an important issue in the development of China's auto finance industry.
The main participants in China's auto finance industry are auto finance companies, financial leasing companies, internet finance companies and banking institutions, etc., and the whole industrial chain is relatively complete. However, in terms of the market share in China, auto financing companies occupy the main part of the market, and most Chinese consumers will choose the financial products of auto financing companies. As the economic main body of the subsidiary products of automobile industry, the auto financing company mainly deals with the business of automobile consumer finance loan. However, in the process of operation, the auto financing company not only pursues its own profits, but also undertakes the task of providing consumer loans to customers to promote the sales of vehicles.
Compared with banks, auto financing companies bear more credit risks due to their specific business purposes. Due to various institutional defects in current business model, it is impossible to form sound and effective risk management measures. In addition, the imperfection of credit system and industrial factors such as the fluctuation of automobile prices will lead to a large number of bad debts in the auto finance industry [1012]. Under this background, the financial institutions have suffered significant losses due to vehicle loan defaults, auto underwriting has been tightened and the rejection rate of the auto loans has increased. The credit institutions have demanded for the credit risk assessment of customers using rigorous credit risk assessment models to predict the probability of loanee/borrower accurately defaulting on a vehicle loan in the first EMI (Equated Monthly Installments) on the due date, so as to identify customers with high credit risk and further reduce the default rate. Moreover, doing so will ensure that clients capable of repayment are not rejected and important determinants can be identified which can be further used for minimizing the default rates. From above research motivation, this paper will study how to establish a credit risk assessment model for auto financing companies with high classification and prediction accuracy, so as to not only guarantee their own earnings, but also control the bad debt rate generated by credit. This has important practical significance for auto financing companies and even the whole auto financing industry.
Compared with the existing congeneric methods for the credit risk assessment of personal auto loan, this paper makes two contributions as follows.
(i)
First, to reduce the feature dimension, enhance the generalizability of the model, and reduce the possibility of overfitting, based on the 45 preliminary indexes and the limitations of the current single feature selection method, this paper has proposed an improved Filter-Wrapper feature selection method by combining Filter and Wrapper. In the Filter stage, three evaluation criteria including the Relief algorithm, Maximum Information Coefficient method, and Quasi-separable method are selected. Then, the order relation analysis method is used to determine the corresponding weights of the three evaluation criteria, and a fusion model of multiple evaluation criteria is constructed to comprehensively rank the feature importance. In the Wrapper stage, the RF is selected as a classifier and the SBS method is used to screen the final optimal feature subset, thus effectively improving the classification accuracy of subsequent models.
 
(ii)
Second, most scholars study the credit risk assessment in the traditional financial field, but there are few researches on the credit risk assessment of personal auto loan in the auto finance industry. In today's internet era, China's auto finance industry is developing rapidly, and it is necessary to study the increasingly prominent credit risks of auto loans in its development process. Based on this, this paper proposes a PSO-XGBoost model for the credit risk assessment of personal auto loans, which is novel for the research on the credit risk assessment of auto loans in China's auto finance industry. To evaluate the performance of the models, the PSO-XGBoost model is compared against the XGBoost, RF, and LR models on performance evaluation indexes such as accuracy, precision, ROC curve, and AUC value. The results inform the PSO-XGBoost model to be more superior to the other models on classification performance and classification effect. This validates the choice of the PSO-XGBoost model in the credit risk assessment of personal auto loans.
 
This paper is organized as follows. Section “Literature review” surveys the literature. Section “Data preprocessing and unbalanced data set transformation” presents the data preprocessing and the transformation of the unbalanced data set. Section  “Data preprocessing and unbalanced data set transformation” proposes a filter-wrapper feature selection method to select the credit risk assessment indices for the personal auto loan. Section “Feature selection method of credit risk assessment index” presents a PSO-XGBoost model for the credit risk assessment of the personal auto loans and the accompanying empirical analysis. Section “Credit risk assessment of personal auto loans using PSO-XGBoost model” concludes the paper.

Literature review

Adopting appropriate feature selection method to remove redundant features and reduce the dimension of data can effectively improve the computational speed and classification performance of the algorithm. Therefore, feature selection is indispensable in processing massive data. Currently, the popular feature selection methods include Filter [13], Wrapper [14], Modified-Dynamic Feature Importance based Feature Selection (M-DFIFS) algorithm [15], Mean Fisher-based Feature Selection Algorithm (MFFSA) [16], Markov Blanket-based Universal Feature Selection [17], Improved Binary Global Harmony Search (IBGHS) [18], MCDM-based method [19], joint semantic and structural information of labels [20], and the fast multi-objective evolutionary feature selection algorithm (FMABC-FS) [21].
The Filter method is simple and feasible and research have developed evaluation criteria, such as the Relief algorithm, Maximal Information Coefficient method, and the Information Gain method. The Relief algorithm is based on the feature weight proposed by Kira [22], which assigns weights to the features according to their ability to distinguish the samples. Then, the weight is compared with the threshold value. If the weight of the feature is less than the threshold value, it is deleted. In applying the Filter method, Ma and Gao [23] employed a filter-based feature selection approach using Genetic Programming (GP) that uses a correlation-based evaluation method, and their experiments on nine datasets show that features selected by feature construction approach (FCM) are able to improve the classification performance when compared to the original features. Thabtah et al. [13] proposed a simple filter method to quantify the similarity between the observed and expected probabilities and generate scores for the features. They report that their approach significantly reduces the number of selected features on 27 datasets. The Wrapper method takes the accuracy obtained by the subsequent learning algorithm as the evaluation criterion. Compared to the Filter method, the Wrapper method is computationally complex with low operation efficiency albeit high accuracy. Gokalp et al. [24] proposed a wrapper feature selection algorithm using an iterative greedy metaheuristic for sentiment classification. Khammassi and Krichen [25] presented a NSGA2-LR wrapper approach for feature selection in network intrusion detection. González et al. [26] applied a new wrapper method on feature selection, based on a multi-objective evolutionary algorithm to analyze the accuracy and stability for BCI. Mafarja and Mirjalili [27] proposed a wrapper feature selection approach based on the Whale Optimization algorithm.
The single feature selection method is often not comprehensive, while the Filter method and Wrapper method have their own merits and drawbacks. As such, some studies combine both methods, and proposed fusion feature selection methods by treating a combination of a variety of evaluation criteria. For example, Rajab [28] analyzed the advantages and disadvantages of the Information Gain (IG) algorithm and Chi-square (CHI) algorithm, and then used them in combination. Solorio-Fernández et al. [29] presented a hybrid filter–wrapper method for clustering, which combines the spectral feature selection framework using the Laplacian Score ranking and a modified Calinski–Harabasz index. Rao et al. [30] presented a two-stage feature selection method based on the filter and wrapper to select the main features from 35 borrower credit features. In the Filter stage, three filter methods are used to compute the importance of the unbalanced features. In the Wrapper stage, a Lasso-logistic method is used to filter the feature subset using a search algorithm.
Thus, following the earlier works, this paper combines the Filter and Wrapper methods to propose an improved Filter-Wrapper two-stage feature selection method to select the credit risk assessment indexes of the personal auto loans. However, compared to the existing fusion approach of the Filter and Wrapper methods, our two-stage feature selection method is different on the following aspects. In the Filter stage, we consider the aspects of information relevance, amount of information, and quasi-separable ability to, respectively, select three evaluation criteria, i.e., Relief algorithm, Maximal Information Coefficient method and Quasi-separable method to evaluate the importance of the features. A fusion model of multiple evaluation criteria is then constructed to rank the importance of the features. In the Wrapper stage, the Random Forest (RF) is selected as the classifier; the classification accuracy is used as the measurement standard, and the Sequence Backward Selection (SBS) method [31] is used for the feature selection. Based on the classification accuracy, the quality of the corresponding feature subset is evaluated, and the optimal feature subset is selected as a result of the evaluation indexes for the credit risk assessment of the personal auto loans.
Auto finance credit stems from consumer credit finance, notably on individual credit risk assessment. The traditional analysis methods, such as 5C and LAPP, are subjective, and highly dependent on expert experience. Then, in switching to mathematical models to analyze credit risk assessment. Durand [32] was the first to use discriminant analysis to assess individual credit risk. With the advent of better computing power and the availability of massive data sets, artificial intelligence methods such as machine learning, data mining, and deep learning have emerged.
However, traditional statistics, non-parametric statistics, machine learning, and data mining have been applied separately to credit risk assessment. With these single technique methods, there are often problems associated with low precision prediction, model overfitting, and low algorithm efficiency. Therefore, research have since combined statistical methods with artificial intelligence methods such as machine learning and data mining to address those shortcomings when applied to individual credit risk assessment. For example, Yu and Wang [33] proposed a kernel principal components analysis based least squares fuzzy support vector machine method with variable penalty factors for credit classification, and conducted an empirical analysis to prove the effectiveness of the model. Combining decision tree theory with machine learning methods, Rao et al. [34] selected a loan data set on the Pterosaur Loan platform, and used a 2-stage Syncretic Cost-sensitive Random Forest (SCSRF) model to evaluate the credit risk of the borrowers. Further, Lanzarini et al. [35] combined particle swarm optimization with competitive neural networks to propose an LVQ + PSO model, to predict the credit customer’s loan situation. Barani et al. [36] proposed a new improved Particle Swarm Optimization (PSO) combined with Chaotic Cellular Automata (CCA). Similarly, Mojarrad and Ayubi [37] proposed a novel approach in particle swarm optimization (PSO) that combines chaos and velocity clamping with the aim of eliminating its known disadvantage that enforces particles to continue searching in search space boundaries. However, as the credit datasets are typically high-dimensional, class-imbalanced, and are of large sample sizes, Liu et al. [38] recently proposed an Evolutionary Multi-Objective Soft Subspace Clustering (EMOSSC) algorithm for credit risk assessment. Luo et al. [39] employed a two-stage clustering method using kernel-free support vector machine, and applied the method incorporating t-test feature weights for credit risk assessment.
While there is rich research on personal credit risk assessment, particularly on optimizing the performance of the current credit risk assessment models by either improving or combining statistical methods with artificial intelligence to obtain better prediction, there is little literature on the credit risk assessment of personal auto loans in the auto finance industry. In this paper, we study the problem of the credit risk assessment of personal auto loans, and combine Particle Swarm Optimization (PSO) with the XGBoost model to form a PSO-XGBoost model to evaluate the credit risk of personal auto loans. We validate the PSO-XGBoost model against three evaluation models (XGBoost, RF, and LR).

Data preprocessing and unbalanced data set transformation

To study the current credit risk problem in the auto finance industry and to reduce the loan default rate of the auto financing institutions, we select the data set of personal auto loans on the Kaggle platform as the research samples. The data set is first preprocessed, and transformed according to the specific indexes to construct an overall index. Next, based on the description and value range of the indexes in the data set, the credit risk assessment indexes are preliminarily pre-screened. The unbalanced data set is processed and transformed into a balanced data set.

Data preparation and preprocessing

The details about the selected datasets we studied in this paper are as follows. One of the selected datasets is a data set of personal auto loans about the auto financing institutions that is available on an open data platform, that is, the Kaggle platform. The data set can be downloaded from https://​www.​kaggle.​com/​mamtadhaker/​lt-vehicle-loan-default-prediction.
The selected data set contains 233,154 customer loan records, where 182,543 loan records representing the set of non-defaulting customers, and 50,611 loan records representing the set of defaulters. In addition, the data set contains 41 indexes, where 40 indexes are the independent variables, which are used to predict a borrower’s loan default. The following information regarding the loan and loanee are provided in the 40 indexes: loanee information (Demographic data like age, Identity proof etc.), loan information (Disbursal details, loan to value ratio etc.), bureau data & history (Bureau score, number of active accounts, the status of other loans, credit history etc.). These indexes reveal the borrower's personal information, economic health, and credit history. Another index, loan_default, is used to mark whether the borrower has defaulted, and this is labeled the dependent variable. This index divides borrowers into binary categories: “0” to denote the non-defaulters, and “1” to denote defaulters. As for the missing data in the data set, there is none except the index “Employment.Type”. Table 1 provides a description of the notation used.
Table 1
Notation and description
Index
Label
Description
Y
loan_default
Payment default in the first EMI on due date
X1
UniqueID
Identifier for customers
X2
disbursed_amount
Amount of loan disbursed
X3
asset_cost
Cost of the Asset
X4
ltv
Loan to Value of the asset
X5
branch_id
Branch where the loan was disbursed
X6
supplier_id
Vehicle dealer where the loan was disbursed
X7
manufacturer_id
Vehicle manufacturer (Hero, Honda, TVS)
X8
Current_pincode_ID
Current pincode of the customer
X9
Date.of.Birth
Date of birth of the customer
X10
Employment.Type
Employment type of the customer (Salaried/Self Employed)
X11
DisbursalDate
Date of disbursement
X12
State_ID
State of disbursement
X13
Employee_code_ID
Employee of the organization who logged the disbursement
X14
MobileNo_Avl_Flag
If Mobile no. was shared by the customer then flag as 1
X15
Aadhar_flag
If aadhar was shared by the customer then flag as 1
X16
PAN_flag
If pan was shared by the customer then flag as 1
X17
VoterID_flag
If voter was shared by the customer then flag as 1
X18
Driving_flag
If DL was shared by the customer then flagged as 1
X19
Passport_flag
If passport was shared by the customer then flag as 1
X20
PERFORM_CNS.SCORE
Bureau Score
X21
PERFORM_CNS.SCORE.DESCRIPTION
Bureau score description
X22
PRI.NO.OF.ACCTS
Count of total loans taken by the customer at the time of first disbursement
X23
PRI.ACTIVE.ACCTS
Count of active loans taken by the customer at the time of first disbursement
X24
PRI.OVERDUE.ACCTS
Count of default accounts at the time of first disbursement
X25
PRI.CURRENT.BALANCE
Total principal outstanding of the active loans at the time of first disbursement
X26
PRI.SANCTIONED.AMOUNT
Total amount that was sanctioned for all the loans at the time of first disbursement
X27
PRI.DISBURSED.AMOUNT
Total amount that was disbursed for all the loans at the time of first disbursement
X28
SEC.NO.OF.ACCTS
Count of total loans taken by the customer at the time of second disbursement
X29
SEC.ACTIVE.ACCTS
Count of active loans taken by the customer at the time of second disbursement
X30
SEC.OVERDUE.ACCTS
Count of default accounts at the time of disbursement
X31
SEC.CURRENT.BALANCE
Total principal outstanding of the active loans at the time of second disbursement
X32
SEC.SANCTIONED.AMOUNT
Total amount that was sanctioned for all the loans at the time of second disbursement
X33
SEC.DISBURSED.AMOUNT
Total amount that was disbursed for all the loans at the time of second disbursement
X34
PRIMARY.INSTAL.AMT
Equated Monthly Installment (EMI) Amount of the primary loan
X35
SEC.INSTAL.AMT
EMI Amount of the secondary loan
X36
NEW.ACCTS.IN.LAST.SIX.MONTHS
New loans taken by the borrower in last 6 months before the disbursement
X37
DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS
Loans defaulted in the last 6 months
X38
AVERAGE.ACCT.AGE
Average loan tenure
X39
CREDIT.HISTORY.LENGTH
Time since first loan
X40
NO.OF_INQUIRIES
Enquiries done by the customer for loans
Table 1 shows four types of data in the data set: integer, floating point, date, and character type. Among them, index X4 is a floating point type, X9 and X11 are date types, X21, X38, and X39 are character types, and the other indexes are integer types. The date type and character type data cannot be used directly, so data cleansing and data conversion are needed. Data cleansing mainly deals with data exceptions, including missing values processing, type values processing, exception point processing, and outliers processing. Data conversion is to enhance data processing through data discretization, data specification or by creating new variables.
(1) Date processing.
(i) Type values processing.
In the data set, index X9 (Date of birth of the customer) and X11 (Date of disbursement) are date type indexes, which are processed as follows. The date of birth of the customer is converted to the current age, and the date of disbursement is converted to the number of months from the current time. For the character type indexes X38 (Average loan tenure) and X39 (Time since first loan), their index values are converted to the number of months. For index X10 (Employment type of the customer), the Self Employed type is denoted as 0, and the Salaried type is denoted as 1. There are missing values in this index X10. In addition, there are 20 components in index X21 (Bureau score description), which are converted using the literal meaning of the description. Table 2 contains the specific conversion results.
Table 2
Risk transformation of bureau score description
Risk range of bureau score description
Score
No Bureau History Available
0
Not Scored: Sufficient History Not Available
0
Not Scored: Not Enough Info available on the customer
0
Not Scored: No Activity seen on the customer (Inactive)
0
Not Scored: No Updates available in last 36 months
0
Not Scored: Only a Guarantor
0
Not Scored: More than 50 active Accounts found
0
M-Very High Risk
1
L-Very High Risk
2
K-High Risk
3
J-High Risk
4
I-Medium Risk
5
H-Medium Risk
6
G-Low Risk
7
F-Low Risk
8
E-Low Risk
9
D-Very Low Risk
10
C-Very Low Risk
11
B-Very Low Risk
12
A-Very Low Risk
13
(ii) Exception point processing and outliers processing.
In looking for outliers in the data set, we note that some age values of index X9 (Date of birth of the customer) are less than or equal to zero, which is implausible. Hence, we replace them with null values and treated them as missing values. Also, for the indexes X25 (Total Principal outstanding amount of the active loans at the time of first disbursement), and X31 (Total Principal outstanding amount of the active loans at the time of second disbursement), some of their index values are less than zero, which are invalid, and hence replaced with null values.
(iii) Missing values processing.
The objects with a null value in index X9 (Date of birth of the customer) are filled with the values of the mean age. For the missing values in index X10 (Employment type of the customer), the RF machine learning algorithm is used to fill them. The employment type of the borrower is taken as a dependent variable; the other indexes are treated as independent variables. The existing employment type data are trained in the random forest, to classify and predict the unknown employment types.
(2) Data transformation.
As the data set contains many indexes with the same meaning occurring at different times (the first and second times), notably, indexes X22 and X28, indexes X23 and X29, and indexes X23 and X29, we merge the indexes with the same or similar meaning to yield composite indexes, as shown in Table 3.
Table 3
Composite indexes
Index
Label
Description
X41
Loan_to_asset_ratio
Ratio of loan disbursed amount to the asset cost
X42
Total_no_of_accts
Count of total loans taken by the customer at the first and second time of disbursement
X43
Pri_inacitve_accts
Count of total inactive loans taken by the customer at the first time of disbursement
X44
Sec_inactive_accts
Count of total invalid loans taken by the customer at the second time of disbursement
X45
Total_inactives_accts
Count of total invalid loans taken by the customer at the first and second time of disbursement
X46
Total_actives_accts
Count of total active loans taken by the customer at the first and second time of disbursement
X47
Total_current_balance
Total principal outstanding amount of the active loans at the first and second time of disbursement
X48
Total_sanctioned_amount
Total amount that was not approved for all the loans at the first and second time of disbursement
X49
Total_disbursed_amount
Total amount that was disbursed for all the loans at the first and second time of disbursement
X50
Total_instal_amt
EMI amount of the primary and secondary loan
X51
Pri_loan_proportions
Proportion of the primary total loans to the principal
X52
Sec_loan_proportions
Proportion of the secondary total loan to the principal
X53
Active_to_inactive_act_ratio
Ratio of the customer’s total loans to the invalid loans
The approach for merging the indexes in Table 3 is as follows. The indexes loan_to_asset_ratio, Total_no_of_accts, Pri_inacitve_accts, Sec_inactive_accts, Total_inactives_accts, Total_actives_accts, Total_current_balance, Total_sanctioned_amount, Total_disbursed_amount, Total_instal_amt, Pri_loan_proportions, Sec_loan_proportions, and Active_to_inactive_act_ratio, are denoted by X41, X42, X43, X44, X45, X46, X47, X48, X49, X50, X51, X52, and X53, respectively, and their index values are as follows:
$$ \begin{gathered} X_{41} = \frac{{X_{2} }}{{X_{3} }},\;X_{42} = X_{22} + X_{28} , \hfill \\ X_{43} = X_{22} - X_{23} ,\;X_{44} = X_{28} - X_{29} , \hfill \\ X_{45} = X_{22} - X_{23} + X_{28} - X_{29} ,\;X_{46} = X_{23} + X_{29} , \hfill \\ X_{47} = X_{25} + X_{31} ,\;X_{48} = X_{26} + X_{32} , \hfill \\ X_{49} = X_{27} + X_{33} ,\;X_{50} = X_{34} + X_{35} , \hfill \\ X_{51} = \frac{{X_{27} }}{{\left( {X_{34} + 1} \right)}},\;X_{52} = \frac{{X_{33} }}{{\left( {X_{35} + 1} \right)}}, \hfill \\ X_{53} = \frac{{\left( {X_{22} + X_{28} } \right)}}{{\left( {X_{22} - X_{23} + X_{28} - X_{29} + 1} \right)}}. \hfill \\ \end{gathered} $$
Creating the new composite indexes yields 54 indexes in total. Of these, 53 indexes are independent variables related to the borrower’s information and one index is the dependent variable. From the data, there are 12 indexes with no 0 values, namely X1, X2 X3, X4 X5, X6, X7, X8, X9, X11, X12, and X13, but there are many 0’s in the index values of the other 42 indexes. Hence, if more than three-quarters of the index values of a record are zero, then the record is deemed invalid and deleted accordingly. As such, only 117,156 loan records can be used for research and analysis.

Pre-screening credit risk assessment indexes

The credit risk assessment indexes of the personal auto loans are generally divided into three categories, i.e., personal indexes, economic indexes, and credit indexes. Personal indexes generally reveal the basic information of the borrower, such as age, gender, job, and education, which can be used to predict the change in a borrower’s loan repayment behavior. Economic indexes reflect the economic standing of the borrower. The better the economic standing, the less is the likelihood to default. Credit indexes reflect a borrower’s credit history, including the credit data generated in their life, work, and so on. This information can be used to understand the borrower's credit history of repayment, the borrower's repayment willingness, and can be used to predict future repayment behavior changes.
From the description and value range of the indexes in the data set, it is easy to infer if an index is a credit risk factor. For index X8 (Current pincode of the customer), a borrower’s id is equivalent to a person's name. This index is not a factor affecting credit risk and is deleted. Similarly, the indexes X5 (Branch where the loan was disbursed), X6 (Vehicle Dealer where the loan was disbursed), X7 (Vehicle manufacturer (Hero, Honda, TVS)), X12 (State of disbursement), and X13 (Employee of the organization who logged the disbursement) are randomly assigned by the system, and have no real impact on the credit risk assessment and should also be deleted. In addition, as the index values of index X14 in all the loan records are 1, index X14 does not have a predictive role in credit risk assessment and it should be deleted. Using this approach to screen, 8 indexes are eliminated. Thus, 45 indexes related to customer information and 1 dependent variable index remain in the data set, as shown in Table 4.
Table 4
Resulting credit risk assessment indexes
Index
Label
Index
Label
Z1
Aadhar_flag
Z24
VoterID_flag
Z2
DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS
Z25
age
Z3
Driving_flag
Z26
asset_cost
Z4
Employment.Type
Z27
average_acct_age_month
Z5
NEW.ACCTS.IN.LAST.SIX.MONTHS
Z28
credit_history_length_month
Z6
NO.OF_INQUIRIES
Z29
credit_risk_grade
Z7
PAN_flag
Z30
disbursal_months_passed
Z8
PERFORM_CNS.SCORE
Z31
disbursed_amount
Z9
PRI.ACTIVE.ACCTS
Z32
ltv
Z10
PRI.CURRENT.BALANCE
Z33
loan_to_asset_ratio
Z11
PRI.DISBURSED.AMOUNT
Z34
total_no_of_accts
Z12
PRI.NO.OF.ACCTS
Z35
pri_inactive_accts
Z13
PRI.OVERDUE.ACCTS
Z36
sec_inactive_accts
Z14
PRI.SANCTIONED.AMOUNT
Z37
total_inactive_accts
Z15
PRIMARY.INSTAL.AMT
Z38
total_active_accts
Z16
Passport_flag
Z39
total_current_balance
Z17
SEC.ACTIVE.ACCTS
Z40
total_sanctioned_amount
Z18
SEC.CURRENT.BALANCE
Z41
total_disbursed_amount
Z19
SEC.DISBURSED.AMOUNT
Z42
total_instal_amt
Z20
SEC.INSTAL.AMT
Z43
pri_loan_proportion
Z21
SEC.NO.OF.ACCTS
Z44
sec_loan_proportion
Z22
SEC.OVERDUE.ACCTS
Z45
active_to_inactive_act_ratio
Z23
SEC.SANCTIONED.AMOUNT
  

Transforming unbalanced data set

After the data preprocessing, we convert the unbalanced data set into a balanced data set. Traditional machine learning algorithms focus on the overall accuracy, and the trained classifiers tend to favor the majority category in the training process [4042], while the prediction accuracy of the minority category is very low. We proposed a Smote-Tomek Link algorithm to convert the imbalanced data set into a balanced dataset, to improve the prediction accuracy of the minority category and the overall classification effect of the data set.
In this section, based on the traditional Smote algorithm [4244], a Smote-Tomek Link algorithm is proposed to transform the unbalanced data set into the balanced one.
The basic steps of the Smote-Tomek Link algorithm are as follows: (i) Select n minority class sample points randomly using the Smote algorithm, and find m subclass sample points closest to these n minority class sample points. (ii) Select any point in the nearest m subclass samples, and this point is a new data sample. On this basis, an integrated Smote-Tomek Link algorithm is designed by combining with Tomek Link. The basic ideas are given as follows.
For the newly generated data point and the point closest to other non-new sample points, a pair of Tomek link is formed. Then the rules are defined: a space is framed with the new generation point as the center and the distance of Tomek Link as the range radius.
If the number of minority classes or majority classes in this space is less than the minimum threshold, then the new generation point is considered as a "trash point", and this kind of point is removed or have been another SMOTE training. If the number of minority classes or majority classes in the space is greater than or equal to the minimum threshold, then we can sample in the set of minority classes samples that have reserved and put in SMOTE training. According to this rule, the "trash points" are eliminated, and the new data that meets the criteria are retained. Repeat the above steps, and finally add the generated sample to the data set to get a new balanced data sample set.

Unbalanced data set transformation based on Smote-Tomek Link algorithm

From section “Data preparation and preprocessing”, 117,156 loan records are obtained, of which 93,315 records are the auto loan data of the non-defaulters, and 23,841 are the auto loan data of the defaulters. The imbalance ratio of the data set is almost four times, which would affect the model effect. Thus, we use the Smote-Tomek Link algorithm proposed in Subsection “Smote-Tomek Link algorithm” to process and transform the imbalanced data set into a balanced data set. To highlight the superiority of this algorithm in processing the data set, several machine learning models are adopted to make predictions and the effects are compared using the relevant evaluation indexes.
(1) Experimental methods
We use the Smote and Smote-Tomek Link algorithm to process data set T, yielding two data sets T1 and T2 respectively. The data sets T, T1, and T2 are further divided into 70–30 training-test sets individually. Then, we apply two machine learning methods, i.e., the Logistic Regression (LR) model and the Random Forest (RF) model as the classifiers for training and prediction. The effect of the models is compared using relevant evaluation indexes such as F1-score, G-means, MCC, and AUC [4547].
(2) Evaluation indexes of unbalanced learning
For a two-category problem in machine learning, the majority category is usually labeled the negative category, while the minority category with high recognition importance is labeled the positive category. Based on the true category of the sample and the category predicted by the classifier, there are four classification types: true positive (TP), false positive (FP), true negative (TN), and false negative example (FN). Among them, TP and TN are the positive (negative) samples that are correctly predicted by the classifier. FP and FN represent that the sample is a negative (positive) category, and the sample is wrongly predicted as a positive (negative) category by the classifier. Table 5 shows the confusion matrix of the classification results.
Table 5
Confusion matrix of classification results
Actual category
Prediction category
Positive
Negative
Positive
TP
FN
Negative
FP
TN
From the confusion matrix, the recall rate R and precision rate P are found using [45, 47, 48]:
$$ R = \frac{TP}{{TP + FN}},\;P = \frac{TP}{{TP + FP}}. $$
(i) F1-measure
The F1-measure is the harmonic mean of the recall rate R and precision rate P, which can evaluate the overall classification of unbalanced data sets [4547]. The larger the value of F1, the better is the classification effect of the classifier, with.
$$F1{ - }measure = \frac{2}{{\frac{1}{P} + \frac{1}{R}}} = \frac{2PR}{{P + R}}.$$
(ii) G-means
The G-means evaluates the performance of the unbalanced data classification. For an unbalanced data set, the value of the G-means will be high only if the classification accuracy of positive category samples and negative category samples is relatively high. Otherwise, the value of G-means will be low. The G-means is expressed as follows [4547]:
$$ G{ - }means = \sqrt {\frac{TP}{{TP + FN}} \times \frac{TN}{{TN + FP}}} $$
(iii) MCC
The Markov Correlation Coefficient (MCC) is an important index to evaluate the performance of unbalanced data classification. In general, the greater the MCC, the better is the classification effect of the model. The MCC is expressed as [4547]:
$$ MCC = \frac{TP \times TN - FP \times FN}{{\sqrt {\left( {TP \,{+}\, FP} \right) \,{\times}\, \left( {TP \,{+}\, FN} \right) \,{\times}\, \left( {TN \,{+}\, FP} \right) \,{\times}\, \left( {TN \,{+}\, FN} \right)} }} $$
(iv) AUC
The AUC is the area under the ROC (Receiver Operating Characteristic) curve, and is a common index to measure the overall classification performance of the classifier [4547]. The F1, G-means, and MCC assessment indexes are based on thresholds, but AUC is not related to the selection of a threshold.
(3) Analysis of experimental results
For the original untreated data set, the data set processed by the Smote algorithm, and the data set processed by the Borderline-Smote algorithm, two machine learning methods (LR model, RF model) are used as the classifiers to train and predict. The final classification results are shown in Table 6. Panels A and B show the classification result obtained by the LR and RF models respectively.
Table 6
Classification results based on LR and RF models
Data set
F1
G-means
MCC
AUC
Panel A: LR model
    
 Unprocessed data set
0.012352
0.079060
0.032097
0.502130
 Smote algorithm
0.620625
0.618822
0.237656
0.618827
 Smote-Tomek Link algorithm
0.624523
0.620736
0.239479
0.619983
Panel B: RF model
    
 Unprocessed data set
0.039540
0.143128
0.066204
0.507563
 Smote algorithm
0.842869
0.842489
0.684974
0.842489
 Smote-Tomek Link algorithm
0.851321
0.848532
0.670156
0.857371
From Table 6, when the imbalanced data set is not processed, the fitting effect of both the LR and RF models is extremely poor. This is because the distribution of the majority category and the minority category in the data set is uneven. As a result, the model tends to predict the minority category into the majority category during training, thus lowering the prediction accuracy of the minority category. Using the Smote algorithm or the Smote-Tomek Link algorithm to process the data greatly improves the property of the classifier. From the F1, G-means, MCC and AUC values, using the same classifier to train and predict the data set processed by the Smote algorithm and the data set processed by the Smote-Tomek Link algorithm, the results suggest that the classification effect and predicting performance using the Smote-Tomek Link algorithm to be better than that of the Smote algorithm. Thus, we use the Smote-Tomek Link algorithm to transform the imbalanced data set into a balanced data set.

Feature selection method of credit risk assessment index

From the balanced data set, 186,630 auto loan records are obtained. From them, 45 features (indexes) are used to reflect the borrower's auto loan information. Due to the large number of feature dimensions of the auto loan borrowers, there may be features that are irrelevant or redundant to credit risk. Therefore, it is necessary to make a feature selection of these 45 features to further screen the indexes and simplify the feature subsets, so as to reduce the dimension of the feature space. In this way, the generalizability of the established credit risk assessment model of personal auto loans can be enhanced and any overfitting can be reduced.

Improved Filter-Wrapper feature selection method

An improved Filter-Wrapper feature selection method is presented for selecting the main features from among the 45 preliminary indexes in Table 4. In the Filter stage, three evaluation criteria, namely, Relief algorithm [48, 49], Maximal Information Coefficient [50], and Quasi-separable method [51], are used to evaluate the importance of the features from three aspects: information relevance, information quantity, and quasi-separable ability. A fusion model of multiple evaluation criteria is constructed to rank the importance of the features. To overcome the subjectivity in determining the weight coefficients of the feature importance, the order relation analysis method [5153] is used to determine the corresponding weight of the three evaluation criteria, the classification accuracy is used as the measurement standard, and the SBS method [31] is used to rank the 45 preliminary features. The lower the rank order, the lesser is the importance of that feature. At the same time, the feature subset after each deletion is trained and predicted, so as to obtain the classification accuracy of the data set. The feature subset is then evaluated on the classification accuracy, and the optimal feature subset is found.
(1) Filter stage
It is difficult for an evaluation criterion to comprehensively evaluate the quality of the feature subsets. If the evaluation criteria are combined, they can complement each other and improve the evaluation quality. For the 45 preliminary features listed in Table 4, three evaluation criteria: Relief algorithm, Maximum Information Coefficient method, and Quasi-separable method, are selected.
The dimensionality of the three evaluation criteria is different, which may lead to significant differences in the corresponding values of the features and affect the subsequent fusion process of the evaluation criteria, resulting in large deviations in the results. With this in mind, the dimensions of the three evaluation criteria are harmonized using:
$$ Re_{i} = \frac{{re_{i} - \mathop {\min }\nolimits_{i} {\kern 1pt} {\kern 1pt} re_{i} }}{{\mathop {\max }\nolimits_{i} {\kern 1pt} {\kern 1pt} re_{i} - \mathop {\min }\nolimits_{i} {\kern 1pt} {\kern 1pt} re_{i} }},{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1,2,....,45 $$
$$ M_{i} = \frac{{m_{i} - \mathop {\min }\nolimits_{i} {\kern 1pt} {\kern 1pt} m_{i} }}{{\mathop {\max }\nolimits_{i} {\kern 1pt} {\kern 1pt} m_{i} - \mathop {\min }\nolimits_{i} {\kern 1pt} {\kern 1pt} m_{i} }},{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1,2,....,45 $$
$$ C_{i} = \frac{{c_{i} - \mathop {\min }\nolimits_{i} {\kern 1pt} {\kern 1pt} c_{i} }}{{\mathop {\max }\nolimits_{i} {\kern 1pt} {\kern 1pt} c_{i} - \mathop {\min }\nolimits_{i} {\kern 1pt} {\kern 1pt} c_{i} }},{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1,2,....,45 $$
where rei, mi and ci are the values obtained by the Relief algorithm, Maximum Information Coefficient method, and Quasi-separable method, respectively. The max and min represent the maximum and minimum values respectively. Rei, Mi, and Ci are the values after range standardization.
Though the three evaluation criteria are measured differently, they all conform to the rule namely the greater the weight of feature i, the stronger is the classification ability of that feature. Thus, the values obtained by the three evaluation criteria are fused to form a fusion model of multiple evaluation criteria. The fusion evaluation value of feature i is denoted as totali, which denotes the importance degree of feature i, which is written as
$$ total_{i} = w_{1} Re_{i} + w_{2} C_{i} + w_{3} M_{i} $$
(1)
where \(w_{1}\), \(w_{2}\) and \(w_{3}\) are the weights corresponding to the Relief algorithm, Maximum Information Coefficient method and Quasi-separable method, respectively.
As the influence of each evaluation criterion on the result is different, their weights are different, and the determination of the weights will be related to the fitting effect of the subsequent model, so determining the weights is key. For this, we employ the order relation analysis method [5153] to obtain the weights of the evaluation criteria, as shown in Fig. 1.
The steps to determine the weights are as follows.
Step 1: Determine the order relationship among the evaluation criteria. From the effect of the Relief algorithm, Maximum Information Coefficient method and Quasi-separable method, the rank relation among the evaluation criteria is as follows:
$$ U_{1} > U_{2} > U_{3} $$
where \(U_{1}\) is the Relief algorithm, \(U_{2}\) is the Quasi-separable method, and \(U_{3}\) is the Maximum Information Coefficient method, respectively.
Step 2: Obtain the relative importance of the three evaluation criteria using comparative judgment. Suppose the ratio of the importance of evaluation criteria \(U_{k - 1}\) to \(U_{k}\) is \(\gamma_{k}\) [51, 52], that is,
$$ \gamma_{k} = \frac{{U_{k - 1} }}{{U_{k} }},\;k = 2,3,...,n $$
(2)
where the value of \(\gamma_{k}\) is as defined in Table 7.
Table 7
Value of γk and description
γk
Description
1.0
Uk-1 is just as important as Uk
1.2
Uk-1 is slightly more important than Uk
1.4
Uk-1 is obviously more important than Uk
1.6
Uk-1 is highly more important than Uk
1.8
Uk-1 is extremely more important than Uk
Using Table 7 and Eq. (2), the importance of the order relation among the three evaluation criteria can be assessed. The Relief algorithm is slightly more important than the Quasi-separable method, which is slightly more important than the Maximum Information Coefficient method. Thus, we have:
$$ \gamma_{2} = \frac{{U_{1} }}{{U_{2} }} = 1.2,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \gamma_{3} = \frac{{U_{2} }}{{U_{3} }} = 1.2 $$
(3)
Step 3: Compute the importance weight \(w_{m}\). The ranking of the weights of the three evaluation criteria is consistent with their corresponding positions in the order relation among them. The importance weights are found [51, 52] as follows:
$$ w_{m} = \left( {1 + \sum\limits_{k = 2}^{m} {\prod\limits_{i = k}^{m} {\gamma_{k} } } } \right)^{ - 1} $$
(4)
$$ w_{k - 1} = r_{k} w_{k} ,\left( {k = m,m - 1,...,2} \right) $$
(5)
Combining Eqs. (4) and (5) yields
$$ w_{3} = \left( {1 + \gamma_{2} \times \gamma_{3} + \gamma_{3} } \right)^{ - 1} = \left( {1 + 1.2 \times 1.2 + 1.2} \right)^{ - 1} = 0.2747 $$
(6)
$$ w_{2} = \gamma_{3} w_{3} = 1.2 \times 0.27473 = 0.3297 $$
(7)
$$ w_{1} = \gamma_{2} w_{2} = 1.2 \times 0.32968 = 0.3956 $$
(8)
Thus, the importance weights of the Relief algorithm, Maximum Information Coefficient method and Quasi-separable method are 0.3956, 0.3297 and 0.2747, respectively, satisfying \(w_{1} + w_{2} + w_{3} = 1\).
Step 4: Compute the fusion evaluation value \(total_{i}\). Substituting \(w_{1}\), \(w_{2}\) and \(w_{3}\) into Eq. (1), the fusion model of multiple evaluation criteria is expressed as
$$ total_{i} = 0.3956Re_{i} + 0.3297C_{i} + 0.2747M_{i} $$
(9)
Step 5: Rank the features. Using the fusion evaluation value \(total_{i}\), the features are now ranked.
Figure 2 shows the flowchart of the comprehensive ranking of the features during the Filter stage.
(2) Wrapper stage
The feature selection now enters the Wrapper stage, where we will further screen the sorted features, simplify the feature subset, and reduce the dimension, so as to improve the accuracy of classification. Here, the RF [54] is selected as a classifier at the encapsulation stage, and an SBS method is used to eliminate the features in accordance with the rank order of the features. From the complete feature set, at each iteration, the least important feature is removed. At the same time, the classifier is used to train and predict the current feature subset, so as to obtain the classification accuracy under this feature subset and compare it with the classification accuracy obtained in the previous iteration. The feature subset with the highest classification accuracy (that is, the optimal feature subset) is selected as the result of the evaluation index selection for the credit risk of the personal auto loans.
The steps of the Wrapper algorithm are as follows.
Input: The original data set F = {f1, f2, , fk}, where k is the number of original feature sets; k = 45.
Output: Select the optimal feature subset with the highest classification accuracy.
Figure 3 shows the algorithm flowchart.

Analysis of selection of credit risk assessment indexes

Comprehensive ranking of features in Filter stage

Using the method described in Sect. “Data preparation and preprocessing”, the three evaluation criteria are used to evaluate the importance of features in the Filter stage, and the evaluation value of each feature is obtained. Then, the evaluation value is standardized to ensure dimensional consistency. Next, the fusion model of multiple evaluation criteria is used for information fusion to obtain the fusion evaluation value of each feature, and the importance of each feature is then ranked according to the value of the fusion evaluation. The results obtained using Python are shown in Table 8.
Table 8
Comprehensive ranking of 45 preliminary features
Feature
Relief
MIC
Quasi-separable method
Fusion evaluation value
Rank
Z1
0.656250
0.057894
0.036012
0.287389
7
Z2
0.157813
0.002484
0.166743
0.118088
26
Z3
0.000000
0.014015
0.084077
0.031570
45
Z4
1.000000
0.116463
0.741940
0.672210
1
Z5
0.158703
0.031215
0.157836
0.123396
24
Z6
0.159447
0.000837
0.000001
0.063307
44
Z7
0.218750
0.034355
0.206417
0.164030
16
Z8
0.505691
0.378509
0.015686
0.309199
5
Z9
0.156034
0.027407
0.133023
0.113114
27
Z10
0.156250
0.096885
0.253947
0.172153
15
Z11
0.156250
0.179913
0.004739
0.112797
29
Z12
0.156013
0.020293
0.065588
0.088918
40
Z13
0.158750
0.000000
0.276819
0.154069
18
Z14
0.156250
0.203496
0.048480
0.133697
20
Z15
0.156256
0.085870
0.502350
0.251028
9
Z16
0.125000
0.001669
0.131155
0.093150
34
Z17
0.159247
0.000472
0.034478
0.074495
41
Z18
0.156297
0.039070
0.000001
0.072564
42
Z19
0.156253
0.041563
0.158975
0.125645
23
Z20
0.156250
0.034128
0.000000
0.071187
43
Z21
0.161916
0.002349
0.079134
0.090790
37
Z22
0.168713
0.001104
0.071656
0.090671
38
Z23
0.156263
0.043679
0.165562
0.128402
21
Z24
0.250000
0.021885
0.136613
0.149953
19
Z25
0.393644
0.059518
0.071771
0.195738
12
Z26
0.156303
0.128334
0.000007
0.097089
33
Z27
0.166422
0.007934
0.099641
0.100868
32
Z28
0.160613
0.033051
0.059533
0.092245
36
Z29
0.717303
0.157933
0.413192
0.463379
3
Z30
0.356469
0.064443
0.014473
0.163493
17
Z31
0.156378
0.547479
1.000000
0.541956
2
Z32
0.191141
0.425568
0.230247
0.268431
8
Z33
0.189450
0.106937
0.229650
0.180038
13
Z34
0.156056
0.019809
0.067408
0.089402
39
Z35
0.155972
0.021773
0.479931
0.225917
11
Z36
0.159753
0.003789
0.118671
0.103365
30
Z37
0.155953
0.021338
0.497559
0.231602
10
Z38
0.156181
0.027768
0.132485
0.113094
28
Z39
0.156253
0.096018
0.264068
0.175253
14
Z40
0.156250
0.202428
0.522285
0.289617
6
Z41
0.156250
0.179345
0.051005
0.127895
22
Z42
0.156256
0.085308
0.052022
0.102401
31
Z43
0.156250
0.112157
0.000000
0.092622
35
Z44
0.156350
0.035648
0.151286
0.121523
25
Z45
0.160081
1.000000
0.038522
0.350729
4

Feature selection in Wrapper stage

With the 45 preliminary features in Table 8 ranked, we now use the Wrapper algorithm to select the optimal feature subset. To ensure a better classification accuracy, we use the ten-fold cross validation [55] to accept the average classification accuracy after ten predictions. Figure 4 shows the change in classification accuracy as data dimension decreases. From Fig. 4, when the data dimension is 34, the classification accuracy of the RF classifier reaches the highest level. Thereafter, the classification accuracy decreases. Thus, the first 34 features in the comprehensive ranking are chosen as the optimal feature subset, that is, they form the credit risk assessment indexes of the personal auto loans.

Credit risk assessment of personal auto loans using PSO-XGBoost model

Next, a credit risk assessment model of the personal auto loans based on the PSO-XGBoost model is formed. The XGBoost model [56] has good characteristics such as high prediction accuracy and fast runtime, while the Particle Swarm Optimization (PSO) algorithm [5759] is used to optimize the parameters in the XGBoost model. Then, the PSO-XGBoost model, XGBoost model, RF model, and LR model [60] are used on the training data set. Prediction is made on the test data set to obtain their respective prediction outcomes, and the performance of the four models are evaluated and compared against performance evaluation indexes such as accuracy and precision, ROC curve, and AUC value.

XGBoost model

XGBoost (eXtreme Gradient Boosting) [56] is a C +  + realization based on the Gradient Boosting Machine algorithm, which is a boosting algorithm. The XGBoost model seeks to constantly add trees, and split the features to make the trees grow. As the data set is divided into a training set and test set using a 7:3 ratio, so, 130,641 records of auto loan data are selected to form the XGBoost model. The training set D, containing 130,641 samples with 34 features, is expressed as \(D = \left\{ {\left( {x_{i} ,y_{i} } \right)} \right\}\left( {\left| D \right| = 130,641,x_{i} \in R^{31} ,y_{i} \in R} \right)\), where \(x_{i}\) represents the i-th sample, \(y_{i} = \left\{ {\left. {0,1} \right\}} \right.\) represents the category of default group, with 0 (1) being the non-default (default) group, respectively.
Now, suppose the total number of trees is K, then the predicted value of K to the sample is [56]
$$ \hat{y}_{i} = \phi \left( {x_{i} } \right) = \sum\limits_{k = 1}^{K} {f_{k} \left( {x_{i} } \right),f_{k} \in F} $$
(10)
$$ F = \left\{ {f\left( x \right) = w_{q\left( x \right)} } \right\}\left( {q:R^{31} \to T,w \in R^{T} } \right) $$
(11)
where \(\hat{y}_{i}\) is the predicted value of the model representing the predicted category label of sample \(i\), \(F\) is the set of classification and regression trees (CART), \(f\left( x \right)\) is a regression tree, and \(w_{q\left( x \right)}\) represents the set of all node scores of this tree, namely, the prediction of samples; q represents the classification of samples on the leaf node, that is, input a sample, map the sample to the predicted category output by the leaf node according to the model, and judge whether it is a non-defaulting or defaulting population; \(w\) is the leaf node score, and T is the number of leaf nodes in the tree.
From Eq. (10), we note that the predicted values of the XGBoost model are the sum of the predicted values of the K trees. To learn these K trees, we define the objective function, which contains a loss function and a regularization function [56], and this can be expressed as
$$ Obj = L\left( v \right) + \Omega \left( v \right) = \sum\limits_{i} {l\left( {\hat{y}_{i} ,y_{i} } \right)} + \sum\limits_{k} {\Omega \left( {f_{k} } \right)} $$
(12)
where \(L\left( v \right)\) is the loss function, which can evaluate the fitting degree of the model; \(\Omega \left( v \right)\) is the regularization function used to simplify the model and control its complexity; \(\hat{y}_{i}\) is the predicted value of the model representing the predicted category label of sample \(i\); \(y_{i}\) is the true category label of sample \(i\); \(l\left( {\hat{y}_{i} ,y_{i} } \right)\) is used to measure the deviation degree between the actual and the predicted values obtained by the credit risk assessment model, which is a non-negative real valued function; \(k\) is the number of trees, and \(f_{k}\) is the kth tree.
The term \(\Omega \left( {f_{k} } \right)\) in Eq. (12) is the regularization term [56], which is given by
$$ \Omega \left( {f_{k} } \right) = \gamma T + \frac{1}{2}\lambda \left\| w \right\|^{2} $$
(13)
where T is the number of leaf nodes in each tree, \(w\) is the set of leaf node scores in each tree, \(\gamma\) is the leaf weight, \(\lambda\) is the penalty coefficient. \(\gamma\) and \(\lambda\) jointly determine the model’s complexity.
According to the XGBoost model, the newly generated tree is the residue after fitting the previous round. Therefore, when t trees are generated, Eq. (10) can be written as
$$ \hat{y}_{i}^{\left( t \right)} = \hat{y}_{i}^{{\left( {t - 1} \right)}} + f_{t} \left( {x_{i} } \right) $$
(14)
Substituting Eq. (14) into Eq. (12), the objective function can be rewritten as [56]
$$ Obj^{\left( t \right)} = \sum\limits_{i = 1}^{n} {l\left( {y_{i} ,\hat{y}_{i}^{{\left( {t - 1} \right)}} + f_{t} \left( {x_{i} } \right)} \right) + \Omega \left( {f_{t} } \right)} $$
(15)
The goal is to find the value \(f_{k}\) that minimizes Eq. (15). In Gradient Boosted Decision Tree (GBDT), only the first-order gradient is adopted. Compared to the GBDT, the XGBoost model rewrites the objective function using a second-order Taylor series. Thus, Eq. (15) is approximated by
$$ Obj^{\left( t \right)} \approx \sum\limits_{i = 1}^{n} {\left[ {l\left( {y_{i} ,\hat{y}_{i}^{{\left( {t - 1} \right)}} } \right) + g_{i} f_{t} \left( {x_{i} } \right) + \frac{1}{2}h_{i} f_{t}^{2} \left( {x_{i} } \right)} \right] + \Omega \left( {f_{t} } \right)} $$
(16)
where \(g_{i} = \partial_{{\hat{y}_{i}^{{\left( {t - 1} \right)}} }} l\left( {y_{i} ,\hat{y}_{i}^{{\left( {t - 1} \right)}} } \right)\), \(h_{i} = \partial_{{\hat{y}_{i}^{{\left( {t - 1} \right)}} }}^{2} l\left( {y_{i} ,\hat{y}_{i}^{{\left( {t - 1} \right)}} } \right)\) are the first- and second-order derivatives of the loss function with respect to \(\hat{y}_{i}^{{\left( {t - 1} \right)}}\). As \(l\left( {y_{i} ,\hat{y}_{i}^{{\left( {t - 1} \right)}} } \right)\) is a constant, Eq. (16) can be rewritten as
$$ Obj^{\left( t \right)} = \sum\limits_{i = 1}^{n} {\left[ {g_{i} f_{t} \left( {x_{i} } \right) + \frac{1}{2}h_{i} f_{t}^{2} \left( {x_{i} } \right)} \right] + \Omega \left( {f_{t} } \right)} $$
(17)
Clearly, \(Obj^{\left( t \right)}\) depends on the first- and second-order derivatives of each data point on the error function. Thus, the iteration about the tree is turned into an iteration about the leaf node, and the following results can be obtained [56].
$$ \begin{gathered} Obj^{\left( t \right)} = \sum\limits_{i = 1}^{n} {\left[ {g_{i} f_{t} \left( {x_{i} } \right) + \frac{1}{2}h_{i} f_{t}^{2} \left( {x_{i} } \right)} \right] + \Omega \left( {f_{t} } \right)} \hfill \\ = \sum\limits_{i = 1}^{n} {\left[ {g_{i} w_{{q\left( {x_{i} } \right)}} + \frac{1}{2}h_{i} w_{{q\left( {x_{i} } \right)}}^{2} } \right]} + \gamma T + \lambda \cdot \frac{1}{2}\sum\limits_{j = 1}^{T} {w_{j}^{2} } \hfill \\ = \sum\limits_{j = 1}^{T} {\left[ {\left( {\sum\limits_{{i \in I_{J} }} {g_{i} } } \right)w_{j} + \frac{1}{2}\left( {\sum\limits_{{i \in I_{j} }} {h_{i} } } \right)w_{j}^{2} } \right] + \gamma T + \lambda \cdot \frac{1}{2}\sum\limits_{j = 1}^{T} {w_{j}^{2} } } \hfill \\ = \sum\limits_{j = 1}^{T} {\left[ {\left( {\sum\limits_{{i \in I_{J} }} {g_{i} } } \right)w_{j} + \frac{1}{2}\left( {\sum\limits_{{i \in I_{j} }} {h_{i} + \lambda } } \right)w_{j}^{2} } \right] + \gamma T} . \hfill \\ \end{gathered} $$
(18)
Therefore, the problem is transformed into a problem of finding the extreme value of a quadratic function in wj. That is, we must find the optimal value of \(w_{j}\) that minimizes Eq. (18). Using the method of solving the extremum of a quadratic function, we obtain \({w}_{j}^{*}\) and the minimum value of the objective function [56] as follows.
$$ w_{j}^{*} = - \frac{{G_{j} }}{{H_{i} + \lambda }},\;Obj = - \frac{1}{2}\sum\limits_{j = 1}^{T} {\frac{{G_{j}^{2} }}{{H_{j} + \lambda }}} + \gamma T $$
(19)
Compared to the GBDT model, the XGBoost model adds the regularization term to the objective function of the credit risk assessment model to prevent the model from overfitting. At the same time, Taylor expansion is used to optimize the objective function to find the best segmentation point in the CART regression tree. Therefore, the constructed credit risk assessment model has higher accuracy and better fitting performance than the other models.

PSO-XGBoost model

The XGBoost model often adjusts the parameters manually, resulting in a longer search time and higher computational cost. If the PSO algorithm is used to optimize the parameters of XGBoost model, each parameter is coded into particles in the space. According to the PSO algorithm, the optimal parameters of XGBoost model are searched within a fixed number of iterations, so as to find the optimal solution to the XGBoost model. Thus, we integrate the PSO algorithm into the parameter optimization of XGBoost model to form the PSO-XGBoost model. This method has fast convergence, higher precision, and lower cost. The steps of the PSO-XGBoost model are shown in Fig. 5.
We must first determine the parameters to be optimized for the XGBoost model. As the accuracy of the XGBoost model is important, three parameters are selected for optimization, i.e., the learning rate, maximum depth of the tree, and the sample weight of the minimum leaf node. Thus, the dimension of the particle swarm space in the PSO algorithm is 3. Next, the maximum number of iterations, learning factor, inertia weight, and the number of particles N in the PSO must be determined. Finally, the PSO-XGBoost model is constructed, and the predicted error rate is taken as the fitness of the PSO algorithm, that is, the calculated error rate function is taken as the fitness function.
The next steps are to initialize the entire particle swarm in three-dimensional space (including each particle's position and velocity) and to compute the error rate of each particle after initialization according to the error rate function. The local and global optimal values of the entire particle swarm are obtained by comparison. We then determine whether the termination condition is met (i.e., the maximum number of iterations is breached). If the termination condition is not met, then the error rate and the corresponding parameter value are output, and the velocity and position of each particle are updated. The error rate of each particle after the update is computed using the error rate function, and the error rate of each particle is compared with the current local and global optimal values. If the error rate is less than the optimal value, then the optimal value is replaced with the current error rate; otherwise, the optimal value holds. Next, if the current iteration number has not reached the maximum number set, then the iteration continues until the termination condition is satisfied, and the final optimal value is output. From the optimal values found, the best parameter values of the model are now known, and the PSO-XGBoost model can then be used to assess the credit risk of the personal auto loans.

Analysis of credit risk assessment of personal auto loans

(1) Data set partitioning
From Sect. “Literature Review”, 186,630 auto loan records can be used for the empirical analysis, and they are divided into a training set and test set in a 7:3 ratio, as shown in Table 9.
Table 9
Information on data set
Data set
Number of features
Number of samples
Positive/negative ratio
Missing value
Training
34
130,641
1.00
NA
Test
34
55,989
1.00
NA
(2) Parameter optimization
To improve the execution and classification performance of the XGBoost model, a parameter adjustment of the XGBoost model is required. For this, the PSO algorithm is used to optimize the parameters of the XGBoost model.
When forming the XGBoost model, three parameters need to be adjusted, i.e., learning rate (learning_rate), maximum depth of the tree (max_depth), and the sample weight of minimum leaf node (min_child_depth), to improve the accuracy of the XGBoost model. The steps are detailed as follows:
(i)
Learning rate: the step size is used in the updating process to prevent overfitting. After each update, the weight of the new feature is obtained. Reducing the weight of the feature ensures a more conservative computation. The step size and the maximum number of iterations usually jointly determine the fitting effect of the algorithm, and the robustness of the model can also be improved by reducing the weight at each step.
 
(ii)
Maximum depth of the tree: for the maximum depth of the decision tree in the XGBoost model, if no specific value is entered, a default value is assumed. So, the decision tree does not limit the depth of the subtree when it is created. However, if the model sample has a large amount of data and many features, it needs to be limited, so as to avoid overfitting.
 
(iii)
Sample weight of minimum leaf node: this is similar to the parameter min_child_leaf in the gradient lift tree algorithm. The parameter min_child_leaf in the gradient lift tree algorithm represents the total number of minimum samples, while min_child_weight in the XGBoost model represents the sum of the minimum sample weights, which are also used to avoid overfitting situations. When the value of min_child_weight is large, the model can avoid learning some local samples, so the value of this parameter can be adjusted to avoid model overfitting.
 
Three parameters: learning rate (learning_rate), maximum depth of the tree (max_depth), and sample weight of minimum leaf node (min_child_depth) in the XGBoost model are adjusted by the PSO algorithm on the 130,641 data points in the training set, to ensure model optimization and improve the accuracy of the model prediction. In the iterative process of optimizing the three parameters, the error rate of the XGBoost model is used as the fitness evaluation function in the PSO algorithm. Figure 6 shows the correlation of the number of iterations and the error rate of the model.
Figure 6 shows that the PSO algorithm continues to optimize the parameters as the number of iterations increases, with a decreasing error rate of the model. When a stationary value is reached, the optimal value of the parameter is found and the PSO-XGBoost model has a minimum error rate.

Performance evaluation of PSO-XGBoost model

To evaluate model performance, the PSO-XGBoost model is compared with the XGBoost, RF, and LR models [60]. As our problem studied is a two-category problem, we use evaluation indexes such as accuracy, precision, complexity, ROC curve and AUC value to evaluate the effect of the models.
(1) Confusion matrix
Expanding on Table 5, we provide a confusion matrix to visualize the model’s outcome (see Table 10).
Table 10
Confusion matrix
Actual category
Prediction category
Positive example
(Defaulted: “1”)
Negative example
(Non-defaulting: “0”)
Positive example (Default: “1”)
TP
FN
Negative example (No-default: “0”)
FP
TN
From the confusion matrix in Table 10, there are four possibilities for the results predicted by the model. The first is the true positive example (TP), i.e., the borrower has already defaulted previously, and the model predicts the borrower to belong to a high-risk group, which is very likely to breach the contract. Therefore, the agency should be highly alert of such borrowers. The second is the false negative example (FN), that is, in reality, the customer has a default, but the model wrongly predicts the customer to belong to a low-risk group. Approving such customers will cause a huge financial loss to the auto financing firms. The third cell is the false positive example (FP), which means that, the borrower has no default at all. However, the model predicts the customer is a high-risk borrower capable of defaulting on the loan. Such borrowers will be filtered out by the institution and potential revenue will be lost. A similar argument applies to the fourth category—the true negative example (TN).
In this paper, the data set is divided into a training and a test set, and the PSO-XGBoost, XGBoost, RF, and LR models are used for training and prediction, as shown by the confusion matrix of Table 11.
Table 11
Model comparison by confusion matrix
Values
PSO-XGBoost
XGBoost
RF
LR
1
0
1
0
1
0
1
0
1 (Defaulted)
21,392
6704
20,745
7351
20,977
7119
19,369
8727
0 (No-default)
2753
25,140
2794
25,099
3702
24,191
7699
20,194
(2) Evaluation indexes of model performance
(i) Accuracy and error
Accuracy is the proportion of the number of correctly predicted samples in the total number of samples [24, 34, 61], expressed as:
$$ Accuracy = \frac{TP + TN}{{TP + FN + FP + TN}} $$
(20)
Error is the proportion of the number of incorrectly predicted samples in the total number of samples [24, 34, 61], expressed as:
$$ Error = \frac{FN + FP}{{TP + FN + FP + TN}} $$
(21)
The higher the accuracy, the smaller the error, and the better is the effect of the classifier model and vice versa.
(ii) Precision and Recall
Precision refers to the proportion of true positive samples in the total positive samples judged by the model [24, 34, 61], that is,
$$ \, P = \frac{TP}{{TP + FP}} $$
(22)
Recall rate refers to the proportion of positive samples judged by the model in the total actual positive samples [24, 34, 61], that is,
$$ R = \frac{TP}{{TP + FN}} $$
(23)
Using Eqs. (20)-(23), the evaluation indexes of each model are obtained as shown in Table 12.
Table 12
Comparison of evaluation indexes of models
Evaluation index
PSO-XGBoost
XGBoost
RF
LR
Accuracy
0.8311
0.7888
0.8067
0.7066
Precision
0.8860
0.7613
0.8500
0.7156
Recall
0.7614
0.7384
0.7466
0.6894
Time complexity
9 s
5 s
6 s
3 s
Space complexity
77 M
74.3 M
66 M
36 M
It can be seen from Table 12 that the classification accuracy of XGBoost model is 78.88%, while the Accuracy of PSO-XGBoost model is 83.11%, increasing by 4.23%, greatly improving the classification accuracy. At the same time, the Precision and Recall of PSO-XGBoost model are better than those of the XGBoost model, which indicates that the evaluating performance of PSO-XGBoost model is better than that of the XGBoost model. The Logistic regression model had the worst evaluation performance among the four models, because all its evaluation indexes are the lowest. The classification accuracy of the RF model is 80.67%, and that of the PSO-XGBoost model is 2.44% higher than it. In terms of Precision and Recall, the PSO-XGBoost model is also better than that of the RF model. In conclusion, among the four models, the PSO-XGBoost model had the best performance for credit risk evaluation of personal auto loan than other three models.
(3) Complexity.
The complexity of all the algorithms being compared (PSO-XGBoost, XGBoost, RF, and LR) are measured from the two dimensions of time and space. The time dimension refers to the running time taken to execute the current algorithm, which is called time complexity. The spatial dimension refers to the amount of memory required to perform the current algorithm, which is called spatial complexity. The calculation results of time complexity and space complexity of each algorithm are shown in Table 12. From these results, we can see that the amount of memory required to perform the current algorithm of the proposed PSO-XGBoost is only 77 M, which slightly higher than that of the other methods, but it takes up very little memory. Similarly, the running time taken to execute the current algorithm of the proposed PSO-XGBoost is only 9 s, which means that the time complexity of the proposed algorithm is not high.
(4) ROC curve and AUC value.
The ROC curve ranks the samples according to the predicted results of the learner, and then makes the predictions sequentially by treating the sample as a positive example. The sensitivity TPR and specificity FPR are found each time [24, 34, 61], using
$$ TPR = \frac{TP}{{TP + FN}},\;FPR = \frac{FP}{{FP + TN}} $$
The TPR is taken as the horizontal axis and the specificity FPR as the vertical axis when mapping the ROC curve. The AUC is the area under the ROC curve. When the ROC curve of one learner is completely wrapped by the ROC curve of another learner, then it can be safely assumed that the performance of the first learner is better than that of the latter. However, when two curves intersect, a more reasonable judgment is to compare the values of the respective AUC’s. The higher the AUC value, the better is the performance of the learner.
Figure 7 shows the comparison diagram of ROC curves and AUC values of PSO-XGBoost model, XGBoost model, RF model and LR model. According to the comparison rule of ROC curve [4547], the closer the ROC curve is to the upper left corner, the better its evaluation performance will be. From Fig. 7, the ROC curve of the PSO-XGBoost model is the closest to the top left corner, and covers the ROC curves of the other three models. Furthermore, the AUC value of the PSO-XGBoost model is 0.90, which is better than other three models’ AUC. Hence, the PSO-XGBoost model has the best performance and the highest prediction accuracy for the credit risk evaluation of personal auto loans, which affirms the results offered by the earlier evaluation indexes.

Further analysis of model performance

In this subsection, to judge the performance of the proposed model in this paper sufficiently, another new experiment is provided to perform a comparative analysis to enrich our claim on generalization, where the selected data set is from a Chinese vehicle loan agency that is publicly available on the Kaggle platform. The data set for this new experiment can be downloaded from the website: https://​www.​kaggle.​com/​xiaochou/​auto-loan-default-risk.

Data processing and feature selection

The selected data set contains 199,717 customer loan records, where 164,289 loan records represent the information of customers who have not defaulted, and 35,428 loan records represent the information of customers who have defaulted. Moreover, the whole data set contains 54 indexes, where 53 indexes are the information indexes used to predict customer loan default, which are known as independent variables, and mainly reflect the customer's personal basic information, economic status and credit record information. Another index, Loan_default, is a dependent variable, which is used to mark whether a customer has defaulted. The decision-making task is to establish a risk identification model to predict borrowers who may default.
First of all, data processing and transformation are carried out for the type values, abnormal value and the missing values in the data set. Then, the credit risk assessment indexes of auto loan in this data set are preliminarily screened, and 42 independent variable indexes and 1 dependent variable index are retained. Due to the great difference between the default information and the non-default information of this data set, and the imbalance degree is nearly five times, so it is necessary to conduct unbalanced processing on this data set to transform it into a balanced auto loan data set. The Smote-Tomek Link algorithm proposed in Sect. “Smote-Tomek Link algorithm” is used for unbalanced processing, so as to improve the prediction accuracy of minority category and improve the overall classification effect of unbalanced data sets. Finally, the improved Filter-Wrapper feature Selection method proposed in Sect. “Improved Filter-Wrapper feature selection method” is selected for feature selection, and 30 features are selected as the optimal features, as shown in Table 13.
Table 13
The optimal feature subset
Index
Label
Index
Label
Z1
main_account_loan_no
Z16
Driving_flag
Z2
main_account_active_loan_no
Z17
passport_flag
Z3
main_account_overdue_no
Z18
credit_score
Z4
main_account_outstanding_loan
Z19
main_account_monthly_payment
Z5
main_account_sanction_loan
Z20
sub_account_monthly_payment
Z6
main_account_disbursed_loan
Z21
last_six_month_new_loan_no
Z7
sub_account_loan_no
Z22
last_six_month_defaulted_no
Z8
sub_account_active_loan_no
Z23
average_age
Z9
sub_account_overdue_no
Z24
credit_history
Z10
sub_account_outstanding_loan
Z25
enquirie_no
Z11
sub_account_sanction_loan
Z26
loan_to_asset_ratio
Z12
sub_account_disbursed_loan
Z27
total_account_loan_no
Z13
disbursed_amount
Z28
sub_account_inactive_loan_no
Z14
asset_cost
Z29
total_inactive_loan_no
Z15
ltv
Z30
main_account_inactive_loan_no

The decision-making process of risk assessment

(1) Data set partitioning
From Sect. “Data processing and feature selection”, 199,717 auto loan records can be used for the empirical analysis, and they are divided into a training set and test set in a 7:3 ratio, as shown in Table 14.
Table 14
Information on data set
Data set
Number of features
Number of samples
Positive/negative ratio
Missing value
Training
30
139,802
1.00
NA
Test
30
59,915
1.00
NA
(2) Parameter optimization
The PSO algorithm is used to optimize the three parameters of XGBoost model, i.e., learning_rate, max_depth and min_child_depth. In the process of iterative optimization with these three parameters, the error rate of XGBoost model is used as the fitness evaluation function in PSO algorithm, and the correlation graph between the number of iterations and the error rate of model is obtained, as shown in Fig. 8.
Figure 8 shows the specific variation trend of the error rate of PSO-XGBoost model. As the number of iterations increases, PSO algorithm continues to optimize parameters, and the error rate of the model decreases. When a stationary value is reached, the optimal value of the parameter is found and the PSO-XGBoost model has a minimum error rate.
(3) Performance evaluation
To verify the performance of the PSO-XGBoost model on this dataset, we have compared the proposed PSO-XGBoost model with several congeneric works, i.e., the XGBoost model, RF model and LR model using the performance evaluation indexes such as accuracy, precision, complexity, ROC curve and AUC value to evaluate the effect of the models. The evaluation indexes of each model are obtained as shown in Table 15.
Table 15
Comparison of evaluation indexes of models
Evaluation index
PSO-XGBoost
XGBoost
RF
LR
Accuracy
0.7805
0.7458
0.7733
0.6527
Precision
0.7827
0.7498
0.7645
0.6418
Recall
0.7745
0.7353
0.7676
0.6853
Time complexity
24 s
12 s
13 s
4 s
Space complexity
116.2 M
110.4 M
111.7 M
54.4 M
According to the calculation results of the evaluation indexes in Table 13, we can see that the Accuracy, Precision and Recall of PSO-XGBoost model are all better than those of the other three models. Thus, the PSO-XGBoost model is found to be superior on classification performance and classification effect. In addition, the time complexity degree and the space complexity degree of the proposed PSO-XGBoost are not high, which shows that our proposed model is effective and operable.
In addition, the ROC curves and AUC values of four compared models (PSO-XGBoost, XGBoost, RF, and LR) are plotted in the same figure, as shown in Fig. 9.
As can be seen from Fig. 9, the ROC curve of PSO-XGBoost model is the closest to the upper left corner, followed by those of RF model, XGBoost model, and LR model. In terms of AUC, the AUC value of PSO-XGBoost model is 0.86, which is the highest of the four models. From both ROC curve and AUC value, it can be concluded that the PSO-XGBoost model presented in this paper has the best performance and the highest prediction accuracy for credit risk evaluation of personal auto loans, which is consistent with the results judged by the Sect. “Performance evaluation of PSO-XGBoost model”. Most notably, for the ROC and AUC, Carrington et al. [63] pointed out that in classification and diagnostic tests, ROC and AUC describe how an adjustable threshold causes changes in two types of errors: false positives and false negatives, but the ROC curve and AUC are only partially meaningful when used with unbalanced data. In this sense, if ROC and AUC are used, it is best to first convert unbalanced data sets into balanced data sets. Otherwise, alternatives should be proposed to the ROC curve and AUC. The concordant partial AUC and the partial c statistic for ROC data proposed by Carrington et al. [63] are just good choices.

Conclusion

Seeking to address the problem of credit risk assessment for personal auto loans, this paper studies the feature selection method of credit risk assessment, and constructs a machine learning based credit risk assessment mechanism. Two data sets of personal auto loans on the Kaggle platform are selected as the research samples. Noting the imbalanced data set, the Smote-Tomek Link algorithm is proposed to achieve a balanced data set. An improved Filter-Wrapper feature selection method is then proposed to select the credit risk assessment indexes of the personal auto loans. A PSO-XGBoost model for the credit risk assessment is constructed and an empirical analysis is made.
Moreover, the proposed PSO-XGBoost model is compared with the RF model, XGBoost model, and LR model using the performance evaluation indexes such as accuracy, precision, complexity, ROC curve and AUC value. In the empirical analysis according to the first data set given by Sect. “Data preparation and preprocessing”, the comparison results show that the classification accuracy of the PSO-XGBoost model is 83.11%, which is improved by 4.23%, 2.44%, and 12.45% respectively than that of the RF, XGBoost and LR; In terms of Precision and Recall, the PSO-XGBoost model is also better than RF, XGBoost and LR; The AUC value is 0.9, which is also higher than other three comparison models. From the results of another empirical analysis according to the data set given by Sect. “Further analysis of model performance”, the results also inform the PSO-XGBoost model to be more superior to the other models on classification performance and classification effect. This validates the choice of the PSO-XGBoost model in the credit risk assessment of personal auto loans.
Due to the data set selected in this paper is a set of two-category data, thus the problem discussed in this paper is just a two-category credit risk assessment of personal auto loan. However, in the actual field of personal auto loan, the loan customers can be classified into multiple levels of credit, so that the auto financing institutions can carry out credit business with differentiated strategies for different customers, so as to improve the core competitiveness of the company. Therefore, seeking multi-category data set of personal auto loan and studying on the multi-category credit risk assessment model based on the two-category model established in this paper are the new directions of further research in the future.

Acknowledgements

We would like to thank the editor and the anonymous reviewers for their helpful comments.

Declarations

Conflict of interest

The authors declare that they have no competing interests.
There no ethical approval and patient consent to participate are required for this study.
The authors confirm that the final version of the manuscript has been reviewed, approved, and consented for publication by all authors.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literatur
1.
Zurück zum Zitat Chen Y, Lawell C, Wang YS (2020) The Chinese automobile industry and government policy. Research in Transportation Economics 100849. Chen Y, Lawell C, Wang YS (2020) The Chinese automobile industry and government policy. Research in Transportation Economics 100849.
2.
Zurück zum Zitat Walks A (2018) Driving the poor into debt? Automobile loans, transport disadvantage, and automobile dependence. Transp Policy 65:137–149 Walks A (2018) Driving the poor into debt? Automobile loans, transport disadvantage, and automobile dependence. Transp Policy 65:137–149
3.
Zurück zum Zitat Kang YX, Mao SH, Zhang YH (2022) Fractional time-varying grey traffic flow model based on viscoelastic fluid and its application. Transportation Research Part B: Methodological 157:149–174 Kang YX, Mao SH, Zhang YH (2022) Fractional time-varying grey traffic flow model based on viscoelastic fluid and its application. Transportation Research Part B: Methodological 157:149–174
4.
Zurück zum Zitat Wells P, Wang XB, Wang LQ, Liu HK, Orsato R (2020) More friends than foes? The impact of automobility-as-a-service on the incumbent automotive industry. Technol Forecast Soc Chang 154:119975 Wells P, Wang XB, Wang LQ, Liu HK, Orsato R (2020) More friends than foes? The impact of automobility-as-a-service on the incumbent automotive industry. Technol Forecast Soc Chang 154:119975
5.
Zurück zum Zitat Gao MY, Yang HL, Xiao QZ, Goh M (2021) A novel method for carbon emission forecasting based on Gompertz’s law and fractional grey model: Evidence from American industrial sector. Renewable Energy 181:803–819 Gao MY, Yang HL, Xiao QZ, Goh M (2021) A novel method for carbon emission forecasting based on Gompertz’s law and fractional grey model: Evidence from American industrial sector. Renewable Energy 181:803–819
7.
Zurück zum Zitat Li B, Dong XJ, Wen JH (2022) Cooperative-driving control for mixed fleets at wireless charging sections for lane changing behaviour. Energy 243:122976 Li B, Dong XJ, Wen JH (2022) Cooperative-driving control for mixed fleets at wireless charging sections for lane changing behaviour. Energy 243:122976
8.
Zurück zum Zitat Wu DM, Fang M, Wang Q (2018) An empirical study of bank stress testing for auto loans. J Financ Stab 39:79–89 Wu DM, Fang M, Wang Q (2018) An empirical study of bank stress testing for auto loans. J Financ Stab 39:79–89
9.
Zurück zum Zitat Xiao QZ, Chen L, Xie M, Wang C (2021) Optimal contract design in sustainable supply chain: Interactive impacts of fairness concern and overconfidence. Journal of the Operational Research Society 72(7):1505–1524 Xiao QZ, Chen L, Xie M, Wang C (2021) Optimal contract design in sustainable supply chain: Interactive impacts of fairness concern and overconfidence. Journal of the Operational Research Society 72(7):1505–1524
10.
Zurück zum Zitat Chen L, Nan GF, Li MQ, Feng B, Liu QR (2021) Manufacturer's online selling strategies under spillovers from online to offline sales. Journal of the Operational Research Society, forthcoming. Chen L, Nan GF, Li MQ, Feng B, Liu QR (2021) Manufacturer's online selling strategies under spillovers from online to offline sales. Journal of the Operational Research Society, forthcoming.
11.
Zurück zum Zitat Duan HQ, Snyder T, Yuan WC (2018) Corruption, economic development, and auto loan delinquency: Evidence from China. J Econ Bus 99:28–38 Duan HQ, Snyder T, Yuan WC (2018) Corruption, economic development, and auto loan delinquency: Evidence from China. J Econ Bus 99:28–38
12.
Zurück zum Zitat Li P, Rao CJ, Goh M, Yang ZQ (2021) Pricing strategies and profit coordination under a double echelon green supply chain. J Clean Prod 278:123694 Li P, Rao CJ, Goh M, Yang ZQ (2021) Pricing strategies and profit coordination under a double echelon green supply chain. J Clean Prod 278:123694
13.
Zurück zum Zitat Thabtah F, Kamalov F, Hammoud S, Shahamiri SR (2020) Least loss: A simplified filter method for feature selection. Inf Sci 534:1–15MathSciNetMATH Thabtah F, Kamalov F, Hammoud S, Shahamiri SR (2020) Least loss: A simplified filter method for feature selection. Inf Sci 534:1–15MathSciNetMATH
14.
Zurück zum Zitat Aremu OO, Cody RA, Hyland-Wood D, McAree PR (2020) A relative entropy based feature selection framework for asset data in predictive maintenance. Comput Ind Eng 145:106536 Aremu OO, Cody RA, Hyland-Wood D, McAree PR (2020) A relative entropy based feature selection framework for asset data in predictive maintenance. Comput Ind Eng 145:106536
15.
Zurück zum Zitat Wei GF, Zhao J, Feng YL, He AX, Yu J (2020) A novel hybrid feature selection method based on dynamic feature importance. Appl Soft Comput 93:106337 Wei GF, Zhao J, Feng YL, He AX, Yu J (2020) A novel hybrid feature selection method based on dynamic feature importance. Appl Soft Comput 93:106337
16.
Zurück zum Zitat Shah SMS, Shah FA, Hussain SA, Batool S (2020) Support vector machines-based heart disease diagnosis using feature subset, wrapping selection and extraction methods. Comput Electr Eng 84:106628 Shah SMS, Shah FA, Hussain SA, Batool S (2020) Support vector machines-based heart disease diagnosis using feature subset, wrapping selection and extraction methods. Comput Electr Eng 84:106628
17.
Zurück zum Zitat Lee J, Jeong JY, Jun CH (2020) Markov blanket-based universal feature selection for classification and Regression of mixed-type data. Expert Syst Appl 158:113398 Lee J, Jeong JY, Jun CH (2020) Markov blanket-based universal feature selection for classification and Regression of mixed-type data. Expert Syst Appl 158:113398
18.
Zurück zum Zitat Gholami J, Pourpanah F, Wang XZ (2020) Feature selection based on improved binary global harmony search for data classification. Appl Soft Comput 93:106402 Gholami J, Pourpanah F, Wang XZ (2020) Feature selection based on improved binary global harmony search for data classification. Appl Soft Comput 93:106402
19.
Zurück zum Zitat Kou G, Yang P, Peng Y, Xiao F, Chen Y, Alsaadi FE (2020) Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Appl Soft Comput 86:105836 Kou G, Yang P, Peng Y, Xiao F, Chen Y, Alsaadi FE (2020) Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Appl Soft Comput 86:105836
20.
Zurück zum Zitat Huang H, Liu H (2020) Feature selection for hierarchical classification via joint semantic and structural information of labels. Knowl-Based Syst 195:105655 Huang H, Liu H (2020) Feature selection for hierarchical classification via joint semantic and structural information of labels. Knowl-Based Syst 195:105655
21.
Zurück zum Zitat Wang XH, Zhang Y, Sun XY (2020) Multi-objective feature selection based on artificial bee colony: An acceleration approach with variable sample size. Appl Soft Comput 88:106041 Wang XH, Zhang Y, Sun XY (2020) Multi-objective feature selection based on artificial bee colony: An acceleration approach with variable sample size. Appl Soft Comput 88:106041
22.
Zurück zum Zitat Kira K, Rendell LA (1992) The feature selection problem: Traditional methods and a new algorithm. Proc. of 10th National Conference on Artificial Intelligence, Canada: AAAI Press pp. 129–134. Kira K, Rendell LA (1992) The feature selection problem: Traditional methods and a new algorithm. Proc. of 10th National Conference on Artificial Intelligence, Canada: AAAI Press pp. 129–134.
23.
Zurück zum Zitat Ma JB, Gao XY (2020) A filter-based feature construction and feature selection approach for classification using genetic programming. Knowl-Based Syst 196:105806 Ma JB, Gao XY (2020) A filter-based feature construction and feature selection approach for classification using genetic programming. Knowl-Based Syst 196:105806
24.
Zurück zum Zitat Gokalp O, Tasci E, Ugur A (2020) A novel wrapper feature selection algorithm based on iterated greedy metaheuristic for sentiment classification. Expert Syst Appl 146:113176 Gokalp O, Tasci E, Ugur A (2020) A novel wrapper feature selection algorithm based on iterated greedy metaheuristic for sentiment classification. Expert Syst Appl 146:113176
25.
Zurück zum Zitat Khammassi C, Krichen S (2020) A NSGA2-LR wrapper approach for feature selection in network intrusion detection. Comput Netw 172:107183 Khammassi C, Krichen S (2020) A NSGA2-LR wrapper approach for feature selection in network intrusion detection. Comput Netw 172:107183
26.
Zurück zum Zitat González J, Ortega J, Damas M, Martín-Smith P, Gan JQ (2019) A new multi-objective wrapper method for feature selection – Accuracy and stability analysis for BCI. Neurocomputing 333:407–418 González J, Ortega J, Damas M, Martín-Smith P, Gan JQ (2019) A new multi-objective wrapper method for feature selection – Accuracy and stability analysis for BCI. Neurocomputing 333:407–418
27.
Zurück zum Zitat Mafarja M, Mirjalili S (2018) Whale optimization approaches for wrapper feature selection. Appl Soft Comput 62:441–453 Mafarja M, Mirjalili S (2018) Whale optimization approaches for wrapper feature selection. Appl Soft Comput 62:441–453
28.
Zurück zum Zitat Rajab KD (2017) New hybrid features selection method: a case study on websites phishing. Security & Communication Networks 2:1–10 Rajab KD (2017) New hybrid features selection method: a case study on websites phishing. Security & Communication Networks 2:1–10
29.
Zurück zum Zitat Solorio-Fernández S, Carrasco-Ochoa JA, Martínez-Trinidad JF (2016) A new hybrid filter–wrapper feature selection method for clustering based on ranking. Neurocomputing 214:866–880 Solorio-Fernández S, Carrasco-Ochoa JA, Martínez-Trinidad JF (2016) A new hybrid filter–wrapper feature selection method for clustering based on ranking. Neurocomputing 214:866–880
30.
Zurück zum Zitat Rao CJ, Lin H, Liu M (2020) Design of comprehensive evaluation index system for P2P credit risk of “three rural” borrowers. Soft Comput 24(15):11493–11509 Rao CJ, Lin H, Liu M (2020) Design of comprehensive evaluation index system for P2P credit risk of “three rural” borrowers. Soft Comput 24(15):11493–11509
31.
Zurück zum Zitat Lin YP, Chen LL, Zou JZ (2019) Application of hybrid feature selection algorithm based on particle swarm optimization in fatigue driving. Comput Eng 45(2):278–283 Lin YP, Chen LL, Zou JZ (2019) Application of hybrid feature selection algorithm based on particle swarm optimization in fatigue driving. Comput Eng 45(2):278–283
32.
Zurück zum Zitat Durand D (1941) Risk elements in consumer instalment financing technical edition. National Bureau of Economic Research 218(1): 237–237. Durand D (1941) Risk elements in consumer instalment financing technical edition. National Bureau of Economic Research 218(1): 237–237.
33.
Zurück zum Zitat Yu LA, Wang SY (2009) A kernel principal component analysis based least squares fuzzy support vector machine methodology with variable penalty factors for credit classification. Journal of System Science and Mathematical Science 29(10):1311–1326MATH Yu LA, Wang SY (2009) A kernel principal component analysis based least squares fuzzy support vector machine methodology with variable penalty factors for credit classification. Journal of System Science and Mathematical Science 29(10):1311–1326MATH
34.
Zurück zum Zitat Rao CJ, Liu M, Goh M, Wen JH (2020) 2-stage modified random forest model for credit risk assessment of P2P network lending to “Three Rurals” borrowers. Appl Soft Comput 95:106570 Rao CJ, Liu M, Goh M, Wen JH (2020) 2-stage modified random forest model for credit risk assessment of P2P network lending to “Three Rurals” borrowers. Appl Soft Comput 95:106570
35.
Zurück zum Zitat Lanzarini LC, Monte AV, Bariviera AF, Santana PJ (2017) Simplifying credit scoring rules using LVQ + PSO. Kybernetes 46(1):8–16 Lanzarini LC, Monte AV, Bariviera AF, Santana PJ (2017) Simplifying credit scoring rules using LVQ + PSO. Kybernetes 46(1):8–16
38.
Zurück zum Zitat Liu C, Xie J, Zhao Q, Xie QW, Liu CQ (2019) Novel evolutionary multi-objective soft subspace clustering algorithm for credit risk assessment. Expert Syst Appl 138:112827 Liu C, Xie J, Zhao Q, Xie QW, Liu CQ (2019) Novel evolutionary multi-objective soft subspace clustering algorithm for credit risk assessment. Expert Syst Appl 138:112827
39.
Zurück zum Zitat Luo J, Yan X, Tian Y (2020) Unsupervised quadratic surface support vector machine with application to credit risk assessment. Eur J Oper Res 280:1008–1017MathSciNetMATH Luo J, Yan X, Tian Y (2020) Unsupervised quadratic surface support vector machine with application to credit risk assessment. Eur J Oper Res 280:1008–1017MathSciNetMATH
40.
Zurück zum Zitat Blagus R, Lusa L (2013) SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics 14(1):1–16 Blagus R, Lusa L (2013) SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics 14(1):1–16
41.
Zurück zum Zitat Roshan SE, Asadi S (2020) Improvement of bagging performance for classification of imbalanced datasets using evolutionary multi-objective optimization. Eng Appl Artif Intell 87:103319 Roshan SE, Asadi S (2020) Improvement of bagging performance for classification of imbalanced datasets using evolutionary multi-objective optimization. Eng Appl Artif Intell 87:103319
42.
Zurück zum Zitat Xie YX, Peng LZ, Chen ZX, Yang B, Zhang HL, Zhang HB (2019) Generative learning for imbalanced data using the Gaussian mixed model. Appl Soft Comput 79:439–451 Xie YX, Peng LZ, Chen ZX, Yang B, Zhang HL, Zhang HB (2019) Generative learning for imbalanced data using the Gaussian mixed model. Appl Soft Comput 79:439–451
43.
Zurück zum Zitat Hong WH, Yap JH, Selvachandran G, Thong PH, Son LH (2021) Forecasting mortality rates using hybrid Lee-Carter model, artificial neural network and random forest. Complex & Intelligent Systems 7:163–189 Hong WH, Yap JH, Selvachandran G, Thong PH, Son LH (2021) Forecasting mortality rates using hybrid Lee-Carter model, artificial neural network and random forest. Complex & Intelligent Systems 7:163–189
45.
Zurück zum Zitat Gan D, Shen J, An B, Xu M, Liu N (2020) Integrating TANBN with cost sensitive classification algorithm for imbalanced data in medical diagnosis. Comput Ind Eng 140:106266 Gan D, Shen J, An B, Xu M, Liu N (2020) Integrating TANBN with cost sensitive classification algorithm for imbalanced data in medical diagnosis. Comput Ind Eng 140:106266
46.
Zurück zum Zitat Li YS, Chi H, Shao XY, Qi ML, Xu BG (2020) A novel random forest approach for imbalance problem in crime. Knowl-Based Syst 195:105738 Li YS, Chi H, Shao XY, Qi ML, Xu BG (2020) A novel random forest approach for imbalance problem in crime. Knowl-Based Syst 195:105738
47.
Zurück zum Zitat Sharma D, Willy C, Bischoff J (2021) Optimal subset selection for causal inference using machine learning ensembles and particle swarm optimization. Complex & Intelligent Systems 7:41–59 Sharma D, Willy C, Bischoff J (2021) Optimal subset selection for causal inference using machine learning ensembles and particle swarm optimization. Complex & Intelligent Systems 7:41–59
48.
Zurück zum Zitat He YY, Zhou JH, Lin YP, Zhu TF (2019) A class imbalance-aware Relief algorithm for the classification of tumors using microarray gene expression data. Comput Biol Chem 80:121–127 He YY, Zhou JH, Lin YP, Zhu TF (2019) A class imbalance-aware Relief algorithm for the classification of tumors using microarray gene expression data. Comput Biol Chem 80:121–127
49.
Zurück zum Zitat Liao K, Fu J, Yang W (2010) Modified relief algorithm for radar HRRP target recognition. Journal of Electronic Measurement and Instrument 24(9):831–836 Liao K, Fu J, Yang W (2010) Modified relief algorithm for radar HRRP target recognition. Journal of Electronic Measurement and Instrument 24(9):831–836
50.
Zurück zum Zitat Sun GL, Li JB, Dai J, Song ZC, Lang F (2018) Feature selection for IoT based on maximal information coefficient. Futur Gener Comput Syst 89:606–616 Sun GL, Li JB, Dai J, Song ZC, Lang F (2018) Feature selection for IoT based on maximal information coefficient. Futur Gener Comput Syst 89:606–616
51.
Zurück zum Zitat Zhang YS, Yang C, Yang AR, Xiong C, Zhou XG, Zhang ZG (2015) Feature selection for classification with class-separability strategy and data envelopment. Neurocomputing 166(10):172–184 Zhang YS, Yang C, Yang AR, Xiong C, Zhou XG, Zhang ZG (2015) Feature selection for classification with class-separability strategy and data envelopment. Neurocomputing 166(10):172–184
52.
Zurück zum Zitat Fu PH, Zhan ZG, Wu CJ (2013) Efficiency analysis of Chinese road systems with DEA and order relation analysis method: Externality concerned. Procedia Soc Behav Sci 966:1227–1238 Fu PH, Zhan ZG, Wu CJ (2013) Efficiency analysis of Chinese road systems with DEA and order relation analysis method: Externality concerned. Procedia Soc Behav Sci 966:1227–1238
53.
Zurück zum Zitat Rao CJ, Gao Y (2022) Evaluation mechanism design for the development level of urban-rural integration based on an improved TOPSIS method. Mathematics 10:380 Rao CJ, Gao Y (2022) Evaluation mechanism design for the development level of urban-rural integration based on an improved TOPSIS method. Mathematics 10:380
54.
Zurück zum Zitat Mercadier M, Lardy JP (2019) Credit spread approximation and improvement using random forest regression. Eur J Oper Res 277(1):351–365MathSciNetMATH Mercadier M, Lardy JP (2019) Credit spread approximation and improvement using random forest regression. Eur J Oper Res 277(1):351–365MathSciNetMATH
55.
Zurück zum Zitat Wei J, Chen H (2020) Determining the number of factors in approximate factor models by twice K-fold cross validation. Econ Lett 191:109149MathSciNetMATH Wei J, Chen H (2020) Determining the number of factors in approximate factor models by twice K-fold cross validation. Econ Lett 191:109149MathSciNetMATH
56.
Zurück zum Zitat Nobre J, Neves RF (2019) Combining principal component analysis, discrete wavelet transform and XGBoost to trade in the financial markets. Expert Syst Appl 125:181–194 Nobre J, Neves RF (2019) Combining principal component analysis, discrete wavelet transform and XGBoost to trade in the financial markets. Expert Syst Appl 125:181–194
57.
Zurück zum Zitat Zou J, Deng Q, Zheng JH, Yang SX (2020) A close neighbor mobility method using particle swarm optimizer for solving multimodal optimization problems. Inf Sci 519:332–347MathSciNet Zou J, Deng Q, Zheng JH, Yang SX (2020) A close neighbor mobility method using particle swarm optimizer for solving multimodal optimization problems. Inf Sci 519:332–347MathSciNet
60.
Zurück zum Zitat Zhang CX, Xu S, Zhang JS (2019) A novel variational Bayesian method for variable selection in logistic regression models. Comput Stat Data Anal 133:1–19MathSciNetMATH Zhang CX, Xu S, Zhang JS (2019) A novel variational Bayesian method for variable selection in logistic regression models. Comput Stat Data Anal 133:1–19MathSciNetMATH
62.
Zurück zum Zitat Rao CJ, He YW, Wang XL (2021) Comprehensive evaluation of non-waste cities based on two-tuple mixed correlation degree. Int J Fuzzy Syst 23:369–391 Rao CJ, He YW, Wang XL (2021) Comprehensive evaluation of non-waste cities based on two-tuple mixed correlation degree. Int J Fuzzy Syst 23:369–391
63.
Zurück zum Zitat Carrington AM, Fieguth PW, Qazi H et al (2020) A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms. BMC Med Inform Decis Mak 20:4 Carrington AM, Fieguth PW, Qazi H et al (2020) A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms. BMC Med Inform Decis Mak 20:4
Metadaten
Titel
Credit risk assessment mechanism of personal auto loan based on PSO-XGBoost Model
verfasst von
Congjun Rao
Ying Liu
Mark Goh
Publikationsdatum
12.09.2022
Verlag
Springer International Publishing
Erschienen in
Complex & Intelligent Systems / Ausgabe 2/2023
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI
https://doi.org/10.1007/s40747-022-00854-y

Weitere Artikel der Ausgabe 2/2023

Complex & Intelligent Systems 2/2023 Zur Ausgabe

Premium Partner