nach oben

Complex & Intelligent Systems

Erschienen in:

Open Access 12.09.2022 | Original Article

Credit risk assessment mechanism of personal auto loan based on PSO-XGBoost Model

verfasst von: Congjun Rao, Ying Liu, Mark Goh

Erschienen in: Complex & Intelligent Systems | Ausgabe 2/2023

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

As online P2P loans in automotive financing grows, there is a need to manage and control the credit risk of the personal auto loans. In this paper, the personal auto loans data sets on the Kaggle platform are used on a machine learning based credit risk assessment mechanism for personal auto loans. An integrated Smote-Tomek Link algorithm is proposed to convert the data set into a balanced data set. Then, an improved Filter-Wrapper feature selection method is presented to select credit risk assessment indexes for the loans. Combining Particle Swarm Optimization (PSO) with the eXtreme Gradient Boosting (XGBoost) model, a PSO-XGBoost model is formed to assess the credit risk of the loans. The PSO-XGBoost model is compared against the XGBoost, Random Forest, and Logistic Regression models on the standard performance evaluation indexes of accuracy, precision, ROC curve, and AUC value. The PSO-XGBoost model is found to be superior on classification performance and classification effect.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Introduction

China's auto finance market started relatively late, and the idea of buying a car by installment only appeared in 1993. In 1998, the Government introduced a policy of encouraging automobile consumer loans, effectively kick starting China's automobile finance market. By 2018, China's automotive finance market was 139 million yuan, a growth of 19.2%. With better personal credit information, this market is set to grow. According to China’s banking regulatory commission, from 2013 to 2017, the compounded annual growth rate of outstanding loans in China's auto financing business was as high as 29%. By the end of 2017, the loan balance of the auto finance business in China had reached 668.8 billion yuan, an increase of 28.39% year-on-year. Today, the auto finance industry accounts for an increasing proportion of the overall personal credit and finance industry, and its influence on China’s economy is also increasing, together with the accompanying financial credit risks. Two factors compound this phenomenon: greater lifestyle consumption and easier access to online finance [1, 2]. Indeed, the auto finance industry has many advantages, such as a flexible credit verification process and simpler vetting procedures, compared to the traditional financial institutions [3‐5].

At present, a variety of auto finance products is available in the market, such as the highly popular P2P online auto finance or the micro-loan network. In the tide of the Internet, various companies are trying to attract consumers with new technologies and new models, hoping to take the lead in the field of auto finance. All these signs indicate that the auto finance industry will develop rapidly in the future. In the process of such development, not only brought a lot of benefits, but also brought some disadvantages. Because the auto finance industry is characterized by high risks and high returns, if there are no effective industrial control measures, then it is impossible to develop into a sustainable and healthy auto finance industry [6‐9]. At present, China's auto finance industry is at a preliminary stage of development, in which there are still many problems, such as the imperfect establishment of personal credit investigation system, inadequate laws and regulations, and inadequate risk supervision and management, which all show that the credit risk problem is particularly important. Therefore, scientific and effective control of corporate credit risk has become an important issue in the development of China's auto finance industry.

The main participants in China's auto finance industry are auto finance companies, financial leasing companies, internet finance companies and banking institutions, etc., and the whole industrial chain is relatively complete. However, in terms of the market share in China, auto financing companies occupy the main part of the market, and most Chinese consumers will choose the financial products of auto financing companies. As the economic main body of the subsidiary products of automobile industry, the auto financing company mainly deals with the business of automobile consumer finance loan. However, in the process of operation, the auto financing company not only pursues its own profits, but also undertakes the task of providing consumer loans to customers to promote the sales of vehicles.

Compared with banks, auto financing companies bear more credit risks due to their specific business purposes. Due to various institutional defects in current business model, it is impossible to form sound and effective risk management measures. In addition, the imperfection of credit system and industrial factors such as the fluctuation of automobile prices will lead to a large number of bad debts in the auto finance industry [10‐12]. Under this background, the financial institutions have suffered significant losses due to vehicle loan defaults, auto underwriting has been tightened and the rejection rate of the auto loans has increased. The credit institutions have demanded for the credit risk assessment of customers using rigorous credit risk assessment models to predict the probability of loanee/borrower accurately defaulting on a vehicle loan in the first EMI (Equated Monthly Installments) on the due date, so as to identify customers with high credit risk and further reduce the default rate. Moreover, doing so will ensure that clients capable of repayment are not rejected and important determinants can be identified which can be further used for minimizing the default rates. From above research motivation, this paper will study how to establish a credit risk assessment model for auto financing companies with high classification and prediction accuracy, so as to not only guarantee their own earnings, but also control the bad debt rate generated by credit. This has important practical significance for auto financing companies and even the whole auto financing industry.

Compared with the existing congeneric methods for the credit risk assessment of personal auto loan, this paper makes two contributions as follows.

(i)

First, to reduce the feature dimension, enhance the generalizability of the model, and reduce the possibility of overfitting, based on the 45 preliminary indexes and the limitations of the current single feature selection method, this paper has proposed an improved Filter-Wrapper feature selection method by combining Filter and Wrapper. In the Filter stage, three evaluation criteria including the Relief algorithm, Maximum Information Coefficient method, and Quasi-separable method are selected. Then, the order relation analysis method is used to determine the corresponding weights of the three evaluation criteria, and a fusion model of multiple evaluation criteria is constructed to comprehensively rank the feature importance. In the Wrapper stage, the RF is selected as a classifier and the SBS method is used to screen the final optimal feature subset, thus effectively improving the classification accuracy of subsequent models.

(ii)

Second, most scholars study the credit risk assessment in the traditional financial field, but there are few researches on the credit risk assessment of personal auto loan in the auto finance industry. In today's internet era, China's auto finance industry is developing rapidly, and it is necessary to study the increasingly prominent credit risks of auto loans in its development process. Based on this, this paper proposes a PSO-XGBoost model for the credit risk assessment of personal auto loans, which is novel for the research on the credit risk assessment of auto loans in China's auto finance industry. To evaluate the performance of the models, the PSO-XGBoost model is compared against the XGBoost, RF, and LR models on performance evaluation indexes such as accuracy, precision, ROC curve, and AUC value. The results inform the PSO-XGBoost model to be more superior to the other models on classification performance and classification effect. This validates the choice of the PSO-XGBoost model in the credit risk assessment of personal auto loans.

This paper is organized as follows. Section “Literature review” surveys the literature. Section “Data preprocessing and unbalanced data set transformation” presents the data preprocessing and the transformation of the unbalanced data set. Section “Data preprocessing and unbalanced data set transformation” proposes a filter-wrapper feature selection method to select the credit risk assessment indices for the personal auto loan. Section “Feature selection method of credit risk assessment index” presents a PSO-XGBoost model for the credit risk assessment of the personal auto loans and the accompanying empirical analysis. Section “Credit risk assessment of personal auto loans using PSO-XGBoost model” concludes the paper.

Literature review

Adopting appropriate feature selection method to remove redundant features and reduce the dimension of data can effectively improve the computational speed and classification performance of the algorithm. Therefore, feature selection is indispensable in processing massive data. Currently, the popular feature selection methods include Filter [13], Wrapper [14], Modified-Dynamic Feature Importance based Feature Selection (M-DFIFS) algorithm [15], Mean Fisher-based Feature Selection Algorithm (MFFSA) [16], Markov Blanket-based Universal Feature Selection [17], Improved Binary Global Harmony Search (IBGHS) [18], MCDM-based method [19], joint semantic and structural information of labels [20], and the fast multi-objective evolutionary feature selection algorithm (FMABC-FS) [21].

The Filter method is simple and feasible and research have developed evaluation criteria, such as the Relief algorithm, Maximal Information Coefficient method, and the Information Gain method. The Relief algorithm is based on the feature weight proposed by Kira [22], which assigns weights to the features according to their ability to distinguish the samples. Then, the weight is compared with the threshold value. If the weight of the feature is less than the threshold value, it is deleted. In applying the Filter method, Ma and Gao [23] employed a filter-based feature selection approach using Genetic Programming (GP) that uses a correlation-based evaluation method, and their experiments on nine datasets show that features selected by feature construction approach (FCM) are able to improve the classification performance when compared to the original features. Thabtah et al. [13] proposed a simple filter method to quantify the similarity between the observed and expected probabilities and generate scores for the features. They report that their approach significantly reduces the number of selected features on 27 datasets. The Wrapper method takes the accuracy obtained by the subsequent learning algorithm as the evaluation criterion. Compared to the Filter method, the Wrapper method is computationally complex with low operation efficiency albeit high accuracy. Gokalp et al. [24] proposed a wrapper feature selection algorithm using an iterative greedy metaheuristic for sentiment classification. Khammassi and Krichen [25] presented a NSGA2-LR wrapper approach for feature selection in network intrusion detection. González et al. [26] applied a new wrapper method on feature selection, based on a multi-objective evolutionary algorithm to analyze the accuracy and stability for BCI. Mafarja and Mirjalili [27] proposed a wrapper feature selection approach based on the Whale Optimization algorithm.

The single feature selection method is often not comprehensive, while the Filter method and Wrapper method have their own merits and drawbacks. As such, some studies combine both methods, and proposed fusion feature selection methods by treating a combination of a variety of evaluation criteria. For example, Rajab [28] analyzed the advantages and disadvantages of the Information Gain (IG) algorithm and Chi-square (CHI) algorithm, and then used them in combination. Solorio-Fernández et al. [29] presented a hybrid filter–wrapper method for clustering, which combines the spectral feature selection framework using the Laplacian Score ranking and a modified Calinski–Harabasz index. Rao et al. [30] presented a two-stage feature selection method based on the filter and wrapper to select the main features from 35 borrower credit features. In the Filter stage, three filter methods are used to compute the importance of the unbalanced features. In the Wrapper stage, a Lasso-logistic method is used to filter the feature subset using a search algorithm.

Thus, following the earlier works, this paper combines the Filter and Wrapper methods to propose an improved Filter-Wrapper two-stage feature selection method to select the credit risk assessment indexes of the personal auto loans. However, compared to the existing fusion approach of the Filter and Wrapper methods, our two-stage feature selection method is different on the following aspects. In the Filter stage, we consider the aspects of information relevance, amount of information, and quasi-separable ability to, respectively, select three evaluation criteria, i.e., Relief algorithm, Maximal Information Coefficient method and Quasi-separable method to evaluate the importance of the features. A fusion model of multiple evaluation criteria is then constructed to rank the importance of the features. In the Wrapper stage, the Random Forest (RF) is selected as the classifier; the classification accuracy is used as the measurement standard, and the Sequence Backward Selection (SBS) method [31] is used for the feature selection. Based on the classification accuracy, the quality of the corresponding feature subset is evaluated, and the optimal feature subset is selected as a result of the evaluation indexes for the credit risk assessment of the personal auto loans.

Auto finance credit stems from consumer credit finance, notably on individual credit risk assessment. The traditional analysis methods, such as 5C and LAPP, are subjective, and highly dependent on expert experience. Then, in switching to mathematical models to analyze credit risk assessment. Durand [32] was the first to use discriminant analysis to assess individual credit risk. With the advent of better computing power and the availability of massive data sets, artificial intelligence methods such as machine learning, data mining, and deep learning have emerged.

However, traditional statistics, non-parametric statistics, machine learning, and data mining have been applied separately to credit risk assessment. With these single technique methods, there are often problems associated with low precision prediction, model overfitting, and low algorithm efficiency. Therefore, research have since combined statistical methods with artificial intelligence methods such as machine learning and data mining to address those shortcomings when applied to individual credit risk assessment. For example, Yu and Wang [33] proposed a kernel principal components analysis based least squares fuzzy support vector machine method with variable penalty factors for credit classification, and conducted an empirical analysis to prove the effectiveness of the model. Combining decision tree theory with machine learning methods, Rao et al. [34] selected a loan data set on the Pterosaur Loan platform, and used a 2-stage Syncretic Cost-sensitive Random Forest (SCSRF) model to evaluate the credit risk of the borrowers. Further, Lanzarini et al. [35] combined particle swarm optimization with competitive neural networks to propose an LVQ + PSO model, to predict the credit customer’s loan situation. Barani et al. [36] proposed a new improved Particle Swarm Optimization (PSO) combined with Chaotic Cellular Automata (CCA). Similarly, Mojarrad and Ayubi [37] proposed a novel approach in particle swarm optimization (PSO) that combines chaos and velocity clamping with the aim of eliminating its known disadvantage that enforces particles to continue searching in search space boundaries. However, as the credit datasets are typically high-dimensional, class-imbalanced, and are of large sample sizes, Liu et al. [38] recently proposed an Evolutionary Multi-Objective Soft Subspace Clustering (EMOSSC) algorithm for credit risk assessment. Luo et al. [39] employed a two-stage clustering method using kernel-free support vector machine, and applied the method incorporating t-test feature weights for credit risk assessment.

While there is rich research on personal credit risk assessment, particularly on optimizing the performance of the current credit risk assessment models by either improving or combining statistical methods with artificial intelligence to obtain better prediction, there is little literature on the credit risk assessment of personal auto loans in the auto finance industry. In this paper, we study the problem of the credit risk assessment of personal auto loans, and combine Particle Swarm Optimization (PSO) with the XGBoost model to form a PSO-XGBoost model to evaluate the credit risk of personal auto loans. We validate the PSO-XGBoost model against three evaluation models (XGBoost, RF, and LR).

Data preprocessing and unbalanced data set transformation

To study the current credit risk problem in the auto finance industry and to reduce the loan default rate of the auto financing institutions, we select the data set of personal auto loans on the Kaggle platform as the research samples. The data set is first preprocessed, and transformed according to the specific indexes to construct an overall index. Next, based on the description and value range of the indexes in the data set, the credit risk assessment indexes are preliminarily pre-screened. The unbalanced data set is processed and transformed into a balanced data set.

Data preparation and preprocessing

The details about the selected datasets we studied in this paper are as follows. One of the selected datasets is a data set of personal auto loans about the auto financing institutions that is available on an open data platform, that is, the Kaggle platform. The data set can be downloaded from https://www.kaggle.com/mamtadhaker/lt-vehicle-loan-default-prediction.

The selected data set contains 233,154 customer loan records, where 182,543 loan records representing the set of non-defaulting customers, and 50,611 loan records representing the set of defaulters. In addition, the data set contains 41 indexes, where 40 indexes are the independent variables, which are used to predict a borrower’s loan default. The following information regarding the loan and loanee are provided in the 40 indexes: loanee information (Demographic data like age, Identity proof etc.), loan information (Disbursal details, loan to value ratio etc.), bureau data & history (Bureau score, number of active accounts, the status of other loans, credit history etc.). These indexes reveal the borrower's personal information, economic health, and credit history. Another index, loan_default, is used to mark whether the borrower has defaulted, and this is labeled the dependent variable. This index divides borrowers into binary categories: “0” to denote the non-defaulters, and “1” to denote defaulters. As for the missing data in the data set, there is none except the index “Employment.Type”. Table 1 provides a description of the notation used.

Table 1

Notation and description

Index	Label	Description
Y	loan_default	Payment default in the first EMI on due date
X₁	UniqueID	Identifier for customers
X₂	disbursed_amount	Amount of loan disbursed
X₃	asset_cost	Cost of the Asset
X₄	ltv	Loan to Value of the asset
X₅	branch_id	Branch where the loan was disbursed
X₆	supplier_id	Vehicle dealer where the loan was disbursed
X₇	manufacturer_id	Vehicle manufacturer (Hero, Honda, TVS)
X₈	Current_pincode_ID	Current pincode of the customer
X₉	Date.of.Birth	Date of birth of the customer
X₁₀	Employment.Type	Employment type of the customer (Salaried/Self Employed)
X₁₁	DisbursalDate	Date of disbursement
X₁₂	State_ID	State of disbursement
X₁₃	Employee_code_ID	Employee of the organization who logged the disbursement
X₁₄	MobileNo_Avl_Flag	If Mobile no. was shared by the customer then flag as 1
X₁₅	Aadhar_flag	If aadhar was shared by the customer then flag as 1
X₁₆	PAN_flag	If pan was shared by the customer then flag as 1
X₁₇	VoterID_flag	If voter was shared by the customer then flag as 1
X₁₈	Driving_flag	If DL was shared by the customer then flagged as 1
X₁₉	Passport_flag	If passport was shared by the customer then flag as 1
X₂₀	PERFORM_CNS.SCORE	Bureau Score
X₂₁	PERFORM_CNS.SCORE.DESCRIPTION	Bureau score description
X₂₂	PRI.NO.OF.ACCTS	Count of total loans taken by the customer at the time of first disbursement
X₂₃	PRI.ACTIVE.ACCTS	Count of active loans taken by the customer at the time of first disbursement
X₂₄	PRI.OVERDUE.ACCTS	Count of default accounts at the time of first disbursement
X₂₅	PRI.CURRENT.BALANCE	Total principal outstanding of the active loans at the time of first disbursement
X₂₆	PRI.SANCTIONED.AMOUNT	Total amount that was sanctioned for all the loans at the time of first disbursement
X₂₇	PRI.DISBURSED.AMOUNT	Total amount that was disbursed for all the loans at the time of first disbursement
X₂₈	SEC.NO.OF.ACCTS	Count of total loans taken by the customer at the time of second disbursement
X₂₉	SEC.ACTIVE.ACCTS	Count of active loans taken by the customer at the time of second disbursement
X₃₀	SEC.OVERDUE.ACCTS	Count of default accounts at the time of disbursement
X₃₁	SEC.CURRENT.BALANCE	Total principal outstanding of the active loans at the time of second disbursement
X₃₂	SEC.SANCTIONED.AMOUNT	Total amount that was sanctioned for all the loans at the time of second disbursement
X₃₃	SEC.DISBURSED.AMOUNT	Total amount that was disbursed for all the loans at the time of second disbursement
X₃₄	PRIMARY.INSTAL.AMT	Equated Monthly Installment (EMI) Amount of the primary loan
X₃₅	SEC.INSTAL.AMT	EMI Amount of the secondary loan
X₃₆	NEW.ACCTS.IN.LAST.SIX.MONTHS	New loans taken by the borrower in last 6 months before the disbursement
X₃₇	DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS	Loans defaulted in the last 6 months
X₃₈	AVERAGE.ACCT.AGE	Average loan tenure
X₃₉	CREDIT.HISTORY.LENGTH	Time since first loan
X₄₀	NO.OF_INQUIRIES	Enquiries done by the customer for loans

Table 1 shows four types of data in the data set: integer, floating point, date, and character type. Among them, index X₄ is a floating point type, X₉ and X₁₁ are date types, X₂₁, X₃₈, and X₃₉ are character types, and the other indexes are integer types. The date type and character type data cannot be used directly, so data cleansing and data conversion are needed. Data cleansing mainly deals with data exceptions, including missing values processing, type values processing, exception point processing, and outliers processing. Data conversion is to enhance data processing through data discretization, data specification or by creating new variables.

(1) Date processing.

(i) Type values processing.

In the data set, index X₉ (Date of birth of the customer) and X₁₁ (Date of disbursement) are date type indexes, which are processed as follows. The date of birth of the customer is converted to the current age, and the date of disbursement is converted to the number of months from the current time. For the character type indexes X₃₈ (Average loan tenure) and X₃₉ (Time since first loan), their index values are converted to the number of months. For index X₁₀ (Employment type of the customer), the Self Employed type is denoted as 0, and the Salaried type is denoted as 1. There are missing values in this index X₁₀. In addition, there are 20 components in index X₂₁ (Bureau score description), which are converted using the literal meaning of the description. Table 2 contains the specific conversion results.

Table 2

Risk transformation of bureau score description

Risk range of bureau score description	Score
No Bureau History Available	0
Not Scored: Sufficient History Not Available	0
Not Scored: Not Enough Info available on the customer	0
Not Scored: No Activity seen on the customer (Inactive)	0
Not Scored: No Updates available in last 36 months	0
Not Scored: Only a Guarantor	0
Not Scored: More than 50 active Accounts found	0
M-Very High Risk	1
L-Very High Risk	2
K-High Risk	3
J-High Risk	4
I-Medium Risk	5
H-Medium Risk	6
G-Low Risk	7
F-Low Risk	8
E-Low Risk	9
D-Very Low Risk	10
C-Very Low Risk	11
B-Very Low Risk	12
A-Very Low Risk	13

(ii) Exception point processing and outliers processing.

In looking for outliers in the data set, we note that some age values of index X₉ (Date of birth of the customer) are less than or equal to zero, which is implausible. Hence, we replace them with null values and treated them as missing values. Also, for the indexes X₂₅ (Total Principal outstanding amount of the active loans at the time of first disbursement), and X₃₁ (Total Principal outstanding amount of the active loans at the time of second disbursement), some of their index values are less than zero, which are invalid, and hence replaced with null values.

(iii) Missing values processing.

The objects with a null value in index X₉ (Date of birth of the customer) are filled with the values of the mean age. For the missing values in index X₁₀ (Employment type of the customer), the RF machine learning algorithm is used to fill them. The employment type of the borrower is taken as a dependent variable; the other indexes are treated as independent variables. The existing employment type data are trained in the random forest, to classify and predict the unknown employment types.

(2) Data transformation.

As the data set contains many indexes with the same meaning occurring at different times (the first and second times), notably, indexes X₂₂ and X₂₈, indexes X₂₃ and X₂₉, and indexes X₂₃ and X_29, we merge the indexes with the same or similar meaning to yield composite indexes, as shown in Table 3.

Table 3

Composite indexes

Index	Label	Description
X₄₁	Loan_to_asset_ratio	Ratio of loan disbursed amount to the asset cost
X₄₂	Total_no_of_accts	Count of total loans taken by the customer at the first and second time of disbursement
X₄₃	Pri_inacitve_accts	Count of total inactive loans taken by the customer at the first time of disbursement
X₄₄	Sec_inactive_accts	Count of total invalid loans taken by the customer at the second time of disbursement
X₄₅	Total_inactives_accts	Count of total invalid loans taken by the customer at the first and second time of disbursement
X₄₆	Total_actives_accts	Count of total active loans taken by the customer at the first and second time of disbursement
X₄₇	Total_current_balance	Total principal outstanding amount of the active loans at the first and second time of disbursement
X₄₈	Total_sanctioned_amount	Total amount that was not approved for all the loans at the first and second time of disbursement
X₄₉	Total_disbursed_amount	Total amount that was disbursed for all the loans at the first and second time of disbursement
X₅₀	Total_instal_amt	EMI amount of the primary and secondary loan
X₅₁	Pri_loan_proportions	Proportion of the primary total loans to the principal
X₅₂	Sec_loan_proportions	Proportion of the secondary total loan to the principal
X₅₃	Active_to_inactive_act_ratio	Ratio of the customer’s total loans to the invalid loans

The approach for merging the indexes in Table 3 is as follows. The indexes loan_to_asset_ratio, Total_no_of_accts, Pri_inacitve_accts, Sec_inactive_accts, Total_inactives_accts, Total_actives_accts, Total_current_balance, Total_sanctioned_amount, Total_disbursed_amount, Total_instal_amt, Pri_loan_proportions, Sec_loan_proportions, and Active_to_inactive_act_ratio, are denoted by X₄₁, X₄₂, X₄₃, X₄₄, X₄₅, X₄₆, X₄₇, X₄₈, X₄₉, X₅₀, X₅₁, X₅₂, and X₅₃, respectively, and their index values are as follows:

$$ \begin{gathered} X_{41} = \frac{{X_{2} }}{{X_{3} }},\;X_{42} = X_{22} + X_{28} , \hfill \\ X_{43} = X_{22} - X_{23} ,\;X_{44} = X_{28} - X_{29} , \hfill \\ X_{45} = X_{22} - X_{23} + X_{28} - X_{29} ,\;X_{46} = X_{23} + X_{29} , \hfill \\ X_{47} = X_{25} + X_{31} ,\;X_{48} = X_{26} + X_{32} , \hfill \\ X_{49} = X_{27} + X_{33} ,\;X_{50} = X_{34} + X_{35} , \hfill \\ X_{51} = \frac{{X_{27} }}{{\left( {X_{34} + 1} \right)}},\;X_{52} = \frac{{X_{33} }}{{\left( {X_{35} + 1} \right)}}, \hfill \\ X_{53} = \frac{{\left( {X_{22} + X_{28} } \right)}}{{\left( {X_{22} - X_{23} + X_{28} - X_{29} + 1} \right)}}. \hfill \\ \end{gathered} $$

Creating the new composite indexes yields 54 indexes in total. Of these, 53 indexes are independent variables related to the borrower’s information and one index is the dependent variable. From the data, there are 12 indexes with no 0 values, namely X₁, X₂ X₃, X₄ X₅, X₆, X₇, X₈, X₉, X₁₁, X₁₂, and X₁₃, but there are many 0’s in the index values of the other 42 indexes. Hence, if more than three-quarters of the index values of a record are zero, then the record is deemed invalid and deleted accordingly. As such, only 117,156 loan records can be used for research and analysis.

Pre-screening credit risk assessment indexes

The credit risk assessment indexes of the personal auto loans are generally divided into three categories, i.e., personal indexes, economic indexes, and credit indexes. Personal indexes generally reveal the basic information of the borrower, such as age, gender, job, and education, which can be used to predict the change in a borrower’s loan repayment behavior. Economic indexes reflect the economic standing of the borrower. The better the economic standing, the less is the likelihood to default. Credit indexes reflect a borrower’s credit history, including the credit data generated in their life, work, and so on. This information can be used to understand the borrower's credit history of repayment, the borrower's repayment willingness, and can be used to predict future repayment behavior changes.

From the description and value range of the indexes in the data set, it is easy to infer if an index is a credit risk factor. For index X₈ (Current pincode of the customer), a borrower’s id is equivalent to a person's name. This index is not a factor affecting credit risk and is deleted. Similarly, the indexes X₅ (Branch where the loan was disbursed), X₆ (Vehicle Dealer where the loan was disbursed), X₇ (Vehicle manufacturer (Hero, Honda, TVS)), X₁₂ (State of disbursement), and X₁₃ (Employee of the organization who logged the disbursement) are randomly assigned by the system, and have no real impact on the credit risk assessment and should also be deleted. In addition, as the index values of index X₁₄ in all the loan records are 1, index X₁₄ does not have a predictive role in credit risk assessment and it should be deleted. Using this approach to screen, 8 indexes are eliminated. Thus, 45 indexes related to customer information and 1 dependent variable index remain in the data set, as shown in Table 4.

Table 4

Resulting credit risk assessment indexes

Index	Label	Index	Label
Z₁	Aadhar_flag	Z₂₄	VoterID_flag
Z₂	DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS	Z₂₅	age
Z₃	Driving_flag	Z₂₆	asset_cost
Z₄	Employment.Type	Z₂₇	average_acct_age_month
Z₅	NEW.ACCTS.IN.LAST.SIX.MONTHS	Z₂₈	credit_history_length_month
Z₆	NO.OF_INQUIRIES	Z₂₉	credit_risk_grade
Z₇	PAN_flag	Z₃₀	disbursal_months_passed
Z₈	PERFORM_CNS.SCORE	Z₃₁	disbursed_amount
Z₉	PRI.ACTIVE.ACCTS	Z₃₂	ltv
Z₁₀	PRI.CURRENT.BALANCE	Z₃₃	loan_to_asset_ratio
Z₁₁	PRI.DISBURSED.AMOUNT	Z₃₄	total_no_of_accts
Z₁₂	PRI.NO.OF.ACCTS	Z₃₅	pri_inactive_accts
Z₁₃	PRI.OVERDUE.ACCTS	Z₃₆	sec_inactive_accts
Z₁₄	PRI.SANCTIONED.AMOUNT	Z₃₇	total_inactive_accts
Z₁₅	PRIMARY.INSTAL.AMT	Z₃₈	total_active_accts
Z₁₆	Passport_flag	Z₃₉	total_current_balance
Z₁₇	SEC.ACTIVE.ACCTS	Z₄₀	total_sanctioned_amount
Z₁₈	SEC.CURRENT.BALANCE	Z₄₁	total_disbursed_amount
Z₁₉	SEC.DISBURSED.AMOUNT	Z₄₂	total_instal_amt
Z₂₀	SEC.INSTAL.AMT	Z₄₃	pri_loan_proportion
Z₂₁	SEC.NO.OF.ACCTS	Z₄₄	sec_loan_proportion
Z₂₂	SEC.OVERDUE.ACCTS	Z₄₅	active_to_inactive_act_ratio
Z₂₃	SEC.SANCTIONED.AMOUNT

Transforming unbalanced data set

After the data preprocessing, we convert the unbalanced data set into a balanced data set. Traditional machine learning algorithms focus on the overall accuracy, and the trained classifiers tend to favor the majority category in the training process [40‐42], while the prediction accuracy of the minority category is very low. We proposed a Smote-Tomek Link algorithm to convert the imbalanced data set into a balanced dataset, to improve the prediction accuracy of the minority category and the overall classification effect of the data set.

Smote-Tomek Link algorithm

In this section, based on the traditional Smote algorithm [42‐44], a Smote-Tomek Link algorithm is proposed to transform the unbalanced data set into the balanced one.

The basic steps of the Smote-Tomek Link algorithm are as follows: (i) Select n minority class sample points randomly using the Smote algorithm, and find m subclass sample points closest to these n minority class sample points. (ii) Select any point in the nearest m subclass samples, and this point is a new data sample. On this basis, an integrated Smote-Tomek Link algorithm is designed by combining with Tomek Link. The basic ideas are given as follows.

For the newly generated data point and the point closest to other non-new sample points, a pair of Tomek link is formed. Then the rules are defined: a space is framed with the new generation point as the center and the distance of Tomek Link as the range radius.

If the number of minority classes or majority classes in this space is less than the minimum threshold, then the new generation point is considered as a "trash point", and this kind of point is removed or have been another SMOTE training. If the number of minority classes or majority classes in the space is greater than or equal to the minimum threshold, then we can sample in the set of minority classes samples that have reserved and put in SMOTE training. According to this rule, the "trash points" are eliminated, and the new data that meets the criteria are retained. Repeat the above steps, and finally add the generated sample to the data set to get a new balanced data sample set.

Unbalanced data set transformation based on Smote-Tomek Link algorithm

From section “Data preparation and preprocessing”, 117,156 loan records are obtained, of which 93,315 records are the auto loan data of the non-defaulters, and 23,841 are the auto loan data of the defaulters. The imbalance ratio of the data set is almost four times, which would affect the model effect. Thus, we use the Smote-Tomek Link algorithm proposed in Subsection “Smote-Tomek Link algorithm” to process and transform the imbalanced data set into a balanced data set. To highlight the superiority of this algorithm in processing the data set, several machine learning models are adopted to make predictions and the effects are compared using the relevant evaluation indexes.

(1) Experimental methods

We use the Smote and Smote-Tomek Link algorithm to process data set T, yielding two data sets T₁ and T₂ respectively. The data sets T, T₁, and T₂ are further divided into 70–30 training-test sets individually. Then, we apply two machine learning methods, i.e., the Logistic Regression (LR) model and the Random Forest (RF) model as the classifiers for training and prediction. The effect of the models is compared using relevant evaluation indexes such as F1-score, G-means, MCC, and AUC [45‐47].

(2) Evaluation indexes of unbalanced learning

For a two-category problem in machine learning, the majority category is usually labeled the negative category, while the minority category with high recognition importance is labeled the positive category. Based on the true category of the sample and the category predicted by the classifier, there are four classification types: true positive (TP), false positive (FP), true negative (TN), and false negative example (FN). Among them, TP and TN are the positive (negative) samples that are correctly predicted by the classifier. FP and FN represent that the sample is a negative (positive) category, and the sample is wrongly predicted as a positive (negative) category by the classifier. Table 5 shows the confusion matrix of the classification results.

Table 5

Confusion matrix of classification results

Actual category	Prediction category
Actual category	Positive	Negative
Positive	TP	FN
Negative	FP	TN

From the confusion matrix, the recall rate R and precision rate P are found using [45, 47, 48]:

$$ R = \frac{TP}{{TP + FN}},\;P = \frac{TP}{{TP + FP}}. $$

(i) F1-measure

The F1-measure is the harmonic mean of the recall rate R and precision rate P, which can evaluate the overall classification of unbalanced data sets [45‐47]. The larger the value of F1, the better is the classification effect of the classifier, with.

$$F1{ - }measure = \frac{2}{{\frac{1}{P} + \frac{1}{R}}} = \frac{2PR}{{P + R}}.$$

(ii) G-means

The G-means evaluates the performance of the unbalanced data classification. For an unbalanced data set, the value of the G-means will be high only if the classification accuracy of positive category samples and negative category samples is relatively high. Otherwise, the value of G-means will be low. The G-means is expressed as follows [45‐47]:

$$ G{ - }means = \sqrt {\frac{TP}{{TP + FN}} \times \frac{TN}{{TN + FP}}} $$

(iii) MCC

The Markov Correlation Coefficient (MCC) is an important index to evaluate the performance of unbalanced data classification. In general, the greater the MCC, the better is the classification effect of the model. The MCC is expressed as [45‐47]:

$$ MCC = \frac{TP \times TN - FP \times FN}{{\sqrt {\left( {TP \,{+}\, FP} \right) \,{\times}\, \left( {TP \,{+}\, FN} \right) \,{\times}\, \left( {TN \,{+}\, FP} \right) \,{\times}\, \left( {TN \,{+}\, FN} \right)} }} $$

(iv) AUC

The AUC is the area under the ROC (Receiver Operating Characteristic) curve, and is a common index to measure the overall classification performance of the classifier [45‐47]. The F1, G-means, and MCC assessment indexes are based on thresholds, but AUC is not related to the selection of a threshold.

(3) Analysis of experimental results

For the original untreated data set, the data set processed by the Smote algorithm, and the data set processed by the Borderline-Smote algorithm, two machine learning methods (LR model, RF model) are used as the classifiers to train and predict. The final classification results are shown in Table 6. Panels A and B show the classification result obtained by the LR and RF models respectively.

Table 6

Classification results based on LR and RF models

Data set	F1	G-means	MCC	AUC
Panel A: LR model
Unprocessed data set	0.012352	0.079060	0.032097	0.502130
Smote algorithm	0.620625	0.618822	0.237656	0.618827
Smote-Tomek Link algorithm	0.624523	0.620736	0.239479	0.619983
Panel B: RF model
Unprocessed data set	0.039540	0.143128	0.066204	0.507563
Smote algorithm	0.842869	0.842489	0.684974	0.842489
Smote-Tomek Link algorithm	0.851321	0.848532	0.670156	0.857371

From Table 6, when the imbalanced data set is not processed, the fitting effect of both the LR and RF models is extremely poor. This is because the distribution of the majority category and the minority category in the data set is uneven. As a result, the model tends to predict the minority category into the majority category during training, thus lowering the prediction accuracy of the minority category. Using the Smote algorithm or the Smote-Tomek Link algorithm to process the data greatly improves the property of the classifier. From the F1, G-means, MCC and AUC values, using the same classifier to train and predict the data set processed by the Smote algorithm and the data set processed by the Smote-Tomek Link algorithm, the results suggest that the classification effect and predicting performance using the Smote-Tomek Link algorithm to be better than that of the Smote algorithm. Thus, we use the Smote-Tomek Link algorithm to transform the imbalanced data set into a balanced data set.

Feature selection method of credit risk assessment index

From the balanced data set, 186,630 auto loan records are obtained. From them, 45 features (indexes) are used to reflect the borrower's auto loan information. Due to the large number of feature dimensions of the auto loan borrowers, there may be features that are irrelevant or redundant to credit risk. Therefore, it is necessary to make a feature selection of these 45 features to further screen the indexes and simplify the feature subsets, so as to reduce the dimension of the feature space. In this way, the generalizability of the established credit risk assessment model of personal auto loans can be enhanced and any overfitting can be reduced.

An improved Filter-Wrapper feature selection method is presented for selecting the main features from among the 45 preliminary indexes in Table 4. In the Filter stage, three evaluation criteria, namely, Relief algorithm [48, 49], Maximal Information Coefficient [50], and Quasi-separable method [51], are used to evaluate the importance of the features from three aspects: information relevance, information quantity, and quasi-separable ability. A fusion model of multiple evaluation criteria is constructed to rank the importance of the features. To overcome the subjectivity in determining the weight coefficients of the feature importance, the order relation analysis method [51‐53] is used to determine the corresponding weight of the three evaluation criteria, the classification accuracy is used as the measurement standard, and the SBS method [31] is used to rank the 45 preliminary features. The lower the rank order, the lesser is the importance of that feature. At the same time, the feature subset after each deletion is trained and predicted, so as to obtain the classification accuracy of the data set. The feature subset is then evaluated on the classification accuracy, and the optimal feature subset is found.

(1) Filter stage

It is difficult for an evaluation criterion to comprehensively evaluate the quality of the feature subsets. If the evaluation criteria are combined, they can complement each other and improve the evaluation quality. For the 45 preliminary features listed in Table 4, three evaluation criteria: Relief algorithm, Maximum Information Coefficient method, and Quasi-separable method, are selected.

The dimensionality of the three evaluation criteria is different, which may lead to significant differences in the corresponding values of the features and affect the subsequent fusion process of the evaluation criteria, resulting in large deviations in the results. With this in mind, the dimensions of the three evaluation criteria are harmonized using:

$$ Re_{i} = \frac{{re_{i} - \mathop {\min }\nolimits_{i} {\kern 1pt} {\kern 1pt} re_{i} }}{{\mathop {\max }\nolimits_{i} {\kern 1pt} {\kern 1pt} re_{i} - \mathop {\min }\nolimits_{i} {\kern 1pt} {\kern 1pt} re_{i} }},{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1,2,....,45 $$

$$ M_{i} = \frac{{m_{i} - \mathop {\min }\nolimits_{i} {\kern 1pt} {\kern 1pt} m_{i} }}{{\mathop {\max }\nolimits_{i} {\kern 1pt} {\kern 1pt} m_{i} - \mathop {\min }\nolimits_{i} {\kern 1pt} {\kern 1pt} m_{i} }},{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1,2,....,45 $$

$$ C_{i} = \frac{{c_{i} - \mathop {\min }\nolimits_{i} {\kern 1pt} {\kern 1pt} c_{i} }}{{\mathop {\max }\nolimits_{i} {\kern 1pt} {\kern 1pt} c_{i} - \mathop {\min }\nolimits_{i} {\kern 1pt} {\kern 1pt} c_{i} }},{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1,2,....,45 $$

where re_i, m_i and c_i are the values obtained by the Relief algorithm, Maximum Information Coefficient method, and Quasi-separable method, respectively. The max and min represent the maximum and minimum values respectively. Re_i, M_i, and C_i are the values after range standardization.

Though the three evaluation criteria are measured differently, they all conform to the rule namely the greater the weight of feature i, the stronger is the classification ability of that feature. Thus, the values obtained by the three evaluation criteria are fused to form a fusion model of multiple evaluation criteria. The fusion evaluation value of feature i is denoted as total_i, which denotes the importance degree of feature i, which is written as

$$ total_{i} = w_{1} Re_{i} + w_{2} C_{i} + w_{3} M_{i} $$

(1)

where $w_{1}$, $w_{2}$ and $w_{3}$ are the weights corresponding to the Relief algorithm, Maximum Information Coefficient method and Quasi-separable method, respectively.

As the influence of each evaluation criterion on the result is different, their weights are different, and the determination of the weights will be related to the fitting effect of the subsequent model, so determining the weights is key. For this, we employ the order relation analysis method [51‐53] to obtain the weights of the evaluation criteria, as shown in Fig. 1.

The steps to determine the weights are as follows.

Step 1: Determine the order relationship among the evaluation criteria. From the effect of the Relief algorithm, Maximum Information Coefficient method and Quasi-separable method, the rank relation among the evaluation criteria is as follows:

$$ U_{1} > U_{2} > U_{3} $$

where $U_{1}$ is the Relief algorithm, $U_{2}$ is the Quasi-separable method, and $U_{3}$ is the Maximum Information Coefficient method, respectively.

Step 2: Obtain the relative importance of the three evaluation criteria using comparative judgment. Suppose the ratio of the importance of evaluation criteria $U_{k - 1}$ to $U_{k}$ is $\gamma_{k}$ [51, 52], that is,

$$ \gamma_{k} = \frac{{U_{k - 1} }}{{U_{k} }},\;k = 2,3,...,n $$

(2)

where the value of $\gamma_{k}$ is as defined in Table 7.

Table 7

Value of γ_k and description

γ_k	Description
1.0	U_k-1 is just as important as U_k
1.2	U_k-1 is slightly more important than U_k
1.4	U_k-1 is obviously more important than U_k
1.6	U_k-1 is highly more important than U_k
1.8	U_k-1 is extremely more important than U_k

Using Table 7 and Eq. (2), the importance of the order relation among the three evaluation criteria can be assessed. The Relief algorithm is slightly more important than the Quasi-separable method, which is slightly more important than the Maximum Information Coefficient method. Thus, we have:

$$ \gamma_{2} = \frac{{U_{1} }}{{U_{2} }} = 1.2,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \gamma_{3} = \frac{{U_{2} }}{{U_{3} }} = 1.2 $$

(3)

Step 3: Compute the importance weight $w_{m}$. The ranking of the weights of the three evaluation criteria is consistent with their corresponding positions in the order relation among them. The importance weights are found [51, 52] as follows:

$$ w_{m} = \left( {1 + \sum\limits_{k = 2}^{m} {\prod\limits_{i = k}^{m} {\gamma_{k} } } } \right)^{ - 1} $$

(4)

$$ w_{k - 1} = r_{k} w_{k} ,\left( {k = m,m - 1,...,2} \right) $$

(5)

Combining Eqs. (4) and (5) yields

$$ w_{3} = \left( {1 + \gamma_{2} \times \gamma_{3} + \gamma_{3} } \right)^{ - 1} = \left( {1 + 1.2 \times 1.2 + 1.2} \right)^{ - 1} = 0.2747 $$

(6)

$$ w_{2} = \gamma_{3} w_{3} = 1.2 \times 0.27473 = 0.3297 $$

(7)

$$ w_{1} = \gamma_{2} w_{2} = 1.2 \times 0.32968 = 0.3956 $$

(8)

Thus, the importance weights of the Relief algorithm, Maximum Information Coefficient method and Quasi-separable method are 0.3956, 0.3297 and 0.2747, respectively, satisfying $w_{1} + w_{2} + w_{3} = 1$.

Step 4: Compute the fusion evaluation value $total_{i}$. Substituting $w_{1}$, $w_{2}$ and $w_{3}$ into Eq. (1), the fusion model of multiple evaluation criteria is expressed as

$$ total_{i} = 0.3956Re_{i} + 0.3297C_{i} + 0.2747M_{i} $$

(9)

Step 5: Rank the features. Using the fusion evaluation value $total_{i}$, the features are now ranked.

Figure 2 shows the flowchart of the comprehensive ranking of the features during the Filter stage.

(2) Wrapper stage

The feature selection now enters the Wrapper stage, where we will further screen the sorted features, simplify the feature subset, and reduce the dimension, so as to improve the accuracy of classification. Here, the RF [54] is selected as a classifier at the encapsulation stage, and an SBS method is used to eliminate the features in accordance with the rank order of the features. From the complete feature set, at each iteration, the least important feature is removed. At the same time, the classifier is used to train and predict the current feature subset, so as to obtain the classification accuracy under this feature subset and compare it with the classification accuracy obtained in the previous iteration. The feature subset with the highest classification accuracy (that is, the optimal feature subset) is selected as the result of the evaluation index selection for the credit risk of the personal auto loans.

The steps of the Wrapper algorithm are as follows.

Input: The original data set F = {f₁, f₂, …, f_k}, where k is the number of original feature sets; k = 45.

Output: Select the optimal feature subset with the highest classification accuracy.

Figure 3 shows the algorithm flowchart.

Analysis of selection of credit risk assessment indexes

Comprehensive ranking of features in Filter stage

Using the method described in Sect. “Data preparation and preprocessing”, the three evaluation criteria are used to evaluate the importance of features in the Filter stage, and the evaluation value of each feature is obtained. Then, the evaluation value is standardized to ensure dimensional consistency. Next, the fusion model of multiple evaluation criteria is used for information fusion to obtain the fusion evaluation value of each feature, and the importance of each feature is then ranked according to the value of the fusion evaluation. The results obtained using Python are shown in Table 8.

Table 8

Comprehensive ranking of 45 preliminary features

Feature	Relief	MIC	Quasi-separable method	Fusion evaluation value	Rank
Z₁	0.656250	0.057894	0.036012	0.287389	7
Z₂	0.157813	0.002484	0.166743	0.118088	26
Z₃	0.000000	0.014015	0.084077	0.031570	45
Z₄	1.000000	0.116463	0.741940	0.672210	1
Z₅	0.158703	0.031215	0.157836	0.123396	24
Z₆	0.159447	0.000837	0.000001	0.063307	44
Z₇	0.218750	0.034355	0.206417	0.164030	16
Z₈	0.505691	0.378509	0.015686	0.309199	5
Z₉	0.156034	0.027407	0.133023	0.113114	27
Z₁₀	0.156250	0.096885	0.253947	0.172153	15
Z₁₁	0.156250	0.179913	0.004739	0.112797	29
Z₁₂	0.156013	0.020293	0.065588	0.088918	40
Z₁₃	0.158750	0.000000	0.276819	0.154069	18
Z₁₄	0.156250	0.203496	0.048480	0.133697	20
Z₁₅	0.156256	0.085870	0.502350	0.251028	9
Z₁₆	0.125000	0.001669	0.131155	0.093150	34
Z₁₇	0.159247	0.000472	0.034478	0.074495	41
Z₁₈	0.156297	0.039070	0.000001	0.072564	42
Z₁₉	0.156253	0.041563	0.158975	0.125645	23
Z₂₀	0.156250	0.034128	0.000000	0.071187	43
Z₂₁	0.161916	0.002349	0.079134	0.090790	37
Z₂₂	0.168713	0.001104	0.071656	0.090671	38
Z₂₃	0.156263	0.043679	0.165562	0.128402	21
Z₂₄	0.250000	0.021885	0.136613	0.149953	19
Z₂₅	0.393644	0.059518	0.071771	0.195738	12
Z₂₆	0.156303	0.128334	0.000007	0.097089	33
Z₂₇	0.166422	0.007934	0.099641	0.100868	32
Z₂₈	0.160613	0.033051	0.059533	0.092245	36
Z₂₉	0.717303	0.157933	0.413192	0.463379	3
Z₃₀	0.356469	0.064443	0.014473	0.163493	17
Z₃₁	0.156378	0.547479	1.000000	0.541956	2
Z₃₂	0.191141	0.425568	0.230247	0.268431	8
Z₃₃	0.189450	0.106937	0.229650	0.180038	13
Z₃₄	0.156056	0.019809	0.067408	0.089402	39
Z₃₅	0.155972	0.021773	0.479931	0.225917	11
Z₃₆	0.159753	0.003789	0.118671	0.103365	30
Z₃₇	0.155953	0.021338	0.497559	0.231602	10
Z₃₈	0.156181	0.027768	0.132485	0.113094	28
Z₃₉	0.156253	0.096018	0.264068	0.175253	14
Z₄₀	0.156250	0.202428	0.522285	0.289617	6
Z₄₁	0.156250	0.179345	0.051005	0.127895	22
Z₄₂	0.156256	0.085308	0.052022	0.102401	31
Z₄₃	0.156250	0.112157	0.000000	0.092622	35
Z₄₄	0.156350	0.035648	0.151286	0.121523	25
Z₄₅	0.160081	1.000000	0.038522	0.350729	4

Feature selection in Wrapper stage

With the 45 preliminary features in Table 8 ranked, we now use the Wrapper algorithm to select the optimal feature subset. To ensure a better classification accuracy, we use the ten-fold cross validation [55] to accept the average classification accuracy after ten predictions. Figure 4 shows the change in classification accuracy as data dimension decreases. From Fig. 4, when the data dimension is 34, the classification accuracy of the RF classifier reaches the highest level. Thereafter, the classification accuracy decreases. Thus, the first 34 features in the comprehensive ranking are chosen as the optimal feature subset, that is, they form the credit risk assessment indexes of the personal auto loans.

Credit risk assessment of personal auto loans using PSO-XGBoost model

Next, a credit risk assessment model of the personal auto loans based on the PSO-XGBoost model is formed. The XGBoost model [56] has good characteristics such as high prediction accuracy and fast runtime, while the Particle Swarm Optimization (PSO) algorithm [57‐59] is used to optimize the parameters in the XGBoost model. Then, the PSO-XGBoost model, XGBoost model, RF model, and LR model [60] are used on the training data set. Prediction is made on the test data set to obtain their respective prediction outcomes, and the performance of the four models are evaluated and compared against performance evaluation indexes such as accuracy and precision, ROC curve, and AUC value.

XGBoost model

XGBoost (eXtreme Gradient Boosting) [56] is a C + + realization based on the Gradient Boosting Machine algorithm, which is a boosting algorithm. The XGBoost model seeks to constantly add trees, and split the features to make the trees grow. As the data set is divided into a training set and test set using a 7:3 ratio, so, 130,641 records of auto loan data are selected to form the XGBoost model. The training set D, containing 130,641 samples with 34 features, is expressed as $D = \left\{ {\left( {x_{i} ,y_{i} } \right)} \right\}\left( {\left| D \right| = 130,641,x_{i} \in R^{31} ,y_{i} \in R} \right)$, where $x_{i}$ represents the i-th sample, $y_{i} = \left\{ {\left. {0,1} \right\}} \right.$ represents the category of default group, with 0 (1) being the non-default (default) group, respectively.

Now, suppose the total number of trees is K, then the predicted value of K to the sample is [56]

$$ \hat{y}_{i} = \phi \left( {x_{i} } \right) = \sum\limits_{k = 1}^{K} {f_{k} \left( {x_{i} } \right),f_{k} \in F} $$

(10)

$$ F = \left\{ {f\left( x \right) = w_{q\left( x \right)} } \right\}\left( {q:R^{31} \to T,w \in R^{T} } \right) $$

(11)

where $\hat{y}_{i}$ is the predicted value of the model representing the predicted category label of sample $i$, $F$ is the set of classification and regression trees (CART), $f\left( x \right)$ is a regression tree, and $w_{q\left( x \right)}$ represents the set of all node scores of this tree, namely, the prediction of samples; q represents the classification of samples on the leaf node, that is, input a sample, map the sample to the predicted category output by the leaf node according to the model, and judge whether it is a non-defaulting or defaulting population; $w$ is the leaf node score, and T is the number of leaf nodes in the tree.

From Eq. (10), we note that the predicted values of the XGBoost model are the sum of the predicted values of the K trees. To learn these K trees, we define the objective function, which contains a loss function and a regularization function [56], and this can be expressed as

$$ Obj = L\left( v \right) + \Omega \left( v \right) = \sum\limits_{i} {l\left( {\hat{y}_{i} ,y_{i} } \right)} + \sum\limits_{k} {\Omega \left( {f_{k} } \right)} $$

(12)

where $L\left( v \right)$ is the loss function, which can evaluate the fitting degree of the model; $\Omega \left( v \right)$ is the regularization function used to simplify the model and control its complexity; $\hat{y}_{i}$ is the predicted value of the model representing the predicted category label of sample $i$; $y_{i}$ is the true category label of sample $i$; $l\left( {\hat{y}_{i} ,y_{i} } \right)$ is used to measure the deviation degree between the actual and the predicted values obtained by the credit risk assessment model, which is a non-negative real valued function; $k$ is the number of trees, and $f_{k}$ is the k^th tree.

The term $\Omega \left( {f_{k} } \right)$ in Eq. (12) is the regularization term [56], which is given by

$$ \Omega \left( {f_{k} } \right) = \gamma T + \frac{1}{2}\lambda \left\| w \right\|^{2} $$

(13)

where T is the number of leaf nodes in each tree, $w$ is the set of leaf node scores in each tree, $\gamma$ is the leaf weight, $\lambda$ is the penalty coefficient. $\gamma$ and $\lambda$ jointly determine the model’s complexity.

According to the XGBoost model, the newly generated tree is the residue after fitting the previous round. Therefore, when t trees are generated, Eq. (10) can be written as

$$ \hat{y}_{i}^{\left( t \right)} = \hat{y}_{i}^{{\left( {t - 1} \right)}} + f_{t} \left( {x_{i} } \right) $$

(14)

Substituting Eq. (14) into Eq. (12), the objective function can be rewritten as [56]

$$ Obj^{\left( t \right)} = \sum\limits_{i = 1}^{n} {l\left( {y_{i} ,\hat{y}_{i}^{{\left( {t - 1} \right)}} + f_{t} \left( {x_{i} } \right)} \right) + \Omega \left( {f_{t} } \right)} $$

(15)

The goal is to find the value $f_{k}$ that minimizes Eq. (15). In Gradient Boosted Decision Tree (GBDT), only the first-order gradient is adopted. Compared to the GBDT, the XGBoost model rewrites the objective function using a second-order Taylor series. Thus, Eq. (15) is approximated by

$$ Obj^{\left( t \right)} \approx \sum\limits_{i = 1}^{n} {\left[ {l\left( {y_{i} ,\hat{y}_{i}^{{\left( {t - 1} \right)}} } \right) + g_{i} f_{t} \left( {x_{i} } \right) + \frac{1}{2}h_{i} f_{t}^{2} \left( {x_{i} } \right)} \right] + \Omega \left( {f_{t} } \right)} $$

(16)

where $g_{i} = \partial_{{\hat{y}_{i}^{{\left( {t - 1} \right)}} }} l\left( {y_{i} ,\hat{y}_{i}^{{\left( {t - 1} \right)}} } \right)$, $h_{i} = \partial_{{\hat{y}_{i}^{{\left( {t - 1} \right)}} }}^{2} l\left( {y_{i} ,\hat{y}_{i}^{{\left( {t - 1} \right)}} } \right)$ are the first- and second-order derivatives of the loss function with respect to $\hat{y}_{i}^{{\left( {t - 1} \right)}}$. As $l\left( {y_{i} ,\hat{y}_{i}^{{\left( {t - 1} \right)}} } \right)$ is a constant, Eq. (16) can be rewritten as

$$ Obj^{\left( t \right)} = \sum\limits_{i = 1}^{n} {\left[ {g_{i} f_{t} \left( {x_{i} } \right) + \frac{1}{2}h_{i} f_{t}^{2} \left( {x_{i} } \right)} \right] + \Omega \left( {f_{t} } \right)} $$

(17)

Clearly, $Obj^{\left( t \right)}$ depends on the first- and second-order derivatives of each data point on the error function. Thus, the iteration about the tree is turned into an iteration about the leaf node, and the following results can be obtained [56].

$$ \begin{gathered} Obj^{\left( t \right)} = \sum\limits_{i = 1}^{n} {\left[ {g_{i} f_{t} \left( {x_{i} } \right) + \frac{1}{2}h_{i} f_{t}^{2} \left( {x_{i} } \right)} \right] + \Omega \left( {f_{t} } \right)} \hfill \\ = \sum\limits_{i = 1}^{n} {\left[ {g_{i} w_{{q\left( {x_{i} } \right)}} + \frac{1}{2}h_{i} w_{{q\left( {x_{i} } \right)}}^{2} } \right]} + \gamma T + \lambda \cdot \frac{1}{2}\sum\limits_{j = 1}^{T} {w_{j}^{2} } \hfill \\ = \sum\limits_{j = 1}^{T} {\left[ {\left( {\sum\limits_{{i \in I_{J} }} {g_{i} } } \right)w_{j} + \frac{1}{2}\left( {\sum\limits_{{i \in I_{j} }} {h_{i} } } \right)w_{j}^{2} } \right] + \gamma T + \lambda \cdot \frac{1}{2}\sum\limits_{j = 1}^{T} {w_{j}^{2} } } \hfill \\ = \sum\limits_{j = 1}^{T} {\left[ {\left( {\sum\limits_{{i \in I_{J} }} {g_{i} } } \right)w_{j} + \frac{1}{2}\left( {\sum\limits_{{i \in I_{j} }} {h_{i} + \lambda } } \right)w_{j}^{2} } \right] + \gamma T} . \hfill \\ \end{gathered} $$

(18)

Therefore, the problem is transformed into a problem of finding the extreme value of a quadratic function in w_j. That is, we must find the optimal value of $w_{j}$ that minimizes Eq. (18). Using the method of solving the extremum of a quadratic function, we obtain ${w}_{j}^{*}$ and the minimum value of the objective function [56] as follows.

$$ w_{j}^{*} = - \frac{{G_{j} }}{{H_{i} + \lambda }},\;Obj = - \frac{1}{2}\sum\limits_{j = 1}^{T} {\frac{{G_{j}^{2} }}{{H_{j} + \lambda }}} + \gamma T $$

(19)

Compared to the GBDT model, the XGBoost model adds the regularization term to the objective function of the credit risk assessment model to prevent the model from overfitting. At the same time, Taylor expansion is used to optimize the objective function to find the best segmentation point in the CART regression tree. Therefore, the constructed credit risk assessment model has higher accuracy and better fitting performance than the other models.

PSO-XGBoost model

The XGBoost model often adjusts the parameters manually, resulting in a longer search time and higher computational cost. If the PSO algorithm is used to optimize the parameters of XGBoost model, each parameter is coded into particles in the space. According to the PSO algorithm, the optimal parameters of XGBoost model are searched within a fixed number of iterations, so as to find the optimal solution to the XGBoost model. Thus, we integrate the PSO algorithm into the parameter optimization of XGBoost model to form the PSO-XGBoost model. This method has fast convergence, higher precision, and lower cost. The steps of the PSO-XGBoost model are shown in Fig. 5.

We must first determine the parameters to be optimized for the XGBoost model. As the accuracy of the XGBoost model is important, three parameters are selected for optimization, i.e., the learning rate, maximum depth of the tree, and the sample weight of the minimum leaf node. Thus, the dimension of the particle swarm space in the PSO algorithm is 3. Next, the maximum number of iterations, learning factor, inertia weight, and the number of particles N in the PSO must be determined. Finally, the PSO-XGBoost model is constructed, and the predicted error rate is taken as the fitness of the PSO algorithm, that is, the calculated error rate function is taken as the fitness function.

The next steps are to initialize the entire particle swarm in three-dimensional space (including each particle's position and velocity) and to compute the error rate of each particle after initialization according to the error rate function. The local and global optimal values of the entire particle swarm are obtained by comparison. We then determine whether the termination condition is met (i.e., the maximum number of iterations is breached). If the termination condition is not met, then the error rate and the corresponding parameter value are output, and the velocity and position of each particle are updated. The error rate of each particle after the update is computed using the error rate function, and the error rate of each particle is compared with the current local and global optimal values. If the error rate is less than the optimal value, then the optimal value is replaced with the current error rate; otherwise, the optimal value holds. Next, if the current iteration number has not reached the maximum number set, then the iteration continues until the termination condition is satisfied, and the final optimal value is output. From the optimal values found, the best parameter values of the model are now known, and the PSO-XGBoost model can then be used to assess the credit risk of the personal auto loans.

Analysis of credit risk assessment of personal auto loans

(1) Data set partitioning

From Sect. “Literature Review”, 186,630 auto loan records can be used for the empirical analysis, and they are divided into a training set and test set in a 7:3 ratio, as shown in Table 9.

Table 9

Information on data set

Data set	Number of features	Number of samples	Positive/negative ratio	Missing value
Training	34	130,641	1.00	NA
Test	34	55,989	1.00	NA

(2) Parameter optimization

To improve the execution and classification performance of the XGBoost model, a parameter adjustment of the XGBoost model is required. For this, the PSO algorithm is used to optimize the parameters of the XGBoost model.

When forming the XGBoost model, three parameters need to be adjusted, i.e., learning rate (learning_rate), maximum depth of the tree (max_depth), and the sample weight of minimum leaf node (min_child_depth), to improve the accuracy of the XGBoost model. The steps are detailed as follows:

(i)

Learning rate: the step size is used in the updating process to prevent overfitting. After each update, the weight of the new feature is obtained. Reducing the weight of the feature ensures a more conservative computation. The step size and the maximum number of iterations usually jointly determine the fitting effect of the algorithm, and the robustness of the model can also be improved by reducing the weight at each step.

(ii)

Maximum depth of the tree: for the maximum depth of the decision tree in the XGBoost model, if no specific value is entered, a default value is assumed. So, the decision tree does not limit the depth of the subtree when it is created. However, if the model sample has a large amount of data and many features, it needs to be limited, so as to avoid overfitting.

(iii)

Sample weight of minimum leaf node: this is similar to the parameter min_child_leaf in the gradient lift tree algorithm. The parameter min_child_leaf in the gradient lift tree algorithm represents the total number of minimum samples, while min_child_weight in the XGBoost model represents the sum of the minimum sample weights, which are also used to avoid overfitting situations. When the value of min_child_weight is large, the model can avoid learning some local samples, so the value of this parameter can be adjusted to avoid model overfitting.

Three parameters: learning rate (learning_rate), maximum depth of the tree (max_depth), and sample weight of minimum leaf node (min_child_depth) in the XGBoost model are adjusted by the PSO algorithm on the 130,641 data points in the training set, to ensure model optimization and improve the accuracy of the model prediction. In the iterative process of optimizing the three parameters, the error rate of the XGBoost model is used as the fitness evaluation function in the PSO algorithm. Figure 6 shows the correlation of the number of iterations and the error rate of the model.

Figure 6 shows that the PSO algorithm continues to optimize the parameters as the number of iterations increases, with a decreasing error rate of the model. When a stationary value is reached, the optimal value of the parameter is found and the PSO-XGBoost model has a minimum error rate.

Performance evaluation of PSO-XGBoost model

To evaluate model performance, the PSO-XGBoost model is compared with the XGBoost, RF, and LR models [60]. As our problem studied is a two-category problem, we use evaluation indexes such as accuracy, precision, complexity, ROC curve and AUC value to evaluate the effect of the models.

(1) Confusion matrix

Expanding on Table 5, we provide a confusion matrix to visualize the model’s outcome (see Table 10).

Table 10

Confusion matrix

Actual category	Prediction category
Actual category	Positive example (Defaulted: “1”)	Negative example (Non-defaulting: “0”)
Positive example (Default: “1”)	TP	FN
Negative example (No-default: “0”)	FP	TN

From the confusion matrix in Table 10, there are four possibilities for the results predicted by the model. The first is the true positive example (TP), i.e., the borrower has already defaulted previously, and the model predicts the borrower to belong to a high-risk group, which is very likely to breach the contract. Therefore, the agency should be highly alert of such borrowers. The second is the false negative example (FN), that is, in reality, the customer has a default, but the model wrongly predicts the customer to belong to a low-risk group. Approving such customers will cause a huge financial loss to the auto financing firms. The third cell is the false positive example (FP), which means that, the borrower has no default at all. However, the model predicts the customer is a high-risk borrower capable of defaulting on the loan. Such borrowers will be filtered out by the institution and potential revenue will be lost. A similar argument applies to the fourth category—the true negative example (TN).

In this paper, the data set is divided into a training and a test set, and the PSO-XGBoost, XGBoost, RF, and LR models are used for training and prediction, as shown by the confusion matrix of Table 11.

Table 11

Model comparison by confusion matrix

Values	PSO-XGBoost		XGBoost		RF		LR
Values	1	0	1	0	1	0	1	0
1 (Defaulted)	21,392	6704	20,745	7351	20,977	7119	19,369	8727
0 (No-default)	2753	25,140	2794	25,099	3702	24,191	7699	20,194

(2) Evaluation indexes of model performance

(i) Accuracy and error

Accuracy is the proportion of the number of correctly predicted samples in the total number of samples [24, 34, 61], expressed as:

$$ Accuracy = \frac{TP + TN}{{TP + FN + FP + TN}} $$

(20)

Error is the proportion of the number of incorrectly predicted samples in the total number of samples [24, 34, 61], expressed as:

$$ Error = \frac{FN + FP}{{TP + FN + FP + TN}} $$

(21)

The higher the accuracy, the smaller the error, and the better is the effect of the classifier model and vice versa.

(ii) Precision and Recall

Precision refers to the proportion of true positive samples in the total positive samples judged by the model [24, 34, 61], that is,

$$ \, P = \frac{TP}{{TP + FP}} $$

(22)

Recall rate refers to the proportion of positive samples judged by the model in the total actual positive samples [24, 34, 61], that is,

$$ R = \frac{TP}{{TP + FN}} $$

(23)

Using Eqs. (20)-(23), the evaluation indexes of each model are obtained as shown in Table 12.

Table 12

Comparison of evaluation indexes of models

Evaluation index	PSO-XGBoost	XGBoost	RF	LR
Accuracy	0.8311	0.7888	0.8067	0.7066
Precision	0.8860	0.7613	0.8500	0.7156
Recall	0.7614	0.7384	0.7466	0.6894
Time complexity	9 s	5 s	6 s	3 s
Space complexity	77 M	74.3 M	66 M	36 M

It can be seen from Table 12 that the classification accuracy of XGBoost model is 78.88%, while the Accuracy of PSO-XGBoost model is 83.11%, increasing by 4.23%, greatly improving the classification accuracy. At the same time, the Precision and Recall of PSO-XGBoost model are better than those of the XGBoost model, which indicates that the evaluating performance of PSO-XGBoost model is better than that of the XGBoost model. The Logistic regression model had the worst evaluation performance among the four models, because all its evaluation indexes are the lowest. The classification accuracy of the RF model is 80.67%, and that of the PSO-XGBoost model is 2.44% higher than it. In terms of Precision and Recall, the PSO-XGBoost model is also better than that of the RF model. In conclusion, among the four models, the PSO-XGBoost model had the best performance for credit risk evaluation of personal auto loan than other three models.

(3) Complexity.

The complexity of all the algorithms being compared (PSO-XGBoost, XGBoost, RF, and LR) are measured from the two dimensions of time and space. The time dimension refers to the running time taken to execute the current algorithm, which is called time complexity. The spatial dimension refers to the amount of memory required to perform the current algorithm, which is called spatial complexity. The calculation results of time complexity and space complexity of each algorithm are shown in Table 12. From these results, we can see that the amount of memory required to perform the current algorithm of the proposed PSO-XGBoost is only 77 M, which slightly higher than that of the other methods, but it takes up very little memory. Similarly, the running time taken to execute the current algorithm of the proposed PSO-XGBoost is only 9 s, which means that the time complexity of the proposed algorithm is not high.

(4) ROC curve and AUC value.

The ROC curve ranks the samples according to the predicted results of the learner, and then makes the predictions sequentially by treating the sample as a positive example. The sensitivity TPR and specificity FPR are found each time [24, 34, 61], using

$$ TPR = \frac{TP}{{TP + FN}},\;FPR = \frac{FP}{{FP + TN}} $$

The TPR is taken as the horizontal axis and the specificity FPR as the vertical axis when mapping the ROC curve. The AUC is the area under the ROC curve. When the ROC curve of one learner is completely wrapped by the ROC curve of another learner, then it can be safely assumed that the performance of the first learner is better than that of the latter. However, when two curves intersect, a more reasonable judgment is to compare the values of the respective AUC’s. The higher the AUC value, the better is the performance of the learner.

Figure 7 shows the comparison diagram of ROC curves and AUC values of PSO-XGBoost model, XGBoost model, RF model and LR model. According to the comparison rule of ROC curve [45‐47], the closer the ROC curve is to the upper left corner, the better its evaluation performance will be. From Fig. 7, the ROC curve of the PSO-XGBoost model is the closest to the top left corner, and covers the ROC curves of the other three models. Furthermore, the AUC value of the PSO-XGBoost model is 0.90, which is better than other three models’ AUC. Hence, the PSO-XGBoost model has the best performance and the highest prediction accuracy for the credit risk evaluation of personal auto loans, which affirms the results offered by the earlier evaluation indexes.

Further analysis of model performance

In this subsection, to judge the performance of the proposed model in this paper sufficiently, another new experiment is provided to perform a comparative analysis to enrich our claim on generalization, where the selected data set is from a Chinese vehicle loan agency that is publicly available on the Kaggle platform. The data set for this new experiment can be downloaded from the website: https://www.kaggle.com/xiaochou/auto-loan-default-risk.

Data processing and feature selection

The selected data set contains 199,717 customer loan records, where 164,289 loan records represent the information of customers who have not defaulted, and 35,428 loan records represent the information of customers who have defaulted. Moreover, the whole data set contains 54 indexes, where 53 indexes are the information indexes used to predict customer loan default, which are known as independent variables, and mainly reflect the customer's personal basic information, economic status and credit record information. Another index, Loan_default, is a dependent variable, which is used to mark whether a customer has defaulted. The decision-making task is to establish a risk identification model to predict borrowers who may default.

First of all, data processing and transformation are carried out for the type values, abnormal value and the missing values in the data set. Then, the credit risk assessment indexes of auto loan in this data set are preliminarily screened, and 42 independent variable indexes and 1 dependent variable index are retained. Due to the great difference between the default information and the non-default information of this data set, and the imbalance degree is nearly five times, so it is necessary to conduct unbalanced processing on this data set to transform it into a balanced auto loan data set. The Smote-Tomek Link algorithm proposed in Sect. “Smote-Tomek Link algorithm” is used for unbalanced processing, so as to improve the prediction accuracy of minority category and improve the overall classification effect of unbalanced data sets. Finally, the improved Filter-Wrapper feature Selection method proposed in Sect. “Improved Filter-Wrapper feature selection method” is selected for feature selection, and 30 features are selected as the optimal features, as shown in Table 13.

Table 13

The optimal feature subset

Index	Label	Index	Label
Z₁	main_account_loan_no	Z₁₆	Driving_flag
Z₂	main_account_active_loan_no	Z₁₇	passport_flag
Z₃	main_account_overdue_no	Z₁₈	credit_score
Z₄	main_account_outstanding_loan	Z₁₉	main_account_monthly_payment
Z₅	main_account_sanction_loan	Z₂₀	sub_account_monthly_payment
Z₆	main_account_disbursed_loan	Z₂₁	last_six_month_new_loan_no
Z₇	sub_account_loan_no	Z₂₂	last_six_month_defaulted_no
Z₈	sub_account_active_loan_no	Z₂₃	average_age
Z₉	sub_account_overdue_no	Z₂₄	credit_history
Z₁₀	sub_account_outstanding_loan	Z₂₅	enquirie_no
Z₁₁	sub_account_sanction_loan	Z₂₆	loan_to_asset_ratio
Z₁₂	sub_account_disbursed_loan	Z₂₇	total_account_loan_no
Z₁₃	disbursed_amount	Z₂₈	sub_account_inactive_loan_no
Z₁₄	asset_cost	Z₂₉	total_inactive_loan_no
Z₁₅	ltv	Z₃₀	main_account_inactive_loan_no

The decision-making process of risk assessment

(1) Data set partitioning

From Sect. “Data processing and feature selection”, 199,717 auto loan records can be used for the empirical analysis, and they are divided into a training set and test set in a 7:3 ratio, as shown in Table 14.

Table 14

Information on data set

Data set	Number of features	Number of samples	Positive/negative ratio	Missing value
Training	30	139,802	1.00	NA
Test	30	59,915	1.00	NA

(2) Parameter optimization

The PSO algorithm is used to optimize the three parameters of XGBoost model, i.e., learning_rate, max_depth and min_child_depth. In the process of iterative optimization with these three parameters, the error rate of XGBoost model is used as the fitness evaluation function in PSO algorithm, and the correlation graph between the number of iterations and the error rate of model is obtained, as shown in Fig. 8.

Figure 8 shows the specific variation trend of the error rate of PSO-XGBoost model. As the number of iterations increases, PSO algorithm continues to optimize parameters, and the error rate of the model decreases. When a stationary value is reached, the optimal value of the parameter is found and the PSO-XGBoost model has a minimum error rate.

(3) Performance evaluation

To verify the performance of the PSO-XGBoost model on this dataset, we have compared the proposed PSO-XGBoost model with several congeneric works, i.e., the XGBoost model, RF model and LR model using the performance evaluation indexes such as accuracy, precision, complexity, ROC curve and AUC value to evaluate the effect of the models. The evaluation indexes of each model are obtained as shown in Table 15.

Table 15

Comparison of evaluation indexes of models

Evaluation index	PSO-XGBoost	XGBoost	RF	LR
Accuracy	0.7805	0.7458	0.7733	0.6527
Precision	0.7827	0.7498	0.7645	0.6418
Recall	0.7745	0.7353	0.7676	0.6853
Time complexity	24 s	12 s	13 s	4 s
Space complexity	116.2 M	110.4 M	111.7 M	54.4 M

According to the calculation results of the evaluation indexes in Table 13, we can see that the Accuracy, Precision and Recall of PSO-XGBoost model are all better than those of the other three models. Thus, the PSO-XGBoost model is found to be superior on classification performance and classification effect. In addition, the time complexity degree and the space complexity degree of the proposed PSO-XGBoost are not high, which shows that our proposed model is effective and operable.

In addition, the ROC curves and AUC values of four compared models (PSO-XGBoost, XGBoost, RF, and LR) are plotted in the same figure, as shown in Fig. 9.

As can be seen from Fig. 9, the ROC curve of PSO-XGBoost model is the closest to the upper left corner, followed by those of RF model, XGBoost model, and LR model. In terms of AUC, the AUC value of PSO-XGBoost model is 0.86, which is the highest of the four models. From both ROC curve and AUC value, it can be concluded that the PSO-XGBoost model presented in this paper has the best performance and the highest prediction accuracy for credit risk evaluation of personal auto loans, which is consistent with the results judged by the Sect. “Performance evaluation of PSO-XGBoost model”. Most notably, for the ROC and AUC, Carrington et al. [63] pointed out that in classification and diagnostic tests, ROC and AUC describe how an adjustable threshold causes changes in two types of errors: false positives and false negatives, but the ROC curve and AUC are only partially meaningful when used with unbalanced data. In this sense, if ROC and AUC are used, it is best to first convert unbalanced data sets into balanced data sets. Otherwise, alternatives should be proposed to the ROC curve and AUC. The concordant partial AUC and the partial c statistic for ROC data proposed by Carrington et al. [63] are just good choices.

Conclusion

Seeking to address the problem of credit risk assessment for personal auto loans, this paper studies the feature selection method of credit risk assessment, and constructs a machine learning based credit risk assessment mechanism. Two data sets of personal auto loans on the Kaggle platform are selected as the research samples. Noting the imbalanced data set, the Smote-Tomek Link algorithm is proposed to achieve a balanced data set. An improved Filter-Wrapper feature selection method is then proposed to select the credit risk assessment indexes of the personal auto loans. A PSO-XGBoost model for the credit risk assessment is constructed and an empirical analysis is made.

Moreover, the proposed PSO-XGBoost model is compared with the RF model, XGBoost model, and LR model using the performance evaluation indexes such as accuracy, precision, complexity, ROC curve and AUC value. In the empirical analysis according to the first data set given by Sect. “Data preparation and preprocessing”, the comparison results show that the classification accuracy of the PSO-XGBoost model is 83.11%, which is improved by 4.23%, 2.44%, and 12.45% respectively than that of the RF, XGBoost and LR; In terms of Precision and Recall, the PSO-XGBoost model is also better than RF, XGBoost and LR; The AUC value is 0.9, which is also higher than other three comparison models. From the results of another empirical analysis according to the data set given by Sect. “Further analysis of model performance”, the results also inform the PSO-XGBoost model to be more superior to the other models on classification performance and classification effect. This validates the choice of the PSO-XGBoost model in the credit risk assessment of personal auto loans.

Due to the data set selected in this paper is a set of two-category data, thus the problem discussed in this paper is just a two-category credit risk assessment of personal auto loan. However, in the actual field of personal auto loan, the loan customers can be classified into multiple levels of credit, so that the auto financing institutions can carry out credit business with differentiated strategies for different customers, so as to improve the core competitiveness of the company. Therefore, seeking multi-category data set of personal auto loan and studying on the multi-category credit risk assessment model based on the two-category model established in this paper are the new directions of further research in the future.

Acknowledgements

We would like to thank the editor and the anonymous reviewers for their helpful comments.

Declarations

Conflict of interest

The authors declare that they have no competing interests.

There no ethical approval and patient consent to participate are required for this study.

The authors confirm that the final version of the manuscript has been reviewed, approved, and consented for publication by all authors.

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Vorheriger Artikel Parallel temporal feature selection based on improved attention mechanism for dynamic gesture recognition

Nächster Artikel A novel dynamic reference point model for preference-based evolutionary multiobjective optimization

Chen Y, Lawell C, Wang YS (2020) The Chinese automobile industry and government policy. Research in Transportation Economics 100849.

Walks A (2018) Driving the poor into debt? Automobile loans, transport disadvantage, and automobile dependence. Transp Policy 65:137–149

Kang YX, Mao SH, Zhang YH (2022) Fractional time-varying grey traffic flow model based on viscoelastic fluid and its application. Transportation Research Part B: Methodological 157:149–174

Wells P, Wang XB, Wang LQ, Liu HK, Orsato R (2020) More friends than foes? The impact of automobility-as-a-service on the incumbent automotive industry. Technol Forecast Soc Chang 154:119975

Gao MY, Yang HL, Xiao QZ, Goh M (2021) A novel method for carbon emission forecasting based on Gompertz’s law and fractional grey model: Evidence from American industrial sector. Renewable Energy 181:803–819

Rao CJ, Wang C, Hu Z, Xiao XP, Goh M (2022) Grey uncertain linguistic multi-attribute group decision making method based on GCC-HCD. IEEE Transactions on Computational Social Systems (in press). https://doi.org/10.1109/TCSS.2022.3166526CrossRef

Li B, Dong XJ, Wen JH (2022) Cooperative-driving control for mixed fleets at wireless charging sections for lane changing behaviour. Energy 243:122976

Wu DM, Fang M, Wang Q (2018) An empirical study of bank stress testing for auto loans. J Financ Stab 39:79–89

Xiao QZ, Chen L, Xie M, Wang C (2021) Optimal contract design in sustainable supply chain: Interactive impacts of fairness concern and overconfidence. Journal of the Operational Research Society 72(7):1505–1524

10.

Chen L, Nan GF, Li MQ, Feng B, Liu QR (2021) Manufacturer's online selling strategies under spillovers from online to offline sales. Journal of the Operational Research Society, forthcoming.

11.

Duan HQ, Snyder T, Yuan WC (2018) Corruption, economic development, and auto loan delinquency: Evidence from China. J Econ Bus 99:28–38

12.

Li P, Rao CJ, Goh M, Yang ZQ (2021) Pricing strategies and profit coordination under a double echelon green supply chain. J Clean Prod 278:123694

13.

Thabtah F, Kamalov F, Hammoud S, Shahamiri SR (2020) Least loss: A simplified filter method for feature selection. Inf Sci 534:1–15MathSciNetMATH

14.

Aremu OO, Cody RA, Hyland-Wood D, McAree PR (2020) A relative entropy based feature selection framework for asset data in predictive maintenance. Comput Ind Eng 145:106536

15.

Wei GF, Zhao J, Feng YL, He AX, Yu J (2020) A novel hybrid feature selection method based on dynamic feature importance. Appl Soft Comput 93:106337

16.

Shah SMS, Shah FA, Hussain SA, Batool S (2020) Support vector machines-based heart disease diagnosis using feature subset, wrapping selection and extraction methods. Comput Electr Eng 84:106628

17.

Lee J, Jeong JY, Jun CH (2020) Markov blanket-based universal feature selection for classification and Regression of mixed-type data. Expert Syst Appl 158:113398

18.

Gholami J, Pourpanah F, Wang XZ (2020) Feature selection based on improved binary global harmony search for data classification. Appl Soft Comput 93:106402

19.

Kou G, Yang P, Peng Y, Xiao F, Chen Y, Alsaadi FE (2020) Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods. Appl Soft Comput 86:105836

20.

Huang H, Liu H (2020) Feature selection for hierarchical classification via joint semantic and structural information of labels. Knowl-Based Syst 195:105655

21.

Wang XH, Zhang Y, Sun XY (2020) Multi-objective feature selection based on artificial bee colony: An acceleration approach with variable sample size. Appl Soft Comput 88:106041

22.

Kira K, Rendell LA (1992) The feature selection problem: Traditional methods and a new algorithm. Proc. of 10th National Conference on Artificial Intelligence, Canada: AAAI Press pp. 129–134.

23.

Ma JB, Gao XY (2020) A filter-based feature construction and feature selection approach for classification using genetic programming. Knowl-Based Syst 196:105806

24.

Gokalp O, Tasci E, Ugur A (2020) A novel wrapper feature selection algorithm based on iterated greedy metaheuristic for sentiment classification. Expert Syst Appl 146:113176

25.

Khammassi C, Krichen S (2020) A NSGA2-LR wrapper approach for feature selection in network intrusion detection. Comput Netw 172:107183

26.

González J, Ortega J, Damas M, Martín-Smith P, Gan JQ (2019) A new multi-objective wrapper method for feature selection – Accuracy and stability analysis for BCI. Neurocomputing 333:407–418

27.

Mafarja M, Mirjalili S (2018) Whale optimization approaches for wrapper feature selection. Appl Soft Comput 62:441–453

28.

Rajab KD (2017) New hybrid features selection method: a case study on websites phishing. Security & Communication Networks 2:1–10

29.

Solorio-Fernández S, Carrasco-Ochoa JA, Martínez-Trinidad JF (2016) A new hybrid filter–wrapper feature selection method for clustering based on ranking. Neurocomputing 214:866–880

30.

Rao CJ, Lin H, Liu M (2020) Design of comprehensive evaluation index system for P2P credit risk of “three rural” borrowers. Soft Comput 24(15):11493–11509

31.

Lin YP, Chen LL, Zou JZ (2019) Application of hybrid feature selection algorithm based on particle swarm optimization in fatigue driving. Comput Eng 45(2):278–283

32.

Durand D (1941) Risk elements in consumer instalment financing technical edition. National Bureau of Economic Research 218(1): 237–237.

33.

Yu LA, Wang SY (2009) A kernel principal component analysis based least squares fuzzy support vector machine methodology with variable penalty factors for credit classification. Journal of System Science and Mathematical Science 29(10):1311–1326MATH

34.

Rao CJ, Liu M, Goh M, Wen JH (2020) 2-stage modified random forest model for credit risk assessment of P2P network lending to “Three Rurals” borrowers. Appl Soft Comput 95:106570

35.

Lanzarini LC, Monte AV, Bariviera AF, Santana PJ (2017) Simplifying credit scoring rules using LVQ + PSO. Kybernetes 46(1):8–16

36.

Barani MJ, Ayubi P, Hadi RM (2014) Improved particle swarm optimization based on chaotic cellular automata. Proceedings of 2014 Iranian Conference on Intelligent Systems (ICIS), pp. 1–6, doi: https://doi.org/10.1109/IranianCIS.2014.6802523.

37.

Mojarrad MH, Ayubi P (2015) Particle swarm optimization with chaotic velocity clamping (CVC-PSO). Proceedings of 2015 7th Conference on Information and Knowledge Technology (IKT), pp. 1–6, doi: https://doi.org/10.1109/IKT.2015.7288811.

38.

Liu C, Xie J, Zhao Q, Xie QW, Liu CQ (2019) Novel evolutionary multi-objective soft subspace clustering algorithm for credit risk assessment. Expert Syst Appl 138:112827

39.

Luo J, Yan X, Tian Y (2020) Unsupervised quadratic surface support vector machine with application to credit risk assessment. Eur J Oper Res 280:1008–1017MathSciNetMATH

40.

Blagus R, Lusa L (2013) SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics 14(1):1–16

41.

Roshan SE, Asadi S (2020) Improvement of bagging performance for classification of imbalanced datasets using evolutionary multi-objective optimization. Eng Appl Artif Intell 87:103319

42.

Xie YX, Peng LZ, Chen ZX, Yang B, Zhang HL, Zhang HB (2019) Generative learning for imbalanced data using the Gaussian mixed model. Appl Soft Comput 79:439–451

43.

Hong WH, Yap JH, Selvachandran G, Thong PH, Son LH (2021) Forecasting mortality rates using hybrid Lee-Carter model, artificial neural network and random forest. Complex & Intelligent Systems 7:163–189

44.

Hu J, Ou X, Liang P, Li B (2021) Applying particle swarm optimization-based decision tree classifier for wart treatment selection. Complex & Intelligent Systems (in press). https://doi.org/10.1007/s40747-021-00348-3CrossRef

45.

Gan D, Shen J, An B, Xu M, Liu N (2020) Integrating TANBN with cost sensitive classification algorithm for imbalanced data in medical diagnosis. Comput Ind Eng 140:106266

46.

Li YS, Chi H, Shao XY, Qi ML, Xu BG (2020) A novel random forest approach for imbalance problem in crime. Knowl-Based Syst 195:105738

47.

Sharma D, Willy C, Bischoff J (2021) Optimal subset selection for causal inference using machine learning ensembles and particle swarm optimization. Complex & Intelligent Systems 7:41–59

48.

He YY, Zhou JH, Lin YP, Zhu TF (2019) A class imbalance-aware Relief algorithm for the classification of tumors using microarray gene expression data. Comput Biol Chem 80:121–127

49.

Liao K, Fu J, Yang W (2010) Modified relief algorithm for radar HRRP target recognition. Journal of Electronic Measurement and Instrument 24(9):831–836

50.

Sun GL, Li JB, Dai J, Song ZC, Lang F (2018) Feature selection for IoT based on maximal information coefficient. Futur Gener Comput Syst 89:606–616

51.

Zhang YS, Yang C, Yang AR, Xiong C, Zhou XG, Zhang ZG (2015) Feature selection for classification with class-separability strategy and data envelopment. Neurocomputing 166(10):172–184

52.

Fu PH, Zhan ZG, Wu CJ (2013) Efficiency analysis of Chinese road systems with DEA and order relation analysis method: Externality concerned. Procedia Soc Behav Sci 966:1227–1238

53.

Rao CJ, Gao Y (2022) Evaluation mechanism design for the development level of urban-rural integration based on an improved TOPSIS method. Mathematics 10:380

54.

Mercadier M, Lardy JP (2019) Credit spread approximation and improvement using random forest regression. Eur J Oper Res 277(1):351–365MathSciNetMATH

55.

Wei J, Chen H (2020) Determining the number of factors in approximate factor models by twice K-fold cross validation. Econ Lett 191:109149MathSciNetMATH

56.

Nobre J, Neves RF (2019) Combining principal component analysis, discrete wavelet transform and XGBoost to trade in the financial markets. Expert Syst Appl 125:181–194

57.

Zou J, Deng Q, Zheng JH, Yang SX (2020) A close neighbor mobility method using particle swarm optimizer for solving multimodal optimization problems. Inf Sci 519:332–347MathSciNet

58.

Li X, Xiao XP, Guo H (2022) A novel grey Bass extended model considering price factors for the demand forecasting of European new energy vehicles. Neural Computing and Applications (in press). https://doi.org/10.1007/s00521-022-07041-7CrossRef

59.

Gao MY, Yang HL, Xiao QZ, Goh M (2022) COVID-19 lockdowns and air quality: Evidence from grey spatiotemporal forecasts. Socio-Economic Planning Sciences (in press). https://doi.org/10.1016/j.seps.2022.101228CrossRef

60.

Zhang CX, Xu S, Zhang JS (2019) A novel variational Bayesian method for variable selection in logistic regression models. Comput Stat Data Anal 133:1–19MathSciNetMATH

61.

Wang J, Rao CJ, Goh M, Xiao XP (2022) Risk assessment of coronary heart disease based on cloud-random forest. Artificial Intelligence Review (in press). https://doi.org/10.1007/s10462-022-10170-zCrossRef

62.

Rao CJ, He YW, Wang XL (2021) Comprehensive evaluation of non-waste cities based on two-tuple mixed correlation degree. Int J Fuzzy Syst 23:369–391

63.

Carrington AM, Fieguth PW, Qazi H et al (2020) A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms. BMC Med Inform Decis Mak 20:4

Titel: Credit risk assessment mechanism of personal auto loan based on PSO-XGBoost Model
verfasst von: Congjun Rao
Ying Liu
Mark Goh
Publikationsdatum: 12.09.2022
Verlag: Springer International Publishing
Erschienen in: Complex & Intelligent Systems / Ausgabe 2/2023
Print ISSN: 2199-4536
Elektronische ISSN: 2198-6053
DOI: https://doi.org/10.1007/s40747-022-00854-y

Springer Professional

Credit risk assessment mechanism of personal auto loan based on PSO-XGBoost Model

Abstract

Publisher's Note

Introduction

Literature review

Data preprocessing and unbalanced data set transformation

Data preparation and preprocessing

Pre-screening credit risk assessment indexes

Transforming unbalanced data set

Smote-Tomek Link algorithm

Unbalanced data set transformation based on Smote-Tomek Link algorithm

Feature selection method of credit risk assessment index

Improved Filter-Wrapper feature selection method

Analysis of selection of credit risk assessment indexes

Comprehensive ranking of features in Filter stage

Feature selection in Wrapper stage

Credit risk assessment of personal auto loans using PSO-XGBoost model

XGBoost model

PSO-XGBoost model

Analysis of credit risk assessment of personal auto loans

Performance evaluation of PSO-XGBoost model

Further analysis of model performance

Data processing and feature selection

The decision-making process of risk assessment

Conclusion

Acknowledgements

Declarations

Conflict of interest

Publisher's Note

Premium Partner

Springer Professional

Abstract

Publisher's Note

Introduction

Literature review

Data preprocessing and unbalanced data set transformation

Data preparation and preprocessing

Pre-screening credit risk assessment indexes

Transforming unbalanced data set

Smote-Tomek Link algorithm

Unbalanced data set transformation based on Smote-Tomek Link algorithm

Feature selection method of credit risk assessment index

Improved Filter-Wrapper feature selection method

Analysis of selection of credit risk assessment indexes

Comprehensive ranking of features in Filter stage

Feature selection in Wrapper stage

Credit risk assessment of personal auto loans using PSO-XGBoost model

XGBoost model

PSO-XGBoost model

Analysis of credit risk assessment of personal auto loans

Performance evaluation of PSO-XGBoost model

Further analysis of model performance

Data processing and feature selection

The decision-making process of risk assessment

Conclusion

Acknowledgements

Declarations

Conflict of interest

Ethical approval and consent to participate

Consent for publication

Publisher's Note

Weitere Artikel der Ausgabe 2/2023

Using dual evolutionary search to construct decision tree based ensemble classifier

A swarm-optimizer-assisted simulation and prediction model for emerging infectious diseases based on SEIR

On the similarity measures of N-cubic Pythagorean fuzzy sets using the overlapping ratio

Object tracking in infrared images using a deep learning model and a target-attention mechanism

Improved SparseEA for sparse large-scale multi-objective optimization problems

A novel dynamic reference point model for preference-based evolutionary multiobjective optimization

Premium Partner