nach oben

Artificial Intelligence Review

Erschienen in:

Open Access 01.04.2024

Deep reinforcement learning based on balanced stratified prioritized experience replay for customer credit scoring in peer-to-peer lending

verfasst von: Yadong Wang, Yanlin Jia, Sha Fan, Jin Xiao

Erschienen in: Artificial Intelligence Review | Ausgabe 4/2024

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

In recent years, deep reinforcement learning (DRL) models have been successfully utilised to solve various classification problems. However, these models have never been applied to customer credit scoring in peer-to-peer (P2P) lending. Moreover, the imbalanced class distribution in experience replay, which may affect the performance of DRL models, has rarely been considered. Therefore, this article proposes a novel DRL model, namely a deep Q-network based on a balanced stratified prioritized experience replay (DQN-BSPER) model, for customer credit scoring in P2P lending. Firstly, customer credit scoring is formulated as a discrete-time finite-Markov decision process. Subsequently, a balanced stratified prioritized experience replay technology is presented to optimize the loss function of the deep Q-network model. This technology can not only balance the numbers of minority and majority experience samples in the mini-batch by using stratified sampling technology but also select more important experience samples for replay based on the priority principle. To verify the model performance, four evaluation measures are introduced for the empirical analysis of two real-world customer credit scoring datasets in P2P lending. The experimental results show that the DQN-BSPER model can outperform four benchmark DRL models and seven traditional benchmark classification models. In addition, the DQN-BSPER model with a discount factor γ of 0.1 has excellent credit scoring performance.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

In recent years, P2P lending companies are rapidly developing, mainly including Leading Club, Kiva and Zopa, etc. Further, P2P lending has gradually become an important channel for small loans and private financing. However, due to network virtualisation and imperfection monitoring, P2P lending entails greater credit risks than traditional bank lending (Du et al. 2020). Customer credit scoring (CCS) is an effective tool for assessing credit risk in P2P lending. Generally, CCS can be regarded as a binary classification problem in which customer credit is divided into two categories: ‘good’ and ‘bad’.

At present, there are three types of credit scoring methods: expert judgement, statistical analysis, and machine learning (Baesens et al. 2003; Dastile et al. 2020; Xia et al. 2020). Common expert judgement methods include 5C and 5P, which rely on expert experience to evaluate customer credit. Statistical analysis methods have been proposed to improve the efficiency of CCS, including linear discriminant analysis (LDA; Altman 1968) and logistic regression (LR; Hosmer et al. 2013) methods. Machine learning methods mainly include naive Bayes (NB; Rish 2001), decision tree (DT; Yeo and Grant 2018), k-nearest neighbour (KNN; Wauters and Vanhoucke 2017), support vector machine (SVM; Trafalis and Gilbert 2006), and deep neural network (DNN; Gunnarsson et al. 2021). The experimental results of some studies have shown that machine learning methods often achieve better credit scoring performance than statistical analysis methods (Blumenstock et al. 2022; Dumitrescu et al. 2022; Lessmann et al. 2015; Petrides et al. 2020; Serrano-Cinca and Gutiérrez-Nieto 2016).

With the deepening of research, it has been found that the class distribution of credit scoring datasets in P2P lending is often highly imbalanced; that is, the samples of customers with bad credit are much smaller than those with good credit, which may lead to poor classification accuracy in credit scoring for customers with bad credit (Crone and Finlay 2012; Veganzones and Séverin 2018; Xiao et al. 2021). To solve this problem, resampling methods [random oversampling (ROS), random undersampling (RUS), etc.] are proposed to balance the class distribution of the training set before modelling (Marqués et al. 2013; Protopapadakis et al. 2019).

When using the traditional classification model for CCS in P2P lending, the assumptions utilised in the existing studies are that the samples in the dataset are independent and identically distributed (Borgonovo and Smith 2011; Lopez-Martin et al. 2020; Óskarsdóttir et al. 2019). However, for real-world credit scoring in P2P lending, a large number of samples are generated through the dynamic interaction between customers and financial institutions or platforms, which are not strictly independent and identically distributed. The developed deep reinforcement learning (DRL) model provides a new method of solving the above problems, it is a dynamic decision-making method based on the Markov decision process (MDP; Mnih et al. 2015). Thus far, DRL models have been successfully used in various fields, including management strategy optimisation (Liu et al. 2020; Schnaubelt 2022; van Heeswijk 2022), energy management (Sun 2020), and autopilot technology (Wurman et al. 2022).

Among the various DRL models, the deep Q-network (DQN) model (Mnih et al. 2013) is the most commonly used. Therefore, some scholars have attempted to employ DQN models to solve classification problems (Chatterjee and Namin 2019; Ding et al. 2019; Li and Xu 2020; Martinez et al. 2020; Wang et al. 2022; Zhao et al. 2016). The core idea is firstly to formulate the classification problem as an MDP and then to use experience replay technology to build the mini-batch in the training set to optimize the loss function of the DQN model dynamically. Finally, the optimized DQN model is applied to classify the samples in the test set. In particular, DQN models have gradually shown their advantages in binary classification (Lin et al. 2020; Lopez-Martin et al. 2020; Martinez et al. 2020). The most commonly used experience replay technology in DQN models is random experience replay (RER), which uses a random sampling method to select experience samples from the buffer to build the mini-batch. However, RER technology has difficulty converging DQN models in complex scenarios. Therefore, stratified experience replay (SER) technology (Chen et al. 2018) and prioritized experience replay (PER) technology (Schaul et al. 2015) have been developed to improve the convergence performance of DQN models.

The previous studies have significantly contributed to the application of DRL models in classification tasks. However, the methods employed have limitations. First, mini-batches have mainly been constructed based on RER, SER, or PER to optimize the loss functions of DQN models in the previous research. Although these experience replay technologies can improve the convergence performance of DQN models to some extent, balancing the numbers of minority and majority experience samples in the mini-batch is difficult, which may affect the classification performance of DQN models, especially in imbalanced classification. Second, when applying DQN models for classification, most scholars have designed the value of the discount factor according to common DRL environments (such as autopilot, robot control, and computer games), considering their effects on the credit scoring performance of DQN models. If an inappropriate discount factor value is designed, the DQN model performance may worsen. Third, the DQN-IRF model proposed by Lin et al. (2020) addresses classification tasks with imbalanced class distributions. However, this model is mainly applied to image and text classification. The features of these samples for classification are very different from those of samples for CCS, which can lead to the poor credit scoring performance of the DQN-IRF model.

To solve the above problems, we constructed a DQN based on the balanced stratified prioritized experience replay (DQN-BSPER) model. Firstly, we formulated CCS as a discrete-time finite MDP according to the characteristics of credit scoring. Then, we developed balanced stratified prioritized experience replay (BSPER) technology to improve the experience replay process of DQN models to optimize the model loss function. To verify the model performance, we introduced four evaluation measures (EMs) for empirical analysis on two real-world CCS datasets in P2P lending with an imbalanced class distribution. Firstly, the effects of the discount factor and proportion of minority and majority experience samples in the stratified priority mini-batch on the CCS performance of the DQN-BSPER model were analysed. Then, we compared and analysed the CCS performance of the DQN-BSPER model and four other benchmark DRL models, namely, a DQN, DQN based on stratified experience replay (DQN-SER), DQN based on prioritized experience replay (DQN-PER), and DQN-IRF, and further compared their convergence performance. Next, the CCS performance of the DQN-BSPER model was statistically compared with those of seven traditional benchmark classification models. Finally, the effects of the imbalanced class distribution on the CCS performance of the DQN-BSPER model were analysed.

The main contributions of this paper are as follows:

(1)

we propose the BSPER technology to improve the experience replay process of the DQN model for CCS in P2P lending. The proposed BSPER technology can not only reduce the impact of the highly imbalanced class distribution on the DQN model performance by balancing the numbers of minority and majority samples in the mini-batch, but also select more important experience samples according to the temporal difference (TD) error to improve the convergence performance of the DQN model.

(2)

The effects of the discount factor on the CCS performance of the DQN-BSPER model are analysed, compensating for the fact that previous scholars have often designed the discount factor according to common DRL environments (such as autopilot, robot control, and computer games), which may lead to poor CCS performance.

(3)

It verifies that the proposed DQN-BSPER model exhibits excellent CCS performance in P2P lending, demonstrating that it could serve as a credit scoring tool in P2P lending for financial institutions.

The remainder of this paper is organised as follows. Section 2 provides a literature review. The details of the theoretical background are described in Sect. 3. Section 4 elaborates the discrete-time finite MDP for CCS. Section 5 introduces the process of the DQN-BSPER model in detail. The experimental design is elaborated in Sect. 6. Section 7 presents the experimental result. Finally, conclusions and future works are given in Sect. 8.

2 Literature review

2.1 Customer credit scoring

In actual CCS, banks or financial institutions determine whether to grant loans to customers based on customer credit. Even if customers have different credit grades, the final result is still ‘granting’ or ‘not granting’. Therefore, CCS can be regarded as a binary classification problem.

Currently CCS models stemming from operations research and artificial intelligence have also become popular, include LR (Hosmer et al. 2013), NB (Rish 2001), DT (Yeo and Grant 2018), KNN (Wauters and Vanhoucke 2017), SVM (Trafalis and Gilbert 2006), and DNN (Gunnarsson et al. 2021) models. For instance, Lessmann et al. (2015) compared the 41 classification models performance on eight CCS data sets and validated that the DNN model achieved excellent performance. Fernandes and Artes (2016) introduced a measure of the local default risk based on the application of ordinary kriging to logistic credit scoring models. These models achieved better performance on the Brazilian dataset. Li et al. (2020) proposed a recursive Bayes estimator to improve the precision of credit scoring by incorporating the dynamic interaction topology of customers. The experimental results showed that, under the proposed framework, the designed estimator achieved a higher precision than any efficient estimator. Xiao et al. (2021) compared DNN, LDA, LR, DT, and SVM models performance. The experimental results on seven CCS datasets showed that the LR model provided the best performance. Wang et al. (2022) proposed an innovative DQN model for CCS and compared the performance of the proposed DQN model with those of eight other classification models. The experimental results obtained using five CCS datasets showed that the proposed model performed significantly better than the other eight traditional classification models.

In recent years, P2P lending has developed rapidly with the rise of the Internet, and scholars have begun to focus on customer credit scores in P2P lending. For instance, Guo et al. (2016) designed an instance-based credit scoring model that can assess the return and risk of loans. To verify the proposed model, the authors conducted extensive experiments on two actual CCS datasets in P2P lending. The experimental results showed that this model could effectively improve investment performance. Wang et al. (2021) proposed a misclassification cost matrix for P2P credit grading, using a set of equations and models to calculate costs. The results obtained on the Lending Club dataset showed that cost-sensitive classifiers could significantly reduce the total cost. Bastani et al. (2019) proposed a two-stage scoring method based on credit and profit. The first stage identifies the NPLs. The second stage predicts profitability based on the internal rate of return. In both stages, wide and deep learning were used to build the prediction models. The results obtained using the Lending Club dataset showed that the proposed model outperformed the existing credit scoring and profit scoring methods.

In summary, in real-world credit scoring, especially in P2P lending, data are generated in the dynamic interaction between customers and financial institutions, which means that a sequence correlation may exist between the samples, which can affect the CCS performance of the traditional classification model. The DQN-BSPER model proposed in this paper is a sequential decision model and verifies whether there is a sequence correlation between samples in the CCS datasets in P2P lending.

2.2 Deep reinforcement learning for classification problems

DRL has been widely used in various real-world fields (Moor et al. 2022; Fan et al. 2020; Liu et al. 2020; Patel et al. 2019; Silver et al. 2018), and the most popular model is the DQN model (Mnih et al. 2013, 2015). Over the years, increasingly many scholars have begun using the DQN model for supervised classification. For instance, Zhao et al. (2016) proposed a DQN model for image classification and experimentally proved that the proposed model was highly competitive on the vehicle classification datasets. Lin et al. (2020) proposed a DQN based on an improved reward function (DQN-IRF) model. The model provides more rewards to minority experience samples to make the classification strategy more inclined toward the minority, which effectively improves the classification performance of the model when applied to image and text datasets.

In binary classification, the DQN model have shown excellent performance. For instance, Lopez-Martin et al. (2020) used four DRL models for intrusion detection and verified that their performance were better than traditional classification models. Ding et al. (2019) constructed a DQN model with RER technology to identify machinery running faults and achieved 100% recognition accuracy. Lim et al. (2021) used a DQN model with RER technology to intelligently predict the hidden relationships in criminal network. The experimental results showed that the proposed model performance was superior to RF and SVM. Lin et al. (2020) proposed a DQN model with RER technology for classification task through making the results inclined toward the minority class, and then verify that the improved DQN model was significantly superior to the DNN model on the image and text datasets. In addition, Chen et al. (2018) introduced stratified sampling technology into the experience replay process of the DQN model to improve its convergence performance. Schaul et al. (2015) introduced the degree of importance (referred to as priority) into the experience replay process and developed PER technology. This technology firstly determines the priority of each experience sample according to the TD error. Subsequently, it selects important experience samples to construct the mini-batch based on the priority to optimize the loss function. They proved that the improved process is very effective in improving convergence for DRL models.

In summary, previous scholars have mainly constructed mini-batches based on RER, SER, or PER for DQN models. It is difficult to balance the numbers of minority and majority experience samples in the mini-batch, which may affect the DQN model performance in imbalanced classification. Our proposed BSPER technique fully considers the class-imbalanced characteristics of the CCS datasets in P2P lending and improves the convergence performance of DQN.

3 Theoretical background

3.1 Notations

For convenience and clarity, the main mathematical notations and definitions used in this article are presented in “Appendix 1”.

3.2 Reinforcement learning and Q-learning

Reinforcement learning (RL) is a subclass of machine learning that aims to optimize the action strategy of the agent continuously to maximise the expected cumulative reward in the process of interaction with the environment, that is, to maximise the Q-function (Sutton and Barto 1998). In previous studies, RL tasks have usually been formulated as MDPs (Cai et al. 2020; Zhang et al. 2021), and their basic elements can be expressed as a tuple $\left({\varvec{S}},{\varvec{A}},P,R,\gamma \right)$, where ${\varvec{S}}$ indicates the state space, ${\varvec{A}}$ indicates the action space, $P:{\varvec{S}}\times {\varvec{A}}\times {\varvec{S}}\to [\mathrm{0,1}]$ indicates the state transition probability, $R:{\varvec{S}}\times {\varvec{A}}\to {\mathbb{R}}$ indicates the reward function, and $\gamma \in [\mathrm{0,1}]$ indicates the discount factor that balances the importance of future rewards and current rewards. In particular, during the tth ($t\in [0,T]$) time step, the agent first performs an action ${a}_{t}\in {\varvec{A}}$ under the environment state ${s}_{t}\in {\varvec{S}}$. Then, the reward ${r}_{t}$ is generated by environment and feeds it back to the agent. Finally, the environment is transferred to the next state ${s}_{t+1}$ according to probability $P$. The cumulative reward from ${s}_{t}$ to ${s}_{T}$ can be expressed as ${R}_{t}={\sum }_{i=t}^{T}{\gamma }^{i-t}{r}_{i}$. In addition, the expected cumulative reward (usually represented by the Q-function) corresponding to the state–action pair $\left({s}_{t},{a}_{t}\right)$ is expressed as $Q\left({s}_{t},{a}_{t}\right)=E[{\sum }_{i=t}^{T}{\gamma }^{i-t}{r}_{i}]$ (Chen et al. 2018), and the optimal strategy ${\pi }^{*}$ represents the strategy that can maximise the Q-function. According to the Bellman equation (Gosavi 2009), the Q-function can be transformed into the following form:

$$Q\left({s}_{t},{a}_{t}\right)=E\left[{r}_{t}+\gamma \underset{{a}_{t+1}\in {\varvec{A}}}{{\text{max}}}Q\left({s}_{t+1},{a}_{t+1}\right)\right],$$

(1)

where $\underset{{a}_{t+1}\in {\varvec{A}}}{{\text{max}}}Q\left({s}_{t+1},{a}_{t+1}\right)$ represents the maximum Q-value that the agent can obtain when the state is ${s}_{t+1}$.

Q-learning is a widely used model-free RL algorithm based on asynchronous dynamic programming that can quickly find the optimal strategy for the MDP (Watkins and Dayan 1992). The core idea is to find the optimal strategy ${\pi }^{*}$ that can maximise the Q-value using the Bellman equation (Gosavi 2009) to iterate the Q-table continuously. The general steps of Q-learning can be summarised as follows:

Step 1: Initialise the Q-values of all state-action pairs in the Q-table.

Step 2: According to the Q-table, the agent executes action ${a}_{t}$ in state ${s}_{t}$.

Step 3: When the state-action pair is $\left({s}_{t},{a}_{t}\right)$, the agent obtains reward ${r}_{t}$ according to the reward function. Simultaneously, the next state ${s}_{t+1}$ is generated from environment, then iteratively updates the Q-value using the Bellman equation (Gosavi 2009):

$$Q\left({s}_{t},{a}_{t}\right)\leftarrow Q\left({s}_{t},{a}_{t}\right)+\alpha \left[{r}_{t}+\gamma \underset{{a}_{t+1}\in {\varvec{A}}}{{\text{max}}}Q\left({s}_{t+1},{a}_{t+1}\right)-Q\left({s}_{t},{a}_{t}\right)\right],$$

(2)

where $\alpha$ represents the learning rate.

Step 4: Repeat Steps 2 and 3 until none of the Q-values in the Q-table change. Then, the strategy corresponding to the Q-table is the optimal strategy ${\pi }^{*}$ (Sutton and Barto 1998).

3.3 Deep Q-network

Q-learning algorithms have been successfully applied in many fields of real word, but they are primarily suitable for tasks with small state spaces. In the real world, the state spaces of tasks are typically large, and the number of states can even reach tens of millions. To solve this problem, Mnih et al. (2015) combined a DNN with Q-learning to develop a DQN model, which approximately expresses the Q-function by automatically extracting features from the state space. Specifically, experience samples are continuously obtained first according to the greedy strategy, and they are stored in a fixed-size replay memory buffer to form an experience sample set. In particular, the experience sample at the tth time step is represented as ${e}_{t}=({s}_{t},{a}_{t},{r}_{t},{s}_{t+1})$, and the experience sample set is represented as ${{\varvec{E}}}_{t}=\{{e}_{1},{e}_{2},\dots ,{e}_{t}\}$. Then, $k$ experience samples are randomly selected from ${{\varvec{E}}}_{t}$ to form a mini-batch using RER technology. Finally, a DNN is used to fit the Q-function, so the network corresponding to the Q-function is called the Q-network, and its general expression is $Q\left(s,a;\theta \right)$, whereas the loss function of $L({\theta }_{t}^{C})$ can be expressed as

$$L({\theta }_{t}^{C})=\frac{1}{k}\sum_{i=1}^{k}{\left({y}_{i}-Q\left({s}_{i},{a}_{i};{\theta }_{t}^{C}\right)\right)}^{2},$$

(3)

where ${{y}_{i}=r}_{i}+\gamma \underset{{a}_{i+1}\in {\varvec{A}}}{{\text{max}}}Q\left({s}_{i+1},{a}_{i+1};{\theta }_{t}^{T}\right)$ indicates the target Q-function; ${\theta }_{t}^{T}$ indicates the parameters of the target Q-network; $Q\left({s}_{i},{a}_{i};{\theta }_{t}^{C}\right)$ indicates the current Q-function; ${\theta }_{t}^{C}$ indicates the parameters of the current Q-network. The gradient descent algorithm was used to update the parameters of the Q-network:

$$\frac{\nabla L\left({\theta }_{t}^{C}\right)}{\nabla {\theta }_{t}^{C}}=\frac{2}{k}\sum_{i=1}^{k}\left({y}_{i}-Q\left({s}_{i},{a}_{i};{\theta }_{t}^{C}\right)\right)\frac{\nabla L\left(Q\left({s}_{i},{a}_{i};{\theta }_{t}^{C}\right)\right)}{\nabla {\theta }_{t}^{C}},$$

(4)

$${\theta }_{t+1}^{C}={\theta }_{t}^{C}-\alpha \frac{\nabla L\left({\theta }_{t}^{C}\right)}{\nabla {\theta }_{t}^{C}},$$

(5)

where $\alpha$ indicates the learning rate of the Q-network. After training, we can obtain the optimal strategy ${\pi }^{*}=\underset{a\in {\varvec{A}}}{argmax}{Q}^{*}\left(s,a;\theta \right)$, which can maximise the Q-function. In addition, iterative updating technology is used to reduce the correlation between the current Q-network and target Q-network. In other words, the parameter of the current Q-network ${\theta }_{t}^{C}$ is assigned to the parameter of the target Q-network ${\theta }_{t}^{T}$ after a certain number of steps for improving the convergence stability of the Q-network (Lopez-Martin et al. 2020; Mnih et al. 2015). For the detailed algorithm, see the literature (Mnih et al. 2015).

3.4 Experience replay

In most DRL frameworks, the agent constantly receives new experience samples to update the parameters of the DQN model incrementally. The simplest update method uses only one experience sample in each time step to update the model parameters. However, the biggest drawback of this method is that the rare experiences that may be useful in the future will be quickly forgotten, resulting in low sampling efficiency. Experience replay can effectively solve this issue (Luo et al. 2018). In other words, experience samples are firstly stored in a fixed-size replay memory buffer; then, each time the parameters of the DQN model are updated, a fixed number of experience samples are sampled from the replay memory buffer to construct a mini-batch; and finally, the gradient descent algorithm is used to train and optimize the model.

Obviously, a complete experience replay process consists of storing and sampling the experience samples. Therefore, the selection of experience samples from the replay memory buffer is important in improving the DQN model performance. In particular, when an experience sample is stored in replay memory buffer $D$, a new index label $i\in \left\{\mathrm{1,2},\dots ,d\right\}$ is assigned to the experience. The priority of the ith experience sample is represented as $P({e}_{i})$. The entire replay memory buffer can be regarded as a combination of experience samples and priority $\left\{{e}_{i},P({e}_{i})\right\}$. The key to selecting the experience samples is to determine $P({e}_{i})$ for each experience sample. The most commonly used experience replay technology in the DQN model is RER, and the priority of each experience sample is the same, that is, $P\left({e}_{i}\right)=\frac{1}{d}$. This technology uses a random sampling method to select experience samples from the buffer to construct the mini-batch, which ignores the experience samples that play an important role in the parameter updating of the DQN model. Consequently, the DQN model may converge slowly in complex tasks. To solve this problem, Schaul et al. (2015) proposed PER technology. The core steps of this technology can be summarised as follows. Firstly, the TD error of the experience sample in the replay memory buffer can be calculated as follows:

$${\delta }_{i}=\left|{y}_{i}-Q\left({s}_{i},{a}_{i};{\theta }_{t}^{C}\right)\right|, i=1, 2, ..., d.$$

(6)

The priority of each experience sample is then determined according to the TD error; that is, the selected probability of the experience sample can be expressed as

$$P\left({e}_{i}\right)=\frac{{\delta }_{i}}{{\sum }_{1}^{d}{\delta }_{i}}.$$

(7)

Finally, $k$ experience samples are selected from the replay memory buffer according to the probability $P\left({e}_{i}\right)$ to construct a mini-batch. The higher the TD error, the higher the priority of the experience sample. Thus, a sample with a higher TD error is selected with a higher probability, which effectively improves the convergence performance of the DQN model (Schaul et al. 2015).

Furthermore, Chen et al. (2018) developed an SER by introducing stratified sampling technology into the experience replay for the DQN model. The core idea of this technology is to use a stratified sampling method to select different classes of experience samples from the replay memory buffer to construct a mini-batch. The experimental results demonstrate that this technology can dramatically improve the classification performance of the DQN model.

4 Formulating customer credit scoring in P2P lending as discrete-time finite MDP

In the real world, CCS in P2P lending is a interactive process similar to the RL process. The number of environmental states is limited within a certain time step, and the time step is discrete. Therefore, we formulated CCS in P2P lending as a discrete-time finite MDP (CSDFMDP). Firstly, a limited number of customers enter the environment. The agent then identifies the environmental state and performs a credit scoring action according to the state. Next, the environment generates rewards based on the reward function and feeds them back to the agent. Finally, the agent optimizes the action strategy based on the feedback reward. This process is repeated until there are no more customers in the environment. The ultimate objective of the agent is to classify the customer samples as accurately as possible. More importantly, the agent can obtain different rewards when classifying the customer samples correctly or incorrectly. Therefore, the agent can optimize the credit scoring action by maximising the Q-function.

To describe the CSDFMDP more clearly, we use ${{\varvec{D}}}_{train}={\{\left({x}_{t},{y}_{t}\right)\}}_{1\le t\le T}$ to represent the CCS training set, where $T$ indicates the number of customer samples, ${x}_{t}=\left({x}_{t}^{1},{x}_{t}^{2},\dots {,x}_{t}^{g}\right)$ indicates the tth customer sample, $g$ indicates the number of features, and ${y}_{t}\in \{\mathrm{0,1}\}$ indicates the class label of ${x}_{t}$. The basic elements related to CSDFMDP can be described as follows:

(1)

Environment. In the real world, one of the most critical factors affecting the P2P CCS environment is the customer, which was the main object of our study. Then, we simplified the CSDFMDP environment to include only the customer.

(2)

Environment state space. The environment state space can be expressed as ${\varvec{S}}=\left\{{s}_{1},{s}_{2},\dots ,{s}_{T}\right\}$, where the environment state at the tth time step is defined as ${s}_{t}=\left({s}_{t}^{1},{s}_{t}^{2},\dots {,s}_{t}^{g}\right)$. In particular, ${s}_{1}$ indicates the initial environment state corresponding to ${x}_{1}$ in the training set, and ${s}_{T}$ indicates the terminal environment state corresponding to ${x}_{T}$ in the training set.

(3)

Agent. The agent represents a substitute for the bank loan approver, which classifies customer credit according to the environmental state.

(4)

Action space. The action space is represented as ${\varvec{A}}=\left\{\mathrm{0,1}\right\}$, where 0 and 1 indicates that the agent classifies the customer as having good credit and bad credit respectively. Then, at the tth time step, the credit scoring action performed by the agent according to state ${s}_{t}$ is expressed as ${a}_{t}\in {\varvec{A}}$.

(5)

Reward. At the tth time step, the feedback reward from the environment is ${r}_{t}$, that is, if the agent classifies the customer credit correctly, then the environment feeds back a positive reward to the agent according to the reward function; otherwise, the environment feeds back a negative reward. Referring to the literature (Chatterjee and Namin 2019; Lopez-Martin et al. 2020), we set the reward function for CCS as follows:

$$R\left({a}_{t},{y}_{t}\right)=\left\{\begin{array}{c}1, {a}_{t}={y}_{t},\\ -1, {a}_{t}\ne {y}_{t},\end{array}\right.$$

(8)

where $R\left({a}_{t},{y}_{t}\right)$ indicates that if the credit scoring action of the agent ${a}_{t}$ is the same as the class label of customer credit ${y}_{t}$, then the feedback reward from the environment is ${r}_{t}=1$; otherwise, it is ${r}_{t}=-1$.

(6)

State transition probabilities. According to the literature (So and Thomas 2011), the state transition probability of the CSDFMDP is as follows:

$$p\left({s}_{t+1}|{s}_{t},{a}_{t}\right)=p\left({s}_{t+1}|{{s}_{t},{a}_{t},s}_{t-1},{a}_{t-1},\dots ,{s}_{1},{a}_{1}\right),$$

(9)

where $p\left({s}_{t+1}|{s}_{t},{a}_{t}\right)$ indicates the probability of the environment transferring to state ${s}_{t+1}$ when the state–action pair is $\left({s}_{t},{a}_{t}\right)$. In particular, the order of the customer samples was fixed; therefore, the state transition probability was deterministic, that is, $p\left({s}_{t+1}|{s}_{t},{a}_{t}\right)=1$.

(7)

Strategy. In the CSDFMDP, the classification strategy $\pi \left({a}_{t}|{s}_{t}\right)$ indicates the probability of performing credit scoring action ${a}_{t}$ by the agent under environment state ${s}_{t}$, so the optimal credit scoring action according to the greedy strategy can be expressed as follows:

$${\pi }^{*}\left({a}_{t}|{s}_{t}\right)=\left\{\begin{array}{c}1,\, if\, {a}_{t}=\underset{{a}_{t}\in {\varvec{A}}}{argmax}Q\left({s}_{t},{a}_{t}\right),\\ 0, else,\end{array}\right.$$

(10)

where the greedy strategy means that the agent only selects the action that maximises $Q\left({s}_{t},{a}_{t}\right)$ in environment state ${s}_{t}$.

Specifically, the process of the agent from ${s}_{t}$ to ${s}_{T}$ is as follows.

Step 1: When the CCS environment state is ${s}_{t}$, the agent performs a credit scoring action based on greedy policy ${a}_{t}$.

Step 2: According to the reward function $R\left({a}_{t},{y}_{t}\right)$, the environment feeds back a reward to the agent; that is, if the credit scoring action of agent ${a}_{t}$ is the same as real class label ${y}_{t}$, then the feedback reward from the environment is ${r}_{t}=1$; otherwise, it is ${r}_{t}=-1$.

Step 3: According to the state–action pair $({s}_{t},{a}_{t})$, the environment is transferred to the next state, ${s}_{t+1}$.

Step 4: Repeat Steps 1–3 until the environment reaches the terminal state, ${s}_{T}$.

The cumulative reward obtained by the agent from ${s}_{t}$ to ${s}_{T}$ is ${R}_{t}={\sum }_{i=t}^{T}{\gamma }^{i-t}{r}_{i}$. The Q-function corresponding to the state-action pair can be expressed as $\left({s}_{t},{a}_{t}\right)=E\left[{r}_{t}+\gamma \underset{{a}_{t+1}\in {\varvec{A}}}{{\text{max}}}Q\left({s}_{t+1},{a}_{t+1}\right)\right]$, where $\gamma \in [\mathrm{0,1}]$ indicates the discount factor. In particular, $\gamma =0$ indicates that only the current credit scoring result is considered, $\gamma =1$ indicates that the future and current credit scoring results both have the equal importance.

5 Proposed DQN based on balanced stratified prioritized experience replay for customer credit scoring in P2P lending

This section proposes the DQN-BSPER model for CCS in P2P lending. We firstly combined SER and PER to develop the BSPER technology and then designed the DQN-BSPER model; that is, we used the BSPER technology to improve the experience replay process of the DQN model to optimize its loss function. This section introduces the BSPER technology in detail and describes the process of using the DQN-BSPER model for CCS.

5.1 Balanced stratified prioritized experience replay

The most commonly used experience replay technology in the DQN model is RER, which randomly selects experience samples from the buffer to construct a mini-batch. The main advantage of this method is that it is easy to implement. However, the mini-batch constructed by RER technology has an imbalanced class distribution in CCS, and it is difficult to select experience samples that play important roles in updating the loss function parameters, which may affect the CCS performance of the DQN model. SER can reduce the effects of the imbalanced class distribution on the DQN model performance by adjusting the numbers of minority and majority experience samples in the mini-batch (Chen et al. 2018). However, it is difficult to select experience samples that play important roles in updating the loss function parameters. To avoid the shortcomings of the SER, PER provides a concept that can select more important experience samples to improve the convergence performance of the DQN model (Schaul et al. 2015), but the class distribution of the constructed mini-batch is still imbalanced. It can be observed that SER and PER are complementary in the experience replay process. Therefore, we combined SER and PER to develop the BSPER technology.

To express the sampling process of the BSPER technology more clearly, we provide definitions of the majority and minority experience samples. If the current state of an experience sample is a minority sample (a positive class sample), then the sample is the minority experience sample, which is represented as ${{\varvec{E}}}^{min}$, and the tth minority experience sample is represented as ${e}_{t}^{min}=({s}_{t}^{min},{a}_{t}^{min},{r}_{t}^{min},{s}_{t+1}^{min})$. If the current state of an experience sample is a majority class sample (a negative class sample), then let the sample be the majority experience sample, which is represented as ${{\varvec{E}}}^{maj}$, and let the tth majority experience sample be indicated as ${e}_{t}^{maj}=({s}_{t}^{maj},{a}_{t}^{maj},{r}_{t}^{maj},{s}_{t+1}^{maj})$, ${{\varvec{E}}}^{min}\cap {{\varvec{E}}}^{maj}=\varnothing$. The core steps of the SPER technology can be described as follows.

Step 1: Store the experience samples in the minority and majority experience replay buffers ${D}^{min}$ and ${D}^{maj}$, respectively, according to the form of SumTree (Schaul et al. 2015), and obtain the replay memory sets of the minority and majority experience samples ${{\varvec{E}}}_{replay}^{min}={\left\{{e}_{i}^{min}\right\}}_{1\le i\le {d}_{1}}\subset {{\varvec{E}}}^{min}$ and ${{\varvec{E}}}_{replay}^{maj}={\left\{{e}_{j}^{maj}\right\}}_{1\le j\le {d}_{2}}\subset {{\varvec{E}}}^{maj}$.

Step 2: Calculate the TD errors of the experience samples in ${{\varvec{E}}}_{replay}^{min}$ and ${{\varvec{E}}}_{replay}^{maj}$, denoted as ${\delta }_{i}^{min}$ and ${\delta }_{j}^{maj}$, respectively, which can be expressed as follows:

$${\delta }_{i}^{min}=\left|{y}_{i}^{min}-Q\left({s}_{i}^{min},{a}_{i}^{min};{\theta }^{C}\right)\right|,\quad i=\mathrm{1,2},\dots ,{d}_{1},$$

(11)

$${\delta }_{j}^{maj}=\left|{y}_{j}^{maj}-Q\left({s}_{j}^{maj},{a}_{j}^{maj};{\theta }^{C}\right)\right|,\quad j=\mathrm{1,2},\dots ,{d}_{2},$$

(12)

where ${y}_{i}^{min}={r}_{i}^{min}+\gamma \underset{{a}_{i+1}^{\mathit{min}}\in {\varvec{A}}}{\cdot {\text{max}}}Q\left({s}_{i+1}^{min},{a}_{i+1}^{min};{\theta }^{T}\right)$ and ${y}_{j}^{maj}={r}_{j}^{maj}+\gamma \underset{{a}_{j+1}^{\mathit{maj}}\in {\varvec{A}}}{\cdot {\text{max}}}Q\left({s}_{j+1}^{maj},{a}_{j+1}^{maj};{\theta }^{T}\right)$ indicate the target Q-functions in the minority and majority experience samples, respectively; $Q\left({s}_{i}^{min},{a}_{i}^{min};{\theta }^{C}\right)$ and $Q\left({s}_{j}^{maj},{a}_{j}^{maj};{\theta }^{C}\right)$ indicate the current Q-functions in the minority and majority experience samples, respectively; ${\theta }^{T}$ and ${\theta }^{C}$ indicate the parameters of the target Q-network and current Q-network, respectively; and ${d}_{1}$ and ${d}_{2}$ indicate the numbers of minority and majority experience samples in ${{\varvec{E}}}_{replay}^{min}$ and ${{\varvec{E}}}_{replay}^{maj}$, respectively.

Step 3: Calculate the probabilities of minority and majority experience samples selected from ${{\varvec{E}}}_{replay}^{min}$ and ${{\varvec{E}}}_{replay}^{maj}$ according to the TD error, which are represented as $P\left({e}_{i}^{min}\right)$ and $P\left({e}_{j}^{maj}\right)$, respectively:

$$P\left({e}_{i}^{min}\right)=\frac{{\delta }_{i}^{min}}{{\sum }_{1}^{{d}_{1}}{\delta }_{i}^{min}},$$

(13)

$$P\left({e}_{j}^{maj}\right)=\frac{{\delta }_{j}^{maj}}{{\sum }_{j}^{{d}_{2}}{\delta }_{j}^{maj}}$$

(14)

Step 4: Select ${k}_{1}$ and ${k}_{2}$ (${k}_{1}={k}_{2}$) minority and majority experience samples from ${{\varvec{E}}}_{replay}^{min}$ and ${{\varvec{E}}}_{replay}^{maj}$ according to $P\left({e}_{i}^{min}\right)$ and $P\left({e}_{j}^{maj}\right)$, respectively, to construct the stratified prioritized mini-batch.

5.2 Procedure of the DQN-BSPER model

This section describes the introduction of the BSPER technology into the DQN model to construct the DQN-BSPER model. Figure 1 shows the Q-network structure of the DQN-BSPER model. The input layer of the network is the environmental state (the features of the customer sample) ${s}_{t}=\left({s}_{t}^{1},{s}_{t}^{2},\dots {,s}_{t}^{g}\right)$, where the number of nodes in the input layer is the feature number of environment state $g$, and the output layer is set to two Q-values. According to the Q-value, the customer class label is mapped using a one-hot method, and the nodes in each layer are connected in a fully connected manner.

The core idea of the DQN-BSPER model for CCS is as follows. Firstly, in the training set, the BSPER technology is used to construct the stratified prioritized mini-batch to optimize the loss function, and then the gradient descent algorithm is used to update the Q-network parameters continuously to optimize the classification strategy of the agent in CCS. Finally, in the test set, the trained DQN-BSPER model was used to evaluate customer credit. The detailed modelling steps of the proposed model can be summarised as follows (the modelling process is shown in “Appendix 2”).

Step 1: Initialise the parameters of the current Q-network ${\theta }_{t}^{C}$, parameters of the target Q-network ${\theta }_{t}^{T}$, target network update frequency $Z$, time step $t$, maximum time step $T$, episode $c$, maximum episode $C$, size of mini-batch $k$, number of minority experience samples in mini-batch ${k}_{1}$, and number of majority experience samples in mini-batch ${k}_{2}$.

Step 2: Randomly sort the customer samples in the training set and input them into the CCS environment. Each environmental state corresponds to a customer sample; that is, ${s}_{t}\leftarrow {x}_{t}$.

Step 3: The agent obtains environment state ${s}_{t}$ and performs credit scoring actions ${a}_{t}$ according to the ε-greedy strategy:

$$\pi \left({a}_{t}|{s}_{t}\right)=\left\{\begin{array}{c}1-\varepsilon , if\, {a}_{t}=\underset{{a}_{t}\in {\varvec{A}}}{argmax}Q\left({s}_{t},{a}_{t};\,{\theta }_{t}^{C}\right),\\ \varepsilon , else,\end{array}\right.$$

(15)

where the ε-greedy strategy adds randomness to the greedy strategy. That is, the agent selects the action that maximises $Q\left({s}_{t},{a}_{t};{\theta }_{t}^{C}\right)$ based on the probability $1-\varepsilon$ under environment state ${s}_{t}$; otherwise, an action is randomly selected based on the probability $\varepsilon$. Then, reward ${r}_{t}$ is obtained according to the reward function $R\left({a}_{t},{y}_{t}\right)$ (see Eq. (8)), and the environment is transferred to the next state ${s}_{t+1}$.

Step 4: According to Step 3, the agent continuously obtains the experience sample ${e}_{t}=\left({s}_{t},{a}_{t},{r}_{t},{s}_{t+1}\right)$, and obtains ${{\varvec{E}}}_{replay}^{min}$ and ${{\varvec{E}}}_{replay}^{maj}$ by using the BSPER technology.

Step 5: If $\left|{{\varvec{E}}}_{replay}^{min}\right|\le {k}_{1}$ or $\left|{{\varvec{E}}}_{replay}^{maj}\right|\le {k}_{2}$, then go to Step 3 and let $t\leftarrow t+1$; otherwise, use the BSPER technology to construct the stratified prioritized mini-batch. Then, the corresponding loss function of Q-network at the t-th time step ${L}_{STS}\left({\theta }_{t}^{C}\right)$ can be expressed as follows:

$${L}_{STS}\left({\theta }_{t}^{C}\right)=\frac{1}{{k}_{1}}\sum_{i=1}^{{k}_{1}}{\left({y}_{i}^{min}-Q\left({s}_{i}^{min},{a}_{i}^{min};{\theta }_{t}^{C}\right)\right)}^{2}+\frac{1}{{k}_{2}}\sum_{j}^{{k}_{2}}{\left({y}_{j}^{maj}-Q\left({s}_{j}^{maj},{a}_{j}^{maj};{\theta }_{t}^{C}\right)\right)}^{2},$$

(16)

$${y}_{i}^{min}=\left\{\begin{array}{ll}{r}_{i}^{min}, if t = T \\ {r}_{i}^{min}+\gamma \cdot \underset{{a}_{i+1}^{\mathit{min}}\in {\varvec{A}}}{{\text{max}}}Q\left({s}_{i+1}^{min},{a}_{i+1}^{min};{\theta }_{t}^{T}\right), if t \ne T \end{array}\right. i=\mathrm{1,2},\dots ,{k}_{1}$$

(17)

$${y}_{j}^{maj}=\left\{\begin{array}{ll}{r}_{j}^{maj}, if t = T \\ {r}_{j}^{maj}+\gamma \underset{{a}_{j+1}^{\mathit{maj}}\in {\varvec{A}}}{\cdot {\text{max}}}Q\left({s}_{j+1}^{maj},{a}_{j+1}^{maj};{\theta }_{t}^{T}\right), if t \ne T \end{array}\right. j=\mathrm{1,2},\dots ,{k}_{2}$$

(18)

where ${y}_{i}^{min}$ and ${y}_{j}^{maj}$ indicate the target Q-functions in the minority and majority experience samples, respectively; $Q\left({s}_{i}^{min},{a}_{i}^{min};{\theta }_{t}^{C}\right)$ and $Q\left({s}_{j}^{maj},{a}_{j}^{maj};{\theta }_{t}^{C}\right)$ indicate the current Q-functions in the minority and majority experience samples, respectively; ${\theta }_{t}^{T}$ and ${\theta }_{t}^{C}$ indicate the parameters of the target and current Q-networks, respectively; ${k}_{1}$ and ${k}_{2}$ (${k}_{1}+{k}_{2}=k$) indicate the numbers of minority and majority experience samples selected from ${{\varvec{E}}}_{replay}^{min}$ and ${{\varvec{E}}}_{replay}^{maj}$, respectively; and $\gamma$ is the discount factor.

Step 6: Use the Adam optimization algorithm to update the parameters of the current Q-network (Schaul et al. 2015):

$$\frac{\nabla {L}_{STS}\left({\theta }_{t}^{C}\right)}{\nabla {\theta }_{t}^{C}}=\frac{2}{{k}_{1}}{\sum }_{i=1}^{{k}_{1}}\left({y}_{i}^{min}-Q\left({s}_{i}^{min},{a}_{i}^{min};\,{\theta }_{t}^{C}\right)\right)\frac{\nabla {L}_{STS}\left(Q\left({s}_{i}^{min},{a}_{i}^{min};\,{\theta }_{t}^{C}\right)\right)}{\nabla {\theta }_{t}^{C}}$$

$$+\frac{2}{{k}_{2}}{\sum }_{j=1}^{{k}_{2}}\left({y}_{j}^{maj}-Q\left({s}_{j}^{maj},{a}_{j}^{maj};\,{\theta }_{t}^{C}\right)\right)\frac{\nabla {L}_{STS}\left(Q\left({s}_{j}^{maj},{a}_{j}^{maj};\,{\theta }_{t}^{C}\right)\right)}{\nabla {\theta }_{t}^{C}},$$

(19)

$${m}_{t}\leftarrow {\beta }_{1}{m}_{t}+\left(1-{\beta }_{1}\right)\frac{\nabla {L}_{STS}\left({\theta }_{t}^{C}\right)}{\nabla {\theta }_{t}^{C}},$$

(20)

$${d}_{t}\leftarrow {\beta }_{2}{d}_{t}+\left(1-{\beta }_{2}\right){\left(\frac{\nabla {L}_{STS}\left({\theta }_{t}^{C}\right)}{\nabla {\theta }_{t}^{C}}\right)}^{2},$$

(21)

$${\widehat{m}}_{t}\leftarrow \frac{{m}_{t}}{1-{\beta }_{1}^{t}},$$

(22)

$${\widehat{d}}_{t}\leftarrow \frac{{d}_{t}}{1-{\beta }_{2}^{t}},$$

(23)

$${\theta }_{t}^{C}\leftarrow {\theta }_{t}^{C}-\frac{\alpha }{\sqrt{{\widehat{d}}_{t}}+\epsilon }{\widehat{m}}_{t},$$

(24)

where $\frac{\nabla {L}_{STS}\left({\theta }_{t}^{C}\right)}{\nabla {\theta }_{t}^{C}}$ indicates the gradient of ${L}_{STS}\left({\theta }_{t}^{C}\right)$ at the tth time step; ${m}_{t}$ and ${d}_{t}$ indicate the first- and second-order moments of $\frac{\nabla {L}_{STS}\left({\theta }_{t}^{C}\right)}{\nabla {\theta }_{t}^{C}}$, respectively; ${\beta }_{1}$ and ${\beta }_{2}$ indicate the exponential decay rates of the first- and second-order moments, respectively; ${\widehat{m}}_{t}$ and ${\widehat{d}}_{t}$ indicate the deviation correction values of ${m}_{t}$ and ${d}_{t}$, respectively; Eq. (24) indicates that the parameters of the Q-network ${\theta }_{t}^{C}$ are updated by combining Eqs. (22) and (23); $\alpha$ indicates the learning rate of the current Q-network; and $\epsilon$ indicates a constant value to prevent $\sqrt{{\widehat{d}}_{t}}+\epsilon$ from generating the 0 value. Then, for each $Z$ time step, the parameters of the current Q-network ${\theta }_{t}^{C}$ are assigned to the parameters of the target Q-network ${\theta }_{t}^{T}$; that is, let ${\theta }_{Z|t}^{T}\leftarrow {\theta }_{Z|t}^{C}$.

Step 7: If $t\le T$, then proceed to Step 3 and let $t\leftarrow t+1$; otherwise, proceed to Step 8.

Step 8: If $c\le C$, then go to Step 2 and let $c\leftarrow c+1$; otherwise, stop training, output the trained current Q-network, and use it to conduct CCS according to the greedy strategy in the test set.

The current Q-network obtained through the above steps is a parameterised optimal Q-function; therefore, the strategy corresponding to the network is also the optimal classification strategy. The pseudo-codes of the CCS environment simulation and DQN-BSPER model are shown in “Appendix 3”.

6 Experimental design

This section introduces the experimental design. Firstly, two CCS datasets in P2P lending and preprocessing are described, the detailed experimental process and main parameter settings of the models are discussed in detail, and four EMs are introduced to evaluate the performance of the models.

6.1 Data set description and preprocessing

To analyse the CCS performance of the DQN-BSPER model, we selected two CCS datasets in P2P lending (IFCD and Leading Club). The IFCD dataset was obtained from an Internet financial company in China. The original dataset contained 1110 features. The Leading Club dataset was acquired from the first quarter data of Lending Club, which is a P2P lending platform in the United States in 2016, and the original dataset contained 110 features. According to the regulations of the Basel Banking Regulatory Association, we labelled customers whose loans were overdue for more than 90 days as having bad credit and others as having good credit.

The two CCS datasets in P2P lending contained redundant features and missing values, which were processed as follows. First, we deleted some meaningless features through observation, such as loan number, zip code, and customer ID. Then, the features with missing rates of more than 30% were eliminated. Finally, the recursive feature elimination method Wrapper was used to eliminate the features with the minimum absolute weights continuously. The basic information about the two CCS datasets in P2P lending after preprocessing is shown in Table 1, including the dataset name (the abbreviation in parentheses), number of features, number of customer samples, number of customer samples with good credit, number of customer samples with bad credit, and imbalanced ratio (IR), which is defined as the proportion of the number of the majority samples (customers with good credit) relative to the number of minority samples (customers with bad credit). Obviously, the greater the $IR$, the higher the class distribution imbalance. As shown in Table 1, the two CCS datasets are imbalanced.

Table 1

The basic information of two CCS data sets in P2P lending after preprocessing

Data sets	Features	Samples	Good credit	Bad credit	IR
IFCD (IF)	50	23,319	21,583	1736	12.43
Leading Club (LE)	90	133,887	109,888	23,999	4.58

6.2 Experimental procedure and parameter settings

In this study, we performed fivefold cross validation in each dataset to obtain the calculation results for the models. Firstly, each dataset was divided equally into five subsets at random. In each experiment, one subset was used as the test set, ${{\varvec{D}}}_{test}$, and the remaining subsets were used as the training set, ${{\varvec{D}}}_{train}$. Then, the DQN-BSPER model and other models were trained in ${{\varvec{D}}}_{train}$. Finally, the trained models were used to classify the samples in ${{\varvec{D}}}_{test}$. This process was repeated five times to ensure that each subset was used once. The entire process constitutes fivefold cross validation. To obtain more stable and reliable experimental results, we repeated the fivefold cross validation 10 times and took the average value as the final calculation result. In addition, we recorded the network loss each time the model traversed the training set in the process of fivefold cross validation and averaged the training network loss at the end of training. All experiments in this study were run on Python 3.6, in a Windows 10 × 64 bit system equipped with an Intel (R) Core (TM) i9 processor.

We mainly referred to previous studies (Lin et al. 2020; Liu et al. 2020; Mnih et al. 2015) to set the parameters of the DRL models. “Appendix 4” shows the main parameter settings of the DQN-BSPER and other DRL models. In particular, the size of the replay memory buffers of the DQN-BSPER and DQN-SER models ${D}^{min}$ and ${D}^{maj}$ was 500. In addition, the current and target Q-networks of the DQN-BSPER model had the same structure. To ensure the fairness of the experiment, we adjusted the discount factors of the DQN, DQN-SER, DQN-PER, and DQN-IRF models many times and determined their optimal values. Specifically, the optimal discount factor of the DQN-SER model was 0.1, and the optimal discount factor of the DQN, DQN-PER, and DQN-IRF models was 0.3. In all experiments, the network parameters of the five DRL models were equal. In addition, we used the grid search and fivefold cross validation (Gunnarsson et al. 2021) to ensure the excellent performance of the other classification models.

6.3 Evaluation measures

The confusion matrix method can intuitively evaluate classification models performance (Batista et al. 2004). Table 2 shows the confusion matrix of CCS, from which many different EMs can be obtained.

Table 2

The confusion matrix of CCS

	Predicted positive	Predicted negative	Total
Actual positive (bad credit customer)	TP	FN	TP + FN
Actual negative (good credit customer)	FP	TN	FP + TN
Total	TP + FP	FN + TN	TP + FN + FP + TN

$$True\, positive\, rate\, (TPR)=\frac{TP}{ TP +FN},$$

(25)

$$True\, negative\, rate\, (TNR)=\frac{TN}{TN+FP},$$

(26)

$$False\, positive\, rate\, (FPR)=\frac{FP}{TN+FP},$$

(27)

$$Precision=\frac{TP}{TP+FP},$$

(28)

$$Accuracy=\frac{TP+TN}{TP+FP+TN+FN}.$$

(29)

However, there are some contradictions between these EMs, so some new comprehensive measures have been proposed, such as F1 (Tang et al. 2008) and AUC (Bradley 1997). F1 is a combination of precision and recall:

$$F1=\frac{(1+{\beta }^{2})\times recall\times precision}{{\beta }^{2}\times recall+precision},$$

(30)

where recall = TPR and $\beta$ is the relative coefficient of recall to precision. In this study, we design $\beta =1$.

In this research, we used the approximate representation of the AUC for the binary classification problem (Loyola-González et al. 2016):

$$AUC=\frac{TPR+TNR}{2}.$$

(31)

Therefore, we used four TPR, ACC, F1, and AUC in this study. The larger the values of them, the better the performance of the CCS model.

7 Experimental results and analysis

We empirically analysed the CCS performance of the DQN-BSPER model using two real-world P2P lending CCS datasets. Firstly, we assessed the effects of parameters $\gamma$ and ${k}_{1}/{k}_{2}$ on the CCS performance of the DQN-BSPER model. Subsequently, the CCS performance of DQN-BSPER and the four benchmark DRL models DQN (Mnih et al. 2013), DQN-SER (Chen et al. 2018), DQN-PER (Schaul et al. 2015), and DQN-IRF (Lin et al. 2020) were statistically compared, and their convergence performance was compared. Next, the CCS performance of the DQN-BSPER model and seven traditional benchmark classification models was statistically compared. Finally, the impact of the imbalanced class distribution on the CCS performance of the DQN-BSPER model was analysed.

7.1 Parameter sensitivity analysis

To analyse the effects of the discount factor $\gamma$ and the proportion of minority and majority experience samples in the stratified prioritized mini-batch ${k}_{1}/{k}_{2}$ on the CCS performance of the DQN-BSPER model, we set 11 different types of$\gamma$; that is, we let$\gamma =\{\mathrm{0,0.1,0.2},\dots ,1\}$. A large difference between ${k}_{1}$ and ${k}_{2}$ may significantly affect the DQN-BSPER model performance. Therefore, we set 11 different values of${k}_{1}/{k}_{2}$: {$6/1 , 5/1, 4/1, \dots , 1/6\}$. Owing to the lack of prior experience, we temporarily set ${k}_{1}/{k}_{2}=1$ when analysing the effects of the discount factor on the DQN-BSPER model performance.

Figure 2 depicts the average values of the EMs of the DQN-BSPER model under 11 values of $\gamma$ (the average value of each EM is the average value of the classification results of the model on two CCS datasets in P2P lending). “Appendix 5” shows the CCS performance of the DQN-BSPER model under the 11 discount factors. The bold values indicate the best CCS performance in each column, the numbers in parentheses indicate the rankings of the DQN-BSPER model with different discount factors, and the last column indicates the average ranking value of the DQN-BSPER model performance with each discount factor. The smaller the ranking value, the better the model performance. As shown in Fig. 2, with respect to TPR, the CCS performance of the DQN-BSPER model shows a gradual upward trend with increasing $\gamma$. According to ACC, F1, and AUC, the DQN-BSPER model performance shows a gradual downward trend with increasing $\gamma$. To determine the most appropriate discount factor, we ranked the DQN-BSPER model performance under different discount factors. As can be seen in “Appendix 5”, the DQN-BSPER model has the smallest ranking value when $\gamma =0.1$. The DQN-BSPER model has the largest ranking value when $\gamma =0.8$.

Figure 3 presents the average values of the EMs of the DQN-BSPER model under 11 values of ${k}_{1}/{k}_{2}$. Furthermore, we ranked the CCS performance of the DQN-BSPER model under different ${k}_{1}/{k}_{2}$ values, and the results are shown in “Appendix 6”. As depicted in Fig. 3, with respect to TPR, the DQN-BSPER model performance shows a gradual downward trend with increasing ${k}_{1}/{k}_{2}$. With respect to ACC, the DQN-BSPER model performance shows a gradual upward trend with increasing ${k}_{1}/{k}_{2}$. With respect to F1 and AUC, the DQN-BSPER model performance firstly shows an upward trend and then a downward trend. As shown in “Appendix 6”, the DQN-BSPER model has the smallest ranking value when ${k}_{1}/{k}_{2}=1$.

It can be seen from the experimental results that the CCS performance of the DQN-BSPER model is the best when $\gamma =0.1$ and ${k}_{1}/{k}_{2}=1$. Interestingly, with increasing $\gamma$, the TPR gradually increases, whereas ACC, F1, and AUC gradually decrease. This observation shows that the greater $\gamma$, the more majority samples will be wrongly divided into the minority by the DQN-BSPER model, resulting in poor overall performance of the model. When $\gamma =0.1$, the DQN-BSPER model performance is the best, which also demonstrates that there is a weak sequence correlation between the samples in the CCS dataset, whereas identifying the correlation may make the DQN-BSPER model more robust (Lei et al. 2020). In addition, setting ${k}_{1}/{k}_{2}$ too high or too low leads to poor CCS performance of the DQN-BSPER model. When ${k}_{1}/{k}_{2}=1$, the DQN-BSPER model has the highest F1 and ACC values, which verifies that balancing the class distribution of different experience samples in the stratified prioritized mini-batch can improve the CCS performance of the DQN-BSPER model the most effectively.

7.2 Credit scoring performance comparison of DQN-BSPER and four other benchmark DRL models

This section compares the CCS performance of DQN-BSPER and the other four benchmark DRL models, DQN, DQN-SER, DQN-PER, and DQN-IRF, on the IF and LE datasets according to four EMs. Table 3 presents the CCS performance of DQN-BSPER and the other four benchmark DRL models on the IF and LE datasets. The bold font indicates the model with the best CCS performance in each column, and the numbers in parentheses indicate the model performance rankings. The last column indicates the average ranking of the performance of each model.

Table 3

The CCS performance of DQN-BSPER and four other benchmark DRL models on the IF and LE datasets

Model	IF dataset				LE dataset				Average rank
Model	TPR	ACC	F1	AUC	TPR	ACC	F1	AUC	Average rank
DQN-BSPER	0.7464⁽¹⁾	0.6296⁽¹⁾	0.2260⁽¹⁾	0.6779⁽¹⁾	0.9838⁽¹⁾	0.9888⁽²⁾	0.9690⁽¹⁾	0.9857⁽¹⁾	1.13
DQN	0.0191⁽⁴⁾	0.6284⁽²⁾	0.0289⁽⁴⁾	0.5070⁽⁴⁾	0.9585⁽⁴⁾	0.9819⁽⁵⁾	0.9673⁽⁵⁾	0.9827⁽³⁾	3.88
DQN-SER	0.6887⁽²⁾	0.6202⁽⁴⁾	0.2211⁽²⁾	0.6680⁽²⁾	0.9829⁽²⁾	0.9852⁽³⁾	0.9683⁽²⁾	0.9836⁽²⁾	2.38
DQN-PER	0.0422⁽³⁾	0.6241⁽³⁾	0.0782⁽³⁾	0.5182⁽³⁾	0.9635⁽³⁾	0.9821⁽⁴⁾	0.9677⁽³⁾	0.9809⁽⁴⁾	3.25
DQN-IRF	0.0163⁽⁵⁾	0.6107⁽⁵⁾	0.0247⁽⁵⁾	0.5036⁽⁵⁾	0.9499⁽⁵⁾	0.9908⁽¹⁾	0.9674⁽⁴⁾	0.9748⁽⁵⁾	4.36

The bold indicates the best ranked model in each column

Table 3 demonstrates that the DQN-BSPER model has the smallest average ranking, indicating that the model has the best CCS performance. To analyse whether there are statistically significant differences in CCS performance among DQN-BSPER and the other four benchmark DRL models, we used the nonparametric Wilcoxon rank sum test (Wilcoxon 1992). The null hypothesis is that the two comparative models have the same CCS performance. In this paper, we set the significance level $\alpha =0.05$. Then, at the 95% confidence level, when the statistical amount is 8 ($=2\times 4$), the critical value (CV) is 3. The Wilcoxon rank sum test results of DQN-BSPER and four other benchmark models are provided in Table 4. In Table 4, if T = Min(R⁺, R⁻) $\le$ 3, the null hypothesis is rejected. In particular, if T = R⁻ and T $\le$ 3, it means that the former performance is statistically significantly better than the latter. If T = R⁺ and T $\le$ 3, then the opposite is true.

Table 4

Wilcoxon rank sum test results for DQN-BSPER and four other benchmark DRL models on the IF and LE datasets

Comparison	T = Min(R⁺, R⁻)	CV	p-value	Hypothesis
DQN-BSPER vs. DQN	Min(36, 0) = 0	3	0.012	Reject
DQN-BSPER vs. DQN-PER	Min(36, 0) = 0	3	0.012	Reject
DQN-BSPER vs. DQN-SER	Min(36, 0) = 0	3	0.012	Reject
DQN-BSPER vs. DQN-IRF	Min(34, 2) = 2	3	0.025	Reject

R⁺ indicates the sum of the rankings of all cases in which the performance of the former is better than that of the latter when comparing two models; R⁻ indicates the sum of the rankings of all cases in which the performance of the former is worse than that of the latter

The test results in Table 4 indicate that when comparing the DQN-BSPER model with the DQN, DQN-SER, DQN-PER, and DQN-IRF models, R⁺ is less than the CV; in other words, at the 95% confidence level, the CCS performance of the DQN-BSPER model is statistically significantly better than those of the other four benchmark models. More importantly, the CCS performances of the DQN-SER and DQN-PER models are better than that of the DQN model, whereas the DQN-BSPER model is superior to the DQN-SER and DQN-PER models. Thus, although the CCS performance of the DQN model can be improved by using SER or PER technology alone, the effect of improvement is very limited, whereas the combination of SER and PER technologies can further improve the performance of the DQN model. In addition, for the IF dataset, the CCS performance of the DQN-IRF model is very poor. This poor performance may occur because the DQN-IRF model (Lin et al. 2020) is mainly applied to classification tasks with unstructured data (image and text data), whereas the CCS data are structured, which may lead to the failure of the reward function, affecting the CCS performance of the DQN-IRF model.

7.3 Convergence performance comparison of DQN-BSPER and four other benchmark DRL models

To analyse the effects of the SPER technology on the convergence performance of the DQN model, this section compares the convergence performances of the DQN-BSPER, DQN, DQN-SER, DQN-PER, and DQN-IRF models according to four EMs on the IF and LE datasets. Figure 4 shows the average training loss of the Q-networks of DQN-BSPER and the other four benchmark DRL models on two CCS training sets.

Figure 4a shows the average training losses of the Q-networks of the five DQN models on the IF dataset. The average training loss of the Q-network of the DQN-BSPER model reaches a stable value when episode = 1, indicating that the DQN-BSPER model converges when episode = 1, and the value fluctuates slightly as the number of episodes increases. The average training loss of the Q-network of the DQN model is very large when episode = 1; then, the value decreases rapidly and reaches the local minimum when episode = 4, but the value fluctuates strongly as the number of episodes increases. The average training loss of the Q-network of the DQN-SER model is larger than those of the other models, and the fluctuation is also very strong. The average training losses of the Q-networks of DQN-PER and DQN-IRF models reach stable values when episode = 3; that is, the DQN-PER and DQN-IRF models reach a convergence state when episode = 3. Figure 4b shows the average training losses of the Q-networks of the five DQN models on the LE dataset. The average training loss of the Q-network of the DQN-BSPER model reaches a stable value when episode = 1, indicating that the DQN-BSPER model converges when episode = 1, and the value fluctuates slightly as the number of episodes increases. The average training loss of the Q-network of the DQN model is very large when episode = 1 and reaches the local minimum when episode = 9, but the value still fluctuates strongly as the number of episodes increases. The average training loss of the Q-network of the DQN-SER model also fluctuates significantly. The average training losses of the Q-networks of the DQN-PER and DQN-IRF models reach stable values when episode = 6 and episode = 3, respectively; that is, the DQN-PER and DQN-IRF models reach a convergence state when episode = 6 and episode = 3, respectively, and their average training losses of the Q-network are relatively small.

The experimental results show that the Q-network of the DQN-BSPER model has the fastest convergence speed and that the average training loss of the Q-network is the most stable. Thus, the SPER technology can effectively improve the convergence performance of the DQN model; that is, it can not only accelerate the convergence speed of the Q-network, but also make the Q-network more stable. In addition, the average training losses of the Q-networks of the DQN-PER and DQN-IRF models are very small, but the results in Sect. 7.2 show that their CCS performance is relatively poor, which indicates that over-fitting may occur. Interestingly, the fluctuation in the average training losses of the DQN and DQN-SER models is stronger than that of the DQN-BSPER and DQN-PER models. The main reason may be that the DQN-BSPER and DQN-PER models consider the priority of the experience samples when building the mini-batch. Thus, PER technology can improve the stability of the DQN model more effectively than SER technology.

7.4 Credit scoring performance comparison of DQN-BSPER and seven traditional classification models

This section compares the CCS performance of DQN-BSPER and seven traditional benchmark classification models on the IF and LE datasets. The results are presented in “Appendix 7”, in which the bold font indicates the classification model with the best CCS performance in each line, and the numbers in parentheses indicate the performance rankings of the classification models. The last line indicates the average ranking of each classification model. To analyse whether there are statistically significant differences between DQN-BSPER and the seven traditional benchmark classification models, we used the nonparametric statistical test method proposed by (Demšar 2006), namely, the Friedman test (Friedman 1940) and Iman–Davenport test (Iman and Davenport 1980). If there were statistically significant differences, we further used the Nemenyi post hoc test method to compare the eight classification models. In addition, in order to achieve fair results, we used the resampling methods ROS and RUS to balance the class distribution of the two training sets to form four new training sets (the class distribution ratio of the balanced training set was 1:1) during each training process. Each dataset corresponded to two new cases; that is, the IF dataset corresponded to IF (ROS) and IF (RUS), whereas the LE dataset corresponded to LE (ROS) and LE (RUS).

The results in “Appendix 7” show that the DQN-BSPER model has the smallest average ranking, indicating that it achieves the best CCS performance. We used the Friedman and Iman–Davenport tests to analyse further whether there were statistically significant differences between DQN-BSPER and the seven traditional benchmark classification models. The null hypothesis of the two test methods was that the CCS performance of the eight classification models were the same. We used the ${\chi }^{2}$ distribution with 7 freedom degree and the F distribution with 7 and 161 ($=7\times 23$) freedom degree. Table 5 list the test results. If the test value was greater than the distribution value, the null hypothesis was rejected. Therefore, according to results of Table 5, we rejected the null hypothesis and concluded that there were statistically significant differences among the eight classification models in the CCS performance at the 95% confidence level.

Table 5

Friedman and Iman–Davenport test results for DQN-BSPER and seven traditional benchmark classification models

	Test value	Distribution value	Hypothesis
Friedman	98.58	14.07	Reject
Iman–Davenport	32.66	2.07	Reject

After the null hypothesis was rejected, we used the Nemenyi post hoc test. The judgement rule of this test is that if the difference between the average values of any two classification models is greater than the critical difference (CD), then the null hypothesis that the two classification models performance is the same at the 95% confidence level is that there are significant differences in the CCS performance between the two classification models. When the number of classification models is 8 and the CV (Demšar 2006) is 3.03, the corresponding critical distance ${\text{CD}}=3.03\sqrt{(8*9)/(6*24)}\approx 2.14$. The test results are presented in Fig. 5. If there are segments connected between the two classification models, then there are no statistically significant differences. The results in Fig. 5 show that the CCS performance of the DQN-BSPER model is statistically significantly better than those of the other seven traditional benchmark classification models at the 95% confidence level.

7.5 Impact of imbalanced class distribution on the customer credit scoring performance of the DQN-BSPER model

To analyse the effects of imbalanced class distribution on the CCS performance of the DQN-BSPER model, we set 10 imbalance ratios for the IF and LE datasets. Specifically, we firstly randomly sampled from the majority samples in the training set according to the multiple of the number of minority samples and then combined the sampled majority samples with all minority samples to form a new training set. According to this method, training sets with 10 imbalance ratios $(IR=\left\{1, 2, \dots , 10\right\})$ were obtained, and the DQN-BSPER model performance was analysed according to four EMs. Figure 6 shows the CCS performance of the DQN-BSPER model under 10 imbalanced ratios on the IF and LE datasets. Furthermore, we used boxplots to analyse the dispersion degrees of the four EM values, as shown in Fig. 7.

As can be seen from Fig. 6a, under 10 types of IR, the values of TPR, ACC, and AUC of the DQN-BSPER model are higher than the F1 values. In Fig. 6b, under 10 types of IR, the values of TPR, ACC, and AUC of the DQN-BSPER model are also higher than the F1 values. When $IR=1$, the four evaluation values of the DQN-BSPER model are relatively low. In Fig. 7a, the box corresponding to the TPR is the longest, whereas the ACC, AUC, and F1 boxes are relatively short. In Fig. 7b, the box for F1 is the shortest, whereas the TPR, ACC, and AUC boxes are relatively long. It can be seen from the experimental results that for the IF and LE datasets, the boxes of the four EM values of the DQN-BSPER model are very narrow, which demonstrates that their dispersion degree is quite small. Thus, the fluctuations in the TPR, ACC, F1, and AUC values of the DQN-BSPER model are very weak under different imbalance ratios. This finding indicates that the imbalanced class distribution has a limited effect on the CCS performance of the DQN-BSPER model.

8 Conclusions and future works

This paper proposed DRL based on a stratified prioritized experience replay model, that is, the DQN-BSPER model. The model optimizes the loss function by combining SER and PER to improve its CCS performance. Further, we introduced four EMs to conduct an empirical analysis of two real-world CCS datasets in P2P lending. In the parameter analysis experiment, we found that the CCS performance of the DQN-BSPER model was the best when $\gamma =0.1$ and ${k}_{1}/{k}_{2}=1$. This result shows that there is a relatively weak sequence correlation between the CCS dataset samples. The balanced stratified prioritized mini-batch is more conducive to improving the CCS performance of the DQN-BSPER model. In the second experiment, we found that the performance of the DQN-BSPER model was statistically significantly better than those of the DQN, DQN-SER, DQN-PER, and DQN-IRF models. By analysing the convergence performance of the DQN-BSPER model, we found that SPER technology could not only accelerate the convergence speed of the Q-network, but also improve the stability of the network. In addition, in the comparing experiment, we found that the DQN-BSPER model performance was statistically significantly better than those of the other seven traditional benchmark classification models. Interestingly, in the experiment in which the effects of the imbalanced class distribution on the DQN-BSPER model performance were investigated, we found that the effect of the imbalanced class distribution on the CCS performance of the DQN-BSPER model was very limited.

The following aspects will be considered in future work. First, we will introduce decision making methods (Kou et al. 2021a, b, 2021a; Kou et al. 2014; Wang et al. 2021) into the reward function to improve DRL classification performance. Second, we will attempt to use a network structure identifying feature importance (Xiao et al. 2020) to improve the interpretability of the deep reinforcement learning model, which is of great significance for enterprise management. Finally, we will combine relationship network (Kou et al. 2021a, b) with DRL to develop a multi-agent classification system.

Acknowledgements

The financial support from the National Natural Science Foundation of China (Grant No. 72171160), the Guangdong Province Philosophy and Social Science Planning Project (Grant No. GD23YGL24), the Excellent Youth Foundation of Sichuan Province (Grant No. 2020JDJQ0021), the Tianfu Ten-Thousand Talents Program of Sichuan Province (Grant No. 0082204151153), the Humanities and Social Science Youth Foundation of Ministry of Education of China (Grant No. 23YJCZH088), the Sichuan Science and Technology Program (Grant No. 2023YFQ0018), the Scientific Research Starting Project of SWPU (Grant No. 2021QHZ020) are gratefully acknowledged.

Declarations

Conflict of interest

The authors declare no competing interests.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Vorheriger Artikel -linguistic sets and its application for the linguistic multi-attribute group decision making

Nächster Artikel A review of predictive uncertainty estimation with machine learning

Appendices

Appendix 1: Notations and definitions

Notations	Definitions
${{\varvec{D}}}_{train}$	The customer credit scoring training set
${{\varvec{D}}}_{test}$	The customer credit scoring testing set
${{\varvec{D}}}_{min}$	The minority sample set
${{\varvec{D}}}_{maj}$	The majority sample set
${x}_{t}$	The $t$th customer sample
${y}_{t}$	The class label of the tth customer sample
${\varvec{S}}$	The state space of the environment
${\varvec{A}}$	The action space of the agent
${s}_{t}$	The environment state at the tth time step
${a}_{t}$	The action of the agent at the tth time step
${r}_{t}$	The reward obtained by the agent according to the reward function $R$ at the tth time step
$\gamma$	The discount factor
${R}_{t}$	The cumulative reward obtained by the agent from time step $t$ to terminal step in each episode
${e}_{t}$	The experience sample at the tth time step
${{\varvec{E}}}_{t}$	The experience sample set at the tth time step
${{\varvec{E}}}^{min}$	The minority experience sample set
${{\varvec{E}}}^{maj}$	The majority experience sample set
$D$	The replay memory buffer for storing experience samples
${D}^{min}$	The replay memory buffer for storing minority experience samples
${D}^{maj}$	The replay memory buffer for storing majority experience samples
${{\varvec{E}}}_{replay}^{min}$	The experience sample set stored in ${D}^{min}$
${{\varvec{E}}}_{replay}^{maj}$	The experience sample set stored in ${D}^{maj}$
${d}_{1}$	The number of experience samples in ${{\varvec{E}}}_{replay}^{min}$
${d}_{2}$	The number of experience samples in ${{\varvec{E}}}_{replay}^{maj}$
${e}_{t}^{min}$	The minority experience sample at the tth time step
${e}_{t}^{maj}$	The majority experience sample at the tth time step
${\theta }_{t}^{C}$	The parameters of the current Q-network at the tth time step
${\theta }_{t}^{T}$	The parameters of the target Q-network at the tth time step
$k$	The size of the mini-batch
${k}_{1}$	The number of experience samples sampled from ${{\varvec{E}}}_{replay}^{min}$
${k}_{2}$	The number of experience samples sampled from ${{\varvec{E}}}_{replay}^{maj}$
$Z$	The updating frequency of the target Q-network
$T$	The maximum time step
$C$	The maximum episode
$\alpha$	The learning rate of the current Q-network
TP	The number of positive samples predicted as positive
TN	The number of negative samples predicted as negative
FP	The number of negative samples predicted as positive
FN	The number of positive samples predicted as negative

Appendix 2: The process of using the DQN-BSPER model for customer credit scoring in P2P lending

Appendix 3: The pseudo-codes

Algorithm 1 shows the pseudo-code of CCS environment simulation.

Algorithm 2 shows the pseudo-code of the DQN-BSPER model for CCS in P2P lending.

Appendix 4: The main parameters setting of DQN-BSPER and other reinforcement learning models

DRL models	Parameters setting
DQN-BSPER	Initial exploration value = 0.1, final exploration value = 0.01, exploration attenuation = 10,000, $\left\|{D}^{min}\right\|=\left\|{D}^{maj}\right\|=500$, $C=50$, $Z=500$, $k=128$, $\alpha =0.00025$, ${\beta }_{1}=0.9$, ${\beta }_{2}=0.999$, $\tau =1e-8$, hidden layers of the neural network = 3, number of neurons in the hidden layer = 50, activation function in hidden layers = ReLU
DQN-SER	Initial exploration value = 0.1, final exploration value = 0.01, exploration attenuation = 10,000, $\gamma =0.1$, $\left\|{D}^{min}\right\|=\left\|{D}^{maj}\right\|=500$, $C=50$, $Z=500$, $k=128$, $\alpha =0.00025$, ${\beta }_{1}=0.9$, ${\beta }_{2}=0.999$, $\tau =1e-8$, hidden layers of the neural network = 3, number of neurons in the hidden layer = 50, activation function in hidden layers = ReLU
DQN, DQN-PER, DQN-IRF	Initial exploration value = 0.1, final exploration value = 0.01, exploration attenuation = 10,000, $\gamma =0.3$, $\left\|D\right\|=1000$, $C=50$, $Z=500$, $k=128$, $\alpha =0.00025$, ${\beta }_{1}=0.9$, ${\beta }_{2}=0.999$, $\tau =1e-8$, hidden layers of the neural network = 3, number of neurons in the hidden layer = 50, activation function in hidden layers = ReLU

Appendix 5: The CCS performance of the DQN-BSPER model under 11 kinds of discount factors

	IF dataset				LE dataset				Average rank
$\gamma$	TPR	ACC	F1	AUC	TPR	ACC	F1	AUC	Average rank
0	0.7176⁽¹¹⁾	0.6196⁽²⁾	0.2192⁽⁴⁾	0.6646⁽⁷⁾	0.9817⁽⁸⁾	0.9886⁽²⁾	0.9686⁽³⁾	0.9859⁽²⁾	4.88
0.1	0.7464⁽⁸⁾	0.6296⁽¹⁾	0.2260⁽¹⁾	0.6779⁽¹⁾	0.9838⁽⁴⁾	0.9888⁽¹⁾	0.9690⁽¹⁾	0.9857⁽³⁾	2.50
0.2	0.7291⁽¹⁰⁾	0.6106⁽³⁾	0.2179⁽⁶⁾	0.6651⁽⁶⁾	0.9798⁽¹¹⁾	0.9871⁽⁵⁾	0.9645⁽⁵⁾	0.9842⁽⁵⁾	6.38
0.3	0.7378⁽⁹⁾	0.6071⁽⁴⁾	0.2184⁽⁵⁾	0.6672⁽⁵⁾	0.9827⁽⁶⁾	0.9881⁽³⁾	0.9689⁽²⁾	0.9856⁽⁴⁾	4.75
0.4	0.7752⁽⁶⁾	0.5786⁽⁷⁾	0.2149⁽⁷⁾	0.6690⁽³⁾	0.9846⁽³⁾	0.9872⁽⁴⁾	0.9651⁽⁴⁾	0.9862⁽¹⁾	4.38
0.5	0.7637⁽⁷⁾	0.6009⁽⁵⁾	0.2217⁽³⁾	0.6758⁽²⁾	0.9838⁽⁵⁾	0.9841⁽⁶⁾	0.9568⁽⁶⁾	0.9840⁽⁶⁾	5.00
0.6	0.7810⁽⁴⁾	0.5621⁽⁸⁾	0.2098⁽⁸⁾	0.6627⁽⁸⁾	0.9802⁽¹⁰⁾	0.9733⁽⁸⁾	0.9294⁽⁸⁾	0.9760⁽⁹⁾	7.88
0.7	0.7781⁽⁵⁾	0.5938⁽⁶⁾	0.2219⁽²⁾	0.6686⁽⁴⁾	0.9879⁽²⁾	0.9712⁽⁹⁾	0.9248⁽⁹⁾	0.9677⁽¹¹⁾	6.00
0.8	0.8012⁽²⁾	0.5421⁽⁹⁾	0.2066⁽⁹⁾	0.6612⁽⁹⁾	0.9821⁽⁷⁾	0.9653⁽¹¹⁾	0.9102⁽¹¹⁾	0.9718⁽¹⁰⁾	8.50
0.9	0.7954⁽³⁾	0.5316⁽¹⁰⁾	0.2018⁽¹⁰⁾	0.6529⁽¹¹⁾	0.9885⁽¹⁾	0.9681⁽¹⁰⁾	0.9175⁽¹⁰⁾	0.9761⁽⁸⁾	7.88
1	0.8271⁽¹⁾	0.5095⁽¹¹⁾	0.2006⁽¹¹⁾	0.6556⁽¹⁰⁾	0.9806⁽⁹⁾	0.9765⁽⁷⁾	0.9374⁽⁷⁾	0.9781⁽⁹⁾	7.88

Appendix 6: The CCS performance of the DQN-BSPER model under 11 kinds of ${{\varvec{k}}}_{1}/{{\varvec{k}}}_{2}$

	IF dataset				LE dataset				Average rank
${k}_{1}/{k}_{2}$	TPR	ACC	F1	AUC	TPR	ACC	F1	AUC	Average rank
6/1	0.9422⁽¹⁾	0.1829⁽¹¹⁾	0.1461⁽⁷⁾	0.5321^(6.5)	0.9885⁽²⁾	0.9636⁽¹¹⁾	0.9069⁽¹⁰⁾	0.9734⁽¹¹⁾	7.44
5/1	0.9412⁽²⁾	0.2064⁽¹⁰⁾	0.1506⁽⁶⁾	0.5321^(6.5)	0.9803⁽⁶⁾	0.9696⁽¹⁰⁾	0.9063⁽¹¹⁾	0.9769⁽⁶⁾	7.19
4/1	0.8947⁽³⁾	0.2362⁽⁹⁾	0.1581⁽³⁾	0.5519⁽⁵⁾	0.9825⁽⁵⁾	0.9739⁽⁸⁾	0.9293⁽⁹⁾	0.9812⁽⁴⁾	5.75
3/1	0.8674⁽⁴⁾	0.3998⁽⁸⁾	0.1557⁽⁴⁾	0.5608⁽⁴⁾	0.9896⁽¹⁾	0.9780⁽⁷⁾	0.9416⁽⁷⁾	0.9825⁽³⁾	4.75
2/1	0.8096⁽⁵⁾	0.5477⁽⁷⁾	0.1547⁽⁵⁾	0.5827⁽²⁾	0.9841⁽³⁾	0.9829⁽⁶⁾	0.9644⁽⁵⁾	0.9828⁽²⁾	4.38
1	0.7464⁽⁶⁾	0.6296⁽⁶⁾	0.2260⁽¹⁾	0.6779⁽¹⁾	0.9838⁽⁴⁾	0.9888⁽⁴⁾	0.9690⁽¹⁾	0.9857⁽¹⁾	3.00
1/2	0.1830⁽⁷⁾	0.8276⁽⁵⁾	0.1928⁽²⁾	0.5694⁽³⁾	0.9702⁽⁸⁾	0.9883⁽⁵⁾	0.9617⁽⁶⁾	0.9783⁽⁵⁾	5.13
1/3	0.0208⁽¹¹⁾	0.9271⁽⁴⁾	0.0395⁽¹¹⁾	0.5092⁽¹⁰⁾	0.9786⁽⁷⁾	0.9730⁽⁹⁾	0.9299⁽⁸⁾	0.9752⁽⁸⁾	8.50
1/4	0.0519⁽⁸⁾	0.9294⁽³⁾	0.0874⁽⁸⁾	0.5205⁽⁸⁾	0.9601⁽⁹⁾	0.9908⁽³⁾	0.9678⁽²⁾	0.9747⁽⁹⁾	6.25
1/5	0.0403⁽⁹⁾	0.9315^(1.5)	0.0711⁽⁹⁾	0.5163⁽⁹⁾	0.9554⁽¹⁰⁾	0.9911⁽²⁾	0.9653⁽³⁾	0.9734⁽¹⁰⁾	6.69
1/6	0.0317⁽¹⁰⁾	0.9315^(1.5)	0.0576⁽¹⁰⁾	0.5079⁽¹¹⁾	0.9544⁽¹¹⁾	0.9913⁽¹⁾	0.9649⁽⁴⁾	0.9768⁽⁷⁾	6.94

Appendix 7: The customer credit scoring performance of DQN-BSPER and seven traditional benchmark classification models on the IF and LE datasets

		DQN-BSPER	DNN	LDA	LR	NB	DT	KNN	SVM
IF	TPR	0.7464⁽¹⁾	0.1826⁽⁴⁾	0.3129⁽²⁾	0.0367⁽⁸⁾	0.0754⁽⁶⁾	0.1113⁽⁵⁾	0.2084⁽³⁾	0.0742⁽⁷⁾
	ACC	0.6296⁽⁵⁾	0.6213⁽⁶⁾	0.9229⁽³⁾	0.9256⁽²⁾	0.1012⁽⁷⁾	0.8717⁽⁴⁾	0.9287⁽¹⁾	0.0752⁽⁸⁾
	F1	0.2260⁽¹⁾	0.0237⁽⁷⁾	0.0555⁽⁶⁾	0.0214⁽⁸⁾	0.1401⁽²⁾	0.1070⁽⁴⁾	0.0567⁽⁵⁾	0.1381⁽³⁾
	AUC	0.6779⁽¹⁾	0.5543⁽⁴⁾	0.6201⁽²⁾	0.5539⁽⁵⁾	0.5171⁽⁷⁾	0.5198⁽⁶⁾	0.5678⁽³⁾	0.5013⁽⁸⁾
IF (ROS)	TPR	0.6672⁽¹⁾	0.0822⁽⁶⁾	0.1009⁽⁵⁾	0.1011⁽⁴⁾	0.0753⁽⁸⁾	0.1489⁽²⁾	0.1095⁽³⁾	0.0757⁽⁷⁾
	ACC	0.6343⁽³⁾	0.2365⁽⁶⁾	0.5921⁽⁴⁾	0.5883⁽⁵⁾	0.0985⁽⁸⁾	0.8446⁽¹⁾	0.7719⁽²⁾	0.1200⁽⁷⁾
	F1	0.2127⁽¹⁾	0.1507⁽⁵⁾	0.1890⁽²⁾	0.1597⁽⁴⁾	0.1399⁽⁸⁾	0.1806⁽³⁾	0.1435⁽⁶⁾	0.1403⁽⁷⁾
	AUC	0.6514⁽¹⁾	0.5221⁽⁶⁾	0.5320⁽³⁾	0.5224⁽⁵⁾	0.5165⁽⁷⁾	0.5421⁽²⁾	0.5257⁽⁴⁾	0.5108⁽⁸⁾
IF (RUS)	TPR	0.7418⁽¹⁾	0.0900⁽⁷⁾	0.1106⁽³⁾	0.1096⁽⁴⁾	0.0994⁽⁵⁾	0.0981⁽⁶⁾	0.1110⁽²⁾	0.0750⁽⁸⁾
	ACC	0.6243⁽¹⁾	0.5270⁽⁶⁾	0.5883⁽⁴⁾	0.5917⁽³⁾	0.4795⁽⁷⁾	0.5699⁽⁵⁾	0.5919⁽²⁾	0.0985⁽⁸⁾
	F1	0.2199⁽¹⁾	0.1561⁽⁶⁾	0.1958⁽²⁾	0.1867⁽³⁾	0.1615⁽⁵⁾	0.1679⁽⁴⁾	0.1435⁽⁷⁾	0.1394⁽⁸⁾
	AUC	0.6645⁽¹⁾	0.5152⁽⁷⁾	0.5357⁽³⁾	0.5308⁽⁴⁾	0.5259⁽⁵⁾	0.5212⁽⁶⁾	0.5368⁽²⁾	0.5139⁽⁸⁾
LE	TPR	0.9838⁽¹⁾	0.9824⁽²⁾	0.9814⁽⁴⁾	0.9823⁽³⁾	0.9745⁽⁵⁾	0.9642⁽⁶⁾	0.9487⁽⁷⁾	0.1943⁽⁸⁾
	ACC	0.9888⁽¹⁾	0.9881⁽²⁾	0.9475⁽⁵⁾	0.9833⁽⁴⁾	0.9343⁽⁶⁾	0.9874⁽³⁾	0.9277⁽⁷⁾	0.2569⁽⁸⁾
	F1	0.9690⁽¹⁾	0.9658⁽²⁾	0.8287⁽⁵⁾	0.9513⁽⁴⁾	0.7803⁽⁶⁾	0.9650⁽³⁾	0.7578⁽⁷⁾	0.3253⁽⁸⁾
	AUC	0.9857⁽¹⁾	0.9810⁽³⁾	0.9692⁽⁵⁾	0.9815⁽²⁾	0.9517⁽⁶⁾	0.9783⁽⁴⁾	0.9368⁽⁷⁾	0.5966⁽⁸⁾
LE (ROS)	TPR	0.9827⁽¹⁾	0.2322⁽⁷⁾	0.9783⁽³⁾	0.9825⁽²⁾	0.9691⁽⁴⁾	0.9648⁽⁵⁾	0.6091⁽⁶⁾	0.1972⁽⁸⁾
	ACC	0.9883⁽¹⁾	0.4075⁽⁷⁾	0.9765⁽⁴⁾	0.9881⁽²⁾	0.9339⁽⁵⁾	0.9870⁽³⁾	0.8721⁽⁶⁾	0.2703⁽⁸⁾
	F1	0.9674⁽¹⁾	0.3768⁽⁷⁾	0.9313⁽⁴⁾	0.9672⁽³⁾	0.7794⁽⁵⁾	0.9674⁽²⁾	0.6913⁽⁶⁾	0.3294⁽⁸⁾
	AUC	0.9860⁽¹⁾	0.6158⁽⁷⁾	0.9772⁽⁴⁾	0.9852⁽²⁾	0.9491⁽⁵⁾	0.9784⁽³⁾	0.7810⁽⁶⁾	0.5981⁽⁸⁾
LE (RUS)	TPR	0.9790⁽²⁾	0.3266⁽⁷⁾	0.9771⁽³⁾	0.9938⁽¹⁾	0.9720⁽⁴⁾	0.8969⁽⁵⁾	0.6278⁽⁶⁾	0.1932⁽⁸⁾
	ACC	0.9844⁽¹⁾	0.6321⁽⁷⁾	0.9763⁽⁴⁾	0.9832⁽³⁾	0.9340⁽⁵⁾	0.9844⁽²⁾	0.8813⁽⁶⁾	0.2516⁽⁸⁾
	F1	0.9575⁽²⁾	0.4912⁽⁷⁾	0.9307⁽⁴⁾	0.9632⁽¹⁾	0.7796⁽⁵⁾	0.9342⁽³⁾	0.6913⁽⁶⁾	0.3238⁽⁸⁾
	AUC	0.9823⁽¹⁾	0.6615⁽⁷⁾	0.9766⁽³⁾	0.9818⁽²⁾	0.9504⁽⁴⁾	0.9456⁽⁵⁾	0.7939⁽⁶⁾	0.5962⁽⁸⁾
	Average rank	1.33	5.63	3.63	3.50	5.63	3.83	4.83	7.63

Altman EI (1968) Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. J Finance 23(4):589–609

Baesens B, Van Gestel T, Viaene S, Stepanova M, Suykens J, Vanthienen J (2003) Benchmarking state-of-the-art classification algorithms for credit scoring. J Oper Res Soc 54(6):627–635

Bastani K, Asgari E, Namavari H (2019) Wide and deep learning for peer-to-peer lending. Expert Syst Appl 134:209–224

Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29

Blumenstock G, Lessmann S, Seow H-V (2022) Deep learning for survival and competing risk modelling. J Oper Res Soc 73(1):26–38

Borgonovo E, Smith CL (2011) A study of interactions in the risk assessment of complex engineering systems: an application to space PSA. Oper Res 59(6):1461–1476MathSciNet

Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159

Cai R, Li H, Wang S, Chen C, Kot A (2020) DRL-FAS: a novel framework based on deep reinforcement learning for face anti-spoofing. IEEE Trans Inf Forensics Secur 16:937–951

Chatterjee M, Namin A-S (2019) Detecting phishing websites through deep reinforcement learning. In: Proceedings of the IEEE 43rd annual computer software and applications conference, 2019. IEEE, pp 227–232

Chen S-Y, Yu Y, Da Q, Tan J, Huang H-K, Tang H-H (2018) Stabilizing reinforcement learning in dynamic environment with application to online recommendation. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and data mining, 2018. ACM, pp 1187–1196

Crone SF, Finlay S (2012) Instance sampling in credit scoring: an empirical study of sample size and balancing. Int J Forecast 28(1):224–238

Dastile X, Celik T, Potsane M (2020) Statistical and machine learning models in credit scoring: a systematic literature survey. Appl Soft Comput 91:106263

De Moor BJ, Gijsbrechts J, Boute RN (2022) Reward shaping to improve the performance of deep reinforcement learning in perishable inventory management. Eur J Oper Res 301(2):535–545

Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(Jan):1–30MathSciNet

Ding Y, Ma L, Ma J, Suo M, Tao L, Cheng Y, Lu C (2019) Intelligent fault diagnosis for rotating machinery using deep Q-network based health state classification: a deep reinforcement learning approach. Adv Eng Inform 42:100977

Du N, Li L, Lu T, Lu X (2020) Prosocial compliance in P2P lending: a natural field experiment. Manag Sci 66(1):315–333

Dumitrescu E, Hue S, Hurlin C, Tokpavi S (2022) Machine learning for credit scoring: improving logistic regression with non-linear decision-tree effects. Eur J Oper Res 297(3):1178–1192MathSciNet

Fan C, Zeng L, Sun Y, Liu Y-Y (2020) Finding key players in complex networks through deep reinforcement learning. Nat Mach Intell 2(6):317–324

Fernandes GB, Artes R (2016) Spatial dependence in credit risk and its improvement in credit scoring. Eur J Oper Res 249(2):517–524MathSciNet

Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11(1):86–92MathSciNet

Gosavi A (2009) Reinforcement learning: a tutorial survey and recent advances. INFORMS J Comput 21(2):178–192MathSciNet

Gunnarsson BR, Vanden Broucke S, Baesens B, Óskarsdóttir M, Lemahieu W (2021) Deep learning for credit scoring: do or don’t? Eur J Oper Res 295(1):292–305MathSciNet

Guo Y, Zhou W, Luo C, Liu C, Xiong H (2016) Instance-based credit risk assessment for investment decisions in P2P lending. Eur J Oper Res 249(2):417–426MathSciNet

Hosmer DW Jr, Lemeshow S, Sturdivant RX (2013) Applied logistic regression, vol 398. Wiley, Hoboken

Iman RL, Davenport JM (1980) Approximations of the critical region of the Fbietkan statistic. Commun Stat Theory Methods 9(6):571–595

Kou G, Peng Y, Wang G (2014) Evaluation of clustering algorithms for financial risk analysis using MCDM methods. Inf Sci 275:1–12

Kou G, Olgu Akdeniz Ö, Dinçer H, Yüksel S (2021a) Fintech investments in European banks: a hybrid IT2 fuzzy multidimensional decision-making approach. Financ Innov 7(1):39

Kou G, Xu Y, Peng Y, Shen F, Chen Y, Chang K, Kou S (2021b) Bankruptcy prediction for SMEs using transactional data and two-stage multiobjective feature selection. Decis Support Syst 140:113429

Lei K, Zhang B, Li Y, Yang M, Shen Y (2020) Time-driven feature-aware jointly deep reinforcement learning for financial signal representation and algorithmic trading. Expert Syst Appl 140:112872

Lessmann S, Baesens B, Seow H-V, Thomas LC (2015) Benchmarking state-of-the-art classification algorithms for credit scoring: an update of research. Eur J Oper Res 247(1):124–136

Li H, Xu H (2020) Deep reinforcement learning for robust emotional classification in facial expression recognition. Knowl Based Syst 204:106172

Li Y, Wang X, Djehiche B, Hu X (2020) Credit scoring by incorporating dynamic networked information. Eur J Oper Res 286(3):1103–1112MathSciNet

Lim M, Abdullah A, Jhanjhi N (2021) Performance optimization of criminal network hidden link prediction model with deep reinforcement learning. J King Saud Univ Comput Inf Sci 33(10):1202–1210

Lin E, Chen Q, Qi X (2020) Deep reinforcement learning for imbalanced classification. Appl Intell 5:1–15

Liu Y, Chen Y, Jiang T (2020) Dynamic selective maintenance optimization for multi-state systems over a finite horizon: a deep reinforcement learning approach. Eur J Oper Res 283(1):166–181MathSciNet

Lopez-Martin M, Carro B, Sanchez-Esguevillas A (2020) Application of deep reinforcement learning to intrusion detection for supervised problems. Expert Syst Appl 141:112963

Loyola-González O, Martínez-Trinidad JF, Carrasco-Ochoa JA, García-Borroto M (2016) Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases. Neurocomputing 175:935–947

Luo B, Yang Y, Liu D (2018) Adaptive Q-Learning for data-based optimal output regulation with experience replay. IEEE Trans Cybern 48(12):3337–3348

Marqués AI, García V, Sánchez JS (2013) On the suitability of resampling techniques for the class imbalance problem in credit scoring. J Oper Res Soc 64(7):1060–1070

Martinez C, Ramasso E, Perrin G, Rombaut M (2020) Adaptive early classification of temporal sequences using deep reinforcement learning. Knowl Based Syst 190:105290

Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing Atari with deep reinforcement learning. ArXiv preprint arXiv:1312.5602

Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533

Óskarsdóttir M, Bravo C, Sarraute C, Vanthienen J, Baesens B (2019) The value of big data for credit scoring: enhancing financial inclusion using mobile phone data and social network analytics. Appl Soft Comput 74:26–39

Patel D, Hazan H, Saunders DJ, Siegelmann HT, Kozma R (2019) Improved robustness of reinforcement learning policies upon conversion to spiking neuronal network platforms applied to Atari Breakout game. Neural Netw 120:108–115

Petrides G, Moldovan D, Coenen L, Guns T, Verbeke W (2020) Cost-sensitive learning for profit-driven credit scoring. J Oper Res Soc 73(2):1–13

Protopapadakis E, Niklis D, Doumpos M, Doulamis A, Zopounidis C (2019) Sample selection algorithms for credit risk modelling through data mining techniques. Int J Data Min Model Manag 11(2):103–128

Rish I (2001) An empirical study of the naive Bayes classifier. Workshop Empir Methods Artif Intell 3(22):41–46

Schaul T, Quan J, Antonoglou I, Silver D (2015) Prioritized experience replay. ArXiv Preprint arXiv 1511:05952

Schnaubelt M (2022) Deep reinforcement learning for the optimal placement of cryptocurrency limit orders. Eur J Oper Res 296(3):993–1006MathSciNet

Serrano-Cinca C, Gutiérrez-Nieto B (2016) The use of profit scoring as an alternative to credit scoring systems in peer-to-peer (P2P) lending. Decis Support Syst 89:113–122

Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A et al (2018) A general reinforcement learning algorithm that masters chess, Shogi, and Go through self-play. Science 362(6419):1140–1144MathSciNet

So MM, Thomas LC (2011) Modelling the profitability of credit cards by Markov decision processes. Eur J Oper Res 212(1):123–130

Sun AY (2020) Optimal carbon storage reservoir management through deep reinforcement learning. Appl Energy 278:115660

Sutton R, Barto A (1998) Reinforcement learning: an introduction. MIT Press, Cambridge

Tang Y, Zhang Y-Q, Chawla NV, Krasser S (2008) SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern B 39(1):281–288

Trafalis TB, Gilbert RC (2006) Robust classification and regression using support vector machines. Eur J Oper Res 173(3):893–909MathSciNet

van Heeswijk W (2022) Strategic bidding in freight transport using deep reinforcement learning. Ann Oper Res. https://doi.org/10.1007/s10479-022-04572-zCrossRef

Veganzones D, Séverin E (2018) An investigation of bankruptcy prediction in imbalanced datasets. Decis Support Syst 112:111–124

Wang H, Kou G, Peng Y (2021) Multi-class misclassification cost matrix for credit ratings in peer-to-peer lending. J Oper Res Soc 72(4):923–934

Wang Y, Jia Y, Tian Y, Xiao J (2022) Deep reinforcement learning with the confusion-matrix-based dynamic reward function for customer credit scoring. Expert Syst Appl 200:117013

Watkins CJ, Dayan P (1992) Q-learning. Mach Learn 8(3–4):279–292

Wauters M, Vanhoucke M (2017) A nearest neighbour extension to project duration forecasting with artificial intelligence. Eur J Oper Res 259(3):1097–1111MathSciNet

Wilcoxon F (1992) Individual comparisons by ranking methods. In: Breakthroughs in statistics. Springer, Berlin, pp 196–202

Wurman PR, Barrett S, Kawamoto K, MacGlashan J, Subramanian K, Walsh TJ et al (2022) Outracing champion Gran Turismo drivers with deep reinforcement learning. Nature 602(7896):223–228

Xia Y, Zhao J, He L, Li Y, Niu M (2020) A novel tree-based dynamic heterogeneous ensemble method for credit scoring. Expert Syst Appl 159:113615

Xiao J, Zhou X, Zhong Y, Xie L, Gu X, Liu D (2020) Cost-sensitive semi-supervised selective ensemble model for customer credit scoring. Knowl Based Syst 189:105118

Xiao J, Wang Y, Chen J, Xie L, Huang J (2021) Impact of resampling methods and classification models on the imbalanced credit scoring problems. Inf Sci 569:508–526MathSciNet

Yeo B, Grant D (2018) Predicting service industry performance using decision tree analysis. Int J Inf Manag 38(1):288–300

Zhang G, Hu W, Cao D, Liu W, Huang R, Huang Q et al (2021) Data-driven optimal energy management for a wind–solar–diesel-battery-reverse osmosis hybrid energy system using a deep reinforcement learning approach. Energy Convers Manag 227:113608

Zhao D, Chen Y, Lv L (2016) Deep reinforcement learning with visual attention for vehicle classification. IEEE Trans Cogn Dev Syst 9(4):356–367

Titel: Deep reinforcement learning based on balanced stratified prioritized experience replay for customer credit scoring in peer-to-peer lending
verfasst von: Yadong Wang
Yanlin Jia
Sha Fan
Jin Xiao
Publikationsdatum: 01.04.2024
Verlag: Springer Netherlands
Erschienen in: Artificial Intelligence Review / Ausgabe 4/2024
Print ISSN: 0269-2821
Elektronische ISSN: 1573-7462
DOI: https://doi.org/10.1007/s10462-023-10697-9

Notations	Definitions
\({{\varvec{D}}}_{train}\)	The customer credit scoring training set
\({{\varvec{D}}}_{test}\)	The customer credit scoring testing set
\({{\varvec{D}}}_{min}\)	The minority sample set
\({{\varvec{D}}}_{maj}\)	The majority sample set
\({x}_{t}\)	The \(t\)th customer sample
\({y}_{t}\)	The class label of the tth customer sample
\({\varvec{S}}\)	The state space of the environment
\({\varvec{A}}\)	The action space of the agent
\({s}_{t}\)	The environment state at the tth time step
\({a}_{t}\)	The action of the agent at the tth time step
\({r}_{t}\)	The reward obtained by the agent according to the reward function \(R\) at the tth time step
\(\gamma\)	The discount factor
\({R}_{t}\)	The cumulative reward obtained by the agent from time step \(t\) to terminal step in each episode
\({e}_{t}\)	The experience sample at the tth time step
\({{\varvec{E}}}_{t}\)	The experience sample set at the tth time step
\({{\varvec{E}}}^{min}\)	The minority experience sample set
\({{\varvec{E}}}^{maj}\)	The majority experience sample set
\(D\)	The replay memory buffer for storing experience samples
\({D}^{min}\)	The replay memory buffer for storing minority experience samples
\({D}^{maj}\)	The replay memory buffer for storing majority experience samples
\({{\varvec{E}}}_{replay}^{min}\)	The experience sample set stored in \({D}^{min}\)
\({{\varvec{E}}}_{replay}^{maj}\)	The experience sample set stored in \({D}^{maj}\)
\({d}_{1}\)	The number of experience samples in \({{\varvec{E}}}_{replay}^{min}\)
\({d}_{2}\)	The number of experience samples in \({{\varvec{E}}}_{replay}^{maj}\)
\({e}_{t}^{min}\)	The minority experience sample at the tth time step
\({e}_{t}^{maj}\)	The majority experience sample at the tth time step
\({\theta }_{t}^{C}\)	The parameters of the current Q-network at the tth time step
\({\theta }_{t}^{T}\)	The parameters of the target Q-network at the tth time step
\(k\)	The size of the mini-batch
\({k}_{1}\)	The number of experience samples sampled from \({{\varvec{E}}}_{replay}^{min}\)
\({k}_{2}\)	The number of experience samples sampled from \({{\varvec{E}}}_{replay}^{maj}\)
\(Z\)	The updating frequency of the target Q-network
\(T\)	The maximum time step
\(C\)	The maximum episode
\(\alpha\)	The learning rate of the current Q-network
TP	The number of positive samples predicted as positive
TN	The number of negative samples predicted as negative
FP	The number of negative samples predicted as positive
FN	The number of positive samples predicted as negative

	IF dataset				LE dataset				Average rank
\({k}_{1}/{k}_{2}\)	TPR	ACC	F1	AUC	TPR	ACC	F1	AUC	Average rank
6/1	0.9422⁽¹⁾	0.1829⁽¹¹⁾	0.1461⁽⁷⁾	0.5321^(6.5)	0.9885⁽²⁾	0.9636⁽¹¹⁾	0.9069⁽¹⁰⁾	0.9734⁽¹¹⁾	7.44
5/1	0.9412⁽²⁾	0.2064⁽¹⁰⁾	0.1506⁽⁶⁾	0.5321^(6.5)	0.9803⁽⁶⁾	0.9696⁽¹⁰⁾	0.9063⁽¹¹⁾	0.9769⁽⁶⁾	7.19
4/1	0.8947⁽³⁾	0.2362⁽⁹⁾	0.1581⁽³⁾	0.5519⁽⁵⁾	0.9825⁽⁵⁾	0.9739⁽⁸⁾	0.9293⁽⁹⁾	0.9812⁽⁴⁾	5.75
3/1	0.8674⁽⁴⁾	0.3998⁽⁸⁾	0.1557⁽⁴⁾	0.5608⁽⁴⁾	0.9896⁽¹⁾	0.9780⁽⁷⁾	0.9416⁽⁷⁾	0.9825⁽³⁾	4.75
2/1	0.8096⁽⁵⁾	0.5477⁽⁷⁾	0.1547⁽⁵⁾	0.5827⁽²⁾	0.9841⁽³⁾	0.9829⁽⁶⁾	0.9644⁽⁵⁾	0.9828⁽²⁾	4.38
1	0.7464⁽⁶⁾	0.6296⁽⁶⁾	0.2260⁽¹⁾	0.6779⁽¹⁾	0.9838⁽⁴⁾	0.9888⁽⁴⁾	0.9690⁽¹⁾	0.9857⁽¹⁾	3.00
1/2	0.1830⁽⁷⁾	0.8276⁽⁵⁾	0.1928⁽²⁾	0.5694⁽³⁾	0.9702⁽⁸⁾	0.9883⁽⁵⁾	0.9617⁽⁶⁾	0.9783⁽⁵⁾	5.13
1/3	0.0208⁽¹¹⁾	0.9271⁽⁴⁾	0.0395⁽¹¹⁾	0.5092⁽¹⁰⁾	0.9786⁽⁷⁾	0.9730⁽⁹⁾	0.9299⁽⁸⁾	0.9752⁽⁸⁾	8.50
1/4	0.0519⁽⁸⁾	0.9294⁽³⁾	0.0874⁽⁸⁾	0.5205⁽⁸⁾	0.9601⁽⁹⁾	0.9908⁽³⁾	0.9678⁽²⁾	0.9747⁽⁹⁾	6.25
1/5	0.0403⁽⁹⁾	0.9315^(1.5)	0.0711⁽⁹⁾	0.5163⁽⁹⁾	0.9554⁽¹⁰⁾	0.9911⁽²⁾	0.9653⁽³⁾	0.9734⁽¹⁰⁾	6.69
1/6	0.0317⁽¹⁰⁾	0.9315^(1.5)	0.0576⁽¹⁰⁾	0.5079⁽¹¹⁾	0.9544⁽¹¹⁾	0.9913⁽¹⁾	0.9649⁽⁴⁾	0.9768⁽⁷⁾	6.94

Springer Professional

Abstract

Publisher's Note

1 Introduction

2 Literature review

2.1 Customer credit scoring

2.2 Deep reinforcement learning for classification problems

3 Theoretical background

3.1 Notations

3.2 Reinforcement learning and Q-learning

3.3 Deep Q-network

3.4 Experience replay

4 Formulating customer credit scoring in P2P lending as discrete-time finite MDP

5 Proposed DQN based on balanced stratified prioritized experience replay for customer credit scoring in P2P lending

5.1 Balanced stratified prioritized experience replay

5.2 Procedure of the DQN-BSPER model

6 Experimental design

6.1 Data set description and preprocessing

6.2 Experimental procedure and parameter settings

6.3 Evaluation measures

7 Experimental results and analysis

7.1 Parameter sensitivity analysis

7.2 Credit scoring performance comparison of DQN-BSPER and four other benchmark DRL models

7.3 Convergence performance comparison of DQN-BSPER and four other benchmark DRL models

7.4 Credit scoring performance comparison of DQN-BSPER and seven traditional classification models

7.5 Impact of imbalanced class distribution on the customer credit scoring performance of the DQN-BSPER model

8 Conclusions and future works

Acknowledgements

Declarations

Conflict of interest

Publisher's Note

Appendices

Appendix 1: Notations and definitions

Appendix 2: The process of using the DQN-BSPER model for customer credit scoring in P2P lending

Appendix 3: The pseudo-codes

Appendix 4: The main parameters setting of DQN-BSPER and other reinforcement learning models

Appendix 5: The CCS performance of the DQN-BSPER model under 11 kinds of discount factors

Appendix 6: The CCS performance of the DQN-BSPER model under 11 kinds of \({{\varvec{k}}}_{1}/{{\varvec{k}}}_{2}\)

Appendix 7: The customer credit scoring performance of DQN-BSPER and seven traditional benchmark classification models on the IF and LE datasets

Weitere Artikel der Ausgabe 4/2024

New formulation for predicting total dissolved gas supersaturation in dam reservoir: application of hybrid artificial intelligence models based on multiple signal decomposition

A three-way decision method on multi-scale single-valued neutrosophic decision systems

Multi-level textual-visual alignment and fusion network for multimodal aspect-based sentiment analysis

On the computational complexity of ethics: moral tractability for minds and machines

Spline magnitude disparity cross correlated deep network for gait recognition

Video shot-boundary detection: issues, challenges and solutions

Premium Partner