nach oben

Journal of Intelligent Information Systems

Erschienen in:

Open Access 16.05.2023 | Research

A novel approach for software defect prediction using CNN and GRU based on SMOTE Tomek method

verfasst von: Nasraldeen Alnor Adam Khleel, Károly Nehéz

Erschienen in: Journal of Intelligent Information Systems | Ausgabe 3/2023

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

Software defect prediction (SDP) plays a vital role in enhancing the quality of software projects and reducing maintenance-based risks through the ability to detect defective software components. SDP refers to using historical defect data to construct a relationship between software metrics and defects via diverse methodologies. Several prediction models, such as machine learning (ML) and deep learning (DL), have been developed and adopted to recognize software module defects, and many methodologies and frameworks have been presented. Class imbalance is one of the most challenging problems these models face in binary classification. However, When the distribution of classes is imbalanced, the accuracy may be high, but the models cannot recognize data instances in the minority class, leading to weak classifications. So far, little research has been done in the previous studies that address the problem of class imbalance in SDP. In this study, the data sampling method is introduced to address the class imbalance problem and improve the performance of ML models in SDP. The proposed approach is based on a convolutional neural network (CNN) and gated recurrent unit (GRU) combined with a synthetic minority oversampling technique plus the Tomek link (SMOTE Tomek) to predict software defects. To establish the efficiency of the proposed models, the experiments have been conducted on benchmark datasets obtained from the PROMISE repository. The experimental results have been compared and evaluated in terms of accuracy, precision, recall, F-measure, Matthew’s correlation coefficient (MCC), the area under the ROC curve (AUC), the area under the precision-recall curve (AUCPR), and mean square error (MSE). The experimental results showed that the proposed models predict the software defects more effectively on the balanced datasets than the original datasets, with an improvement of up to 19% for the CNN model and 24% for the GRU model in terms of AUC. We compared our proposed approach with existing SDP approaches based on several standard performance measures. The comparison results demonstrated that the proposed approach significantly outperforms existing state-of-the-art SDP approaches on most datasets.

Károly Nehéz these authors contributed equally to this work.

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Determining source code defects is usually tricky due to software projects' colossal code base. The importance and challenges of defect prediction have made it an active research area in software engineering (Dam et al., 2018). Defects in software are often challenging to detect or identify, and developers spend significant time locating and fixing them. The software life cycle includes many activities to identify source code defects, such as design reviews, code inspections, integration tests, functions testing, unit tests, etc. (Tong et al., 2018). Early detection of a defect in software projects during the development phase helps allocate testing resources reasonably, determine the testing priority of different software modules, and improve the effectiveness of the software development process (Kumar & Singh, 2021). SDP is a process for predicting source code defects using tools or techniques based on historical data. SDP approaches can be divided into with-in-project defect prediction (WPDP), cross-project defect prediction (CPDP) for a similar dataset, and cross-project defect prediction (CPDP) for a heterogeneous dataset (Kalaivani & Beena, 2018; Li et al., 2018).

In this study, we develop our models based on the with-in-project defect prediction (WPDP) approach. In the WPDP approach, a prediction model can be built based on collecting historical data from a software project and predicting defects in the same project. WPDP performs best if there is enough historical data to train the model (Omri et al., 2020). There are two ways in which previous studies have attempted to build accurate SDP models: the first approach is to manually design new features or new sets of features to represent defects more effectively, while the second approach involves applying new and improved ML-based classifiers. Several models have been proposed for SDP based on the second approach (ML-based classifiers). However, there is still a need to develop accurate defect detection models or detectors and robust software metrics to distinguish between defective and non-defective software modules. Latest studies leverage manually designed software metrics such as Halstead features, McCabe features, C.K. features, MOOD features, etc., to build classifiers.

Recently, DL algorithms have been adopted to improve research tasks in software engineering, especially in SDP (Liang et al., 2019; Omri et al., 2020). DL algorithms differ from classical artificial neural networks in one critical aspect: they contain many hidden layers (Ferenc et al., 2020; Koay et al., 2022). DL is a type of ML that allows computational models consisting of multiple processing layers to learn data representations with various levels of abstraction. DL architecture has been widely applied in many fields to solve detection, classification, and prediction problems (Zhu et al., 2020). DL has drawn more and more attention because of its robust feature learning capability and has been successfully used in many domains, such as speech recognition, image classification, etc. The CNN and GRU models are the most popular DL architectures designed to solve the problem of long-term dependencies and gradient vanishing. These models can recognize longer sequences of time series data to provide high predictive performance in SDP (Tong et al., 2018).

Unfortunately, the studies of SDP are facing a big challenge: the class imbalance problem. When there is an uneven distribution of classes in the training data set, this indicates that this data is imbalanced. The class imbalance problem means that the number of non-defective modules (majority class) is much more than that of defective modules (minority class). Imbalanced class classification biases performance towards the majority numbered class in the case of a binary application. Most ML techniques can predict better when each class's instances are roughly equal. This problem severely hinders the efficiency of these models and produces imbalanced false-positive and false-negative results (Lango & Stefanowski, 2018). This study selects imbalanced datasets from the public PROMISE repository for experimental purposes (Chen et al., 2015; Deng et al., 2020a; Phan & Nguyen, 2017);. However, several experiments in the previous studies (Deng et al., 2020a; Khuat & Le 2020; Kumar & Sathyanarayana, 2015; Miholca et al., 2018) were conducted based on these datasets using many ML models; most of the results were very poor because of the class imbalance problem. Very few of these studies are based on CNN and GRU models. However, to our knowledge, there is no experiment using CNN and GRU combined with SMOTE Tomek in the literature.

To bridge these gaps, this study aims to apply data balancing methods to address the problem of class imbalance and investigate the impact of data balancing methods on the performance of ML models in detecting software defects. Firstly, we apply data balancing methods to balance the training set. Secondly, we train and test the proposed models using the balanced training set, and finally, we evaluate the results based on many performance measures. In summary, the goal and main contributions of our study are summarized as follows:

(i)

In this study, we propose a novel approach that combines CNN and GRU with SMOTE Tomek method to predict software defects.

(ii)

We evaluate the performance of the proposed approach and compare it with the traditional ML model (RF) as the baseline model and compare it with the existing approaches used in SDP.

(iii)

We show that the performance of ML models in SDP can be significantly improved when balancing the data set by applying data balancing methods.

The structure of this paper is organized as follows. Section 2 presents a discussion on related work. Section 3 presents background on software defect prediction, convolutional neural networks, and gated recurrent unit. Section 4 presents the hypothesis and research questions. Section 5 presents the motivation for our proposed work. After that, our research methodology is presented in Section 6. Section 7 presents the experimental results and discussion. Section 8 presents the implication of the findings. Section 9 presents threats to validity, followed by conclusions in the last section (Section 10).

The prediction of defects in software systems is significant, and there is great interest in developing novel high-performance software defect predictors. SDP models aim to improve the quality of software application systems (Khuat & Le, 2020). Many models have been constructed to recognize the defects in software modules using artificial intelligence and statistical methods (Cao, 2020; Dam et al., 2018; Deng et al., 2020a; Qiu et al., 2019; Pan et al., 2019; Tong et al., 2020; Munir et al., 2021). Tong al. (2018) proposed a novel approach for SDP using deep representations combined with the two-stage ensemble to address the class imbalance problem. The experiments were performed on 12 NASA datasets, and results were evaluated based on F-measure, AUC, and MCC. The experimental results showed that (i) deep representations are promising for SDP, (ii) the two-stage ensemble is more effective for addressing the class imbalance problem in SDP compared with classic ensemble learning methods, and (iii) the proposed approach is significantly effective for SDP. HONGLIANG LIANG al. (Liang et al., 2019) proposed Seml, a novel framework that combines word embedding and LSTM for SDP. The model was evaluated based on eight open-source projects. The experimental results showed that the Seml outperforms three state-of-the-art defect prediction approaches on most datasets for both within-project and cross-project defect prediction. Ferenc et al. (2020) proposed a methodology of how to adapt DNNs s for bug prediction. The methodology was applied on a large bug dataset (containing 8780 bugged and 38,838 not bugged Java classes). The results demonstrate that DL with static metrics can indeed boost prediction accuracies. Kun Zhu et al. (2020) proposed a novel just-in-time defect prediction model named DAECNN-JDP based on denoising autoencoder and CNN. The model was evaluated based on six large open-source projects and compared with 11 baseline models. The experimental results showed that the proposed model outperforms these baseline models. Deng et al. (2020a) proposed a novel LSTM method to perform SDP; their method can automatically learn semantic and contextual information from the program's ASTs. The experiment was performed on several open-source projects, showing that the proposed LSTM method is superior to the state-of-the-art methods. Khuat and Le ( 2020) conducted an empirical study to evaluate the importance of sampling techniques in SDP. The experimental results indicated the positive effects of combining sampling techniques with ensemble learning models. This method addressed the class imbalance problem and achieved high prediction accuracy. Miholca et al. (2018) presented a supervised classification approach named (HyGRAR). It is a nonlinear hybrid model that combines gradual relational association rule mining and artificial neural networks to predict software defects. The experiments were conducted using ten open-source datasets. The experimental results showed the excellent performance of the proposed classifier and better performance than most of the previously proposed classifiers. This method achieved high prediction accuracy. Kumar and Satyanarayana (2015) developed a Hybrid Neural Network model with object-oriented and C.K. metrics for software fault prediction. Adaptive Genetic Algorithm has been used for ANN optimization. The proposed model has been tested with PROMISE data sets. The experimental results showed better performance compared to major existing schemes. Hao Xu et al. (Qiu et al. 2019) proposed a novel approach using the transfer CNN model to mine the transferable semantic features for CPDP tasks. The experiments were conducted based on ten benchmark projects with 90 pairs of CPDP tasks. Their results showed that the proposed model is superior to the reference methods. Pan et al. (2019) proposed an improved CNN model for WPDP and compared the experimental results with those of existing CNN studies. An experiment was performed using a 30-repetition holdout validation and a 10 * 10 cross-validation. Their results showed that the CNN model significantly outperformed the state-of-the-art ML models for WPDP. Tong et al. (2020) proposed a novel credibility-based imbalance-boosting method to address the class imbalance problem in software defect proneness prediction. Experiments were performed on datasets obtained from the NASA and PROMISE datasets. The proposed method was compared with several approaches. The experimental results showed that the proposed method is a more promising alternative for addressing the class imbalance problem than previous methods. Munir et al. (2021) proposed a new framework based on GRU and LSTM for SDP. The experiments were evaluated based on 119,989 C/C + + programs in Code4Bench. The proposed method was compared with several approaches. The experimental results demonstrated that the proposed method performs better regarding a recall, precision, accuracy, and F1 metrics. Li et al. (2017) proposed a framework based on the programs' Abstract Syntax Trees called Defect Prediction via CNN. The model was evaluated based on seven open-source projects in terms of f-measure. The experimental results showed that the model improves the state-of-the-art method by 12% on average. Kukkar et al. (2019) proposed a novel DL model for multiclass severity classification called bug severity classification using a CNN and Random Forest with Boosting based on five open-source projects. Their results prove that the proposed model enhances the performance of bug severity classification over state-of-the-art techniques. Pandey et al. (2020) proposed a new method using deep representation and ensemble learning (BPDET) for software bug prediction; ensemble learning was applied to address the class imbalance problem. An experiment was performed based on 12 data sets from the PROMISE repository. The experimental results showed that the proposed method outperformed existing state-of-the-art techniques. This method addressed the class imbalance problem and achieved high prediction accuracy. Zhao Yang and Hongbing Qian (Yang et al., 2018) proposed an ANNs model, which automated parameter tuning techniques to optimize the defect prediction models. The model was evaluated based on 30 datasets downloaded from the Tera-PROMISE Repository. Their results showed that the proposed model performance improved after tuning parameter settings. The authors suggested that researchers should pay attention to tuning parameter settings by Caret for ANNs instead of using suboptimal default settings if they select ANNs for training models in future defect prediction studies. Zhao et al. (2019) proposed a novel SDP model called Siamese parallel fully-connected networks (SPFCNN), combining Siamese networks' advantages and DL. The authors compared the proposed model with the state-of-the-art SDP models using six datasets from the NASA repository. The experimental results showed that the proposed model contributes significantly higher performance than benchmarked SDP approaches. Farid et al. (2021) proposed a hybrid model using bidirectional long short-term memory and CNN to predict software defects. The proposed model was evaluated using seven open-source Java projects from the PROMISE datasets. Their results showed that the proposed model is accurate for predicting software defects. Fan et al. (2019) presented an SDP framework via an attention-based RNN. The models were evaluated based on an open-source Apache Java project, using F1-measure and AUC. The experimental results demonstrated that the proposed model improves the F1 measure by 14% and AUC by 7% compared with the state-of-the-art methods. Majd et al. (2020) proposed SLDeep using LSTM as a learning model, a technique for statement-level SDP based on more than 100,000 C/C + + projects. The results showed that the proposed model seems effective at statement-level SDP and can be adopted. Feng al. (2021) investigated the role of SMOTE-based and stable SMOTE-based oversampling techniques in improving SDP. The approach was evaluated based on four common classifiers across 26 datasets from the PROMISE Repository. This method addressed the class imbalance problem and achieved high prediction accuracy. The experimental analysis showed that the performance of stable SMOTE-based oversampling techniques is more stable and better than that of SMOTE-based oversampling techniques.

After reviewing previous studies in SDP, we noticed that most proposed methods ignore the class imbalance problem. According to studies that addressed the class imbalance problem (Feng et al., 2021; Khuat & Le, 2020; Pandey et al., 2020; Tong et al., 2018, 2020), the authors point out that the data balancing methods are essential in improving SDP accuracy. So, the primary point from the recent studies is that ML combined with data balancing methods can improve and increase prediction accuracy. Therefore, this work aims to address the class imbalance problem to enhance the efficiency of the proposed models.

3 Background

This section briefly introduces software defect prediction, convolutional neural networks, and gated recurrent unit.

3.1 Software defect prediction

Software defect prediction (SDP) is one of the most popular research areas in software engineering and a vital activity during software development and maintenance (Omri et al., 2020). The goal of SDP is to improve software quality and reduce the cost of software development by identifying and fixing defects early in the development cycle (Ferenc et al., 2020). Software defects are errors or bugs in software code that can cause the software to behave unexpectedly or unintentionally. These defects can result in software crashes, security vulnerabilities, data loss, and other negative consequences (Tong et al., 2018). Identifying and fixing defects early in the development process can save time and money by avoiding costly rework and reducing the risk of software failures. Bug reports are basic software development tools describing software defects, especially open-source software (Pandey et al., 2020). To warranty software quality, many projects use bug reports to gather and record the reported bugs. The defects are classified into two classes: intrinsic bugs refer to bugs introduced by one or more specific changes to the source code and extrinsic bugs refer to bugs introduced by changes not recorded in the version control system. Several techniques are used in SDP, including statistical models, ML algorithms, and data mining techniques. These techniques use historical data on software defects, such as defect reports and code changes, to predict the likelihood of future defects (Kalaivani & Beena, 2018). Based on the type of data and the context of the prediction, SDP can be categorized into different types, which are:

The within-project defect prediction (WPDP) approach involves using historical data to predict defects within a single project. WPDP approach uses data from the same project to train the prediction models, such as source code metrics, bug reports, and code reviews.

II.

Cross-project defect prediction (CPDP) approach for a similar dataset: This approach involves predicting defects in a new project using historical data from similar projects. The CPDP approach uses data from one or more similar projects to train the prediction models and then apply them to the new project.

III.

Cross-project defect prediction (CPDP) approach for a heterogeneous dataset: This approach involves predicting defects in a new project using historical data from projects that differ in their development context or characteristics. The CPDP approach uses data from one or more heterogeneous projects to train the prediction models and then apply them to the new project.

Each of these SDP approaches has its advantages and limitations. WPDP is usually more accurate since it is based on the specific context of the predicted project, but it requires a significant amount of historical data from the same project. CPDP for a similar dataset can be useful when there is not enough data for WPDP. Still, it assumes that the new project has a similar development context to the projects used for training. CPDP for a heterogeneous dataset can be challenging since the development contexts of the projects used for training and the new project may differ significantly. Still, it can be useful when there is insufficient data for WPDP or CPDP for a similar dataset (Li et al., 2018; Omri et al., 2020).

3.2 Convolutional neural network

A convolutional neural network (CNN) is a feedforward neural network that processes data with a known, grid-like topology. It can be used for both supervised learning and unsupervised learning. CNN was mainly designed for field image processing but has achieved tremendous success in practical applications, including speech recognition, natural language processing, etc. (Cao et al., 2020; Zhu et al., 2020). CNN model is inspired by the typical CNN architecture used in image classification and consists of a feature extraction part and a classification part, as shown in Fig. 1. These parts consist of convolution, batch normalization, and maximum merge layers. These layers constitute the hidden layer of the architecture. The convolutional layer performs convolution operations based on the specified filter and kernel parameters. It calculates the network weights to the next layer, while the maximum pooling layer achieves a reduction in the dimension of the feature space. Batch normalization is used to mitigate the effect of different input distributions for each training mini-batch to improve training. Activation functions enable the training of the CNN model quickly and accurately (Phan & Nguyen, 2017). There are many activation functions used in CNN, such as Sigmoid, Rectified Linear Unit (ReLU), and hyperbolic tangent (Tanh) (Li et al., 2017; Pan et al., 2019). In this model, we used two activation functions, the ReLU function for the input and hidden layers and the Sigmoid function for the output layer, as shown in the equations below.

$${{h}_{i}}^{m}=ReLU\left({{W}_{i}}^{m-1}\times {{V}_{i}}^{m-1}+{b}^{m-1}\right)$$

(1)

where ${{h}_{i}}^{m}$ represents the convolutional layer, ${{W}_{i}}^{m-1}$ represents the weights of neurons, ${{V}_{i}}^{m-1}$ represents the nodes, and ${b}^{m-1}$ represents the bias layer.

$$S\left(x\right)=\frac{1}{1+ {e}^{-{\sum }_{k }^{ }{W}_{i}+{X}_{i}+b}}$$

(2)

where ${X}_{i}$ is the input, ${W}_{i}$ is the weight of the input, e is [Euler's number] = 2.781…, and b is the bias.

3.3 Gated recurrent unit

A Gated recurrent unit (GRU) network is an optimized structure of the recurrent neural network (RNN). Due to the problem of long-term dependencies that arise when the input sequence is too long, RNN cannot guarantee a long-term nonlinear relationship. This means the learning sequence has a gradient vanishing and gradient explosion phenomenon. Many optimization theories and improved algorithms have been introduced to solve this problem, such as GRU networks, long short-term memory networks, Bidirectional long short-term memory, echo state networks, and independent RNN (Cao, 2020). The GRU network aims to solve the long-term dependence and gradient disappearance problem of RNN. A GRU network is similar to long short-term memory networks with a forget gate but has fewer parameters than long short-term memory LSTM. The GRU network uses the update and reset gates to optimize the learning mechanism, as shown in Fig. 2. The update gate helps the model to determine how much of the past information (from previous time steps) needs to be passed along to the future, and the reset gate helps the model to decide how much of the past information to forget (Li et al., 2019). The update gate model in the GRU network is calculated as shown in the equation below.

$$\mathrm{z}\left(\mathrm{t}\right)=\upsigma \left(\mathrm{W}\left(\mathrm{z}\right).\left[\mathrm{h}\left(\mathrm{t}-1\right),\mathrm{ x}\left(\mathrm{t}\right)\right]\right)$$

(3)

the $\mathrm{z}\left(\mathrm{t}\right)$ is the update gate function, $\mathrm{h}\left(\mathrm{t}-1\right)$ is the output of the previous neuron, $\mathrm{x}\left(\mathrm{t}\right)$ is the input of the current neuron, $\mathrm{W}\left(\mathrm{z}\right)$ represents the weight of the update gate, and $\upsigma$ represents the sigmoid function. The reset gate model in the GRU neural networks is calculated as shown in the equation below.

$$\mathrm{r}\left(\mathrm{t}\right)=\upsigma \left(\mathrm{W}\left(\mathrm{r}\right).\left[\mathrm{h}\left(\mathrm{t}-1\right),\mathrm{ x}\left(\mathrm{t}\right)\right]\right)$$

(4)

$\mathrm{r}\left(\mathrm{t}\right)$ is the reset gate function, $\mathrm{h}\left(\mathrm{t}-1\right)$ represents the output of the previous neuron, $\mathrm{x}\left(\mathrm{t}\right)$ is the input of the current neuron, $\mathrm{W}\left(\mathrm{r}\right)$ represents the weight of the reset gate, and $\upsigma$ is the sigmoid function. The output value of the GRU hidden layer is shown in the equation below.

$$\mathrm{h}\left(\mathrm{t}\right)=\mathrm{tanh}\left(\mathrm{Wh}.\left[\mathrm{rt}*\mathrm{h}\left(\mathrm{t}-1\right),\mathrm{ x}\left(\mathrm{t}\right)\right]\right)$$

(5)

$\mathrm{h}\left(\mathrm{t}\right)$ is the output value to be determined in this neuron, $\mathrm{h}\left(\mathrm{t}-1\right)$ is the output of the previous neuron, $\mathrm{x}\left(\mathrm{t}\right)$ represents the input of the current neuron, $\mathrm{Wh}$ represents the weight of the update gate and tanh () is the hyperbolic tangent function. $\mathrm{Rt}$ is used to control how much memory needs to be retained. The hidden layer information of the final output is shown in the equation below.

$$\mathrm{h}\left(\mathrm{t}\right)=\left(1-\mathrm{z}\left(\mathrm{t}\right)\right)*\mathrm{h}\left(\mathrm{t}-1\right)+\mathrm{z}\left(\mathrm{t}\right)*\mathrm{h}(\mathrm{t})$$

(6)

4 Hypothesis and research questions

In this section, we will discuss the hypothesis and motivation along with research questions and then mention some existing studies addressing the class imbalance problem.

Our hypothesis in this study is if data balancing methods are applied to balance the original data sets, the classification performance of the proposed models will be better in SDP. To investigate our hypothesis, we used a paired t-test to find out whether there was a statistically significant difference between our models on the original and balanced datasets. The formula for the paired t-test is shown in Eq. 7 below. To statistically prove the validity of the impact of data balancing methods on the performance of ML algorithms, the hypothesis is formed as follows:

H0: There is no difference in the accuracy of models when there are no data balancing methods and when the data balancing methods are used.
H1: There is a difference in the accuracy of models when there are no data balancing methods and when the data balancing methods are used.
$$t = \frac{m}{s/\sqrt{n}}$$

(7)

where m is the mean differences, n is the sample size (i.e., number of pairs), and s is the standard deviation.

Based on our hypothesis, this study aims to understand the impact of data balancing methods on the performance of ML algorithms in SDP. In particular, we aim to address the following research questions.

RQ1: Do data balancing methods improve the accuracy of ML models in SDP?

This RQ. Investigates data balancing methods to improve the accuracy of ML models in SDP.

RQ2: Does the proposed approach outperform the state-of-the-art approaches in SDP?

This RQ. It aims to investigate the performance of the proposed approach in SDP compared to the state-of-the-art approaches.

The motivation for the above research questions relates to the importance of applying data balancing methods in SDP studies. According to the latest research on SDP, applying data balancing methods is important in predicting software defects using ML algorithms to ensure that the model is not biased toward the majority class and can accurately predict defective and non-defective modules. Some studies in SDP that applied data balancing methods to address the class imbalance problem revealed that data balancing methods have an important role in improving the accuracy of ML models in SDP. Tong Haonan al. (Tong et al., 2018, 2020) prove in their work that the two-stage ensemble and credibility-based imbalance-boosting are more effective methods for addressing the class imbalance problem in SDP than classic ensemble learning methods. Thanh Tung Khuat and My Hanh Le. (Khuat & Le, 2020) prove in their work that there are positive effects of combining sampling techniques with ensemble learning models in improving the accuracy of SDP. Pandey et al. (2020) prove in their work that combining deep representation with ensemble learning positively enhances the accuracy of software bug prediction. Feng al. (2021) prove in their work that the stable SMOTE-based oversampling techniques are more durable and better than the SMOTE-based oversampling technique.

5 Motivation

According to the literature, existing SDP studies suffer from the class imbalance problem. So, several reasons motivate us to apply data balancing methods in predicting software defects using ML algorithms:

Improve the performance of the ML models: Imbalanced datasets can lead to biased ML models that perform poorly on the minority class. Data balancing methods such as the synthetic minority oversampling technique plus the Tomek link (SMOTE Tomek) can help improve the performance of the ML model on the minority class.

ii.

Better feature representation: Balancing the dataset can help the model learn better feature representations for the minority class. This can lead to better discrimination between defective and non-defective samples and improved model performance.

iii.

Reduce overfitting: Imbalanced datasets can lead to overfitting, where the model learns to over-emphasize the majority class and ignore the minority class. Data balancing methods can help reduce overfitting by balancing the dataset and making it easier for the model to learn from the minority class samples.

6 Proposed methodology

In this section, we present our proposed methodology for SDP using novel ML models (CNN and GRU) combined with data sampling methods (SMOTE Tomek method). We have acquired the datasets from the PROMISE repository. We have applied data pre-processing techniques to deal with problems such as noise and unwanted outliers, missing values, feature type conversion, and normalization. We have also used feature selection techniques to choose the features more relevant to the target class. Then, we applied data balancing methods to balance the training stet. The datasets are split into training and testing datasets to train and test the proposed models. Finally, we built and evaluated our models based on many standard performance measures. A series of steps have been taken and described, such as benchmark datasets used, software metrics used, data pre-processing, features selection, dataset balancing, and performance measures used. Figure 3 illustrates the whole workflow of the proposed SDP methodology, where each step is described in the following sections.

6.1 Benchmark datasets and software metrics

To verify the validity of the proposed approach, we selected six open-source Java projects from the PROMISE dataset (www.kaggle.com.datasets.nazgolnikravesh.software-defect-prediction-dataset). All six projects' source codes and corresponding PROMISE data are public ( (Deng et al., 2020a; Farid et al., 2021; Xia et al., 2016). These projects cover applications such as XML parsers, text search engine libraries, and data transport adapters, and these projects have traditional static metrics for each Java file. The selection of projects was based on the percentage of data imbalance in them. To guarantee the generality of the evaluation results, experimental datasets consist of projects with different sizes and defect rates (in the six projects, the maximum number of instances is 965, and the minimum number of instances is 205. In addition, the minimum defect rate is 2.23% and the maximum defect rate is 92.19%). The reason for selecting these datasets is that (i) the PROMISE datasets are derived from common platform data, publicly available for different domains, and considered as the baseline datasets for SDP studies. (ii) These data sets are freely available (open source) and have public properties, which is beneficial for directly using them and verifying the performance of models. Researchers can use it to verify, compare and iterate their studies, (iii) These datasets are imbalanced, allowing us to apply and assess our proposed method to address the class imbalance problem ( (Feng et al., 2021; Khuat & Le, 2020). Table 1 shows the essential information of selected projects, including project name, project version, number of instances, and defect rate or the percentage of defective instances.

Table 1

Description of the PROMISE datasets that we have chosen

Project Name	Project Version	# Of Instances	Defect Rate %
ant	1.7	745	22.28%
camel	1.6	965	19.48%
ivy	2.0	352	11.36%
jedit	4.3	492	2.23%
log4j	1.2	205	92.19%
xerces	1.4	588	74.31%

Software metrics play the most vital role in building a prediction model to improve software quality by predicting as many software defects as possible. Software metrics in the context of SDP are considered independent variables. Many previous researchers have pointed out that there is a relationship between software metrics and defect predictions (Kumar & Singh, 2021). Generally, the software metrics used in SDP can be divided into static code metrics and process metrics. Static code metrics represent how the source code is complex and include information about the software codes depending on the type of coding; process metrics represent the complex development process from some values such as developer count, time, effort, and cost (Li et al., 2018). In 1976, McCabe released the first static code metrics standard; in 1977, Halstead developed a new metric standard. Some practitioners use this metric as an indicator of defect-proneness level (Öztürk, 2017). The primary studies use software metrics as independent variables for measuring the quality of software modules. Several researchers used McCabe and Halstead metrics as independent variables in SDP. This study relies on the McCabe and Halstead metrics as independent variables. Table 2 shows the traditional static code metrics contained in the PROMISE repository, and for the descriptions, the readers are referred to (Xia et al., 2016).

Table 2

List of 20 traditional static metrics of PROMISE. Descriptions were given in (Xia et al., 2016)

Attribute	Description
dit	The maximum distance from a given class to the root of an inheritance tree
noc	Number of children of a given class in an inheritance tree
cbo	Number of classes that are coupled to a given class
rfc	Number of distinct methods invoked by code in a given class
lcom	Number of method pairs in a class that do not share access to any class attributes
lcom3	Another the lcom metric proposed by Henderson–Sellers
npm	Number of public methods in a given class
loc	Number of lines of code in a given class
dam	The ratio of the number of private/protected attributes to the total number of attributes in a given class
moa	Number of attributes in a given class that are of user-defined types
mfa	Number of methods inherited by a given class divided by the total number of methods that can be accessed by the member methods of the given class
cam	The ratio of the sum of the number of different parameter types of every method in a given class to the product of the number of methods in the given class and the number of different method parameter types in the whole class
ic	Number of parent classes that a given class is coupled to
cbm	Total number of new or overwritten methods that all inherited methods in a given class is coupled to
amc	The average size of methods in a given class
ca	Afferent coupling, which measures the number of classes that depend on a given class
ce	Efferent coupling, which measures the number of classes that a given class depends on
max_cc	The maximum McCabe's cyclomatic complexity (CC) a score of methods in a given class
avg_cc	The arithmetic mean of McCabe's cyclomatic complexity (CC) scores of methods in a given class

6.2 Data pre-processing and features selection

Pre-processing the collected data is one of the critical stages before constructing the model. To generate a good model, data quality needs to be considered. Data pre-processing is a group of techniques applied to the data to remove noise and unwanted outliers from the data set, deal with missing values, feature type conversion, etc., to improve data quality before building the model (Farid et al., 2021; Miholca et al., 2018; Zhao et al., 2019). In addition, normalization is necessary to convert the values into scaled values (scaling data in numeric variables in the range of 0 to 1) to increase the model's efficiency. Min–Max normalization is a simple and easy-to-implement technique. It can preserve the shape of the original distribution of the data because it scales and shifts the data without changing its relative position. Normalizing the data using Min–Max normalization can improve the convergence of ML models. It helps reduce the data's range and makes it easier for the optimization algorithms to find the optimal weights. Further, it reduces the impact of outliers by scaling the data to a fixed range (Qiao et al., 2020). Therefore, the data set was normalized using Min–Max normalization. The formula for calculating the normalized score can be described in (8). Feature selection is crucial in selecting the most discriminative features from the list using appropriate feature selection methods (Agarwal & Tomar, 2014; Li et al., 2018)). The goal of feature selection is to choose the features more relevant to the target class from high-dimensional features and remove the redundant and uncorrelated features (Shippey et al., 2019; Zhao et al., 2018). Feature selection is categorized into three categories: filter methods, wrapper methods, and embedded methods. Each method has rules for selecting the most relevant features as independent variables for training ML models (Jain & Saha, 2021)). In this study, our models were based on embedded methods because these methods fit ML models.

$${x}_{i}=\left({x}_{i}-X\; min\right)/ (X\; max-X\; min)$$

(8)

where max(x) and min(x) represent the maximum and minimum value of the attribute x, respectively.

6.3 Class imbalance and sampling techniques

Class imbalance in classification models represents those situations where the number of examples of one class is much smaller than others (Bashir et al., 2018). If the model is trained on imbalanced datasets, the prediction results will be biased towards the majority class. So, the problem of imbalanced data often leads to the misclassification of cases in the minority class. The datasets used in our study suffer from a common problem in SDP studies: class imbalance (Chen et al., 2015; Deng et al., 2020a; Phan & Nguyen, 2017). The reference datasets are not correctly distributed, showing a lack in the actual distribution of learning instances (The number of defective cases is smaller than non-defective), as shown in Table 1. We manage this problem by modifying the original datasets to increase the realism of the data (Öztürk, 2017). Several data balancing methods have been developed to overcome the imbalanced classes problem; these techniques include subset methods, cost-sensitive learning, algorithm-level implementations, ensemble learning, feature selection methods, clustering methods, optimization methods, and data sampling techniques. Each method can be useful in different contexts, depending on the problem being addressed. Data sampling techniques are the most commonly known methods to address the distributions of imbalanced classes in datasets. Data sampling techniques are more prevalent in the studies of the prediction of software defects due to their easy employment and independence (i.e., they can be applied to any prediction model) (Deng et al., 2020b; Tong et al., 2020). Data sampling techniques aim at modifying the dataset to be processed and obtaining a representative sample of the data. Data sampling techniques tend to adjust the prior distribution of the majority and minority classes in the training data to get a balanced class distribution. Data sampling techniques can be an effective way to reduce the computational burden of analyzing large datasets. They can help ensure that the analysis results apply to the broader population. Data sampling techniques might be divided into oversampling and under-sampling (Feng et al., 2021). Oversampling techniques supplement instances of the minority class to the dataset, while the under-sampling techniques eliminate samples of the majority class to obtain a balanced dataset (Khuat & Le, 2020). The SMOTE is a classic oversampling technique that increases the examples (Alsaeedi & Khan, 2019), while Tomek Link is an under-sampling technique that decreases the samples. SMOTE Tomek is a new method that was applied using the library from imbalanced learn, which combines the synthetic minority oversampling technique (SMOTE) function for oversampling and the Tomek Link function for under-sampling (Elhassan & Aljurf, 2016, Jonathan et al., 2020; Swana et al., 2022)). This study used the SMOTE Tomek method to address the class imbalance problem. Figure 4 shows the distribution of learning instances over the original and balanced data sets.

6.4 Models building and evaluation

Most studies of SDP divide the data into two sets: a training set and a test set. The training set is used to train the model, whereas the testing set is used to evaluate the performance of the defects prediction model. Once a defects prediction model is built, its implementation must be considered (Nehéz & Khleel, 2022). We built our models using Keras as a high-level API based on TensorFlow. Training datasets comprise 80% of the dataset (random selection of features), while test datasets include 20%; each model was developed separately with different parameters, as shown in Table 3. We evaluate our proposed models' performance based on standard performance measures such as confusion matrices, MCC, AUC, AUCPR, and MSE as a Loss function. MCC is used for model evaluation by measuring the difference and describing the correlation between the predicted and actual values (Chen et al., 2015). AUC, which plots the false positive rate on the x-axis and the actual positive rate on the y-axis over all possible classification thresholds (Pandey et al., 2020). AUCPR is a curve that plots the precision versus the recall or a single number summary of the information in the precision-recall curve. MSE is a metric that measures the amount of error in the model. It assesses the average squared difference between the actual and predicted values (Nehéz & Khleel, 2022). A confusion matrix is a specific table used to measure the performance of a model. A confusion matrix summarizes the results of the testing algorithm. It presents a report of True Positives (T.P.), False Positives (F.P.), True Negatives (T.N.), and False Negatives (F.N.) (Koay et al., 2022; Napierala & Stefanowski, 2012), as shown in Table 4.

$$\mathrm{Accuracy}=\left(\mathrm{TP}+\mathrm{TN}\right) / \left(\mathrm{TP}+\mathrm{FP}+\mathrm{FN}+\mathrm{TN}\right)$$

(9)

$$\mathrm{Precision}=\mathrm{TP }/ \left(\mathrm{TP}+\mathrm{FP}\right)$$

(10)

$$\mathrm{Recall}=\mathrm{TP }/ \left(\mathrm{TP}+\mathrm{FN}\right)$$

(11)

$$\mathrm{F}-\mathrm{Measure}=(2*\mathrm{Recall}*\mathrm{Precision}) / (\mathrm{Recall }+\mathrm{ Precision})$$

(12)

$$\text{MCC} = \text{TP} \ast \text{TN} - \text{FP} \ast \text{FN} / \sqrt{(\text{TP} + \text{FP}) \ast (\text{TP} + \text{FN}) \ast (\text{TN} + \text{FP}) \ast (\text{TN} + \text{FN})}$$

(13)

$$\mathrm{AUC}= \frac{{\sum }_{{ins}_{i}\; \in \;Positive\; Class}^{ }\mathrm{rank}\left({ins}_{i}\right)- \frac{\mathrm{M}(\mathrm{M}+1)}{2} }{\mathrm{M }.\mathrm{ N}}$$

(14)

where ${\sum }_{{ins}_{i}\; \in \;Positive \;Class}^{ }\mathrm{rank}\left({ins}_{i}\right)$ It is the sum of the ranks of all positive samples, and M and N are the numbers of positive and negative examples, respectively.

$$\mathrm{AUCPR}={\int }_{0}^{1}\mathrm{Precision}\;(\mathrm{Recall })\;\mathrm{ d}\;(\mathrm{Recall})$$

(15)

$$\mathrm{MSE}=\frac{1}{\mathrm{n}}{\sum }_{\mathrm{i}=1}^{\mathrm{n}}{\left(\mathrm{x}(\mathrm{i})-\mathrm{y}(\mathrm{i}\right))}^{2}$$

(16)

where n is the number of observations, x(i) is the actual value, y(i) is the observed or predicted value for the ${\mathrm{i}}^{\mathrm{th}}$ observation.

Table 3

Parameter settings of the models

Parameters	Models
	CNN	GRU
Layers. GRU	-	100
Activation function	ReLU + Sigmoid	Tanh + Sigmoid
Dropouts	0.2	0.2
Dense	10, 1	1
Optimizer	Adam	Adam
Learning Rate	0.01	0.01
Loos Function	Mean squared error (MSE)	Mean squared error (MSE)
Batch Size	25	64
Epochs	100	100
Validation Split	0.1	0.1
Verbose	-	1

Table 4

Confusion matrix

Predicted	Actual
	Defective	Non-defective
Defective	TN	F.P
Not defective	FN	TP

7 Experimental results and discussion

In this section, we evaluate the efficiency of our proposed models. The experimental environment was based on Python and used data from the same project for training and testing. The study has considered six open-source datasets for empirical analysis using CNN and GRU. As part of our experimental analysis, we employed the traditional ML (RF) algorithm as a baseline model and compared it with our proposed models.

To answer the research question—RQ1, the performance of the prediction models is reported in Tables 5, 6, 7, 8, 9, 10, 11, 12, and Figs. 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 are mentioned below.

Table 5

Performance analysis for proposed CNN Model-Original Data sets

Datasets	Performance Measures
	Accuracy	Precision	Recall	F-Measure	MCC	AUC	AUCPR	MSE
ant	0.83	0.67	0.33	0.44	0.38	0.82	0.57	0.131
camel	0.82	0.62	0.14	0.23	0.23	0.74	0.39	0.136
ivy	0.90	0.67	0.44	0.53	0.49	0.81	0.53	0.086
jedit	0.96	0.00	0.00	0.00	0.01	0.83	0.07	0.037
log4j	0.95	0.95	1.00	0.97	0.00	0.46	0.93	0.048
xerces	0.94	0.94	0.99	0.96	0.83	0.95	0.98	0.049
Averages	0.90	0.64	0.48	0.52	0.32	0.76	0.57	0.081