Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem

doi:10.1016/j.is.2015.02.006

Information Systems

Volume 51, July 2015, Pages 62-71

https://doi.org/10.1016/j.is.2015.02.006 Get rights and content

Author-Highlights

•
SDP is short for Software Defect Prediction.
•
We show that there is not a clear winner in the studied existing methods for SDP^⁎.
•
A cost-sensitive decision forest and voting technique are proposed.
•
The superiority of the proposed techniques is shown.
•
A proposed framework for the forest algorithm for handling class imbalance.

Abstract

Software development projects inevitably accumulate defects throughout the development process. Due to the high cost that defects can incur, careful consideration is crucial when predicting which sections of code are likely to contain defects. Classification algorithms used in machine learning can be used to create classifiers which can be used to predict defects. While traditional classification algorithms optimize for accuracy, cost-sensitive classification methods attempt to make predictions which incur the lowest classification cost. In this paper we propose a cost-sensitive classification technique called CSForest which is an ensemble of decision trees. We also propose a cost-sensitive voting technique called CSVoting in order to take advantage of the set of decision trees in minimizing the classification cost. We then investigate a potential solution to class imbalance within our decision forest algorithm. We empirically evaluate the proposed techniques comparing them with six (6) classifier algorithms on six (6) publicly available clean datasets that are commonly used in the research on software defect prediction. Our initial experimental results indicate a clear superiority of the proposed techniques over the existing ones.

Introduction

Software defect prediction (SDP) is the process of predicting which sections of a code are defective and which are not. Sections of software code are also referred to as modules, examples of which are functions in C/C++ programs and methods in Java programs. A module can be characterized through various measures including the number of distinct operators and operands used, the total number of operators and operands used, Difficulty, Volume Length and Cyclomatic Complexity as introduced in the Halstead [9] and McCabe [13], [14] measures. Considering the measures as attributes and modules as records a dataset D_T can be prepared where we have a number of records and attributes representing the previous modules for which we already know whether or not a module was defective. Therefore, in D_T we also have a class attribute that labels a module as defective or non-defective. Considering D_T as a training dataset a classifier can be built and applied on future modules in order to predict whether the module is defective or non-defective. This is also known as the classification task in data mining.

For the conventional classification task in data mining a classifier is generally built with the aim to minimize the number of misclassified records and thereby maximize the prediction accuracy for future records. However, in SDP a classifier is often built with the aim to minimize the classification cost, which is the cost associated with the classification made. That is, in SDP the classification cost is more important than the number of misclassified records. The cost of a false negative (i.e. a module being actually defective but predicted as non-defective) is generally several times higher than the cost of a false positive (i.e. a module actually being non-defective but predicted as defective). Therefore, it is often better to have several false positive prediction in order to avoid a single false negative prediction. In order to build SDP classifiers which aim to minimize the cost a cost-sensitive learning algorithm is incorporated [5], [23], [12], [16], [19]. The utilization of cost-sensitive learning in SDP can result in monetary savings for a software development group, and as such can generally be seen as more useful than SDP systems which do not utilize cost-sensitive learning.

This study extends our previous conference paper [22] which has never been published in a journal. In this paper we propose a cost-sensitive decision forest algorithm called CSForest and a suitable cost sensitive voting technique called CSVoting in order to reduce the classification cost. We also empirically compare our proposed technique with two existing cost-sensitive classifiers called Weighting [23] and CSTree [16], [12] and two cost-insensitive classifiers called C4.5 [15] and SysFor [11]. We use six (6) publicly available real-world datasets available from the NASA MDP repository [20]. Our experimental results clearly indicate that the proposed cost-sensitive decision forest and cost-sensitive voting perform better than the existing techniques in minimizing the classification cost.

A problem that is often encountered in SDP is the class imbalance problem. This causes performance issues for data mining algorithms. A popular method of accounting for the class imbalance problem is oversampling. We investigate a potential technique for incorporating oversampling into CSForest. A comparison is then made with the original CSForest results in Section 6. The extensions of our conference paper which are made in this study are as follows:

•
Deeper explanation of the related works.
•
Extending the proposed CSForest to incorporate oversampling.
•
Further experimentation.

The rest of this paper is structured as follows: in Section 2 we outline the related work. We introduce the methods CSVoting and CSForest in Section 3. We present our experimental results for CSForest and CSVoting in Section 4. We then present an idea for combatting class imbalance in CSForest in Section 5. The experimental results for combatting the class imbalance are presented in Section 6. Finally in Section 8 we present the concluding remarks and our future work.

Section snippets

Related work

In this section we introduce a number of cost-insensitive classifiers and cost-sensitive classifiers that are either somehow related to the proposed technique or empirically compared with the proposed technique in Section 4. The examples of cost-insensitive classifiers are C4.5 [15] (which is a decision tree classifier) and SysFor [11] (which is a decision forest classifier), and the examples of cost-sensitive classifiers are CSTree [16], [12] and Weighting [23].

Like all other classifiers, a

Our method

In this study we propose a cost-sensitive voting called CSVoting (Cost-Sensitive Voting) for a decision forest, and a cost-sensitive decision forest algorithm called CSForest. Since we consider the scenario for Software Defect Prediction (SDP) we focus on two class classification; defective and non-defective. However, our methods can easily be extended for multi-class cost-sensitive classification.

CSVoting classifies a record R_i as follows. A record R_i falls in (i.e. satisfies the logic rule

Experimental results

Following the common trend in Software Defect Prediction (SDP) we in our experiments also consider that a false negative classification is several times (in this instance 5 times) more costly than a true positive classification. We consider attempting to fix a non-defective module to be similar in cost to detecting and fixing a defective module. Our cost metric is shown in Table 1. Since the value of T_Cⁱ is not clearly defined in the literature [16] and has been considered to be 0 [12], we also

Potential solution to class imbalance

It is well known that software defect prediction datasets typically suffer from the class imbalance problem where the datasets contain only few records with defective modules compared to the number of defect-free modules. For example, the $MC 1^{'}$ and $PC 2^{'}$ datasets contain only 0.73% and 1.01% defective modules. The ability of a classifier to predict the records with the minority class value (such as the defective module) can be challenged by the class imbalance problem in a dataset.

We investigate

Class imbalance experiments

We compare the cost of the original CSForest with a modified version of CSForest that incorporates an oversampling technique as described in the previous section. We use SLS to refer to Safe-Level-Smote. The number in parenthesis in the following table refers to how many oversampled versions of the original data set were created. To make references easier we refer to the columns which represent CSForests which incorporate class imbalance as different setups. For example, Setup 2 is a CSForest

Examples of knowledge discovery by our cost-sensitive forest

A subject of our future research is the knowledge extraction from decision forests built by the CSForest method. However, for the interested reader, we present a few takeaway insights discovered by the proposed cost=sensitive forest. Since these insights are described using attributes from the datasets, a few attributes are first explained in detail in the following subsection.

Conclusion

It is basically certain that software development projects accumulate defects in the development process. Software defect prediction systems may be used to advise the software developers as to which modules may contain defects. Cost-sensitivity can be incorporated in an effort to make the predictions made optimised for monetary cost to the developers. We present CSForest, a cost-sensitive decision forest algorithm, and a cost-sensitive voting technique. We show that when combined, CSForest and

Code Availability

The Java code for CSForest, CSVoting and BCSForest can be found online at “mikesiers.com/software/” and also at "http://csusap.csu.edu.au/~zislam/".

Acknowledgments

The second author of this paper would like to thank the faculty of Business Compact Funding R4 P55 in Charles Sturt University, Australia. We are grateful for the constructive feedback by the reviewers.

References (23)

Y. Freund et al.
A decision-theoretic generalization of on-line learning and an application to boosting
J. Comput. Syst. Sci.
(1997)
V.S. Sheng et al.
Cost-sensitive learning for defect escalation
Knowl.-Based Syst.
(2014)
L. Breiman
Bagging predictors
Mach. Learn.
(1996)
L. Breiman
Random forests
Mach. Learn.
(2001)
C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, Safe-level-smote: safe-level-synthetic minority over-sampling...
N.V. Chawla et al.
SMOTEsynthetic minority over-sampling technique
J. Artif. Intell. Res.
(2002)
P. Domingos, Metacost: a general method for making classifiers cost-sensitive, in: Proceedings of the Fifth ACM SIGKDD...
D. Gray, D. Bowes, N. Davey, Y. Sun, B. Christianson, The misuse of the nasa metrics data program data sets for...
D. Gray et al.
Reflections on the nasa mdp data sets
IET Softw.
(2012)
M.H. Halstead
Elements of Software Science
(1977)

G. Holmes, A. Donkin, I.H. Witten, Weka: A machine learning workbench, in: Proceedings of the Second Australian and New...

Cited by (139)

A review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation and evaluation
2024, Expert Systems with Applications
Class imbalance (CI) in classification problems arises when the number of observations belonging to one class is lower than the other. Ensemble learning combines multiple models to obtain a robust model and has been prominently used with data augmentation methods to address class imbalance problems. In the last decade, a number of strategies have been added to enhance ensemble learning and data augmentation methods, along with new methods such as generative adversarial networks (GANs). A combination of these has been applied in many studies, and the evaluation of different combinations would enable a better understanding and guidance for different application domains. In this paper, we present a computational study to evaluate data augmentation and ensemble learning methods used to address prominent benchmark CI problems. We present a general framework that evaluates 9 data augmentation and 9 ensemble learning methods for CI problems. Our objective is to identify the most effective combination for improving classification performance on imbalanced datasets. The results indicate that combinations of data augmentation methods with ensemble learning can significantly improve classification performance on imbalanced datasets. We find that traditional data augmentation methods such as the synthetic minority oversampling technique (SMOTE) and random oversampling (ROS) are not only better in performance for selected CI problems, but also computationally less expensive than GANs. Our study is vital for the development of novel models for handling imbalanced datasets.
Accuracy and diversity-aware multi-objective approach for random forest construction
2023, Expert Systems with Applications
Random Forest is an ensemble classification approach. It aims to design a discrete finite group of decision trees constructed based on bootstrap samples and random attribute selection. Random Forests have strong generalization capacities due to the variance in the training and attribute couple subsets used for constructing different decision trees in the forest. However, to construct a robust and effective random forest, two main issues need to be taken into account namely: (1) increasing the accuracy and diversity of decision trees; (2) decreasing the number of decision trees. In this paper, a genetic algorithm-based approach to tackle the aforementioned challenges related to random forest construction is proposed. Three objectives are taken into consideration. First, strengthening the classification accuracy of individual decision trees as well as that of the forest. Second, making use of diversity measures among the decision trees to improve the generalization of the constructed model. Third, minimizing the number of trees in the forest and finding an optimal subset of the random forest. An experimental evaluation on several datasets from the UCI Machine Learning Repository is conducted. The obtained results show that the proposed approach outperforms state-of-the-art classical as well as evolutionary random forest construction methods. Finally, the proposed approach is used to build a reliable random forest model for detecting Botnet traffic in Internet of Things environment.
Study of selected methods for balancing independent data sets in k-nearest neighbors classifiers with Pawlak conflict analysis
2022, Applied Soft Computing
The article is devoted to the issue of classification based on independent data sets. More specifically, the impact of using different methods for balancing data sets on the classification quality in an approach that uses Pawlak conflict analysis is investigated. The newly proposed method for classification based on independent data sets assumes applying algorithms for imbalanced classification separately to all fragmented sets. The following algorithms are considered: SMOTE, random over-sampling, TOMEK links, Near Miss, random under-sampling and a combination of SMOTE and TOMEK links. For balanced data sets — decision tables, conflict analysis is used, and coalitions of tables are created. Then the aggregated table for each coalition is defined, and a modified $k$ -nearest neighbors algorithm is used to determine the decision vectors. The majority voting method is used to fuse decision vectors. Experimental results showed that the proposed approach, in most cases, gives much better results than without using methods for imbalanced data. In addition, the proposed approach achieves better results than other methods known from the literature applied to dispersed data. It was noticed that for dispersed and independent data, the best results are generated by the over-sampling approach, especially by the SMOTE, the random over-sampling and the SMOTE and TOMEK methods.
Is handling unbalanced datasets for machine learning uplifts system performance?: A case of diabetic prediction
2022, Diabetes and Metabolic Syndrome: Clinical Research and Reviews
Healthcare is a sensitive sector, and addressing the class imbalance in the healthcare domain is a time-consuming task for machine learning-based systems due to the vast amount of data. This study looks into the impact of socioeconomic disparities on the healthcare data of diabetic patients to make accurate disease predictions.
This study proposed a systematic approach of Closest Distance Ranking and Principal Component Analysis to deal with the unbalanced dataset. A typical machine learning technique was used to analyze the proposed approach. The data set of pregnant diabetic women is analysed for accurate detection.
The results of the case are analysed using sensitivity, which demonstrates that the minority class's lack of information makes it impossible to forecast the results. On the other hand, the unbalanced dataset was treated using the proposed technique and evaluated with the machine learning algorithm which significantly increased the performance of the system.
The performance of the machine learning-based system was significantly enhanced by the unbalanced dataset which was processed with the proposed technique and evaluated with the machine learning algorithm. For the first time, an unbalanced dataset was treated with a combination of Closest Distance Ranking and Principal Component Analysis.
Diversity based imbalance learning approach for software fault prediction using machine learning models[Formula presented]
2022, Applied Soft Computing
The Software fault prediction (SFP) target is to distinguish between faulty and non-faulty modules. The prediction model’s performance is vulnerable to the class imbalance issue in SFP. The existing oversampling approaches generate relatively identical synthetic data, which results in over-generalization and less diverse data. Moreover, many undesirable noisy modules are introduced while generating synthetic data. In this study, we propose the Weighted Average Centroid based Imbalance Learning Approach (WACIL), an effective synthetic over-sampling technique to mitigate the imbalance issue. The WACIL first finds borderline instances, then generates pseudo-data of them through a weighted average centroid concept and filters out inappropriate noise data through a filtration process. We conducted experiments on 24 PROMISE and NASA projects and compared them with some of the existing sampling approaches using K-Nearest Neighbors (KNN), Logistic Regression (LR), Naive Bayes (NB), Support Vector Machine (SVM), Decision Tree (DT) and Deep Neural Network (DNN) as classification models. WACIL achieves superior results in terms of Fall Out Rate (FOR), F-measure and Area Under Curve (AUC) and obtains comparable results in terms of Recall and G-mean compared to the competitive approaches. The statistical analysis indicates that WACIL’s ability to outperform the other over-sampling techniques is significant under the statistical Wilcoxon signed rank test and matched pairs rank biserial correlation coefficient effect size. Hence, WACIL is advisable as a competent choice to deal with the imbalance issue in SFP.
A cost-sensitive Imprecise Credal Decision Tree based on Nonparametric Predictive Inference
2022, Applied Soft Computing
Classifiers sometimes return a set of values of the class variable since there is not enough information to point to a single class value. These classifiers are known as imprecise classifiers. Decision Trees for Imprecise Classification were proposed and adapted to consider the error costs when classifying new instances. In this work, we present a new cost-sensitive Decision Tree for Imprecise Classification that considers the error costs by weighting instances, also considering such costs in the tree building process. Our proposed method uses the Nonparametric Predictive Inference Model, a nonparametric model that does not assume previous knowledge about the data, unlike previous imprecise probabilities models. We show that our proposal might give more informative predictions than the existing cost-sensitive Decision Tree for Imprecise Classification. Experimental results reveal that, in Imprecise Classification, our proposed cost-sensitive Decision Tree significantly outperforms the one proposed so far; even though the cost of erroneous classifications is higher with our proposal, it tends to provide more informative predictions.

View all citing articles on Scopus

View full text

Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem

Author-Highlights

Abstract

Introduction

Section snippets

Related work

Our method

Experimental results

Potential solution to class imbalance

Class imbalance experiments

Examples of knowledge discovery by our cost-sensitive forest

Conclusion

Code Availability

Acknowledgments

J. Comput. Syst. Sci.

Knowl.-Based Syst.

Bagging predictors

Mach. Learn.

Random forests

Mach. Learn.

SMOTEsynthetic minority over-sampling technique

J. Artif. Intell. Res.

Reflections on the nasa mdp data sets

IET Softw.

Elements of Software Science