Elsevier

Information Systems

Volume 51, July 2015, Pages 62-71
Information Systems

Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem

https://doi.org/10.1016/j.is.2015.02.006Get rights and content

Author-Highlights

  • SDP is short for Software Defect Prediction.

  • We show that there is not a clear winner in the studied existing methods for SDP.

  • A cost-sensitive decision forest and voting technique are proposed.

  • The superiority of the proposed techniques is shown.

  • A proposed framework for the forest algorithm for handling class imbalance.

Abstract

Software development projects inevitably accumulate defects throughout the development process. Due to the high cost that defects can incur, careful consideration is crucial when predicting which sections of code are likely to contain defects. Classification algorithms used in machine learning can be used to create classifiers which can be used to predict defects. While traditional classification algorithms optimize for accuracy, cost-sensitive classification methods attempt to make predictions which incur the lowest classification cost. In this paper we propose a cost-sensitive classification technique called CSForest which is an ensemble of decision trees. We also propose a cost-sensitive voting technique called CSVoting in order to take advantage of the set of decision trees in minimizing the classification cost. We then investigate a potential solution to class imbalance within our decision forest algorithm. We empirically evaluate the proposed techniques comparing them with six (6) classifier algorithms on six (6) publicly available clean datasets that are commonly used in the research on software defect prediction. Our initial experimental results indicate a clear superiority of the proposed techniques over the existing ones.

Introduction

Software defect prediction (SDP) is the process of predicting which sections of a code are defective and which are not. Sections of software code are also referred to as modules, examples of which are functions in C/C++ programs and methods in Java programs. A module can be characterized through various measures including the number of distinct operators and operands used, the total number of operators and operands used, Difficulty, Volume Length and Cyclomatic Complexity as introduced in the Halstead [9] and McCabe [13], [14] measures. Considering the measures as attributes and modules as records a dataset DT can be prepared where we have a number of records and attributes representing the previous modules for which we already know whether or not a module was defective. Therefore, in DT we also have a class attribute that labels a module as defective or non-defective. Considering DT as a training dataset a classifier can be built and applied on future modules in order to predict whether the module is defective or non-defective. This is also known as the classification task in data mining.

For the conventional classification task in data mining a classifier is generally built with the aim to minimize the number of misclassified records and thereby maximize the prediction accuracy for future records. However, in SDP a classifier is often built with the aim to minimize the classification cost, which is the cost associated with the classification made. That is, in SDP the classification cost is more important than the number of misclassified records. The cost of a false negative (i.e. a module being actually defective but predicted as non-defective) is generally several times higher than the cost of a false positive (i.e. a module actually being non-defective but predicted as defective). Therefore, it is often better to have several false positive prediction in order to avoid a single false negative prediction. In order to build SDP classifiers which aim to minimize the cost a cost-sensitive learning algorithm is incorporated [5], [23], [12], [16], [19]. The utilization of cost-sensitive learning in SDP can result in monetary savings for a software development group, and as such can generally be seen as more useful than SDP systems which do not utilize cost-sensitive learning.

This study extends our previous conference paper [22] which has never been published in a journal. In this paper we propose a cost-sensitive decision forest algorithm called CSForest and a suitable cost sensitive voting technique called CSVoting in order to reduce the classification cost. We also empirically compare our proposed technique with two existing cost-sensitive classifiers called Weighting [23] and CSTree [16], [12] and two cost-insensitive classifiers called C4.5 [15] and SysFor [11]. We use six (6) publicly available real-world datasets available from the NASA MDP repository [20]. Our experimental results clearly indicate that the proposed cost-sensitive decision forest and cost-sensitive voting perform better than the existing techniques in minimizing the classification cost.

A problem that is often encountered in SDP is the class imbalance problem. This causes performance issues for data mining algorithms. A popular method of accounting for the class imbalance problem is oversampling. We investigate a potential technique for incorporating oversampling into CSForest. A comparison is then made with the original CSForest results in Section 6. The extensions of our conference paper which are made in this study are as follows:

  • Deeper explanation of the related works.

  • Extending the proposed CSForest to incorporate oversampling.

  • Further experimentation.

The rest of this paper is structured as follows: in Section 2 we outline the related work. We introduce the methods CSVoting and CSForest in Section 3. We present our experimental results for CSForest and CSVoting in Section 4. We then present an idea for combatting class imbalance in CSForest in Section 5. The experimental results for combatting the class imbalance are presented in Section 6. Finally in Section 8 we present the concluding remarks and our future work.

Section snippets

Related work

In this section we introduce a number of cost-insensitive classifiers and cost-sensitive classifiers that are either somehow related to the proposed technique or empirically compared with the proposed technique in Section 4. The examples of cost-insensitive classifiers are C4.5 [15] (which is a decision tree classifier) and SysFor [11] (which is a decision forest classifier), and the examples of cost-sensitive classifiers are CSTree [16], [12] and Weighting [23].

Like all other classifiers, a

Our method

In this study we propose a cost-sensitive voting called CSVoting (Cost-Sensitive Voting) for a decision forest, and a cost-sensitive decision forest algorithm called CSForest. Since we consider the scenario for Software Defect Prediction (SDP) we focus on two class classification; defective and non-defective. However, our methods can easily be extended for multi-class cost-sensitive classification.

CSVoting classifies a record Ri as follows. A record Ri falls in (i.e. satisfies the logic rule

Experimental results

Following the common trend in Software Defect Prediction (SDP) we in our experiments also consider that a false negative classification is several times (in this instance 5 times) more costly than a true positive classification. We consider attempting to fix a non-defective module to be similar in cost to detecting and fixing a defective module. Our cost metric is shown in Table 1. Since the value of TCi is not clearly defined in the literature [16] and has been considered to be 0 [12], we also

Potential solution to class imbalance

It is well known that software defect prediction datasets typically suffer from the class imbalance problem where the datasets contain only few records with defective modules compared to the number of defect-free modules. For example, the MC1 and PC2 datasets contain only 0.73% and 1.01% defective modules. The ability of a classifier to predict the records with the minority class value (such as the defective module) can be challenged by the class imbalance problem in a dataset.

We investigate

Class imbalance experiments

We compare the cost of the original CSForest with a modified version of CSForest that incorporates an oversampling technique as described in the previous section. We use SLS to refer to Safe-Level-Smote. The number in parenthesis in the following table refers to how many oversampled versions of the original data set were created. To make references easier we refer to the columns which represent CSForests which incorporate class imbalance as different setups. For example, Setup 2 is a CSForest

Examples of knowledge discovery by our cost-sensitive forest

A subject of our future research is the knowledge extraction from decision forests built by the CSForest method. However, for the interested reader, we present a few takeaway insights discovered by the proposed cost=sensitive forest. Since these insights are described using attributes from the datasets, a few attributes are first explained in detail in the following subsection.

Conclusion

It is basically certain that software development projects accumulate defects in the development process. Software defect prediction systems may be used to advise the software developers as to which modules may contain defects. Cost-sensitivity can be incorporated in an effort to make the predictions made optimised for monetary cost to the developers. We present CSForest, a cost-sensitive decision forest algorithm, and a cost-sensitive voting technique. We show that when combined, CSForest and

Code Availability

The Java code for CSForest, CSVoting and BCSForest can be found online at “mikesiers.com/software/” and also at "http://csusap.csu.edu.au/~zislam/".

Acknowledgments

The second author of this paper would like to thank the faculty of Business Compact Funding R4 P55 in Charles Sturt University, Australia. We are grateful for the constructive feedback by the reviewers.

References (23)

  • Y. Freund et al.

    A decision-theoretic generalization of on-line learning and an application to boosting

    J. Comput. Syst. Sci.

    (1997)
  • V.S. Sheng et al.

    Cost-sensitive learning for defect escalation

    Knowl.-Based Syst.

    (2014)
  • L. Breiman

    Bagging predictors

    Mach. Learn.

    (1996)
  • L. Breiman

    Random forests

    Mach. Learn.

    (2001)
  • C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, Safe-level-smote: safe-level-synthetic minority over-sampling...
  • N.V. Chawla et al.

    SMOTEsynthetic minority over-sampling technique

    J. Artif. Intell. Res.

    (2002)
  • P. Domingos, Metacost: a general method for making classifiers cost-sensitive, in: Proceedings of the Fifth ACM SIGKDD...
  • D. Gray, D. Bowes, N. Davey, Y. Sun, B. Christianson, The misuse of the nasa metrics data program data sets for...
  • D. Gray et al.

    Reflections on the nasa mdp data sets

    IET Softw.

    (2012)
  • M.H. Halstead

    Elements of Software Science

    (1977)
  • G. Holmes, A. Donkin, I.H. Witten, Weka: A machine learning workbench, in: Proceedings of the Second Australian and New...
  • Cited by (139)

    View all citing articles on Scopus
    View full text