1 Introduction
-
RQ Which performance metrics are good indicators for cost saving potential of defect prediction models?
-
We did not find a generalizable relationship between performance metrics and the cost saving potential of software defect prediction through our empirical study. We provide a mathematical explanation by considering the formulation of the costs and the metrics. The analysis revealed that the likely reason for the lack of a relationship is the small proportion of very large software artifacts as the main driver of the costs because this means that a small proportion of the data drives most of the costs.
-
We suggest that future research always considers costs directly, if the economic performance of defect prediction models is relevant. This means that all studies aiming to find the “best” defect prediction model, should consider this criterion. Otherwise, no claims regarding being better in a use case that involves the prediction of defects in a company to guide quality assurance should be made.
-
We find that the chance that release-level defect prediction models are at all cost saving is mediocre and was for the best model we observed only at 63%. This means that for more than one third of the cases, it can never make economic sense to use release level defect prediction. We note that while this finding is restricted to the approaches we use, we obtained this based on good models from prior benchmark studies.
2 Related Work
3 Research Protocol
3.1 Notation
-
S is the set of software artifacts for which defects are predicted. Examples for software artifacts are files, classes, methods, or changes to any of the aforementioned.
-
h : S →{0,1} is the defect prediction model, where h(s) = 1 means that the model predicts a defect in an artifact s ∈ S. Alternatively, we sometimes use the notation \(h^{\prime }(s) \to [0,1]\) with a threshold t such that h(s) = 1 if and only if \(h^{\prime }(s) > t\), in case performance metrics require scores for instances.
-
D is the set of defects \(d \subseteq S\). Thus, a defect is defined by the set of software artifacts that are affected by the defect.
-
DPRED = {d ∈ D : ∀s ∈ d | h(s) = 1} is the set of predicted defects. Hence, a defect is only predicted successfully, if the defect prediction model predicts all files affected by the defect. Through this, we account for the n-to-m relationship between artifacts and defects, i.e., one file may be affected by multiple bugs and one bug may affect multiple files.
-
DMISS = {d ∈ D : ∃s ∈ d | h(s) = 0} = D ∖ DPRED is the set of missed defects.
-
SDEF = {s ∈ S : ∃d ∈ D | s ∈ d} is the set of software artifacts that are defective.
-
\(S_{\textit {CLEAN}} = \{s \in S: \nexists d \in D~|~s \in d\}\) is the set of software artifacts that are clean, i.e., not defective.
-
h∗:S→{0,1} is the target model, where each prediction is correct, i.e., h∗(s) = { 1 s ∈ SDEF 0 s ∈ SCLEAN
-
tp = |{s ∈ S : h∗(s) = 1 and h(s) = 1}| are the artifacts that are affected by any defect and correctly predicted as defective.
-
fn = |{s ∈ S : h∗(s) = 1 and h(s) = 0}| are the artifacts that are affected by any defect and are missed by the prediction model.
-
tn = |{s ∈ S : h∗(s) = 0 and h(s) = 0}| are the artifacts that are clean and correctly predicted as clean by the prediction model.
-
fp = |{s ∈ S : h∗(s) = 0 and h(s) = 1}| are the artifacts that are clean and wrongly predicted as defective by the prediction model.
-
size(s) is the size of the software artifact s ∈ S. Within this study, we use size(s) = LLOC(s), where LLOC are the logical lines of code of the artifact.1
-
\(C = \frac {C_{\textit {DEF}}}{C_{\textit {QA}}}\) is the ratio between the expected costs of a post release defect CDEF and the expected costs for quality assurance per size unit CQA. Costs for quality assurance for an artifact qa(s) are, therefore, qa(s) = size(s) ⋅ CQA.
3.2 Research Question
-
RQ: Which performance metrics are good indicators for cost saving potential of defect prediction models?
3.3 Variables
3.3.1 Dependent Variable
3.3.2 Independent Variables
3.3.3 Confounding Variables
Project | #Releases | #Files | Defect Ratio (%) |
---|---|---|---|
Ant-ivy | 6 (6) | 240-474 | 2.9-14.2 |
Archiva | 7 (6) | 430-467 | 0.7-6.9 |
Calcite | 16 (16) | 1075-1415 | 3.3-16.6 |
Cayenne | 2 (2) | 1578-1708 | 1.2-2.6 |
Commons-bcel | 6 (3) | 325-378 | 0.0-3.5 |
Commons-beanutils | 10 (1) | 3-104 | 0.0-8.0 |
Commons-codec | 11 (0) | 14-64 | 0.0-12.5 |
Commons-collections | 9 (5) | 26-301 | 0.3-5.2 |
Commons-compress | 17 (12) | 61-201 | 3.0-27.9 |
Commons-configuration | 14 (7) | 29-240 | 2.9-28.6 |
Commons-dbcp | 11 (0) | 32-56 | 1.8-35.9 |
Commons-digester | 14 (3) | 14-157 | 0.0-6.4 |
Commons-io | 11 (5) | 34-115 | 3.5-22.2 |
Commons-jcs | 6 (3) | 213-368 | 0.5-4.4 |
Commons-jexl | 6 (0) | 52-85 | 0.0-9.6 |
Commons-lang | 16 (5) | 26-138 | 0.0-22.2 |
Commons-math | 13 (12) | 106-914 | 0.0-7.2 |
Commons-net | 15 (13) | 95-270 | 1.9-19.8 |
Commons-scxml | 5 (0) | 72-79 | 6.3-19.0 |
Commons-validator | 7 (0) | 17-63 | 0.0-31.8 |
Commons-vfs | 4 (2) | 236-262 | 0.8-11.0 |
Deltaspike | 16 (15) | 56-725 | 1.1-10.5 |
Eagle | 3 (1) | 682-1388 | 0.0-2.6 |
Giraph | 3 (2) | 105-753 | 0.0-6.3 |
Gora | 8 (5) | 97-210 | 0.0-11.3 |
Jspwiki | 12 (2) | 13-730 | 0.5-7.7 |
Knox | 13 (13) | 388-763 | 1.1-12.7 |
Kylin | 11 (11) | 379-1006 | 2.1-8.1 |
Lens | 2 (1) | 584-629 | 0.6-2.1 |
Mahout | 13 (10) | 264-906 | 0.0-10.6 |
Manifoldcf | 28 (28) | 430-1058 | 1.0-10.3 |
Nutch | 22 (22) | 245-500 | 1.7-12.5 |
Opennlp | 2 (2) | 626-632 | 0.8-1.7 |
Parquet-mr | 10 (10) | 221-429 | 2.1-9.0 |
Santuario-java | 6 (2) | 165-463 | 0.0-6.8 |
Systemml | 10 (9) | 851-1073 | 0.0-9.2 |
Tika | 28 (27) | 61-651 | 2.5-19.8 |
Wss4j | 5 (4) | 135-500 | 0.0-9.6 |
Total | 398 (265) | 154.271 | 5.2 |
3.4 Datasets
3.5 Execution Plan
3.5.1 Bootstrapping the Relationships
Parameter | type | min | max |
---|---|---|---|
Ratio of features available for each tree | float | 0 | 1 |
Minimal instances for a decision | integer | 2 | 20 |
Minimal instances for a leaf | integer | 1 | 20 |
-
The correctly predicted instances for each class.
-
The instances that are predicted in the upper neighbor of a class to check for a tendency of moderate overprediction. We also check how many instances are predicted in any of the classes with more cost saving potential to check for the overall overprediction.
-
The instances that are predicted in the lower neighbor of a class to check for a tendency of moderate underprediction. We also check how many instances are predicted in any of the classes with less cost saving potential, to check for the overall underpredictions of that class.
3.5.2 Generalization to Realistic Settings
-
Cross-version defect prediction:
-
Cross-project defect prediction:
-
Watanabe et al. (2008) suggest to standardize the training data based on the mean value of the target project. Naive Bayes is used a classifier.
-
An approach suggested by Camargo Cruz and Ochimizu (2009) that proposes to apply the logarithm to all features and then standardize the training data based on the median of the target project. Naive Bayes is used as classifier.
3.5.3 Interpretation and Theory Building
-
We cannot establish a strong relationship between our variables. We observe this through the confusion matrices in both the bootstrap experiment and the generalization to other defect prediction data. We analyze the results in detail to determine why we could not establish such a theory. In case we find a significant flaw in our methodology through this analysis, we outline how future experiments could avoid this issue and, thereby, at least contribute to the body of knowledge regarding case study guidelines. However, due to the rigorous review of our experiment protocol through the registration, we believe that it is unlikely that we find such a flaw. In case we find no flaw in the methodology, we try to determine the reasons why the metrics are not suitable proxies and try to infer if similar problems may affect other machine learning applications in software engineering. We look for reasons for this lack of a relationship both through analytic considerations of the relationship between the performance metrics and the costs, as well as due to possible explanations directly within the data.
-
We can establish a strong relationship between our variables. We observe this at least through the confusion matrices in the bootstrap experiment, but possibly not when we evaluate the generalization to other models. We use the insights from the multinomial logit, decision tree, and random forest regarding the importance of the independent and confounding variables and how they contribute to the result. We combine these insights to understand which combination of variables is suited for the prediction of the cost saving potential and can, therefore, be used as suitable proxy. We derive a theory regarding suitable proxies from these insights and how they should be used. This theory includes how the proxies are mathematically related to the cost to understand the causal relationships that lead to the criteria being good proxies. The theory may also indicate that there are no suitable proxies, in case the confounding variables are key drivers of the prediction of cost saving potential. We would interpret this as strong indication that cost saving potential depends on the structure of the training and/or test data and cannot be extrapolated from performance metrics.
-
Cost saving potential classification possible:
-
At least 90% of instances that are not cost saving are predicted correctly (level none).
-
At least 90% of instances that have cost saving potential are predicted correctly (not in level none).
-
-
Cost saving potential categorization possible (weak):
-
The two criteria above are fulfilled.
-
At least 90% of the instances with cost saving potential are either in the correct level or in a neighboring cost saving level. For example, 90% of instances with medium cost saving potential are either predicted as medium or or large.
-
-
Cost saving potential categorization possible (strong):
-
At least 90% of instances of each level are predicted correctly.
-
3.5.4 Sensitivity Analysis
3.6 Summary of Deviations
-
We modified our dependent variable to only have four levels instead of six, due to data scarcity in the other two levels.
-
We strengthened the requirements on the data for the bootstrap experiment to enforce that there are at least two defective artifacts to ensure that SMOTUNED (Agrawal and Menzies 2018) is always usable.
-
We calculate Spearman’s rank correlation (Spearman 1987) between all independent and confounding variables to enable the consideration of the interactions between variables in our theory building.
-
We first train a multinomial logit model with normalized variables to select the relevant variables through regularization and subsequently train the model on the selected variables without regularization for the analysis.
-
We specified that we used differential evolution Qing (2009) for the hyperparameter tuning of the random forests and a grid search for the multinomial logit model.
-
We specified that we use McFadden’s adjusted R2 (McFadden 1974) for the selection of the best hyperparameters of the multinomial logit model because we have an ordinal dependent variable.
4 Results
4.1 Bootstrap Experiment
-
the recall group with the metrics recall, F-measure, G-measure, balance, MCC, and consistency;
-
the fpr group with the metrics fpr and errorTypeI;
-
the accuracy group with the metrics accuracy, error, errorTypeII, NECM10, NECM25, and biastest; and
-
the N group with the metrics Ntrain, \(N^{\prime }_{\textit {train}}\), and Ntest.
4.2 Generalization
-
The recall is still strongly correlated with consistency. However, the other correlations are weaker now, especially to F-measure and MCC. Instead, the recall now seems to be associated with fprerrorTypeI, accuracy, error and errorTypeII.
-
The F-measure is now correlated with the precision instead of recall.
-
The accuracy was correlated with the errorTypeII in the bootstrap experiment. Now it is associated with the errorTypeI as well. The correlation between the biastest is now low.
4.3 Sensitivity Analysis
5 Discussion
5.1 Relationship between Variables
5.2 Mathematical Explanation of the Results
5.3 Defect Prediction Performance
-
Whether defect prediction can be cost saving at all is essentially a coin flip slightly favoring you, as the six approaches we used for the generalization experiment were not cost saving with a median of 46%. The best result achieved 37% (cross-version defect prediction the approach by Kawata et al. (2015)) and worst result achieved only 68% (cross-project defect prediction with the approach by Watanabe et al. (2008)).
-
If you win the coin flip (cost saving is possible), the range of actually cost saving values is log-normal distributed with a mean value of roughly 1000, which means that you require an estimation that is accurate to a single KLOC (kilo Lines of Code) of the relation between the costs for quality assurance and the cost of defects.
5.4 Consequences for Defect Prediction Researchers
-
A new defect prediction model is proposed with the intent to demonstrate a better performance than the state of the art. Since a better prediction performance is directly related to the intent to be better from an economic point of view, such studies should use cost saving potential, or a similar criterion that directly measures costs, as main criterion for the comparison of approaches. Other metrics, e.g., recall, may be used to augment such studies to provide insights into the behavior of the prediction model.
-
A defect prediction model is used to study the relationship between a property (e.g., changes, static analysis warnings) and defects. This relationship is not only studied by pure prediction performance, but also by studying the inner workings of the defect prediction, to understand if and how the considered property is related to defects. Such studies do not need to consider the economic side of defect prediction and should instead follow the guidance by Yao and Shepperd (2021) and use MCC.