1 Introduction
-
openjpa-jdbc/src/main/java/org/apache/openjpa/jdbc/kernel/FinderCacheImpl.java
-
openjpa-kernel/src/main/java/org/apache/openjpa/kernel/DelegatingFetchConfiguration.java
-
openjpa-kernel/src/main/java/org/apache/openjpa/kernel/FetchConfiguration.java,
-
openjpa-kernel/src/main/java/org/apache/openjpa/kernel/FetchConfigurationImpl.java
-
openjpa-persistence-jdbc/src/test/java/org/apache/openjpa/persistence/fetchgroups/ TestFetchGroups.java
ee6f4acc3ff9ac43ea4e 98579b478e55767aef24
, we retrieve a list of commits that may have caused the defect (i.e., the defect-inducing commits). Specifically, RA-SZZ retrieved two defect-inducing commits, which touched 773 LOC and 139 classes. These two defect-inducing commits have the IDs 1fede626e2cad16f7bb4d77dd9fc3270a8b6b331
and 979d2340e93eaaa9f273a100dbe78e42ea9ed400
, respectively (both in release 0.9.0). However, not all statements in these two defect-inducing commits are defective. Specifically, only class openjpa-kernel/src/main/java/org/apache/openjpa/kernel/FetchConfigurationImpl.java
is deemed defective among the classes retrieved by RA-SZZ; i.e, the two defect-inducing commits touched the same class. Afterwards, the class openjpa-kernel/src/main/java/org/apache/openjpa/kernel/FetchConfigurationImpl.java
is labeled as defective from release 0.9.0 to release 2.3.0
(excluded). Since we have two defect-inducing commits that may have contributed to the defectiveness of class openjpa-kernel/src/main/java/org/apache/openjpa/kernel/FetchConfigurationImpl.java
, we believe that information embedded in these commits can contribute to both class-defectiveness-predictions.2 Study Design
2.1 Subject Projects
Dataset | CMT | D. CMT | Meth | D. Meth | CLS | D. CLS | Defects | Linkage | Releases | V. CMT |
---|---|---|---|---|---|---|---|---|---|---|
ARTEMIS | 589 | 34% | 83880 | 0.14% | 9425 | 0.41% | 133 | 82% | 1.0.0;1.0.1;1.2.0 | 21 |
DIRSERVER | 367 | 6% | 19335 | 0.11% | 3367 | 1.85% | 14 | 83% | 1.01;1.02;1.03;1.04 | 10 |
GROOVY | 701 | 2% | 14310 | 0.12% | 1496 | 1.54% | 20 | 77% | 1.0-1;1.0-2;1.0-3;1.0-4;1.0-5 | 12 |
MNG | 1835 | 39% | 14519 | 2.33% | 2778 | 8.8% | 582 | 51% | 2.0a1;2.0a2;2.0a3;2.0-1;2.0-2 | 24 |
NUTCH | 223 | 65% | 6934 | 4.86% | 1078 | 13.22% | 135 | 70% | 0.7;0.7.1;0.7.2 | 21 |
OPENJPA | 538 | 54% | 44997 | 0.48% | 3119 | 2.65% | 263 | 92% | 0.9.0;0.9.6;0.9.7 | 23 |
QPID | 1443 | 44% | 43049 | 3.42% | 5423 | 12.36% | 617 | 82% | M1;M2;M2.1;M3;M4 | 24 |
TIKA | 246 | 29% | 1844 | 0.64% | 369 | 6.32% | 69 | 75% | 0.1-incubating;0.2;0.3 | 19 |
ZOOKEEPER | 174 | 21% | 6249 | 0.49% | 838 | 7.01% | 27 | 76% | 3.0.0;3.0.1;3.1.0;3.1.1 | 13 |
2.2 RQ1: Do Methods and Classes vary in Defectiveness?
-
H10: the number of defective methods is equivalent to the number of defective classes.
-
H20: the proportion of defective methods is equivalent to the proportion of defective classes
Effect size | d |
---|---|
Very small | < 0.01 |
Small | < 0.20 |
Medium | < 0.50 |
Large | < 0.80 |
Very Large | ≥ 0.80 |
2.3 RQ2: Does Leveraging JIT Information Increase the Accuracy of MDP?
-
H30: leveraging JIT does not improve the accuracy of MDP.
-
Single: It uses state of the art approach for MDP (Giger et al. 2012; Pascarella et al. 2020). Specifically, we used the following set of features as input to a machine learning classifier:
-
size: LOC of a method.
-
methodHistories: number of times a method was changed.
-
authors: number of distinct authors that changed a method.
-
stmtAdded: sum of all source code statements added to a method body over all method histories.
-
maxStmtAdded: maximum number of source code statements added to a method body throughout the method’s change history.
-
avgStmtAdded: average number of source code statements added to a method body per change to the method.
-
stmtDeleted: sum of all source code statements deleted from a method body over all method histories.
-
maxStmtDeleted: maximum number of source code statements deleted from a method body for all method histories.
-
avgStmtDeleted: Average number of source code statements deleted from a method body per method history.
-
churn: sum of stmtAdded plus stmtDeleted over all method histories.
-
maxChurn: maximum churn for all method histories.
-
avgChurn: average churn per method history
-
cond: number of condition expression changes in a method body over all revisions.
-
elseAdded: number of added else-parts in a method body over all revisions.
-
elseDeleted: number of deleted else-parts from a method body over all revisions.
-
-
Combined: It takes the median to combine the previously mentioned Single approach with two other scores:The rationale of the Combined approach is that a defective commit incurs defective methods (i.e., those methods that are touched by the commit). We use the median as the combination mechanism because it is a simple way to combine several probabilities.
-
SumC is the sum of defectiveness probabilities of the commits touching the method. The rationale is that the probability that a method is defective is related to the sum of probabilities of the commits touching the method.
-
MaxC is the max of defectiveness probabilities of the commits touching the method. The rationale is that the probability that a method is defective is related to the max of probabilities of the commits touching the method.
We use a standard JIT approach (Kamei et al. 2012) to obtain probabilities of defectiveness of the commits that touch the methods. Specifically, we used the following set of features as input to a machine learning classifier:-
Size: lines of code modified.
-
Number of modified subsystems (NS): changes modifying many subsystem are more likely to be defect-prone.
-
Number of modified directories (ND): changes that modify many directories are more likely to be defect-prone.
-
Number of modified files (NF): changes touching many files are more likely to be defect prone.
-
Distribution of modified code across each file (Entropy): changes with high entropy are more likely to be defect-prone, because a developer will have to recall and track higher numbers of scattered changes across each file.
-
Lines of code added (LA): the more lines of code added the more likely a defect is introduced.
-
Lines of code deleted (LD): the more lines of code deleted the higher the chance of a defect to occur.
-
Lines of code in a file before the change (LT): the larger a file the more likely a change might introduce a defect.
-
Whether or not the change is a defect fix (FIX): fixing a defect means that an error was made in an earlier implementation, therefore it may indicate an area where errors are more likely.
-
Number of developers that changed the modified files (NDEV): the larger the NDEV, the more likely a defect is introduced because files revised by many developers often contain different thoughts and coding styles.
-
Average time interval between the last and the current change (AGE): the lower the AGE, the more likely a defect will be introduced.
-
Number of unique changes to the modified files (NUC): the larger the NUC, the more likely a defect is introduced, because a developer will have to recall and track many previous changes.
-
Developer experience (EXP): more experienced developers are less likely to introduce a defect.
-
Recent developer experience (REXP): developers that have often modified the files in recent months are less likely to introduce a defect because they will be more familiar with the recent developments in the system.
-
Developer experience on a subsystem (SEXP): developers that are familiar with the subsystems modified by a change are less likely to introduce a defect.
-
-
AUC: Area Under the Receiving Operating Characteristic Curve (Powers 2007) is the area under the curve of true positives rate versus false positive rate, which is defined by setting multiple thresholds. A positive instance is a defective entity, whereas a negative instance is a defect-free entity. AUC has the advantage to be threshold independent and, therefore, it is recommended for evaluating defect prediction techniques (Lessmann et al. 2008). We decided to avoid metrics such as Precision, Recall and F1, since they are threshold dependent.
-
PofBX: as the effort-aware metric, we used PofBx (Chen et al. 2017; Wang et al. 2020; Xia et al. 2016; Tu et al. 2020). PofBx is defined as the proportion of defective entities identified by analyzing the first x% of the code base. For instance, a PofB10 of 30% signifies that 30% of defective entities have been found by analyzing 10% of the code base. We explored PofBx with an x in the range of [10, 50]. While previous studies only focused on x = 20, we investigated a wider range to obtain more informative results. Note that PofB is different than Popt (Kamei et al. 2010, 2013; Mende and Koschke 2010) in two aspects: normalization and range of x. Regarding normalization, while Popt normalizes the value according to a random approach, PofB does not perform such analysis, which aligns with our goals for two reasons:1)The comparison against a random approach is already provided by AUC, since an AUC higher than 0.5 indicates that a classifier performed better than a random classifier,2)In our study, we are interested in comparing classifiers that rank entities at different levels of granularity. Specifically, since methods and classes have a different defectiveness proportion (see Table 1), a random ranking would perform differently across methods and classes. Regarding the value of x, Popt represents an average of the complete spectrum of x, but we decided to neglect high values of x, as we believe that high values of x would be unrealistic for practitioners when indicating which code should be inspected during testing. Specifically, the lower the amount of code tested, the higher the impact of the ranking approach; i.e., if 100% of the code needs to inspected the ranking approach is effectively useless. Thus, we envisioned a metric that express the return of investing a specific amount of time in testing x% of the code as suggested by the ranking from a classifier. For all these reasons, PofB is a better match to our needs than Popt.
-
Feature Selection: we filter the independent variables described above by using the correlation-based feature subset selection (Hall 1998; Ghotra et al. 2017; Kondo et al. 2019). The approach evaluates the worth of a subset of features by considering the individual predictive capability of each feature, as well as the degree of redundancy between different features. The approach searches the space of feature subsets by a greedy hill-climbing augmented with a backtracking facility. The approach starts with an empty set of features and performs a forward search by considering all possible single feature additions and deletions at a given point.
-
Random Forest: It generates a number of separate, randomized decision trees and provides as classification the mode of the classifications. It has proven to be highly accurate and robust against noise (Breiman 2001). However, it can be computationally expensive as it requires the building of several trees.
-
Logistic Regression: It estimates the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables. The estimation is performed through the logistic distribution function (Cessie and Houwelingen 1992).
-
Naïve Bayes: It uses the Bayes theorem, i.e., it assumes that the contribution of an individual feature towards deciding the probability of a particular class is independent of other features in that dataset instance (Mccallum and Nigam 2001).
-
HyperPipes: It simply constructs a hyper-rectangle for each label that records the bounds for each numeric feature and what values occur for nominal features. During the classifier application, the label is chosen by whose hyper-rectangle most contains the instance (i.e., that which has the highest number of feature values of the test instance fall within the corresponding bounds of the hyper-rectangle) .
-
IBK: Also known as the k-nearest neighbors’ algorithm (k-NN) which is a non-parametric method. The classification is based on the majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (Altman 1992).
-
IB1: It is a special case of IBK with K = 1, i.e., it uses the closest neighbor (Altman 1992).
-
J48: Builds decision trees from a set of training data. It extends the Iterative Dichotomiser 3 classifier (Quinlan 1986) by accounting for missing values, decision trees pruning, continuous feature value ranges and the derivation of rules.
-
VFI: Also known as voting feature intervals (Demiröz and Güvenir 1997). A set of feature intervals represents a concept on each feature dimension separately. Afterwards, each feature is used by distributing votes among classes. The predicted class is the class receiving the highest vote (Demiröz and Güvenir 1997).
-
Voted Perceptron: It uses a new perceptron every time an example is wrongly classified, initializing the weights vector with the final weights of the last perceptron. Each perceptron will also be given another weight corresponding to how many examples they correctly classify before wrongly classifying one, and at the end, the output will be a weighted vote on all perceptrons (Freund and Schapire 1999).
2.4 RQ3: Does Leveraging JIT Information increase the Accuracy of CDP?
-
H40: leveraging JIT does not improve the accuracy of CDP.
-
Single: It uses state of the art approach for CDP (Falessi et al. 2021). Specifically, we used the following set of features as input to a machine learning classifier:
-
Size (LOC): lines of code.
-
LOC Touched: sum over revisions of LOC added and deleted
-
NR: number of revisions.
-
Nfix: number of defect fixes.
-
Nauth: number of authors.
-
LOC Added: sum over revisions of LOC added and deleted.
-
Max LOC Added: maximum over revisions of LOC added.
-
Average LOC Added: average LOC added per revision.
-
Churn: sum over revisions of added deleted LOC.
-
Max Churn: maximum churn over revisions.
-
Average Churn: average churn over revisions.
-
Change Set Size: number of files committed together.
-
Max Change Set: maximum change set size over revisions.
-
Average Change Set: average change set size over revisions.
-
Age: age of release.
-
Weighted Age: age of release weighted by LOC touched.
-
-
Combined: Similar to RQ2, it takes the median to combine the previously described Direct approach with JIT information.
2.5 RQ4: Are we more Accurate in MDP or CDP?
-
H50: MDP is as accurate as CDP.
2.6 Replication Package
3 Study Results
3.1 RQ1: Do Methods and Classes vary in Defectiveness?
Entity | Pvalue | Cohen’s d |
---|---|---|
Class Vs Method | 0.0488 | 0.598 |
Entity | Pvalue | Cohen’s d |
---|---|---|
Class Vs Method | 0.0488 | 0.359 |
3.2 RQ2: Does Leveraging JIT Information Increase the Accuracy of MDP?
-
Combined is better than Direct in all PofBs in Groovy, MNG, NUTCH, OPENJPA, QPID and TIKA.
-
Combined is worse than Direct in all PofBs in DERSERVER.
-
Similar to the previous results related to the AUC, the distribution of values from Combined is substantially more narrow than the distribution of values from Direct. This result indicates that the choice of classifiers is not as important when using Combined.
-
MaxC or SumC have been selected in all projects, except Zookeeper.
-
Direct has been selected in seven out of the nine projects.
-
MaxC or SumC have been selected more than Direct in five out of the nine projects.
AUC | Popt10 | Popt15 | Popt20 | Popt25 | Popt30 | Popt35 | Popt40 | Popt45 | Popt50 | |
---|---|---|---|---|---|---|---|---|---|---|
Pvalue | 0.0001* | 0.1082 | 0.5813 | 0.0275* | 0.0001* | 0.0001* | 0.0001* | 0.0001* | 0.0001* | 0.0001* |
Cohen’s d | 1.1746 | 0.4052 | 0.2263 | 0.4900 | 0.7130 | 0.7970 | 0.8241 | 0.5896 | 0.5567 | 0.3949 |
3.3 RQ3: Does Leveraging JIT Information Increase the Accuracy of CDP?
-
The median of Combined is more accurate than Direct in all nine projects.
-
As in RQ2, the distribution of values across classifiers of Combined is extremely narrower than the distribution of Direct. Therefore, the choice of classifiers is not as important when using the Combined approach.
-
Combined is better than Direct in all PofB values in seven out of nine projects: ARTEMIS, DIRSERVER, MNG, NUTCH, OPENJPA, QPID and TIKA.
-
There is no dataset where Combined is worse than Direct in all PofB values.
-
Similar to the results for AUC, the distribution of values across classifiers from Combined is extremely narrower than the distribution from Direct. Therefore, the choice of classifiers is not as important when using Combined.
-
MaxC or SumC have been selected in all nine projects.
-
Direct has not been selected in two out of the nine projects.
-
MaxC or SumC were selected more than Direct in five out of the nine projects.
AUC | Popt10 | Popt15 | Popt20 | Popt25 | Popt30 | Popt35 | Popt40 | Popt45 | Popt50 | |
---|---|---|---|---|---|---|---|---|---|---|
Pvalue | 0.0001* | 0.2223 | 0.0913 | 0.0001* | 0.0001* | 0.0001* | 0.0001* | 0.0001* | 0.0001* | 0.0001* |
Cohen’s d | 1.5737 | 0.0943 | 0.1787 | 0.5269 | 0.7842 | 0.7590 | 0.7295 | 0.9890 | 1.2621 | 1.2941 |
3.4 RQ4: Are we more Accurate in MDP or CDP?
-
MDP is more accurate than CDP in all nine projects.
-
The distribution of values across classifiers for MDP is extremely narrower than the distribution for CDP. Therefore, the choice of classifiers is less important in MDP than it is in CDP.
-
MDP is more accurate than CDP in all nine classifiers.
-
The distribution of values across projects for MDP is extremely narrower than the distribution for CDP. Thus, classifiers are more stable in MDP than CDP.
-
MDP is better than CDP in all PofB values in four projects.
-
MDP is worse than CDP in all PofB values only in the Groovy project.
AUC | Popt10 | Popt15 | Popt20 | Popt25 | Popt30 | Popt35 | Popt40 | Popt45 | Popt50 | |
---|---|---|---|---|---|---|---|---|---|---|
Pvalue | 0.0001* | 0.0001* | 0.0001* | 0.0002* | 0.0001* | 0.0001* | 0.0001* | 0.0003* | 0.1381 | 0.9942 |
Cohen’s d | 1.7579 | 0.5661 | 0.4334 | 0.3894 | 0.4737 | 0.5959 | 0.6106 | 0.2784 | 0.0367 | 0.1196 |
4 Discussion
4.1 Main Results and Possible Explanations
4.2 Implications
5 Threats to Validity
5.1 Conclusion
5.2 Internal
5.3 Construct
5.4 External
6 Related Work
6.1 Combining Heterogeneous Predictions
6.2 Method Defectiveness Prediction
-
The use of process metrics as features for MDP. Specifically, “The addition of alternative features based on textual, code smells, and developer-related factors improve the performance of the existing models only marginally, if at all.” (Pascarella et al. 2020)
-
The use of a realistic validation procedure. However, they performed a walk-forward procedure, whereas we performed a simple split by preserving. However, both procedures are realistic since they preserve the order of data (Falessi et al. 2020).
-
We use an advanced SZZ implementation (RA-SZZ) whereas they use Relink (Wu et al. 2011).
-
The use of a different definition of a defective entity. In our research, an entity is defective from when the defect was injected until the last release before the defect has been fixed.
-
We use a different set of classifiers.
-
We use effort aware metrics such as PofB.
-
We selected a different set of projects from which we derived the datasets. The change was due to the fact that we needed the same dataset to produce commit, method, and class data.
7 Conclusion
-
MDP is significantly more accurate than CDP (+ 5% AUC and 62% PofB10). Thus, it is better to predict and rank defective methods instead of than defective classes from a practitioner’s perspective. From a researcher’s perspective, given the scarce number of MDP studies, there is a high potential for improving MDP accuracy.
-
Leveraging JIT by using a simple median approach increases the accuracy of MDP by an average of 17% in AUC and 46% in PofB10 and increases the accuracy of CDP by an average of 28% in AUC and 31% in PofB20. However, in a few cases, leveraging JIT decreased the accuracy of MDP and CDP.
-
Since many defective commits were only partially defective, only a small percent of methods touched by defective commits were actually defective. Therefore, we expect that leveraging statement-defectiveness-prediction (Pornprasit and Tantithamthavorn 2021) would better enhance MDP than JIT.
-
Propose and evaluate new approaches to improve MDP by leveraging JIT. Specifically, instead of using a static approach like median, we could use a machine learning approach to combine MDP with JIT information.
-
Leverage statement level defect prediction (Pornprasit and Tantithamthavorn 2021) to augment MDP and CDP.
-
Using multi-level features in a single prediction model. While in this work we evaluated the benefits of combining two predictions, e.g., commits with methods, in the future, we plan to investigate the benefits of performing a single prediction that uses features at different levels, i.e., features at commits and methods levels).