1 Introduction
Just-In-Time Software Defect Prediction (JIT-SDP) is concerned with predicting whether software changes are defect-inducing or clean upon commit time (i.e., just-in-time) based on machine learning approaches (Kamei et al.
2013). In practice, JIT-SDP operates in an online learning scenario, where software changes are produced and labeled over time for the purpose of training and evaluating JIT-SDP models. In particular, each software change must be predicted as defect-inducing or clean at commit time. Then, only when the label (defect-inducing or clean) becomes available, this software change can be used as a data example to evaluate and update (train) JIT-SDP models.
It takes time for the true labels of software changes to be revealed in the real-world process of JIT-SDP (Song and Minku
2023; Ditzler et al.
2015; Cabral et al.
2019). As a result, examples need to be produced based on
observed labels rather than the true labels of software changes. Specifically, a software change is labeled to produce a defect-inducing example when a defect is found to be induced by it; in contrast, it is labeled as clean when no defect has yet been found to be induced by it and enough time has passed for one to be confident that this software change is really clean. Such length of time is referred to as
waiting time (Cabral et al.
2019; Song and Minku
2023) and can be considered as a pre-defined parameter
W of the data collection process. The observed clean label resulting from such waiting time may or may not be the same as the true label of this software change. Whenever it is not the same, a noisy example is produced. Such label noise caused by the waiting time may affect not only the training of JIT-SDP models, but also the validity of procedures used to evaluate them.
Song and Minku (
2023) discussed how to evaluate predictive performance continuously over time during the software development process. The purpose of the
continuous performance evaluation procedure is to track the most recent performance status of JIT-SDP models during the software development. Therefore, in this evaluation procedure, each software change is used to update the predictive performance as soon as it can be labeled, given a waiting time
W. This is necessary in practice because the predictive performance of JIT-SDP models may fluctuate over time as a result of changes in the underlying defect generating process (McIntosh and Kamei
2018; Cabral et al.
2019; Tabassum et al.
2020; Cabral and Minku
2022) and it is important for practitioners to be alerted of any performance deterioration as early as possible. The study found that waiting time had a significant impact on the validity of such kind of evaluation procedure. In particular, if inappropriate waiting times are used, the results of the evaluation procedure become invalid.
Another kind of evaluation procedure is the
retrospective performance evaluation procedure, where software changes are collected and labeled retrospectively rather than continuously over time. The purpose of this evaluation procedure is to check how well JIT-SDP models would have performed in practice if they had been predicting (and potentially learning) those labeled software changes over time. Such procedure can be used to help practitioners to decide which kind of JIT-SDP approach to adopt in their company, rather than for the purpose of monitoring the performance of a currently adopted JIT-SDP model during the software development process. For instance, research papers typically collect and label software changes to retrospectively evaluate how well different JIT-SDP approaches would have performed on those past software changes, rather than monitoring the predictive performance of such models on software changes that are currently being developed in a project. The results of such evaluation procedure are used to determine which kind of JIT-SDP approach is more promising to be adopted in practice. Once adopted in practice, the corresponding JIT-SDP model should then have its predictive performance monitored continuously over time based on continuous evaluation procedures such as the one proposed in Song and Minku (
2023), to alert software engineers if/when its performance start deteriorating.
Retrospective performance evaluation procedures do not need to collect the label of a software change as soon as possible after this software change is committed. Instead, all labels can be collected at the same time moment when one decides to trigger this evaluation procedure. Such labeling process also relies on a waiting time parameter. However, this waiting time refers to the
minimum amount of time we wait to label a software change as being clean, rather than the
exact amount of time used in continuous evaluation procedures. In other words, it corresponds to the age of the newest software change that can be labeled as clean to produce an example. All other clean labeled examples are produced with older software changes. The older the software change, the more time we will have waited to observe its clean label, potentially leading to a more reliable label. Due to these differences between the waiting time used in continuous and retrospective performance evaluation procedures, it is unknown whether the validity issues found to affect continuous performance evaluation procedures (Song and Minku
2023) also affect retrospective performance evaluation procedures.
If the impact of waiting time on the validity of retrospective performance evaluation procedures is significant, it could seriously affect the validity of a large number of existing research studies in JIT-SDP, especially considering that many of them implicitly assume that label noise is non-existent for evaluation purposes. If such impact is not significant, it would mean that the predictive performances obtained in existing research studies are likely reliable in view of different choices of waiting time, and could potentially be used to inform practitioners about which kind of JIT-SDP approach is more promising to adopt in practice.
Therefore, the aim of this paper is to systematically investigate whether and to what extent the conclusions of JIT-SDP retrospective performance evaluation procedures (and thus also the conclusions of a large body of JIT-SDP research studies) are (in)valid in view of the fact that observed labels rather than the true labels of software changes are being used for performance evaluation. This would not only lead to an insight into the validity of the conclusions drawn in existing work that overlooks the role of waiting time on evaluation procedures in JIT-SDP, but also inform future JIT-SDP work on how waiting time should be considered for evaluation purposes.
This study can be seen as a conceptual replication of Song and Minku (
2023) aiming to check whether the findings obtained for continuous evaluation scenarios would also occur in retrospective evaluation scenarios. For this, some adjustments need to be done in the methodology that was kept as similar as possible to that of Song and Minku (
2023). The datasets investigated in this work are also the same as Song and Minku (
2023), but their processing also had to be adjusted for the retrospective evaluation scenario. We answer three of the Research Questions (RQs) from Song and Minku (
2023), but in the context of retrospective performance evaluation procedures rather than in continuous performance evaluation procedures
1:
[RQ1] How large is the amount of label noise caused by different waiting times in retrospective JIT-SDP data collection? The effect of waiting time on label noise may be reduced when evaluating JIT-SDP through retrospective evaluation procedures compared to the continuous evaluation procedure required during the software development process. This is because the waiting time is only used to determine what is the most recent software change that can be used in the retrospective performance evaluation procedure. All other changes will be older than this one, such that more time would have passed to detect their true labels, potentially reducing the amount of label noise. However, it is unknown how large the amount of label noise caused by different waiting times in retrospective data collection is.
[RQ2] To what extent is the validity of retrospective performance evaluation procedures impacted by label noise resulting from waiting time? The label noise resulting from waiting time investigated in RQ1 may or may not be large enough to have a significant impact on the validity of retrospective performance evaluation procedures. This investigation will enable us to check how reliable the estimated performance of current JIT-SDP studies is in view of the label noise caused by waiting time. In other words, it will determine whether conclusions in terms of how well different JIT-SDP approaches perform (and thus which ones are recommended for adoption in practice) are reliable in view of the label noise caused by waiting time.
[RQ3] To what extent is the validity of retrospective performance evaluation procedures impacted by different waiting times? As in Song and Minku (
2023), part of RQ3 can be answered by combining the conclusions of RQ1 and RQ2. If waiting time has significant impact on label noise (RQ1) and label noise has significant impact on the validity of retrospective performance evaluation procedures (RQ2), waiting time may have significant impact on the validity through the label noise it generates. However, waiting time could potentially have further impact on the validity of retrospective performance evaluation procedures that cannot be captured by label noise on its own, possibly intensifying or moderating the impact mediated by label noise. RQ3 complements the study to check whether the choice of waiting time as a while does have an impact on the validity.
To answer these RQs, we conduct experimental studies based on the same 13 GitHub software projects and statistical methodologies in Song and Minku (
2023). We find that different waiting times used in retrospective performance evaluation procedures can cause significantly different amounts of label noise (RQ1). Similar to Song and Minku (
2023)’s results in the continuous evaluation scenario, we find that such amounts of label noise also have a statistically significant impact on the validity of retrospective performance evaluation procedures in JIT-SDP (RQ2). However, the differences between estimated and true predictive performance are smaller than those found in Song and Minku (
2023), being always smaller than 3% and having a median of less than 1% across datasets. Different from Song and Minku (
2023)’s results on the continuous evaluation scenario, when investigating the direct impact of waiting time on the validity (RQ3), we found that such impact is moderated and becomes insignificant for retrospective performance evaluation procedures. Therefore, different waiting times are unlikely to change the conclusions on whether JIT-SDP is accurate enough to be worthy of adoption in practice, especially when conducing studies using multiple datasets.
Our results also show that even waiting times as small as 15 days led to high validity of retrospective performance evaluation procedures. This is very encouraging, as it means that research studies can evaluate JIT-SDP models not only on large projects (with more than 5k software changes) but also with smaller projects (with 1k software changes). Accordingly, it is not necessary to remove a large portion of the most recent software changes to increase the validity of retrospective performance evaluation procedures. This result is particularly relevant given that many software companies develop projects that are much shorter in length than many of the existing open source projects that have been running for many years.
The remainder of this paper is organized as follows. Section
2 motivates and briefly explains two evaluation scenarios of JIT-SDP – a continuous evaluation scenario as in Song and Minku (
2023) and a retrospective evaluation scenario, which will be investigated in this paper. Section
3 discusses background and related work. Section
4 explains our notation system and formulates the validity of the retrospective performance evaluation procedures. Section
5 describes the design of our experiments, and our RQs are answered in Section
6 by analyzing the results of experiments. Threats to validity is discussed Section
7 and Section
8 concludes the paper.
5 Experimental Setup
Being a conceptual replication of Song and Minku (
2023), we adopt the same 13 GitHub open source projects as that work to investigate the three research questions of this paper. These projects were chosen for having more than 4 years of duration (most with more than 8 years duration), rich history (>10k commits) and a wide range of defect-inducing changes ratios (from 2% to 45%). The datasets were collected using
Commit Guru
(Rosen et al.
2015), which implements the original and most popular B-SZZ algorithm (Śliwerski et al.
2005) when an issue tracking system is available and its approximation otherwise. The statistic summary of the projects is shown in Table
2. We will investigate each research question by the corresponding statistical analysis performed across these 13 datasets in Table
2.
Table 2
Summary of the datasets investigated in this work
Brackets | 11,601 | 34.02 | 5,997 | 12/2011 - 12/2017 | JavaScript |
Broadleaf | 12,336 | 20.28 | 5,190 | 11/2008 - 12/2017 | Java |
Camel | 30,229 | 20.67 | 9,850 | 03/2007 - 12/2017 | Java |
Fabric | 12,495 | 20.65 | 9,310 | 12/2011 - 12/2017 | Java |
jGroup | 18,003 | 17.48 | 13,028 | 09/2003 - 12/2017 | Java |
Nova | 26,313 | 44.34 | 14,900 | 08/2010 - 01/2018 | Python |
Django | 26,360 | 42.64 | 14,236 | 07/2005 - 09/2019 | Python |
Rails | 57,949 | 25.64 | 28,421 | 11/2004 - 09/2019 | JavaScript |
Corefx | 26,627 | 6.91 | 7,611 | 11/2014 - 10/2019 | Python |
Rust | 73,876 | 2.02 | 35,766 | 06/2010 - 10/2019 | Python |
Tensorflow | 65,034 | 24.85 | 21,466 | 11/2015 - 11/2019 | Python |
VScode | 51,846 | 2.28 | 19,413 | 11/2015 - 10/2019 | JavaScript |
wp-Calypso | 31,206 | 22.75 | 8,708 | 11/2015 - 10/2019 | JavaScript |
The large duration enables us to calculate a measure of predictive performance which reflects the true performance with high confidence, so that we can compute Eqs. (
2) and (
4). Following Song and Minku (
2023), the 99%-quantile of the time it takes to find the labels of defect-inducing software changes is calculated, and software changes committed more recently than this 99%-quantile would be eliminated. For instance, if this quantile is two years, all software changes committed within the past two years were eliminated. As a result, the remaining software changes were committed for at least more than two years, having at least 99% confidence that they are really clean if no defect has been found to be induced by them so far. Defect-inducing labels are always noise-free in this study, as they cannot involve label noise due to inadequate waiting time. As discussed in Section
3.4, the main aim of this paper is to systematically investigate whether and to what extent waiting time and the label noise resulting it can have on the validity of retrospective performance evaluation procedures. Label noise that is not induced by waiting time is out of the scope of this study, which may have different effects on the validity of performance evaluation procedures and could be investigated as a future work.
All projects have at least 5,000 software changes for which we are confident of their labeling. As in Song and Minku (
2023), we retain the first 5,000 time steps of each project to answer our research questions, so that all projects investigated in this paper would have the same data stream length. This is because the impact of the data stream length will also be investigated in our analyses. Since most projects actually contain considerably more than 5,000 software changes in reality, the confidence in their true labels is higher than 99%. When computing the predictive performance using the retrospective evaluation procedure, we consider the moment of the data collection
T to be the Unix timestamp of the 5,000th example. We then contrast the performance estimated based on the labels obtained through this procedure against the true performance by calculating the validity of the retrospective performance evaluation procedure using Eq. (
4).
As in Song and Minku (
2023), G-mean is adopted to implement performance metric
\(||\cdot ||_G\). However, the equations to evaluate the predictive performance based on the G-mean (Eqs. (
2) and (
3)) are different from those in Song and Minku (
2023), as our paper analyzes the validity of retrospective rather than continuous evaluation procedures. G-mean is the geometric mean between sensitivity (a.k.a. recall) and specificity (one minus the false positive rate) (Kubat et al.
1997). Unlike performance metrics such as F-measure (Yao and Shepperd
2021), G-mean was adopted for being robust against class imbalance, being particularly important for JIT-SDP where class imbalance often takes place (Cabral et al.
2019; Wang et al.
2018; He and Garcia
2009). Larger G-mean values represent better predictive performance.
Being a conceptual replication of Song and Minku (
2023), we adopt the same machine learning algorithm. Oversampling Online Bagging (OOB) with Hoeffding trees (Wang et al.
2015) to update / train the JIT-SDP model whenever a training example is produced, without requiring retraining on past examples. This machine learning algorithm has been shown to work well for JIT-SDP due to its ability to tackle class imbalance evolution (Cabral et al.
2019; Tabassum et al.
2020). We conducted a grid search based on the first 500 (out of the total 5,000) software changes in the data stream of a software project for parameter tuning based on G-mean. As in Song and Minku (
2023), the parameters consisted of the decay factor
\(\in \{{0.9, 0.99}\}\) and the ensemble size
\(\in \{5,10,20\}\). Given a software project, the parameter setting achieving the best G-mean (calculated in Eq. (
2)) at the first 500 time steps across 30 runs was chosen. The predictive performance of the JIT-SDP model was then calculated based on the whole data stream using the best parameter setting. Hoeffding trees adopted the default parameter settings provided by the Python package
scikit-multiflow (Yao and Shepperd
2021), following previous studies in JIT-SDP (Song and Minku
2023; Cabral et al.
2019; Tabassum et al.
2020). All analyses and statistical tests were conducted based on the mean performance across 30 runs with the chosen parameter setting. The code and data used for our experiments is released as open source source at
https://github.com/sunnysong14/jit-sdp-retrospective-pf-validity.
Table 3
Summary of the statistical methodology adopted for answering each RQ
RQ1 | ANOVA | 1) Waiting time W | The amount of label noise \(\eta _W\) in Eq. ( 1) |
2) Length of data stream t | |
RQ2 | Linear regression analysis | 1) Evaluation label noise | The validity of performance evaluation in Eq. ( 4) |
2) Training label noise | |
RQ3 | Linear regression analysis | 1) Evaluation waiting time | The validity of performance Evaluation in Eq. ( 4) |
2) Training waiting time | |
5.1 Statistical Methodology for RQ1
RQ1 investigates impacts of waiting time on the amounts of label noise. Waiting time
W varied among four levels (15, 30, 60 and 90 days), following the previous work (Song and Minku
2023).
The investigation for RQ1 will also take into account different lengths of the data stream (1000, 2000, 3000, 4000 and 5000 evaluation time steps), where the moment of the data collection T used by the retrospective performance evaluation procedure corresponds to the Unix timestamp of the 1000th, 2000th, 3000th, 4000th and 5000th example, respectively. The data stream length is investigated as the proportion of noisy examples could be relative to the size of the data stream. In particular, the “tail” of the data stream could potentially contain more noise than the rest of the data stream because it is composed of more recent software changes (closer to the moment of data collection T), for which less time has passed to find defects. Therefore, for instance, if we have a larger stream length such as 5000 commits, the proportion of noisy examples is likely to be smaller, as the “tail” of the data stream is relatively small compared to the size of the data stream as a whole. Conversely, if we have a smaller stream length such as 1000 commits, the proportion of noisy examples is likely to be larger. Therefore, the impact of the length of the data stream is investigated as part of the analysis in RQ1.
We will perform Analysis of Variance (ANOVA) (Montgomery
2017) with the significance level 0.05 to analyze the impact of waiting time and data stream length on the amount of label noise in the evaluation data stream
\(\mathbb {D}_W^*\), following the prior work (Song and Minku
2023). The null hypothesis states that there is no difference among group means and is rejected when the
p-value is smaller than the significance level 0.05. ANOVA is used instead of non-parametric statistical tests such as Friedman because it enables us to investigate multiple factors. Sphericity is an important assumption made by the repeated measures ANOVA design. Mauchly’s test (Mauchly
1940) is adopted to assess the statistical assumption of sphericity when using ANOVA. When the test yields a
p-value less than the significance level 0.05, we consider that the assumption has been violated. The Greenhouse-Geisser correction is then used to correct for this violation.
As shown in Table
3, the within-subject factors under investigation include the waiting time
W and the data stream length
t. The response variable is the amount of label noise
\(\eta _W\) in Eq. (
1).
5.2 Statistical Methodology for RQ2
RQ2 investigates impacts of label noise on the validity of retrospective performance evaluation procedures. Following the prior work evaluation procedures. Following the prior work ourspsTSEspspaper), we will perform linear regression analyses with the significance level 0.05 for this purpose. The linear regression approach adopted Ordinary Least Squares to learn the model. As shown in Table
3, in addition to the label noise of evaluation examples (evaluation label noise), we will also consider the label noise of training examples (training label noise) as independent variables, for a more thorough analysis. Different from the waiting time used for evaluation purposes (the main concern of this paper), the training waiting time is used to produce the data stream for training JIT-SDP models. Different training waiting times can lead to different levels of noise in the training data and consequently produce different JIT-SDP models. Both the training and evaluation waiting times used to compute the amount of noise varied among 15, 30, 60 and 90 days. As the waiting time used for evaluation purposes is the main topic of this work, whenever using the term “waiting time” on its own, we mean the “evaluation waiting time”; whereas the training waiting time will always be explicitly referred to as “training waiting time”.
Including training label noise would enable the analysis to consider to what extent different JIT-SDP models could impact the conclusions of this study. The dependent variable is the validity of retrospective performance evaluation procedures formulated in Eq. (
4). The
p-value of each independent variable tests the null hypothesis that the corresponding coefficient equals to zero (no effect on the dependent variable). The linear regression statistical test is considered significant if its
p-value is smaller than the significance level 0.05. ANOVA, which was used to answer RQ1, is not viable for answering RQ2. This is because the independent variables are continuous but not ordinal, so one cannot set up the levels of within-subject factors (Montgomery
2017).
5.3 Statistical Methodology for RQ3
RQ3 investigates impacts of waiting time on the validity of retrospective performance evaluation procedures. Following the prior work (Song and Minku
2023), we will perform linear regression analyses with the significance level 0.05 for that. The linear regression approach adopted Ordinary Least Squares to learn the model. As shown in Table
3, we will consider the evaluation waiting time and the training waiting time as the two independent variables in the linear regression analyses for enabling a more thorough analysis of the validity. Both have values varying among 15, 30, 60 and 90 days. The dependent variable is the validity of retrospective performance evaluation procedures in Eq. (
4). ANOVA adopted for answering RQ1 is not viable for RQ3, because there is a constraint between the two independent variables: the evaluation waiting time should be no larger than the training one to follow the principles of the online learning procedure, as explained in Song and Minku (
2023).
7 Threats to Validity
This section discusses the threats to validity of our study, which are similar to the threats to validity of Song and Minku (
2023).
Construct Validity. We carefully chose G-mean as the evaluation metric whenever the performance of JIT-SDP was required to compute in the analyses of this study. Adopting G-mean is adequate due to its insensitivity to the class imbalance issue (Wang et al.
2018), which is particularly important for JIT-SDP that typically suffers from the class imbalance issue (Cabral et al.
2019). G-mean is the most widely used metric in online class imbalance learning studies (Wang et al.
2018). We used grid search based on an initial portion of the data stream to tune parameters of the machine learning algorithms used in this study. Random search might find better parameter values than grid search (Bergstra and Bengio
2012). However, whether or not this is the case in data stream learning is still an open question, as the best values for the initial portion of the data stream are not necessarily the best for the remaining of the data stream due to concept drift, which is frequently occurred in JIT-SDP (Cabral et al.
2019; Cabral and Minku
2022; McIntosh and Kamei
2018). Moreover, this paper is concerned with investigating the validity of the performance evaluation procedures rather than with improving predictive performance of JIT-SDP. The specific choice of parameter tuning method is less relevant in this context than in studies targeted at improving predictive performance of JIT-SDP models.
Internal Validity. A potential threat of the internal validity is that the true labels of some defect-inducing software changes may never be accessible when the defects induced by them are not induced until the end of the data stream due to very large verification latency. To mitigate this threat, we used open source projects covering a period of at least four years and eliminated software changes from the latter periods of data streams.
External Validity. We have investigated 13 open source projects, with 4 levels of waiting time, and 5 lengths for data streams, covering a range of different characteristics in previous JIT-SDP studies. However, as with any study involving machine learning, results may not generalize to other contexts. Moreover, our study focuses on OOB with Hoeffding trees, which have been previously adopted for online JIT-SDP (Song and Minku
2023; Cabral et al.
2019; Tabassum et al.
2020). Being a conceptual replication of Song and Minku (
2023), we adopt the same machine learning approach as in that paper. Other types of machine learning approaches could be investigated as future work following the same investigation procedures and statistical methodologies for answering the same RQs in future work. The conclusions of our study are in the context of noise resulting from waiting time when using SZZ for data collection. Label noise not caused by waiting time may have different effects on the validity and could be investigated as a future work. Similarly, different conclusions may be obtained regarding the impact of waiting time if a different algorithm from SZZ is adopted for data collection.
8 Conclusion
We conducted the first analysis of the extent with which the conclusions of JIT-SDP research studies are (in)valid in view of the fact that observed labels rather than the true labels of software changes are being used for conducting retrospective performance evaluation procedures. We conduct our investigation by answering three research questions as below.
RQ1. How large is the amount of label noise caused by different waiting times in retrospective JIT-SDP data collection? We found that smaller waiting times were associated to significantly larger amount of label noise. The proportion of noisy defect-inducing examples labeled as clean increased by up to 45.86% as a result of smaller waiting time.
RQ2. To what extent is the validity of retrospective performance evaluation procedures impacted by label noise resulting from waiting time? We found that both the evaluation and the training label noise had significant negative impact on the validity of retrospective performance evaluation procedures. However, the magnitude of the changes in the validity was typically small, varying up to around 2% for evaluation label noise and 3% for training label noise, but most of the time being less than 1%.
RQ3. To what extent is the validity of retrospective performance evaluation procedures impacted by different waiting times? No significant impact of the evaluation waiting time was found on the validity of retrospective performance evaluation procedures. Training waiting time had a significant impact on the validity, meaning that the validity of performance evaluation procedures may be better or worse depending on the actual JIT-SDP model being evaluated. However, the changes in validity were small (up to around 2%), and so this impact is unlikely to be relevant.
Besides the investigation of the three research questions, our results also report that the validity of retrospective performance evaluation procedures was high in magnitude even when using small evaluation waiting times. This is an encouraging result, as it means that future studies can make use of not only larger (with 5k+ software changes) but also smaller (with 1k software changes) software projects for evaluating predictive performance of JIT-SDP models. This is particularly important in terms of having a validated performance evaluation, as many software companies have projects of short duration compared to some of the existing open source projects that have run for several years. With this in mind, people would feel safe to trust the estimated performance even for smaller software projects in the retrospective performance evaluation scenario.
As future work, other performance metrics, machine learning algorithms, and sources of label noise can be investigated. The impact of waiting time on the predictive performance of JIT-SDP models could also be investigated.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.