1 Introduction
2 Background and related work
2.1 Terminology
2.2 Related work
3 Research questions and context selection
3.1 Research questions
3.2 Context of the study
http
requests and responses, other to container orchestration. A full report of the domains of the considered projects is available in our online appendix (Pontillo et al. 2022). Nonetheless, the domain observations were already insightful to understand that test code flakiness is a widespread problem that affects projects independently from the domain. In terms of testing activities, all the projects make use of a continuous integration pipeline that allows code changes to be verified against a test suite. With the use of a build tool, e.g., Maven, developers can configure the test cases that must be run when new changes are pushed onto the repository. While we cannot know whether the developers of the considered projects defined a test plan document before configuring the tests to run, it is important to notice that all projects establish contribution guidelines that contributors must follow and that include indications on how to conduct testing activities. As such, the testing activities are not left to the developer’s willingness to perform them, but are defined and updated over time. This increases our confidence in the quality assurance procedures adopted by the considered projects.4 Empirical study variables
4.1 Dependent variable
4.2 Independent variables
Name | Description | Computed on ... |
---|---|---|
Production and test code metrics | ||
CBO | Coupling Between Object, i.e., the number of dependencies a class has with other classes (Chidamber and Kemerer 1994). | Production Class |
Halstead Length | The total number of operator occurrences and the total number of operand occurrences. | Production Class |
Halstead Vocabulary | The total number of distinct operators and operands in a function. | Production Class |
Halstead Volume | Proportional to program size, represents the size, in bits, of space necessary for storing the program. | Production Class |
LOC | Lines of Code, counting both source and comment lines. | Production Class |
LCOM2 | Lack of Cohesion of Methods version 2, i.e., the percentage of methods that do not access a specific attribute averaged over all attributes in the class. | Production Class |
LCOM5 | Lack of Cohesion of Methods version 5, i.e., the density of accesses to attributes by methods. | Production Class |
McCabe | It uses to indicate the number of linearly independent paths through a program’s source code (McCabe 1976). | Test Class |
MPC | Message Passing Coupling, measures the numbers of messages passing among objects of the class. | Production Class |
RFC | Response For a Class, i.e., the number of methods (including inherited ones) that can potentially be called by other classes (Chidamber and Kemerer 1994). | Production Class |
TLOC | Number of lines of code of the Test Suite. | Test Class |
WMC | Weighted Methods per Class, i.e., the sum of the complexities (i.e., McCabe’s Cyclomatic Complexity) of all the methods in a class (Chidamber and Kemerer 1994). Note that Chidamber and Kemerer (Chidamber and Kemerer 1994) did not define a predefined complexity metric to consider for the computation of WMC. In our case, we opted for the McCabe metric to account for the individual complexity of methods. | Production Class |
Code Smells | ||
Class Data Should Be Private | When a class exposes its attributes, violating the information hiding principle. | Production Class |
Complex Class | When a class has a high cyclomatic complexity. | Production Class |
Functional Decomposition | When in a class inheritance and polymorphism are poorly used. | Production Class |
God Class | When a class has huge dimension and implementing different responsibilities. | Production Class |
Spaghetti Code | When a class has no structure and declares long method without parameters. | Production Class |
Test Smells | ||
Assertion Density | Percentage of assertion statements in the test code. | Test Class |
Assertion Roulette | When a test method has multiple non-documented assertions. | Test Class |
Conditional Test Logic | Conditional code within a test method negatively impacts the ease of comprehension by developers. | Test Class |
Eager Test | When a test method invokes several methods of the production object. | Test Class |
Fire and Forget | A test that is at risk of exiting prematurely because it does not properly wait for the results of external calls. | Test Class |
Mystery Guest | When a test method utilizes external resources (e.g., files, database, etc.). | Test Class |
Resource Optimism | When a test method makes an optimistic assumption that the external resource (e.g., File), utilized by the test method, exists. | Test Class |
Sensitive Equal. | When the toString method is used within a test method. | Test Class |
5 RQ1 - The individual effects of metrics on test flakiness
5.1 Research methodology
5.2 Analysis of the results
assert
statements (Eck et al. 2019). Finally, we observe the severity of the Eager Test smell as a metric that differs in two sets as distribution but not as median. This smell measures how focused a test is, namely whether it exercises more methods of the production code. Based on our results, we may conjecture that the lack of focus of tests does not allow them to properly set the environment needed to exercise the production code: as a consequence, their outcome may depend on the order of execution of test methods, i.e., the outcome may change if the environment is (not) set before calling the smelly test.Statistic tests | |||||
---|---|---|---|---|---|
p-value | δ | p-value | δ | ||
CBO | 1.34e−13 | S | Complex Class | 9.85−11 | N |
Halstead Length | 1.17e−06 | S | FD | 0.03 | N |
Halstead Vocab. | 4.70e−09 | S | God Class | 0.38 | N |
Halstead Volume | 3.78e−07 | S | Spaghetti Code | 8.47e−11 | N |
LOC | 7.84e−11 | S | Assertion Density | 1.69e−8 | S |
LCOM2 | < 2.2e−16 | S | Assertion Roulette | 3.81e−10 | S |
LCOM5 | 1.63e−14 | S | Cond. Test Logic | 0.10 | N |
McCabe | 0.20 | N | Eager Test | 2.03e−13 | S |
MPC | 1.04e−7 | S | Fire and forget | 0.74 | N |
RFC | 6.56e−11 | S | Mystery Guest | 0.40 | N |
TLOC | 1.16e−8 | S | Resource optimism | 0.12 | N |
WMC | 1.80e−12 | S | Sensitive equality | 0.17 | N |
CDSBP | 1.30e−9 | N |
Statistic tests | |||||
---|---|---|---|---|---|
p-value | δ | p-value | δ | ||
CBO | < 2.2e−16 | S | Complex Class | < 2.2e−16 | N |
Halstead Length | < 2.2e−16 | S | FD | 0.049 | N |
Halstead Vocab. | < 2.2e−16 | S | God Class | 7.7e−4 | N |
Halstead Volume | < 2.2e−16 | S | Spaghetti Code | < 2.2e−16 | N |
LOC | < 2.2e−16 | S | Assertion Density | 5.09e−4 | N |
LCOM2 | < 2.2e−16 | S | Assertion Roulette | 4.28e−3 | N |
LCOM5 | < 2.2e−16 | N | Cond. Test Logic | 3.91e−7 | N |
McCabe | < 2.2e−16 | S | Eager Test | 0.93 | N |
MPC | < 2.2e−16 | S | Fire And Forget | 8.73e−14 | N |
RFC | < 2.2e−16 | S | Mystery Guest | < 2.2e−16 | S |
TLOC | < 2.2e−16 | S | Resource Optimism | 0.10 | N |
WMC | < 2.2e−16 | S | Sensitive Equality | 1.5e−2 | N |
CDSBP | 0.3887 | N |
6 RQ2 - The combined effects of metrics on test flakiness
6.1 Research methodology
glm
function available in R
toolkit.4 Moreover, to avoid multi-collinearity we used the vif
(Variance Inflation Factors) function implemented in R
to discard highly correlated variables, putting a threshold value equal to 5 (O’brien 2007). The interested reader can find additional information on the correlation between the independent variables in our online appendix (Pontillo et al. 2022). In particular, we conducted correlation analyses using the non-parametric Spearman’s rank correlation coefficient (Myers and Sirois 2004) with the aim of providing further insights into the relations between the considered variables. As a result, we found out that such a correlation analysis reinforced the results obtained when using the vif
function, hence making us more confident about the decisions made when discarding variables.6.2 Analysis of the results
vif
analysis. Similarly, Table 6 reports the results of the Logistic Regression Model on the FlakeFlagger dataset, in which are shown only 16 of the independent variables; the other nine factors, i.e., Complex Class, Halstead Length, Halstead Volume, LCOM2, LOC, MPC, RFC, WMC, and Spaghetti Code, were excluded as a consequence of the multi-collinearity checks.
Generalized linear model | |||||||
---|---|---|---|---|---|---|---|
Estimate | S.E. | Sig. | Estimate | S.E. | Sig. | ||
Intercept | -4.06 | 2.09 | . | Cond. Test Logic | -44.82 | 13.15 | *** |
TLOC | 6.59 | 2.34 | ** | Fire and Forget | 0.88 | 1.98 | |
McCabe | 1.06 | 0.67 | LCOM5 | -1.71 | 1.15 | ||
Assertion Density | 1.41 | 0.57 | * | CBO | 0.34 | 0.77 | |
Assertion Roulette | -23.64 | 9.03 | ** | Halstead Voc. | 3.69 | 0.97 | *** |
Mystery Guest | -1.04 | 2.69 | CDSBP | 1.99 | 1.71 | ||
Eager Test | 4.91 | 0.97 | *** | Complex Class | 1.11 | 0.63 | . |
Sensitive Equality | -7.42 | 7.53 | FD | -0.57 | 0.41 | ||
Resource Optimism | -4.18 | 4.51 | God Class | -1196.50 | 1867.19 |
Generalized linear model | |||||||
---|---|---|---|---|---|---|---|
Estimate | S.E. | Sig. | Estimate | S.E. | Sig. | ||
Intercept | -11.63 | 168.77 | Cond. Test Logic | -2.22 | 1.14 | . | |
TLOC | 4.95 | 0.78 | *** | Fire and Forget | 3.10 | 0.97 | ** |
McCabe | 2.58 | 0.40 | *** | LCOM5 | -19.08 | 2.78 | *** |
Assertion Density | 0.53 | 0.44 | CBO | 0.61 | 0.26 | * | |
Assertion Roulette | 0.29 | 0.85 | Halstead Voc. | 5.58 | 0.57 | *** | |
Mystery Guest | 6.55 | 0.55 | *** | CDSBP | -1.74 | 0.84 | * |
Eager Test | -7.16 | 1.12 | *** | FD | -0.16 | 0.20 | |
Sensitive Equality | -1.13 | 1.13 | God Class | 176.33 | 3657.57 | ||
Resource Optimism | -6.63 | 1.42 | *** |
⋆⋆⋆
’ indicates a p< 0.001, ‘⋆⋆
’ indicates a p< 0.01, ‘⋆
’ indicates a p< 0.05, and ‘.’ indicates a p< 0.1.6.2.1 Results for production and test code metrics
6.2.2 Results for code smells
6.2.3 Result for test smells
7 RQ3 - An approach to predict test flakiness statically
7.1 Research methodology
vif
analysis to discard highly correlated variables (O’brien 2007); and (2) quantifying the predictive power of each metric in terms of information gain (Quinlan 1986). While the former analysis allowed us to limit the scope of our investigation to the actually relevant features, the latter is a measure of how much a model would benefit from the presence of a certain predictor. More formally, let P be the flaky test predictor, let F = \(\left \{f_{1}, f_{2}, ..., f_{n} \right \}\) be the set of features composing P, an information gain algorithm (Quinlan 1986) computes the difference from before to after splitting P on an attribute fi in terms of entropy. It specifically applies the following formula:
nemenyi
function available in R
toolkit.57.2 Analysis of the results
iDFlakies dataset | FlakeFlagger dataset | ||
---|---|---|---|
Features | IG | Features | IG |
Halstead Vocabulary | 0.0338 | Halstead Vocabulary | 0.1727 |
CBO | 0.0166 | Assertion Density | 0.0539 |
LCOM5 | 0.0089 | CBO | 0.0359 |
Complex Class | 0.0059 | TLOC | 0.0284 |
Eager Test | 0.0059 | Mystery Guest | 0.0157 |
TLOC | 0.0049 | McCabe | 0.0133 |
Class Data Should Be Private | 0.0021 | LCOM5 | 0.0128 |
Assertion Roulette | 0.0019 | Assertion Roulette | 0.0107 |
Assertion Density | 0.0010 | Conditional Test Logic | 0.0076 |
McCabe | 0.0010 | Eager Test | 0.0066 |
Fire and Forget | 0.0013 | ||
Functional Decomposition | 0.0011 |
Project | Tests | Flaky tests | TP | TN | FP | FN | Pr | R | F |
---|---|---|---|---|---|---|---|---|---|
iDFlakies | Random forest | ||||||||
activiti | 221 | 20 | 18 | 195 | 6 | 2 | 83% | 90% | 82% |
admiral | 2,082 | 5 | 3 | 2,066 | 11 | 3 | 21% | 60% | 31% |
aletheia | 46 | 3 | 3 | 40 | 3 | 0 | 50% | 100% | 66% |
elastic-job-lite | 564 | 3 | 2 | 554 | 7 | 1 | 22% | 66% | 33% |
fastjson | 544 | 12 | 8 | 530 | 2 | 4 | 75% | 70% | 70% |
hadoop | 12,838 | 58 | 36 | 12,766 | 14 | 22 | 77% | 62% | 66% |
http-request | 309 | 28 | 25 | 280 | 1 | 3 | 96% | 90% | 91% |
incubator-dubbo | 1,768 | 20 | 8 | 1,736 | 12 | 12 | 41% | 40% | 37% |
java-websocket | 135 | 27 | 26 | 92 | 16 | 1 | 63% | 96% | 75% |
pippo | 240 | 5 | 5 | 230 | 5 | 0 | 90% | 100% | 93% |
querydsl | 1,926 | 3 | 0 | 1,920 | 3 | 3 | 0% | 0% | 0% |
struts | 2,577 | 4 | 4 | 2,571 | 2 | 0 | 87% | 100% | 91% |
wildfly | 982 | 38 | 30 | 937 | 7 | 8 | 86% | 79% | 80% |
Total | 24,233 | 226 | 156 | 23,937 | 69 | 70 | 69% | 69% | 68% |
FlakeFlagger | Random forest | ||||||||
achilles | 1,053 | 4 | 2 | 1,049 | 0 | 2 | 100% | 50% | 66% |
activiti | 169 | 16 | 5 | 141 | 12 | 11 | 25% | 25% | 23% |
alluxio | 186 | 122 | 117 | 60 | 4 | 5 | 97% | 96% | 97% |
ambari | 294 | 52 | 47 | 241 | 1 | 5 | 98% | 90% | 93% |
elastic-job-lite | 521 | 3 | 0 | 518 | 3 | 1 | 0% | 0% | 0% |
hbase | 368 | 121 | 105 | 233 | 14 | 16 | 89% | 87% | 87% |
hector | 121 | 33 | 26 | 75 | 13 | 7 | 76% | 81% | 74% |
httpcore | 524 | 15 | 8 | 503 | 6 | 7 | 50% | 60% | 53% |
http-request | 161 | 18 | 13 | 132 | 11 | 5 | 55% | 75% | 61% |
incubator-dubbo | 1,681 | 18 | 11 | 1,658 | 5 | 7 | 76% | 65% | 68% |
java-websocket | 107 | 21 | 20 | 86 | 0 | 1 | 100% | 96% | 98% |
logback | 655 | 15 | 3 | 637 | 3 | 12 | 50% | 20% | 28% |
ninja | 352 | 16 | 16 | 330 | 6 | 0 | 81% | 100% | 88% |
okhttp | 782 | 108 | 70 | 565 | 109 | 38 | 39% | 65% | 48% |
orbit | 26 | 4 | 2 | 20 | 2 | 2 | 50% | 50% | 50% |
spring-boot | 1,634 | 82 | 61 | 1,542 | 10 | 21 | 87% | 74% | 79% |
undertow | 48 | 6 | 2 | 39 | 3 | 4 | 40% | 33% | 26% |
wro4j | 1,103 | 16 | 3 | 1,084 | 3 | 13 | 14% | 15% | 12% |
Total | 9,785 | 670 | 446 | 8,957 | 158 | 224 | 74% | 66% | 70% |
8 RQ4 - Comparing the performance of the static approach with existing baselines
8.1 Research methodology
vif
function and computed the information gain (Quinlan 1986) to discard metrics not providing any gain. Afterwards, we trained a Random Forest algorithm—the choice was the result of a benchmark study where we experimented with multiple learning algorithms and under-/over-sampling strategies against the baseline data, finding that Random Forest combined with SMOTE was the best option to use to train the baselines. We then executed the models, collecting their performance and comparing them with our approach in terms of the same evaluation metrics employed in RQ3, i.e., precision, recall, and F-Measure. Finally, the Nemenyi test was applied to assess the statistical significance of the results achieved.8.2 Analysis of the results
Static approach | FlakeFlagger | |||
---|---|---|---|---|
Features | IG | Features | Type | IG |
Halstead Vocabulary | 0.1727 | Execution Time | FlakeFlagger | 0.1414 |
Assertion Density | 0.0539 | Project Source Lines Covered | FlakeFlagger | 0.0869 |
CBO | 0.0359 | Project Source Classes Covered | FlakeFlagger | 0.0790 |
TLOC | 0.0284 | Covered Lines | FlakeFlagger | 0.0400 |
Mystery Guest | 0.0157 | Covered Changes (past 500 commits) | FlakeFlagger | 0.0328 |
McCabe | 0.0133 | Test Length | FlakeFlagger | 0.0299 |
LCOM5 | 0.0128 | Covered Changes (past 10000 commits) | FlakeFlagger | 0.0258 |
Assertion Roulette | 0.0107 | Covered Changes (past 75 commits) | FlakeFlagger | 0.0253 |
Conditional Test Logic | 0.0076 | Covered Changes (past 100 commits) | FlakeFlagger | 0.0249 |
Eager Test | 0.0066 | Covered Changes (past 50 commits) | FlakeFlagger | 0.0231 |
Fire and Forget | 0.0013 | mtfs | Token | 0.0227 |
Functional Decomposition | 0.0011 | tfs | Token | 0.0217 |
External Library | FlakeFlagger | 0.0188 | ||
tachyon | Token | 0.1716 | ||
for | Token | 0.0162 | ||
Covered Changes (past 10 commits) | FlakeFlagger | 0.0148 | ||
fileid | Token | 0.0132 | ||
create | Token | 0.0128 | ||
int | Token | 0.0128 | ||
ioexception | Token | 0.0126 | ||
master | Token | 0.0124 | ||
writetype | Token | 0.0120 | ||
testutils | Token | 0.0117 | ||
assertthat | Token | 0.0112 | ||
tachyonfile | Token | 0.0110 | ||
throws | Token | 0.016 | ||
createbytefile | Token | 0.0101 | ||
Fire and Forget | FlakeFlagger | 0.0101 | ||
client | Token | 0.0099 | ||
Number of Assertions | FlakeFlagger | 0.0097 | ||
invalidpathexception | token | 0.0095 | ||
testfile | Token | 0.0094 | ||
that | Token | 0.0088 | ||
Covered Changes (past 5 commits) | FlakeFlagger | 0.0087 | ||
filealreadyexistexception | Token | 0.0085 | ||
file | Token | 0.0083 | ||
should | Token | 0.0081 | ||
cluster | Token | 0.0081 | ||
createfile | Token | 0.0079 | ||
Mystery Guest | FlakeFlagger | 0.0078 | ||
Resource Optimism | Token | 0.0077 | ||
new | Token | 0.0071 | ||
return | Token | 0.0071 | ||
asserttrue | Token | 0.0069 | ||
increasing | Token | 0.0068 | ||
null | Token | 0.0067 | ||
then | Token | 0.0065 | ||
throws | Token | 0.0064 | ||
thenreturn | Token | 0.0064 | ||
already | Token | 0.0063 | ||
true | Token | 0.0063 | ||
mkdir | Token | 0.0061 | ||
cli | Token | 0.0060 | ||
conf | Token | 0.0060 | ||
if | Token | 0.0060 | ||
Covered Changes (past 25 commits) | FlakeFlagger | 0.0058 |
for
or cli
(the command line interface) suggest that the fact that a test performs complex tasks is an indication of flakiness. In addition, the most informative terms are connected to the management of files. As the reader might notice, the vast majority of the textual features in Table 9 pertain to exceptions (e.g., throws
, ioexception
, invalidpathexception
, etc.) or to the creation of files (e.g., mkdir
, createfile
, createbytefile
, etc.). Elaborating on the relevance of file-related terms, it may be reasonable to believe that an approach based on vocabulary is particularly suitable to identify flaky tests whose root cause depends on the sub-optimal management of files—this aspect might be interesting to consider in further experimentations on root cause classification.Project | TP | TN | FP | FN | Pr | R | F | TP | TN | FP | FN | Pr | R | F |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FlakeFlagger | Vocabulary approach | |||||||||||||
achilles | 2 | 1,049 | 0 | 2 | 100% | 50% | 66% | 2 | 1,049 | 0 | 2 | 100% | 50% | 66% |
activiti | 5 | 143 | 10 | 10 | 31% | 30% | 29% | 11 | 146 | 7 | 5 | 54% | 70% | 59% |
alluxio | 122 | 63 | 1 | 0 | 99% | 100% | 99% | 121 | 64 | 0 | 1 | 100% | 99% | 99% |
ambari | 44 | 237 | 5 | 8 | 92% | 84% | 87% | 43 | 241 | 1 | 9 | 97% | 83% | 89% |
elastic-job-lite | 1 | 515 | 3 | 2 | 25% | 33% | 27% | 0 | 518 | 0 | 3 | 0% | 0% | 0% |
hbase | 110 | 236 | 11 | 11 | 91% | 90% | 90% | 95 | 223 | 24 | 26 | 79% | 78% | 78% |
hector | 27 | 79 | 9 | 36 | 73% | 81% | 76% | 26 | 83 | 5 | 7 | 87% | 80% | 81% |
httpcore | 12 | 496 | 13 | 3 | 48% | 80% | 58% | 10 | 502 | 7 | 5 | 59% | 75% | 64% |
http-request | 11 | 127 | 16 | 7 | 39% | 65% | 45% | 6 | 140 | 3 | 12 | 45% | 30% | 35% |
incubator-dubbo | 9 | 1,662 | 1 | 9 | 76% | 50% | 58% | 10 | 1,661 | 2 | 8 | 71% | 55% | 59% |
java-websocket | 19 | 85 | 1 | 2 | 96% | 91% | 92% | 20 | 86 | 0 | 1 | 100% | 96% | 98% |
logback | 1 | 636 | 4 | 14 | 10% | 10% | 10% | 0 | 636 | 4 | 15 | 0% | 0% | 0% |
ninja | 16 | 336 | 0 | 0 | 100% | 100% | 100% | 16 | 336 | 0 | 0 | 100% | 100% | 100% |
okhttp | 45 | 603 | 70 | 64 | 41% | 41% | 39% | 33 | 650 | 23 | 76 | 58% | 30% | 38% |
orbit | 3 | 19 | 3 | 1 | 25% | 30% | 26% | 2 | 21 | 1 | 2 | 15% | 20% | 16% |
spring-boot | 61 | 1,544 | 8 | 21 | 90% | 74% | 80% | 59 | 1,544 | 8 | 23 | 88% | 72% | 78% |
undertow | 2 | 40 | 2 | 4 | 20% | 50% | 20% | 1 | 40 | 2 | 5 | 33% | 14% | 19% |
wro4j | 1 | 1,086 | 1 | 15 | 50% | 50% | 66% | 4 | 1,087 | 0 | 72 | 40% | 25% | 29% |
Total | 448 | 9,002 | 112 | 222 | 80% | 66% | 72% | 428 | 9,006 | 108 | 242 | 80% | 63% | 70% |
Combined approach | Static approach | |||||||||||||
achilles | 0 | 1,049 | 0 | 0 | 0% | 0% | 0% | 2 | 1,049 | 0 | 2 | 100% | 50% | 66% |
activiti | 11 | 147 | 6 | 5 | 56% | 70% | 61% | 5 | 141 | 12 | 11 | 25% | 25% | 23% |
alluxio | 122 | 64 | 0 | 0 | 100% | 100% | 100% | 117 | 60 | 4 | 5 | 98% | 90% | 93% |
ambari | 47 | 242 | 0 | 5 | 100% | 90% | 94% | 47 | 241 | 1 | 5 | 98% | 90% | 93% |
elastic-job-lite | 0 | 518 | 0 | 3 | 0% | 0% | 0% | 0 | 518 | 3 | 1 | 0% | 0% | 0% |
hbase | 112 | 238 | 9 | 9 | 92% | 92% | 92% | 105 | 233 | 14 | 16 | 89% | 87% | 87% |
hector | 28 | 85 | 3 | 5 | 92% | 86% | 88% | 26 | 75 | 13 | 7 | 76% | 81% | 74% |
httpcore | 9 | 501 | 8 | 6 | 44% | 65% | 50% | 8 | 503 | 6 | 7 | 50% | 60% | 53% |
http-request | 10 | 140 | 3 | 8 | 70% | 55% | 58% | 13 | 132 | 11 | 5 | 55% | 75% | 61% |
incubator-dubbo | 12 | 1,661 | 2 | 6 | 91% | 70% | 76% | 11 | 1,658 | 5 | 7 | 76% | 65% | 68% |
java-websocket | 20 | 86 | 0 | 1 | 100% | 96% | 98% | 20 | 86 | 0 | 1 | 100% | 96% | 98% |
logback | 2 | 638 | 2 | 13 | 50% | 13% | 20% | 3 | 637 | 3 | 12 | 50% | 20% | 28% |
ninja | 16 | 336 | 0 | 0 | 100% | 100% | 100% | 16 | 330 | 6 | 0 | 81% | 100% | 88% |
okhttp | 35 | 660 | 13 | 74 | 74% | 31% | 43% | 70 | 565 | 109 | 38 | 39% | 65% | 48% |
orbit | 2 | 21 | 1 | 2 | 66% | 50% | 56% | 2 | 20 | 2 | 2 | 50% | 50% | 50% |
spring-boot | 62 | 1,544 | 8 | 20 | 89% | 75% | 81% | 61 | 1,542 | 10 | 21 | 87% | 74% | 79% |
undertow | 1 | 40 | 2 | 5 | 33% | 16% | 22% | 2 | 39 | 3 | 4 | 40% | 33% | 26% |
wro4j | 3 | 1,087 | 0 | 13 | 100% | 18% | 30% | 3 | 1,084 | 3 | 13 | 14% | 15% | 12% |
Total | 463 | 9,057 | 57 | 207 | 89% | 68% | 77% | 446 | 8,957 | 158 | 224 | 74% | 66% | 70% |
file
and resilientfileoutputstream
, file_header
, file_footer
, and others. While the terms actually refer to various specific properties or actions performed on files, a fully textual approach might not properly assess the likelihood of test flakiness because of the many different terms associated to the same potential issue arising with the management of files. In this sense, further improvements of the Vocabulary approach that take text normalization into account might be worth to explore.Static vs. FlakeFlagger | ||
Static corr ∩ FlakeFlagger corr | Static corr ∖ FlakeFlagger corr | FlakeFlagger corr ∖ Static corr |
72% | 14% | 14% |
Static vs. Vocabulary | ||
Static corr ∩ Vocabulary corr | Static corr ∖ Vocabulary corr | Vocabulary corr ∖ Static corr |
72% | 16% | 12% |
Static vs. Combined | ||
Static corr ∩ Combined corr | Static corr ∖ Combined corr | Combined corr ∖ Static corr |
72% | 14% | 14% |
FlakeFlagger vs. Vocabulary | ||
FlakeFlagger corr ∩ Vocabulary corr | FlakeFlagger corr ∖ Vocabulary corr | Vocabulary corr ∖ FlakeFlagger corr |
70.7% | 16.4% | 12.9% |
FlakeFlagger vs. Combined | ||
FlakeFlagger corr ∩ Combined corr | FlakeFlagger corr ∖ Combined corr | Combined corr ∖ FlakeFlagger corr |
78.8% | 8.6% | 12.7% |
Vocabulary vs. Combined | ||
Vocabulary corr ∩ Combined corr | Vocabulary corr ∖ Combined corr | Combined corr ∖ Vocabulary corr |
82.6% | 5.1% | 12.3% |
Static corr ∖ (FlakeFlagger corr ∪ Vocabulary corr ∪ Combined corr) | FlakeFlagger corr ∖ (Static corr ∪ Vocabulary corr ∪ Combined corr) | |
15.5% | 15.7% | |
Vocabulary corr ∖ (Static corr ∪ FlakeFlagger corr ∪ Combined corr) | Combined corr∖ (Static corr ∪ FlakeFlagger corr ∪ Vocabulary corr) | |
13.2% | 17.4% | |
(Static corr ∩ FlakeFlagger corr ∩ Vocabulary corr ∩ Combined) ∖ (Static corr ∪ FlakeFlagger corr ∪ | ||
Vocabulary corr ∪ Combined corr) | ||
38.2% |
9 Threats to validity
9.1 Construct validity
9.2 Conclusion validity
vif
function, aimed at discarding non-relevant independent variables. These procedures followed established guidelines (O’brien 2007), making us confident of the validity of the conclusions drawn.7e3801e19fb43183c59607663ebd53c27a95cf77
of the WRO4J project, where the test case named testbourboncssprocessor.shouldbethreadsafe
was detected as flaky. By analyzing this case further, we found out that the commit did not modify the test nor the associated production class (i.e., the class named bourboncssprocessor
). In addition, the modified classes did not have any structural relation with neither the production nor test class. Yet, the flakiness of the test emerged. In other terms, the flakiness affecting the test manifested itself independently from the actions performed by developers within the commit. This implies that the test might have possibly been flaky even in previous commits of the project, despite not being detected. The example has two main implications. First, novel strategies to identify flakiness-inducing commits should be devised, as they should not only rely on the information coming from an individual commit of the change history (as the flakiness might have been previously emerged), but rather should mark flakiness by also looking at the specific change history of tests (e.g., starting from the emergence of a flaky test, they may traverse in reverse order the commits until the last modification of the test). Second, the information available in current datasets might potentially lead to biased observations when flaky test prediction models are experimented in a time-sensitive fashion, as they were not collected by explicitly considering the many perils of mining flaky test data. For these reasons, we believe that such an analysis would require a brand new set of research questions, methodology, and analyses, and is, therefore, out of the scope of our current submission.9.3 External validity
10 Conclusion, discussion, and future work
-
Code complexity metrics are the ones that differ the most between flaky and non-flaky tests. Not only this result was confirmed on both the considered datasets, but also when looking at the most relevant features employed by the fully static approach. This has two main implications. On the one hand, practitioners might use our findings to justify the adoption of instruments to take code complexity under control. On the other hand, more research on code complexity and how it affects test code quality might be worth to further elaborating instruments to support developers.
-
When analyzing the value of the features used by our approach and by the baselines, we observed that some of them have a different weight. Particularly, while test smells were not deemed relevant for FlakeFlagger, they contributed to our approach in a comparable manner with respect to other features. This opens up new research opportunities into the relation between test smells and flakiness. Some research on the matter has been recently proposed (Camara et al. 2021a), yet we argue that more empirical investigations might be conducted to further understand how test code quality impacts the likelihood of test flakiness.
-
A fully static approach to test flakiness prediction reaches comparable results with respect to the baselines—the F-Measures ranged from 17% to 99% on the two considered datasets. Perhaps more importantly, our approach has higher precision, hence representing a more practical solution for developers. While additional investigations into the matter are already part of our future research agenda, our results have already implications for researchers and practitioners. The former are called to devise and study novel, more powerful metrics that could contribute to the improvement of the flakiness prediction capabilities. The latter may rely on an approach that does not need dynamics computations to verify the quality and reliability of the test cases developed within their own organization. From a practical standpoint, the static nature of the experimented model would let it be run among the other continuous checks that developers normally do to verify the presence of regressions in newly committed code (Vassallo et al. 2020).
-
Our study revealed some peculiarities of the flakiness data that might lead machine learning approaches to work differently. In particular, we identified the diversity of test cases as a relevant factor to even allow a machine learner to work. In addition, we also found some interesting complementarity between our approach and the baselines, which suggests that improvements are still possible. On the basis of these conclusions, we argue that the results of this paper might lead to further research on novel software engineering practices for flaky test prediction, namely instruments and methodologies that are aware of the flakiness data properties and may act accordingly, for instance by dynamically selecting the approach to use or the pre-processing steps to apply.