1 Introduction
testAdd
, which belongs to the SparseGradientTest
test suite of the Apache Commons Math 3.3 system—one of the projects considered in our study. The test aims at verifying that the add
method of the corresponding production class SparseGradient
correctly sums a set of numbers; to this aim, it instantiates the variables to be added (lines #3, #4, and #5 of Listing 1) and sums them using the add
method (line #6). Finally, it calls the method checkF0F1
, implemented in the same test suite, that verifies the sum and checks for the first order derivative passed as additional parameters (line #9). The test case is able to entirely cover the corresponding production method (line coverage= 100%) and, similarly, the entire test suite has a line coverage of 98%. In the subsequent release of Apache Commons Math, the class SparseGradient
did not exhibit any defect: a possible reason lies in the ability of the corresponding test suite to provide developers with an effective instrument to verify the presence of defects and, as a matter of fact, previous work in literature have discovered a correlation between code coverage of tests and post-release defects in production code, i.e., the higher the coverage the lower the number of defects in subsequent releases of system (Cai and Lyu 2007; Chen et al. 2001).
SparseGradient
was actually defect-free in the subsequent release of Apache Commons Math might have and might not have been due to the high code coverage of SparseGradientTest
. Other factors, for instance the low amount of maintenance activities performed on the production class, may have played a role. This is what actually happened to SparseGradient
: it did not undergo any modification in Apache Commons Math 3.4 and, therefore, this was the reason making it defect-free—independently from the high value of code coverage of the corresponding test suite. Should this example be generalizable, it would mean that the findings reported in literature would not depict a clear picture on the relation between the characteristics of tests and their ability to foresee post-release defects.2 Related Work
3 Collecting Test-Related Factors: A Multivocal Literature Review
3.1 Research Methodology
3.1.1 Research Question
3.1.2 Search Query Definition
3.1.3 Selecting the Source Engines
3.1.4 Exclusion and Inclusion Criteria Definition
-
Articles that were not focused on investigating the relation between test-related factors and post-release defects, e.g., papers studying how test smells relate to mutation coverage;
-
Articles that have later been extended; particularly, in case of a conference paper has been extended to journal, we only considered the journal article as it is more complete.
-
Articles not reporting any empirical validation of the relation between test-related factors and post-release defects, e.g., non-validated conjectures of the existence of a relation between test smells and code coverage;
-
Articles that were not written in English;
-
Articles whose full text was not available;
-
Articles that did not undergo a peer-review process, e.g., M.Sc thesis;
-
Duplicate papers retrieved by multiple databases.
-
Articles reporting an empirical validation of the relation between test-related factors and post-release defects, e.g., papers studying how test smells relate to post-release defects.
-
Author. Information on the internet with a listed author is an indication of a credible site. If an author is willing to stand behind the information presented (and in some cases, include his or her contact information) is a good indication that the information is reliable.
-
Date. The date of any research information is important, including information found on the Internet. By including a date, the website allows readers to make decisions about whether that information is recent enough for their purposes.
-
Sources. Credible websites, like books and scholarly articles, should cite the source of the information presented.
-
Domain. Some domains such as .com, .org, and .net can be purchased and used by any individual. However, the domain .edu is reserved for colleges and universities, while .gov denotes a government website. These two are usually credible sources for information. Websites using the domain .org usually refer to non-profit organizations which may have an agenda of persuasion rather than education.
-
Site Design. This can be very subjective, but a well-designed site can be an indication of more reliable information. Good design helps make information more easily accessible.
-
Writing Style. Poor spelling and grammar are an indication that the site may not be credible. In an effort to make the information presented easy to understand, credible sites watch writing style closely.
-
The resource must report on practitioner’s experiences and/or discussion of using test-related factors to establish the likelihood to have defects in production code;
-
The resource must describe the test-related factor(s) it refers to, i.e., it must clearly mention that the test executability represents an important factor to assess post-release defects.
3.1.5 Execution of the Multivocal Literature Review
3.1.6 Quality Assessment and Data Extraction Process
3.2 Analysis of the Results
Group | Name | Description |
---|---|---|
Presence and Executability | Availability of test classes | The availability of a test suite for a production class. |
Executability of test classes | The ability to run a test case for a given production class. | |
Static factors | TLOC | Number of lines of code of the Test Suite. |
TWMC | Weighted Method Count of the Test Suite. | |
TEC | Efferent coupling of the Test Suite. | |
Assertion Density | Percentage of assertion statements in the test code (i.e., number of assertions / T_LOC). | |
Test smells | Assertion Roulette | A test containing several assertions with no explanation. |
Eager Test | A test case testing more methods of the production target. | |
Indirect Testing | A test interacting with the target via another object. | |
Resource Optimism | A test that make optimistic assumptions on the existence of external resources. | |
Mystery Guest | A test that use external resources (e.g., files or databases). | |
Dynamic factors | Line Coverage | Percentage of statement in production class that are covered by the test. |
Branch Coverage | Percentage of branches in production class that are covered by the test. | |
Mutation Coverage | Percentage of mutated statement in production class that are covered by the test. |
4 Studying the Relation Between Test-Related Factors and Post-Release Defects
4.1 Research Methodology
4.1.1 Research Questions and Methodological Sketch
4.1.2 Context selection
Name | # Commits | # Releases | # Contributors | # Production classes | # Test classes | # Defects |
---|---|---|---|---|---|---|
Codec | 1,792 | 45 | 39 | 26 | 31 | 134 |
Collections | 3,091 | 49 | 51 | 270 | 139 | 341 |
DBCP | 1,983 | 62 | 34 | 53 | 21 | 367 |
DbUtils | 656 | 29 | 21 | 25 | 20 | 56 |
IO | 5,400 | 54 | 58 | 100 | 47 | 281 |
Lang | 2,141 | 89 | 140 | 119 | 98 | 634 |
Math | 6,395 | 65 | 34 | 804 | 404 | 684 |
Pool | 1,879 | 168 | 38 | 41 | 14 | 205 |
4.1.3 Dependent Variable
-
For each file fi, \(i=1{\dots } m_{k}\) involved in a defect-fix k (mk is the number of files changed in the defect-fix k) and fixed in its revision rel-fixi,k, we extracted the file revision just before the defect fixing (rel-fixi,k − 1).
-
Starting from the revision rel-fixi,k − 1, for each source line in fi changed to fix the defect k, we identified the production class Cj to which the changed line belongs. Furthermore, the
blame
feature ofGit
is used to identify the revision where the last change to that line occurred. In doing that, blank lines and lines that only contain comments are identified and excluded using an island grammar parser (Moonen 2001). This produces, for each production class Cj, a set of ni,k defect-inducing revisions rel-defecti,j,k, \(j=1 {\dots } n_{i,k}\). Thus, more than one commit can be indicated by the SZZ algorithm as responsible for inducing a bug.
4.1.4 Independent Variables
pom
file, which contains the rules to identify the test classes to execute when the projects need to be built or packaged. In particular, we first identified all production and test classes by scanning the pom
file and looking for the sourceDirectory
and testSourceDirectory
fields, that indicate the location of production and test code, respectively. When the fields were not reported explicitly, we considered the default source and test directories. Afterwards, we used a pattern matching approach based on naming conventions to find the production class related to a certain test class, as it has been done in previous work (Grano et al. 2019; Lubsen et al. 2009; Tufano et al. 2016): given the name of a production class (e.g., ‘ClassName
’) belonging to the sourceDirectory
folder, it checks for the presence of a test class having the same name as the production class but with the prefix or postfix “Test” in the testSourceDirectory
(e.g., ‘ClassNameTest
’ or ‘TestClassName
’). In case the approach cannot identify a test for a certain class, the variable ‘is-tested’ for the considered production class is “false”, “true” otherwise. The accuracy of this linking approach has been previously assessed (Van Rompaey and Demeyer 2009): it showed an accuracy close to 85% and is comparable with more sophisticated (but less scalable) techniques (e.g., slicing-based approaches (Qusef et al. 2014)).mvn verify
command, which executes all the available tests. If the execution of a test proceeds without errors,15 then we considered the test as executable. Otherwise we marked it as non-executable.4.1.5 Confounding Factors
-
We considered the Lines Of Code (PLOC) metric, that measures the size of production classes. According to previous findings (Koru et al. 2009; Palomba et al. 2017b; Zazworka et al. 2011), the larger a class the higher its fault-proneness. As such, the number of post-release defects might be a reflection of the production code size and, therefore, we computed LOC to control our findings on the impact of the presence of test suites. To measure PLOC, we used the tool devised by Spinellis (2005).
-
We computed Weighted Method per Class (PWMC) (Chidamber and Kemerer 1994) as a way to measure the complexity of production code. A number of previous studies has shown the metric to be related to the number of defects in which a production class will incur (Di Nucci et al. 2018; Nagappan et al. 2010; Zimmermann et al. 2007). The tool by Spinellis (Spinellis 2005) was used to compute the metric on our dataset.
-
We measured the Efferent Coupling (PEC) of production classes because, as reported by previous research (Basili et al. 1996; D’Ambros et al. 2009; Fregnan et al. 2019; Knab et al. 2006; Shihab et al. 2010), the higher the coupling of a class the higher its fault-proneness. Also in this case, we employed the tool by Spinellis (2005) to compute PEC.
-
We considered code smells, i.e., symptoms of the presence of poor implementation choices (Brown et al. 1998; Fowler 2018), since they are reported to be connected to the fault-proneness of production code (D’Ambros et al. 2010; Hall et al. 2014; Khomh et al. 2012; Palomba et al. 2017c, 2018a, 2018b; Pecorelli et al. 2019b; Tufano et al. 2017b). We considered five code smells from the catalog by Fowler (2018) that have different characteristics, namely God Class, Class Data Should Be Private, Complex Class, Functional Decomposition, and Spaghetti Code. We provided a complete definition of those smells in our Online Appendix (Pecorelli et al. 2020). These code smells have been analyzed by previous work studying their effect on source code defect-proneness (Khomh et al. 2012; Palomba et al. 2018b). Therefore, our selection was driven by these findings. As for the actual detection of these code smells, we relied on Decor (Moha et al. 2010), a state-of-the-art detection tool which has shown an accuracy close to 80% (Moha et al. 2010). In our work we re-evaluated the precision of Decor. The two authors previously involved in the validation of the test smells also conducted this analysis: they manually validated all the 137 code smell instances output by the tool. The task was to understand whether a certain code smell candidate given by Decor actually revealed the existence of a design problem in source code. After the first assessment, the two inspectors compared their evaluations, reaching an agreement of 95%. The remaining 5% of cases (i.e., seven code smell candidates) were discussed and, finally, four of them turned to be real code smells. Following this validation, we (i) confirmed the good accuracy and the suitability of Decor in our context and (ii) excluded the false positive smells from our analysis.
-
We computed the number of pre-release changes and pre-release defects because metrics capturing the previous history of production classes can reveal relevant evolution aspects (Hassan 2009; Rahman and Devanbu 2013).To compute the number of pre-release changes, we mined the change log of the considered projects and count how many times a certain production class has been modified. As for the pre-release defects, we relied again on the SZZ algorithm implemented in PyDriller (Spadini et al. 2018a).
Group | Name | Description |
---|---|---|
Static factors | PLOC | Number of lines of code of the Production Class |
PWMC | Weighted Method Count of the Production Class | |
PEC | Efferent coupling of the Production class | |
Code smells | God Class | A class having a large size, poor cohesion, and several dependencies with other data classes of the system |
Class Data Should Be Private | A class exposing its attributes, thus violating the information hiding principle | |
Complex Class | A class presenting a overly high cyclomatic complexity | |
Functional Decomposition | A class implemented as a function | |
Spaghetti Code | A class that exhibit a functional-style programming structure, declaring a number of long methods without parameters | |
Process Metrics | Pre-release Changes | Number of changes involving the Production class before the release date of the considered snapshot |
4.1.6 Statistical Modeling and Data Analysis
-
There is no correlation between the presence of test classes and software quality, as measured by post-release defects.
-
There is a correlation between the presence of test classes and software quality, as measured by post-release defects.
-
-
There is no correlation between the executability of test classes and software quality, as measured by post-release defects.
-
There is a correlation between the executability of test classes and software quality, as measured by post-release defects.
-
-
There is no correlation between test code metrics and software quality, as measured by post-release defects.
-
There is a correlation between test code metrics and software quality, as measured by post-release defects.
-
-
There is no correlation between test smells and software quality, as measured by post-release defects.
-
There is a correlation between test smells and software quality, as measured by post-release defects.
-
-
There is no correlation between code coverage metrics and software quality, as measured by post-release defects.
-
There is a correlation between code coverage metrics and software quality, as measured by post-release defects.
-
-
There is no correlation between mutation coverage and software quality, as measured by post-release defects.
-
There is a correlation between mutation coverage and software quality, as measured by post-release defects.
-
varclus
function available in the R
statistical toolkit18), then, if two variables had a correlation higher than 0.6, we excluded the more complex one from the model.4.2 Analysis of the Results
4.2.1 RQ1. The presence and executability of tests
Test | Test + Prod. | Full | |||||||
---|---|---|---|---|---|---|---|---|---|
Estimate | S.E. | Sig. | Estimate | S.E. | Sig. | Estimate | S.E. | Sig. | |
Intercept | 0.11 | 0.05 | * | − 0.03 | 0.05 | − 0.17 | 0.04 | *** | |
is-tested | 0.57 | 0.12 | *** | 0.14 | 0.12 | − 0.09 | 0.11 | ||
are-tests-executable | − 0.49 | 0.13 | *** | − 0.23 | 0.12 | . | − 0.05 | 0.11 | |
PLOC | 0.00 | 0.00 | *** | 0.00 | 0.00 | ||||
isGodClass | 0.24 | 0.16 | 0.17 | 0.15 | |||||
isClassDataShould BePrivate | 0.39 | 0.38 | 0.70 | 0.34 | |||||
isComplexClass | − 1.12 | 0.30 | *** | − 0.29 | 0.27 | ||||
pre-release changes | 0.05 | 0.00 | *** | ||||||
pre-release defects | 0.10 | 0.01 | *** |
Variable name | Minimum | Maximum | Mean | SD |
---|---|---|---|---|
PLOC | 2.00 | 6291.00 | 211.00 | 359.90 |
isGodClass | 0.00 | 1.00 | 0.10 | 0.30 |
isClassDataShouldBePrivate | 0.00 | 1.00 | 0.01 | 0.09 |
isComplexClass | 0.00 | 1.00 | 0.02 | 0.14 |
is-tested | 0.00 | 1.00 | 0.49 | 0.50 |
are-tests-executable | 0.00 | 1.00 | 0.38 | 0.49 |
pre-release changes | 0.00 | 201.00 | 7.32 | 12.26 |
pre-release defects | 0.00 | 35.00 | 1.03 | 3.37 |
post-release defects | 0.00 | 23.00 | 0.20 | 1.33 |
4.2.2 RQ2. The impact of static test code indicators
Variable name | Minimum | Maximum | Mean | SD |
---|---|---|---|---|
PLOC | 13.00 | 6291.00 | 311.00 | 460.00 |
isGodClass | 0.00 | 1.00 | 0.17 | 0.38 |
isClassDataShouldBePrivate | 0.00 | 1.00 | 0.01 | 0.12 |
isComplexClass | 0.00 | 1.00 | 0.04 | 0.19 |
TLOC | 5.00 | 2210.00 | 168.00 | 248.10 |
Assertion Density | 0.00 | 0.83 | 0.19 | 0.15 |
isAssertionRoulette | 0.00 | 1.00 | 0.87 | 0.34 |
isEagerTest | 0.00 | 1.00 | 0.61 | 0.49 |
isMysteryGuest | 0.00 | 1.00 | 0.07 | 0.26 |
isResourceOptimism | 0.00 | 1.00 | 0.02 | 0.14 |
isIndirectTesting | 0.00 | 1.00 | 0.06 | 0.24 |
pre-release changes | 1.00 | 201.00 | 10.00 | 16.21 |
pre-release defects | 0.00 | 35.00 | 1.73 | 4.54 |
post-release defects | 0.00 | 23.00 | 0.29 | 1.65 |
Test | Test + Prod. | Full | |||||||
---|---|---|---|---|---|---|---|---|---|
Estimate | S.E. | Sig. | Estimate | S.E. | Sig. | Estimate | S.E. | Sig. | |
Intercept | − 0.02 | 0.18 | − 0.08 | 0.18 | − 0.19 | 0.16 | |||
TLOC | 0.00 | 0.00 | *** | 0.00 | 0.00 | − 0.00 | 0.00 | ||
Assertion Density | 0.16 | 0.44 | 0.02 | 0.43 | − 0.38 | 0.39 | |||
isAssertionRoulette | − 0.06 | 0.20 | − 0.04 | 0.19 | 0.05 | 0.17 | |||
isEagerTest | 0.12 | 0.14 | 0.03 | 0.14 | − 0.05 | 0.13 | |||
isMysteryGuest | − 0.19 | 0.28 | − 0.20 | 0.28 | − 0.54 | 0.25 | * | ||
isResourceOptimism | 0.73 | 0.53 | 0.54 | 0.52 | − 0.22 | 0.47 | |||
isIndirectTesting | − 0.01 | 0.27 | − 0.04 | 0.27 | 0.09 | 0.24 | |||
PLOC | 0.00 | 0.00 | *** | 0.00 | 0.00 | ||||
isGodClass | 0.36 | 0.23 | 0.32 | 0.21 | |||||
isClassDataShould BePrivate | 0.27 | 0.56 | 0.93 | 0.50 | . | ||||
isComplexClass | − 0.97 | 0.41 | * | − 0.22 | 0.38 | ||||
pre-release changes | 0.03 | 0.00 | *** | ||||||
pre-release defects | 0.09 | 0.02 | *** |
4.2.3 RQ3. The impact of dynamic test code indicators
Variable name | Minimum | Maximum | Mean | SD |
---|---|---|---|---|
PLOC | 13.00 | 5077.00 | 259.00 | 342.20 |
isGodClass | 0.00 | 1.00 | 0.12 | 0.38 |
isClassDataShouldBePrivate | 0.00 | 1.00 | 0.01 | 0.11 |
isComplexClass | 0.00 | 1.00 | 0.02 | 0.14 |
TLOC | 5.00 | 2210.00 | 135.90 | 194.40 |
Assertion Density | 0.00 | 0.83 | 0.20 | 0.16 |
isAssertionRoulette | 0.00 | 1.00 | 0.86 | 0.34 |
isEagerTest | 0.00 | 1.00 | 0.62 | 0.49 |
isMysteryGuest | 0.00 | 1.00 | 0.06 | 0.23 |
isResourceOptimism | 0.00 | 1.00 | 0.02 | 0.12 |
isIndirectTesting | 0.00 | 1.00 | 0.04 | 0.20 |
Line Coverage | 0.00 | 1.00 | 0.90 | 0.14 |
Branch Coverage | 0.00 | 1.00 | 0.75 | 0.32 |
Mutation Coverage | 0.00 | 1.00 | 0.70 | 0.32 |
pre-release changes | 1.00 | 139.00 | 8.49 | 10.91 |
pre-release defects | 0.00 | 29.00 | 1.37 | 3.66 |
post-release defects | 0.00 | 23.00 | 0.19 | 1.23 |
Test | Test + Prod. | Full | |||||||
---|---|---|---|---|---|---|---|---|---|
Estimate | S.E. | Sig. | Estimate | S.E. | Sig. | Estimate | S.E. | Sig. | |
Intercept | 0.17 | 0.38 | − 0.16 | 0.37 | − 0.18 | 0.36 | |||
Line Coverage | 0.27 | 0.45 | 0.41 | 0.44 | 0.21 | 0.43 | |||
Branch Coverage | − 0.08 | 0.17 | − 0.08 | 0.17 | − 0.17 | 0.16 | |||
Mutation Coverage | − 0.55 | 0.18 | ** | − 0.35 | 0.19 | . | − 0.12 | 0.18 | |
LOC (test suite) | 0.00 | 0.00 | *** | − 0.00 | 0.00 | − 0.00 | 0.00 | * | |
Assertion Density | 0.38 | 0.36 | 0.15 | 0.35 | − 0.12 | 0.34 | |||
isAssertionRoulette | − 0.10 | 0.17 | − 0.04 | 0.16 | 0.01 | 0.16 | |||
isEagerTest | 0.09 | 0.12 | 0.03 | 0.12 | − 0.04 | 0.11 | |||
isMysteryGuest | − 0.14 | 0.27 | − 0.02 | 0.27 | − 0.22 | 0.26 | |||
isResourceOptimism | 0.67 | 0.50 | 0.54 | 0.49 | 0.32 | 0.47 | |||
isIndirectTesting | 0.23 | 0.26 | 0.16 | 0.26 | 0.18 | 0.25 | |||
LOC (production class) | 0.00 | 0.00 | *** | 0.00 | 0.00 | ** | |||
isGodClass | − 0.08 | 0.23 | − 0.14 | 0.22 | |||||
isClassDataShouldBePrivate | 1.22 | 0.54 | * | 1.49 | 0.52 | ** | |||
isComplexClass | 0.26 | 0.45 | 0.37 | 0.43 | |||||
pre-release changes | 0.03 | 0.01 | *** | ||||||
pre-release defects | 0.04 | 0.02 | * |
ClassDerivativeStructure
belonging to the Commons-Math project; it shows a high number of PLOCs (i.e., 1,011) but at the same time a high number of TLOCs (i.e., 1,172). This class has no post-release defects, perhaps due to the robustness of the test suite. The second example is the class BaseGenericObjectPool
in the project Commons-Pool; like the previous one, it is characterized by a high PLOCs (i.e., 849) but, the low number of TLOCs (i.e., 43) may have led to 23 post-release defects (the maximum number of defects among the instances we analyzed). These two examples, together with the results of the statistical model, suggest that the size of test suites can be a proxy metric to assess how robust a test is. Differently from the other research questions, the considered variables remain significant across the four models. Indeed, also adding the process metrics, the PLOC and ‘isClassDataShouldBePrivate’ are still significant.
5 Discussion, Implications, and Threats to Validity
5.1 Discussion
BaseGenericObjectPool
had 849 lines of code, while the corresponding test suite BaseGenericObjectPoolTest
had just 43 lines of code. After release 2.3, the class BaseGenericObjectPool
had 23 defects. At a first sight, this may suggest that the test suite was not robust enough in preventing or diagnosing the introduction of defects. However, the developer found that just considering the test suite BaseGenericObjectPoolTest
could potentially be not enough. He pointed out to us that when code is refactored, the tests are left in the original test suites to help detect regressions during the refactoring. So, there could exist a subset of tests in other classes, that we did not consider, which exercise the production class BaseGenericObjectPool
—the test-to-code traceability technique exploited in the study may have under-estimated the number of tests connected to the production class.5.2 Implications
5.3 Threats to Validity
Test
’. While the accuracy of the technique has been previously assessed (Van Rompaey and Demeyer 2009) showing a good compromise between accuracy and scalability, the linking procedure may have introduced some bias in cases tests exercising a production class are not all included in the test suite retrieved by the technique but put in other test suites. In our case, this may have been happened, as mentioned by the interviewed developer of Apache Commons-Pool who commented on our findings in the context of our additional qualitative analysis (see Section 5.1).