1 Introduction
2 Background
2.1 Test Smells
2.2 Test Case Generation
2.3 Limitations of Prior Work
3 Methodology
3.1 Research Questions
-
RQ1: How widespread are test smells in automatically generated test cases?
-
RQ2: How accurate are automated tools in detecting test smells in automatically generated tests?
-
RQ3: How well do test smells reflect real problems in automatically generated test suites?
-
RQ4: How does test smell diffusion in manually written tests compare to automatically generated tests?
-
RQ5: How well do test smells capture real problems in manually written tests?
3.2 Test Class Selection
SourceForge.net
. The selected classes are non-trivial as identified using a well-established triviality test (Panichella et al. 2017), which filters out classes whose methods have a McCabe’s Cyclomatic complexity lower than three. This helps ignore classes that can be fully covered by simple method invocations. As a consequence of this selection criteria, there are no precisely equivalent hand-written test suites for the classes that were selected in the original study (Grano et al. 2019), for us to use in answering RQ4 and RQ5. In particular, only eight out of 100 classes selected by Grano et al. have a manually-written test suite. Hence, we extended the benchmark used to answer RQ4 and RQ5 by selecting other 41 test suites from the top 10 most popular Java projects (Fraser and Arcuri 2014) in the SF110 dataset and that meet similar complexity criteria. We manually validated the suites in this extended benchmark to validate the (eventual) test smells.3.3 Test Case Generation
GitHub
.1EvoSuite also applies several post-processing optimizations to reduce the size of the test cases, identify and remove flaky tests, as well as minimizing the number of generated assertions (see Section 2.2). JTExpert is a well-known testing framework for Java programs that was ranked second in the 2017 SBST tool competition (Sakti et al. 2017). JTExpert utilizes a random strategy with static analyses to automatically generate a complete test suite based on a branch coverage criterion, and is publicly available online.2,3 Differently from EvoSuite, JTExpert does not apply any post-processing optimizations. A limitation of JTExpert is that it does not always generate test cases. For instance, in the SBST tool competition of 2017, it did not generate test cases for 415 out of 1450 runs (27%).3.3.1 EvoSuite
-
Test minimization: the test cases are first minimized by removing spurious statements that do not contribute to any coverage criteria. In fact, crossover and mutation may add statements to a test case that do not lead to covering any additional coverage targets (e.g., branches). Removing these unnecessary statements can potentially reduce the oracle cost (i.e., the time for manually inspecting the test) and address test smells.
-
Assertion minimization: the generated test cases are first enriched with assertions, that assert the output values returned by method calls, get methods, as well as the values of public and protected attributes. The assertions are then filtered based on their ability to strongly kill mutants. A mutant is strongly killed by a test case t if t passes on the original program but fails when executed against the mutant. Assertions that do not contribute to killing new mutants are removed during this phase.
-
Flakiness detection and removal: the final test suite is re-executed to detect potential flaky tests, i.e., tests with non-deterministic behaviors. Flaky test cases are removed in this step.
3.3.2 JTExpert
int
) and strings that appear in the source code and store them in a seeding pool. Then, with a given probability, input data are generated at random or selected from the seeding pool. Besides, JTExpert also seeds the null value with a constant probability while generating instances of classes (Sakti et al. 2017).3.3.3 Parameter Setting
Criterion | Settings by Grano et al. (2019) | Our settings | ||||
---|---|---|---|---|---|---|
M | IQR | CI | M | IQR | CI | |
Branch Coverage | 0.69 | 0.71 | [0.66, 0.72] | 0.74 | 0.70 | [0.71, 0.76] |
Overall Coverage | 0.67 | 0.66 | [0.65, 0.70] | 0.74 | 0.65 | [0.71, 0.76] |
# of Test Cases | 14 | 23 | [13.86, 14.13] | 15 | 26 | [14.83, 15.17] |
Total Test Length | 50 | 135 | [47.25, 52.60] | 46 | 110 | [43.63, 48.24] |
3.4 Detection Tool Selection
GitHub
.5 Recently, Spadini et al. (2020) calibrated the detection rules in tsDetect based on developers’ perception and classification of test smell severity, resulting in thresholds that are better aligned with what developers consider actual bad test design choices.Test | Definition by Deursen et al. (2001) | Rules for interpretation |
---|---|---|
smell | ||
Mystery | Test case that accesses external | Discarded for automatically generated |
Guest | resources such as files and databases, | tests, since tools like EvoSuite |
so that it is no longer self-contained. | use runners that by definition mocks | |
out all accesses to external resources. | ||
Eager | A test that checks multiple | (1) The test must have more than one |
Test | different functionalities in one case, which | assertion and (2) at least one |
makes it hard to read or understand. | assertions is not on the result of a get | |
method. | ||
Assertion | A test that has multiple assertion | A test must have two or more |
Roulette | statements that do not provide any | assertions and neither has any explanatory |
description of why they failed | message accompanying them | |
Indirect | Tests the class under test using methods | The presence of any assert that uses |
Testing | from other classes. | a method that is not part of the class |
under test. | ||
Sensitive | When a test checks for equality | Any assert that checks the exact value |
Equality | through the use of the toString | of a String that is returned through a |
method. | toString call is said to be sensitive | |
Resource | A test that makes optimistic assumptions | Discarded for automatically |
Optimism | about the state/existence of external | generated tests, since tools like EvoSuite |
resources | use runners that by definition mocks | |
out all accesses to external resources. |
3.5 Manual Validation
-
Step 1. Each test suite was independently inspected by two authors of this paper. For each test suite, the analysis is done across the dimensions corresponding to the selected test smells (5 for generated tests, 7 for manually written ones). Since we only look for the presence of a smell, for each dimension, we use a binary marker. For our analysis, as a guideline, each author adheres to the detection rules listed in Table 2. We note that the EvoSuite analysis involved four authors annotating 50 suites each; this annotation was part of our original conference paper (Panichella et al. 2020b) and served to calibrate our annotation protocol. The other two sets were annotated in full by two authors.
-
Step 2. For each set of test suites, the two authors responsible for the analysis discuss their findings and any disputed cases to come to a resolution. We generally encountered disagreement levels between 10% and 20%, which could mostly be resolved through discussion with reference to the guidelines.
-
Step 3. Any remaining controversial cases that could not be resolved between two annotators were discussed by all authors to come to a final agreement and improvement in the protocol. This discussion involved ten cases in total (of which seven on the first dataset, from EvoSuite), and led to slight refinements in the guidelines for corner-cases, as test smells manifest in many complex ways. Furthermore, during this phase, test cases that are not smelly but still demonstrate interesting anomalies were also discussed. At the end of this phase, the classification of all test suites is set.
4 Empirical Results
4.1 RQ1: How Widespread are Test Smells in Automatically Generated Test Cases?
Smell | Manually validated | Reported by Grano et al. (2019) | ||
---|---|---|---|---|
EvoSuite | JTExpert | EvoSuite | JTExpert | |
Eager Test | 21% | 61% | 57% | 62% |
Assertion Roulette | 17% | 64% | 74% | 74% |
Indirect Testing | 32% | 47% | – | – |
Sensitive Equality | 19% | 53% | 7% | 66% |
Mystery Guesta | 0% | 0% | 11% | 15% |
Resource Optimisma | 0% | 0% | 3% | – |
4.2 RQ2: How Accurate are Automated Tools in Detecting Test Smells in Automatically Generated Tests?
Test smell | Tool used by Grano et al. (2019) | tsDetect calibrated by Spadini et al. (2020) | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
FPR | FNR | Precision | Recall | F-measure | FPR | FNR | Precision | Recall | F-measure | |
Assertion Roulette | 0.72 | 0.00 | 0.22 | 1.00 | 0.36 | 0.05 | 0.50 | 0.67 | 0.5 | 0.57 |
Eager Test | 0.53 | 0.05 | 0.33 | 0.95 | 0.49 | 0.05 | 0.45 | 0.73 | 0.55 | 0.63 |
Mystery Guest | 0.12 | – | – | – | – | 0.03 | – | – | – | – |
Sensitive Equality | 0.00 | 0.67 | 1.00 | 0.33 | 0.50 | 0.00 | 0.67 | 1.00 | 0.33 | 0.50 |
Resource Optimism | 0.02 | – | – | – | – | 0.02 | – | – | – | – |
Indirect Testing | 0.00 | 1.00 | – | 0.00 | – | – | – | – | – | – |
Test smell | Tool used by Grano et al. (2019) | tsDetect calibrated by Spadini et al. (2020) | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
FPR | FNR | Prec. | Rec. | F-measure | FPR | FNR | Prec. | Rec. | F-measure | |
Assertion Roulette | 0.03 | 0.03 | 0.98 | 0.97 | 0.97 | 0.06 | 0.11 | 0.96 | 0.89 | 0.92 |
Eager Test | 0.00 | 0.34 | 1.00 | 0.66 | 0.79 | 0.05 | 0.38 | 0.95 | 0.62 | 0.75 |
Mystery Guest | 0.11 | – | – | – | – | 0.01 | – | – | – | – |
Sensitive Equality | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 0.04 | 0.04 | 0.96 | 0.96 | 0.96 |
Resource Optimism | 0.01 | – | – | – | – | 0.01 | – | – | – | – |
Indirect Testing | 0.00 | 1.00 | – | 0.00 | – | – | – | – | – | – |
s1
. The first method sets the private attribute markedForMunging
of the class to false
(it is true
by default). The method munge
manipulates symbols in the global scope of the class if and only if the attribute markedForMunging
is set to true
. Testing this scenario requires both method invocations, otherwise, one of the branches inside the method munge
cannot be tested.
File
”, “FileOutputStream
” “DB
”, “HttpClient
” as smelly; however, EvoSuite separates the test code from environmental dependencies (e.g., external files) in a fully automated fashion through bytecode instrumentation (Arcuri et al. 2014). In particular, it uses two mechanisms: (1) mocking, and (2) customized test runners. For one, classes that access the filesystem (e.g., java.io.File
) have all their methods (and constructors) mocked (Arcuri et al. 2014). EvoSuite also replaces general calls to the Java Virtual Machine (e.g., System.currentTimeMillis
) with mock classes/methods with deterministic behaviors. Finally, the test runner used by EvoSuite replaces occurrences of console inputs (e.g., java.io.InputStream
) in all instrumented classes with a customized console. Notice that EvoSuite resets all mock objects before every test execution.File
”. Figure 3 shows an example of such a false positive. While tsDetect avoids misclassification of mocked file access by checking for the string “Mock”, it does not inspect whether a customized test runner is used, which helps it achieve a lower FPR.
File f = new File();
) were not executed, triggering run-time errors. Even if a few test cases created with JTExpert statically include file manipulation statements, they must be considered false positives since no file was created dynamically at run-time. This scenario further highlights the limitations of static-based test smell detection tools.toString
method is used for (equality-related) assertions. Surprisingly, both test smell detection tools detect only a small portion of this test smell’s instances for EvoSuite. Through manual analysis, we discovered that these tools successfully detect sensitive equality if and only if the method toString
directly appears within an assertion. However, both detection tools can be easily fooled by using first storing the result of toString
in a local variable and then asserting its value against the target—a common pattern with EvoSuite.toString
. When the toString
method is directly used inside the assertions, test smell detection tools correctly identify this type of smell.toText()
, prettyPrint()
) for methods that prints objects as String
instances or implements string-based equality checks.
4.3 RQ3: How Well do Test Smells Reflect Real Problems in Automatically Generated Test Suites?
SubstringLabeler
. We observe that two of the asserts are checking the result of a getter on the entity, whereas the other checks whether the object is busy. Cases such as these were quite common (21% frequency as reported in Table 3), and reflect a lack of singular purpose in test cases, which indeed risks maintainability issues.
assertTrue
on the result of a (tautological) equality test and an unrelated assertEquals
on an attribute of the same object.
PhotoController
, but the time set on the Camera
class is being asserted. In this test, the call to home0.setCamera
leads to coverage on the class under test (the PhotoController
is an observer of home0
) such that the statement survives EvoSuite’s minimization. When EvoSuite’s regular mutation-based assertion minimization does not succeed in retaining any relevant assertions, as a last resort EvoSuite adds an assertion on the last return value produced in the test case. In this case, however, the time value set on the Camera
has nothing to do with the PhotoController
. Support for more advanced assertions could have avoided this problem.
TeamFinderImpl
, but the ultimate assert checks a LinkedHashMap
for emptiness to confirm some aspect of the behavior of setJoin
(to which the LinkedHashMap
is passed). Although we marked this as smelly, in accordance with the pre-established definition, it is debatable whether this is actually an issue: there may not be a direct way to test this map’s value through TeamFinderImpl
(e.g., through a getter), so that the tester is faced with the choice of either incurring this smell or not testing this property. This is not endemic to automatically generated test suites either; questions regarding testing of hidden (or ‘private’) properties are abundant on e.g., StackOverflow, and no consensus exists on what is appropriate.toString
method, is non-robust: that representation is prone to changing in trivial ways, like adding/removing punctuation, which would cause a spurious test failure. We find that automated test cases do generate some tests (19% frequency for EvoSuite as reported in Table 3) that rely on the value returned by toString
methods. Oddly enough, the invocation of toString
is rarely done directly in the assert; rather, its result is often stored in a local variable which is then compared to the expected value in the assert (as seen in Fig. 7). Whether these uses of toString
constitute a real problem is debatable; for any such test, EvoSuite also generated many test cases that explicitly check for equality (to equivalent objects) and/or the values returned by all ‘getter’ methods. Tests such as this seemed to genuinely test the current implementation of the toString
method—we very rarely found cases where the string representation was used specifically to confirm program state after some call or to test equality to another object.
4.4 RQ4: How does Test Smell Diffusion in Manually Written Tests Compare to Automatically Generated Tests?
toString
method. On the other hand, EvoSuite and JTExpert did not contain any instances of Resource Optimism test smell, whereas we do observe some test suites that make optimistic assumptions about file system availability and run time performance among developer-written ones.
Smell | Manual tests |
---|---|
Eager Test | 80% |
Assertion Roulette | 82% |
Indirect Testing | 20% |
Sensitive Equality | 10% |
Mystery Guest | 0% |
Resource Optimism | 10% |
toString
method is almost never directly invoked in an Assert
method, which the Grano et al. tool cannot handle. We also note that both tools, which each claim to be able to detect Resource Optimism, perform poorly in detecting actual cases of this smell. This may be because their definitions and detecting strategies for this smell are overly narrow. Finally, the tool used by Grano et al. performs very poorly at detecting Indirect Testing, finding 0 instances of it. This is identical to this tool’s performance on the automatically generated test suites—indeed, to the best of our knowledge, it is incapable of detecting this smell, despite claiming to.
Test smell | Tool used by Grano et al. (2019) | tsDetect calibrated by Spadini et al. (2020) | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
FPR | FNR | Prec. | Rec. | F1 | FPR | FNR | Prec. | Rec. | F1 | |
Assertion Roulette | 0.74 | 0.23 | 0.62 | 0.76 | 0.68 | 0.00 | 0.20 | 1.00 | 0.80 | 0.89 |
Eager Test | 0.60 | 0.31 | 0.81 | 0.69 | 0.75 | 0.10 | 0.61 | 0.94 | 0.39 | 0.55 |
Mystery Guest | 0.11 | – | – | – | – | 0.01 | – | – | – | – |
Sensitive Equality | 0.09 | 0.80 | 0.20 | 0.20 | 0.20 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 |
Resource Optimism | 0.05 | 0.80 | 0.33 | 0.20 | 0.25 | 0.05 | 0.80 | 0.33 | 0.20 | 0.25 |
Indirect Testing | 0.00 | 1.00 | – | 0.00 | – | – | – | – | – | – |
4.5 RQ5: How Well do Test Smells Capture Real Problems in Manually Written Tests?
Counter
class (the class under test) and one that creates an initial state for the class by invoking a method on ServiceTestUtil
. These method invocations have to co-occur for the purpose of the test scenario and cannot be refactored into separate tests; as such, this test should not be considered smelly.
tokenize
method are being tested with no object state relation between them. In this case, the test would be better off getting split into two distinct tests, or into a parameterized test. We see similar aspects in the other three test suites that are both eager and incoherent. This might be due to the developers not deeming it worthwhile to split tests into multiple tests, for fear of introducing clones; or, these test suites may still be using JUnit 3 (as in the case of Fig. 9), where test parameterization is hard.
assertNull
or assertNotNull
, whose failure is rather self explanatory. Meanwhile, the other 20 test suites are based on JUnit 4 where an assert failure is explained by the JUnit runtime, and the very notion of the lack of documentation as a “smell” is thus rather doubtful.List.size()
.hasNext
invocation is on the iterator and indirectly tests it. In the context of this test case, we see that the assert on this invocation is required and tests to ensure that a part of the parser works correctly. Under the strictest definition of indirect testing, this would be considered problematic. However, the usefulness of such a standard is dubious; the Java core API is surely the most thoroughly vetted Java code in existence so that requiring mocking of basic List
or Iterator
methods is virtually pointless. The developers who wrote these tests evidently agree on this front; so, our findings suggest that invocation of a Java API should be exempt.
toString
method is invoked as part of a test that explicitly means to test the toString
method, immediately rendering the smell incorrect. In the other three cases, a cursory investigation suggested that the only way to get the value of the object under test was through its string representation. While this is arguably indicative of poor implementation of the underlying object, that should hardly qualify as a smell in the testing code.tmp
) is incorrect or, (3) the operating system on which the test is being executed is non-Unix based. One especially strange case, shown in Fig. 11, encodes the strong assumption that enough processing power is available to generate 100,000 passwords in 2 s.
5 Qualitative Reflection
5.1 Manually Written Tests: are Smells a Problem?
toString
-based comparisons were far more rampant and nearly always problematic there, whereas we found them to be both rare and essentially harmless in developer-written tests. In contrast, the bulk of these test suites contained eager tests, more so than in generated ones; yet, nearly none of the former were semantically incoherent, whereas poorly related assertions were abundant in the latter. Anecdotally, the test smells were simply less useful on human-written test suites.5.2 On Rule-Based Detection of Test Smells
fail
method)—automatically generating failure-related messages is out of the scope of current tools. This results in automated tools marking many of their tests as smelly, in many cases incorrectly so. For one, test cases with just a single assert, even if not explained, should never involve this confusion. Furthermore, it is debatable whether e.g., assertNull
(in general) needs an explanatory message as the expected behavior is encoded in its name reason. More generally, advances in the JUnit framework have removed the traceability confound entirely. We still annotated some cases with this smell based on a strict adherence to its definition, but suggest that this smell has become obsolete, which is further reinforced by its high degree of overlap with Eager Test.toString
method, it is considered “sensitive”. Grano et al. interpret this as the presence of a toString
call specifically in an assert statement (Grano et al. 2019). However, we found that EvoSuite often stores the result of a toString
in a local variable before checking its value, so this detection rule has many false negatives. This pattern suggests a disconnect between human-written and automatically generated test suites; the proposed rule may work well on regular tests, but falls short on those automatically generated by EvoSuite.5.3 On Issues in Automatically Generated Tests Not Included in Test Smells
ISession i0
helps to cover the elaborate initialization code of the class ObjectTreeCellRenderer
; yet eventually the constructor throws a NullPointerException
. This is again indicative of a mismatch between coverage of code vs. actual requirements: while EvoSuite succeeded in achieving high coverage through this setup, the resulting test is unlikely to be helpful for finding faults, besides being hard to maintain.
6 Threats to Validity
7 Discussion, Lessons Learned, and Future Directions
7.1 Lesson Learned
-
Lesson 1. A non-trivial portion of the generated test cases contains at least one test smell. However, the occurrence is much less frequent than reported in prior studies (Grano et al. 2019; Palomba et al. 2016). For example, mystery guest and resource optimism have been reported in prior studies as being frequent in automatically generated tests. However, these smells cannot occur due to the mechanisms tools like EvoSuite and JTExpert use to prevent creating random files, potentially damaging the machine on which experiments are performed. The substantial differences between our results and those reported in prior studies are due to the different evaluation processes. Grano et al. (2019), and Palomba et al. (2016) did not manually validate the warnings raised by test smell detection tools. This was done under the very optimistic (and not realistic) assumption that these tools are highly accurate.
-
Lesson 2. Test smell detection tools are very inaccurate in detecting test smells for automatically generated test cases. First, the two state-of-the-art detection tools largely overestimate the occurrences of assertion roulette, eager tests, resource optimism, and mystery guest. The root causes for the low accuracy differ depending on the type of test smell under analysis. We can summarize our findings as follows:
-
For assertion roulette and eager tests, existing tools simply rely on rule sets that count the number of assertions and method calls in a test case as a proxy for “the number of functionalities” under test and that are asserted. Our results suggest that such simple heuristics are highly inaccurate.
-
Test smells detection tools based on pure static rules cannot adequately determine if test cases actually access external files and resources, leading to a very large false-positive rate for mystery guests and resource optimism.
-
None of the test smell detection tools could detect indirect testing instances, which are the most frequent test smell in the test cases generated by EvoSuite.
-
Many instances of sensitive equality were undetected due to incomplete rule sets for covering (many) corner-case scenarios and patterns.
-
-
Lesson 3. Given the results of our first two research questions, in RQ3 we investigated whether the existing catalog of test smells reflects real maintenance problems in automatically generated test cases. Our results indicate that generated test cases are affected by eager tests and multiple assertions, but their severity is debatable. As shown by Spadini et al. (2018), developers consider these types of smells as non-problematic with a very low severity or priority for test fixing operations. Instead, mystery guest and resource optimism do not represent real problems for modern test case generation tools that use mocks or bytecode instrumentation techniques.
-
Lesson 4. Since test smell detection tools have been designed for manually-written tests, one could argue that the low accuracy observed in RQ2 is due to the intrinsic differences between test cases written by developers and those generated with automated techniques. The results of RQ4 showed that the distribution of test smells and their occurrences differ between manually-written and generated tests. Eager tests, assertion roulette, mystery guests, and resource optimism are more frequent among test cases written by developers compared to those automatically generated. Vice versa, sensitive equality and indirect testing are more common in test cases automatically generated.
-
Lesson 5. The accuracy of test smell detection tools is higher for manually-written test cases compared to those generated in an automated fashion. This is partially due to the fact that these tools have been designed and tuned for test cases written by humans. However, the detection tools assessed in our study poorly perform for indirect testing, resource optimism, and mystery guests.
-
Lesson 6. After carefully validating the test smells instances among manually written test cases, we conclude that test smells do not reflect real concerns with respect to test maintainability. Our results lead to similar conclusions by Spadini et al. (2018), whose study reported a large misalignment between test smells instances and what developers consider actual test maintainability concerns. Tufano et al. (2016) reported that the developers did not recognize any problems with the test code snippets they were presented with, although, in theory, those tests were affected by test smells. While prior studies questioned the developers’ ability to recognize test smells, our results and analysis led to a completely different conclusion: many test smells (as currently defined and detected) do not reflect real concerns.
7.2 The Path Forward Concerning Test Smells
unittest
in Python) explicitly indicate which assertion fail and the reason for it, making assertion roulette outdated as potential test smell.