1 Introduction
2 Related Work
3 Goals and Research Questions
4 Dataset Construction
4.1 Projects Selection
4.2 Test Smell Selection
toString
method of an object, while the detector by Bavota et al. (2012) verifies that a toString
method of an object is called within an assertion. According to the performance reported within these previous papers, TsDetect (Peruma et al. 2020) reaches an F-Measure of 90%, while Bavota et al. (2012) claimed an F-Measure of 100%. Based on the above consideration, we decided to discard these two test smells, resulting in a final set of four test smells reported in Table 4 together with their definition.4.3 Test Smell Data Collection
Inspector #1 | Inspector #2 | 200 external practitioners | |
---|---|---|---|
#1 Initial Validation | 963 test cases | ||
#2 Internal Validation | 4335 test cases | 4335 test cases | |
#3 External Validation | 480 test cases |
Section 1: Participant’s background | Type | |
---|---|---|
#1 | What kind of developer are you? | Multiple choice (Industrial, Open-source, Startup, Student, Researcher) |
#2 | How many years of experience do you have with the Java programming language? | Paragraph |
#3 | Please rate your level of expertise with the Java programming language. | 5-point Likert scale |
#4 | How many years of experience do you have in Software Testing? | Paragraph |
#5 | To what extent do you perform each of the following types of testing in your projects? | Multiple-choice grid (Unit, Integration, System, Acceptance, Usability testing from “Never” to “Frequently”) |
#6 | How familiar are you with the concept of test smells, i.e., symptoms of sub-optimal design choices adopted when developing test cases? | 5-point Likert scale |
Test red. | Res. opt. | Mystery guest | Eager test | Total | |
---|---|---|---|---|---|
Ext. valid. | Entire dataset | ||||
0 | 0 | 0 | 0 | 307 | 5,976 |
0 | 0 | 0 | 1 | 103 | 2,082 |
0 | 0 | 1 | 0 | 22 | 413 |
0 | 0 | 1 | 1 | 14 | 391 |
0 | 1 | 0 | 0 | 0 | 3 |
0 | 1 | 0 | 1 | 0 | 0 |
0 | 1 | 1 | 0 | 23 | 513 |
0 | 1 | 1 | 1 | 8 | 207 |
1 | 0 | 0 | 0 | 0 | 17 |
1 | 0 | 0 | 1 | 1 | 13 |
1 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 1 | 1 | 1 | 3 |
1 | 1 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 1 | 1 | 0 |
1 | 1 | 1 | 0 | 0 | 7 |
1 | 1 | 1 | 1 | 0 | 0 |
5 Machine Learning-based Test Smell Detection
Test smell | Definition | Metric | Description | Structural/textual |
---|---|---|---|---|
Eager test | A test method involving many methods of the object being tested. | NMC | Number of method calls | Structural |
TMC | Test method cohesion, i.e., the average textual similarity between all the pairs methods called by a test method | Textual | ||
TS | Textual scattering, i.e., the extent to which the text within the method body is conceptually scattered | Textual | ||
NRF | Number of references to files | Structural | ||
Mystery guest | A test that uses external resources (e.g., databases or files). | NRDB | Number of references to database | Structural |
Resource optimism | A test that uses external resources without checking the state of these. | ERNC | State of external resources, which are not files, not checked | Structural |
FRNC | State of file resources not checked | Structural | ||
PR | Pair redundancy is the ratio between the items covered by a test and those covered by another one | Structural | ||
Test redundancy | A test that could be removed without impacting the test suite. | SR | Suite redundancy is the ratio between the items covered by a test compared and those covered by all other tests in the test suite | Structural |
6 Research Method and Results
6.1 RQ\(_1\) - In Search of Suitable Metrics for Machine Learning-Based Test Smell Detection
Test smell | Metric | Within-project | Cross-project |
---|---|---|---|
Eager test | NMC: Number of Method Calls | 0.037 | 0.007 |
TMC: Test Method Cohesion, i.e., the average textual similarity between all the pairs methods called by a test method | 0.428 | 0.559 | |
TS: The extent to which the text within the method body is conceptually scattered | 0.428 | 0.559 | |
Mystery guest | NRF: Number of References to Files | 0.661 | 0.042 |
NRDB: Number of References to Database | 0.015 | 0.001 | |
Resource optimism | ERNC: state of External Resources, which are not files, Not Checked | 0.012 | 0.007 |
FRNC: state of File Resources Not Checked | 0.052 | 0.022 | |
Test redundancy | PR: Pair Redundancy, i.e., the ratio between the items covered by a test and those covered by another one | 0.001 | 0.000 |
SR: Suite, i.e., Redundancy the ratio between the items covered by a test compared and those covered by all other tests in the test suite | 0.001 | 0.001 |
6.2 RQ\(_2\) - Assessing the Performance of our Machine Learning-Based Test Smell Detector
nemenyi
function available in R toolkit.4Precision | Recall | Accuracy | ||||
Test smell | w/o HT | w/ HT | w/o HT | w/ HT | w/o HT | w/ HT |
Eager test | 0.47 | 0.48 | 0.53 | 0.54 | 0.68 | 0.68 |
Mystery guest | 0.64 | 0.64 | 0.34 | 0.34 | 0.83 | 0.84 |
Resource opt. | 0.33 | 0.33 | 0.31 | 0.36 | 0.85 | 0.84 |
Test red. | 0.08 | 0.01 | 1.00 | 0.97 | 0.05 | 0.03 |
F-Measure | MCC | AUC-PR | ||||
Test smell | w/o HT | w/ HT | w/o HT | w/ HT | w/o HT | w/ HT |
Eager test | 0.50 | 0.51 | 0.27 | 0.28 | 0.49 | 0.50 |
Mystery guest | 0.45 | 0.44 | 0.39 | 0.38 | 0.59 | 0.55 |
Resource opt. | 0.32 | 0.34 | 0.24 | 0.25 | 0.53 | 0.53 |
Test red. | 0.01 | 0.01 | 0.01 | 0.01 | 0.52 | 0.52 |
Precision | Recall | Accuracy | ||||
---|---|---|---|---|---|---|
Test smell | w/o HT | w/ HT | w/o HT | w/ HT | w/o HT | w/ HT |
Eager test | 0.27 | 0.30 | 0.64 | 0.54 | 0.42 | 0.53 |
Mystery guest | 0.44 | 0.44 | 0.37 | 0.37 | 0.82 | 0.82 |
Resource opt. | 0.25 | 0.24 | 0.32 | 0.30 | 0.87 | 0.87 |
Test red. | 0.004 | 0.01 | 0.97 | 0.97 | 0.05 | 0.03 |
F-Measure | MCC | AUC-PR | ||||
Eager test | 0.38 | 0.39 | -0.01 | 0.06 | 0.32 | 0.33 |
Mystery guest | 0.40 | 0.40 | 0.30 | 0.30 | 0.46 | 0.41 |
Resource opt. | 0.28 | 0.26 | 0.22 | 0.20 | 0.27 | 0.28 |
Test red. | 0.01 | 0.01 | 0.01 | 0.00 | 0.41 | 0.13 |
6.3 RQ\(_3\) - Comparing Machine Learning- and Heuristic-Based Techniques for Test Smell Detection
File
instance without calling the method exists()
, isFile()
, or notExist()
.
Precision | Recall | |||
Test smell | ML within | TSDETECT | ML within | TSDETECT |
Eager test | 0.47 | 0.37 | 0.53 | 0.17 |
Mystery guest | 0.64 | 0.42 | 0.34 | 0.44 |
Resource opt. | 0.33 | 0.21 | 0.31 | 0.37 |
F-measure | MCC | |||
Test smell | ML within | TSDETECT | ML within | TSDETECT |
Eager test | 0.50 | 0.23 | 0.27 | 0.06 |
Mystery guest | 0.45 | 0.43 | 0.39 | 0.29 |
Resource opt. | 0.32 | 0.27 | 0.24 | 0.15 |
Precision | Recall | |||
Test smell | ML cross | TSDETECT | ML cross | TSDETECT |
Eager test | 0.27 | 0.35 | 0.64 | 0.16 |
Mystery guest | 0.44 | 0.40 | 0.37 | 0.40 |
Resource opt. | 0.25 | 0.18 | 0.32 | 0.37 |
F-measure | MCC | |||
Test smell | ML cross | TSDETECT | ML cross | TSDETECT |
Eager test | 0.38 | 0.22 | -0.01 | 0.06 |
Mystery guest | 0.40 | 0.40 | 0.30 | 0.29 |
Resource opt. | 0.28 | 0.25 | 0.22 | 0.17 |
Precision | Recall | |||
Test smell | ML within | Darts | ML within | Darts |
Eager test | 0.47 | 0.33 | 0.53 | 0.31 |
F-Measure | MCC | |||
Test smell | ML within | Darts | ML within | Darts |
Eager test | 0.50 | 0.32 | 0.27 | 0.04 |
Precision | Recall | |||
Test smell | ML cross | Darts | ML cross | Darts |
Eager test | 0.27 | 0.30 | 0.64 | 0.31 |
F-measure | MCC | |||
Test smell | ML cross | Darts | ML cross | Darts |
Eager test | 0.38 | 0.30 | -0.01 | 0.03 |
Precision | Recall | |||
Test smell | ML within | TeReDetect | ML within | TeReDetect |
Test red. | 0.01 | 0.00 | 1.00 | 0.00 |
F-Measure | MCC | |||
Test smell | ML within | TeReDetect | ML within | TeReDetect |
Test red. | 0.01 | 0.00 | 0.01 | −0.01 |
Precision | Recall | |||
Test smell | ML cross | TeReDetect | ML cross | TeReDetect |
Test red. | 0.01 | 0.00 | 0.97 | 0.00 |
F-Measure | MCC | |||
Test smell | ML cross | TeReDetect | ML cross | TeReDetect |
Test red. | 0.01 | 0.00 | 0.01 | -0.01 |
Eager test | ||
ML \(_{corr}\) \(\cap \) Darts \(_{corr}\) | ML \(_{corr}\) \(\setminus \) Darts \(_{corr}\) | Darts \(_{corr}\) \(\setminus \) ML \(_{corr}\) |
26% | 53% | 21% |
ML \(_{corr}\) \(\cap \) TsDetect \(_{corr}\) | ML \(_{corr}\) \(\setminus \) TsDetect \(_{corr}\) | TsDetect \(_{corr}\) \(\setminus \) ML \(_{corr}\) |
12% | 76% | 12% |
Mystery guest | ||
ML \(_{corr}\) \(\cap \) TsDetect \(_{corr}\) | ML \(_{corr}\) \(\setminus \) TsDetect \(_{corr}\) | TsDetect \(_{corr}\) \(\setminus \) ML \(_{corr}\) |
72% | 5% | 23% |
Resource optimism | ||
ML \(_{corr}\) \(\cap \) TsDetect \(_{corr}\) | ML \(_{corr}\) \(\setminus \) TsDetect \(_{corr}\) | TsDetect \(_{corr}\) \(\setminus \) ML \(_{corr}\) |
60% | 13% | 27% |
Test redundancy | ||
ML \(_{corr}\) \(\cap \) TeReDetect \(_{corr}\) | ML \(_{corr}\) \(\setminus \) TeReDetect \(_{corr}\) | TeReDetect \(_{corr}\) \(\setminus \) ML \(_{corr}\) |
0% | 100% | 0% |
Eager test | ||
ML \(_{corr}\) \(\cap \) Darts \(_{corr}\) | ML \(_{corr}\) \(\setminus \) Darts \(_{corr}\) | Darts \(_{corr}\) \(\setminus \) ML \(_{corr}\) |
26% | 60% | 14% |
ML \(_{corr}\) \(\cap \) TsDetect \(_{corr}\) | ML \(_{corr}\) \(\setminus \) TsDetect \(_{corr}\) | TsDetect \(_{corr}\) \(\setminus \) ML \(_{corr}\) |
15% | 78% | 7% |
Mystery guest | ||
ML \(_{corr}\) \(\cap \) TsDetect \(_{corr}\) | ML \(_{corr}\) \(\setminus \) TsDetect \(_{corr}\) | TsDetect \(_{corr}\) \(\setminus \) ML \(_{corr}\) |
14% | 67% | 19% |
Resource optimism | ||
ML \(_{corr}\) \(\cap \) TsDetect \(_{corr}\) | ML \(_{corr}\) \(\setminus \) TsDetect \(_{corr}\) | TsDetect \(_{corr}\) \(\setminus \) ML \(_{corr}\) |
11% | 71% | 18% |
Test redundancy | ||
ML \(_{corr}\) \(\cap \) TeReDetect \(_{corr}\) | ML \(_{corr}\) \(\setminus \) TeReDetect \(_{corr}\) | TeReDetect \(_{corr}\) \(\setminus \) ML \(_{corr}\) |
0% | 100% | 0% |
7 Discussion, Further Analysis, and Qualitative Insights
7.1 Machine Learning-based Test Smell Detection: How Bad Is It?
ML-based approach | Optimistic constant | |||
Test smell | Type I | Type II | Type I | Type II |
Eager test | 1,524 (17%) | 1,240 (14%) | 6,079 (70%) | 0 (0%) |
Mystery guest | 239 (3%) | 817 (10%) | 6,118 (80%) | 0 (0%) |
Resource opt. | 445 (7%) | 481 (8%) | 5,576 (89%) | 0 (0%) |
Test red. | 2,302 (72%) | 0 (0%) | 3,169 (99%) | 0 (0%) |
Pessimistic constant | Random constant | |||
Test smell | Type I | Type II | Type I | Type II |
Eager test | 0 (0%) | 2,648 (30%) | 3,246 (37%) | 1,293 (15%) |
Mystery guest | 0 (0%) | 1,487 (20%) | 3,409 (45%) | 764 (10%) |
Resource opt. | 0 (0%) | 688 (11%) | 2,995 (48%) | 346 (6%) |
Test red. | 0 (0%) | 40 (1%) | 2,089 (65%) | 26 (1%) |
ML-based approach | Optimistic constant | |||
Test smell | Type I | Type II | Type I | Type II |
Eager test | 4,578 (48%) | 942 (10%) | 6,934 (72%) | 0 (0%) |
Mystery guest | 723 (8%) | 955 (10%) | 8,099 (84%) | 0 (0%) |
Resource opt. | 691 (7%) | 492 (5%) | 8,903 (92%) | 0 (0%) |
Test red. | 9,105 (95%) | 1 (0.01%) | 9,593 (99%) | 0 (0%) |
Pessimistic constant | Random constant | |||
Test smell | Type I | Type II | Type I | Type II |
Eager test | 0 (0%) | 2,699 (28%) | 3,485 (36%) | 1,388 (14%) |
Mystery guest | 0 (0%) | 1,534 (16%) | 4,121 (43%) | 780 (8%) |
Resource opt. | 0 (0%) | 730 (8%) | 4,462 (46%) | 364 (4%) |
Test red. | 0 (0%) | 40 (0.4%) | 4,822 (50%) | 27 (0.3%) |
7.2 Test Smell Detection: A Research Field to Revisit?
ZKUtil
of the HBase
project, i.e., a framework implementing a centralized service to maintain configuration information and provide distributed synchronization. The production method under test is named setData
and is responsible for storing version data within an internal data structure. The test exercises an individual production method, i.e., setData
, yet it calls various methods of the same production class, i.e., createWithParents
and multiOrSequential
. All the experimented detectors classified this instance as smelly. However, this is a false positive case because the calls performed to the production class methods are required to experiment with the setData
method with different configurations to cover an execution path that could not be covered without performing those calls. For this reason, the test cannot be considered an Eager Test. Based on the argumentations above, we argue that the definition of this smell should be revisited to consider the levels of granularity that should be preserved in unit testing.testWhenValidPreProcessorsSet
leverages the Mockito framework,5 a well-known instrument to enable mocking, to simulate the behavior of the ConfigurableProcessorsFactory
class and get parameters to use within the test. In this case, all detectors failed, as they mistakenly accounted for this call. As such, the definition of mocking-aware metrics would boost test smell detection capabilities.DoubleConverterTest.java
) and looks for the production class having the same name of the test class after removing the suffix or prefix Test
(e.g., DoubleConverter.java
). In case the search succeeds, the test class is associated to the production class and, in a subsequent information-gathering phase, the individual test methods of the test suite are linked to production methods using the same traceability technique. In the case the search fails, the linking is not performed and, therefore, the Eager Test detection fails. In this respect, there are two considerations to make. In the first place, the traceability technique employed by the tools is well-known in literature and has been experimented multiple times (Qusef et al. 2014; Van Rompaey and Demeyer 2009; Parizi et al. 2014), showing an accuracy close to 85%, which is comparable with more sophisticated but less scalable techniques (e.g., the slicing-based approach proposed by Qusef et al. (2014)). Of course, the overall accuracy of the test smell detection process is bounded to the accuracy of the linking process. As such, the improvements in the field of traceability recovery might provide insights into the field of test smell detection. In the second place, it is also worth discussing the sneakiest failure motivation, where the linking is correctly performed but the information available in the production class is not sufficient to perform the detection. To reason on this motivation, let us consider the example shown in Listing 3.EmbeddedJSPResultTest
and has been classified as an Eager Test instance. According to the outcome of the information gathering phase, the test suite was linked to the EmbeddedJSPResult
production class. Nonetheless, such a production class was only an interface for another class, i.e., JSPRuntime
, which was responsible for the actual operations exercised by the testCacheInstanceWithManyThreads
method. More specifically, the code of the EmbeddedJSPResult
class is shown in Listing 4.EmbeddedJSPResult
just contains one method, i.e., doExecute
, that delegates its own operations to the method handle
of the JSPRuntime
class. Because of that, EmbeddedJSPResult
does not contain any method that could be linked to the testCacheInstanceWithManyThreads
test and, for this reason, the test smell detectors could not compute the metrics that would have allowed its detection. In other terms, we may consider this example as a case of conceptual false positive link given by the traceability technique, i.e., the link is technically correct, yet the linked class is not the actual production class under test. On the one hand, the use of more advanced test-to-code traceability techniques (e.g., Qusef et al. 2014; Parizi et al. 2014) might boost the overall test smell detection capabilities. On the other hand, the example provided may inform the possible improvements to make in terms of test-to-code traceability based on pattern matching and naming convention. As a final point of discussion, we may argue that the EmbeddedJSPResult
class (Listing 4) could be affected by the so-called Middle Man (Fowler and Beck 1999), i.e., a type of code smell that arises when a class delegates all its operations to other classes, hence uselessly increasing the complexity and computational costs of the system (Fowler and Beck 1999). In other terms, our analysis may suggest that the presence of code smells in production code may affect the test smell detection capabilities: the intrinsic relation between code and test smells is something we plan to explore as part of our future research agenda.shouldReturnNullValueFromSession
of the project Pippo
—a micro web framework for Java—makes significant use of mocking objects to simulate navigation session values. Such a dependency was therefore interpreted as a Mystery Guest instance. At the same time, the code does not check for the status of the mock; therefore, it was erroneously classified as a Resource Optimism instance. In conclusion, we could emphasize that mocking practices notably impact the performance of test smell detectors and that, therefore, novel mocking-aware detection strategies may provide significant contributions to the field.shouldFindValidWebjar
test of the Wro4J
project. The test checks if external JavaScript pages exist. All the detectors did not identify the external resource, overlooking this potential test smell. In conclusion, we argue that better detectors might be built by devising novel taxonomies to systematically collect comprehensive knowledge on how Mystery Guest and Resource Optimism instances may arise.shoudParseSingular
and shoudParseNonLower
Case
test cases of the Riptide
project. These tests were identified as smelly by both TeReDetect
and the machine learning-based approach. The test cases seem to exercise the same execution path, yet they do that in different manners. More specifically, the test cases aim at verifying the behavior of the valueOf
method of the production class when this is supplied with timestamps expressed in seconds. While this case may look like an instance of Test Redundancy, it is worth considering that the values passed to the valueOf
method have two very different meanings: shouldParseSingular
exercises the production method with an extreme input (time cannot be negative; hence one second represents an extreme value of the input range of the production method), while shouldParseNonLowerCase
with an in-range input (17 seconds). As such, the two methods cannot be considered redundant, as none of them can be removed without impacting the test suite - otherwise, developers would lose a relevant piece of information for the adequacy of the production code. Unfortunately, the pair redundancy metric exploited by the detectors only considers whether two test cases cover the same path without accounting for the rationale behind them. Therefore, we argue the need for more advanced metrics to combine dynamic and semantic analysis to discriminate redundancy cases correctly.