1 Introduction
2 Background
2.1 Software Test Code Quality
String
. This approach can improve the coverage of the resulting test suites and the use of more realistic input strings could improve the readability of the test code. Various studies (Fraser et al. 2013; Ceccato et al. 2015; Shamshiri et al. 2018) investigate the usefulness of test code generators in debugging activities and also highlight shortcomings of generated test code which relate to the high number of assertions, absence of explanatory comments or documentation, quality of identifiers and in general unrealistic test scenarios. Hence, similar to the quality of automatically generated code (Yetistiren et al. 2022; Al Madi 2022), the quality of generated test code is a critical aspect that requires additional consideration and investigation.2.2 Readability of Source Code
3 Research Questions
3.1 Influence Factors in Academia
3.2 Influence Factors in Practice
3.3 Investigating Influence Factors in a Controlled Experiment
4 Systematic Mapping Study
4.1 Study Protocol and Process
Database | Search string |
---|---|
Scopus | SUBJAREA (COMP) TITLE-ABS-KEY(((code) AND (test* OR model) AND (readability OR understandability OR legibility)) OR ((“test” OR “code”) AND (smell) AND (readab* OR understandab* OR legib*))) |
IEEE | ((“All Metadata”: code) AND (“All Metadata”: test* OR “All Metadata”: model) AND (“All Metadata”: readability OR “All Metadata”: understandability OR “All Metadata”: legibility)) ((“All Metadata”: “test” OR “All Metadata”: “code”) AND (“All Metadata”: smell) AND (“All Metadata”: readab* OR “All Metadata”: understandab* OR “All Metadata”: legib*)) |
ACM | ((Title:(code) AND Title:(test* model) AND Title:(readability understandability legibility)) OR (Keyword:(code) AND Keyword:(test* model) AND Keyword:(readability understandability legibility)) OR (Abstract:(code) AND Abstract:(test* model) AND Abstract:(readability “understandability” legibility))) OR ((Abstract:(“test” “code”) AND Abstract:(smell) AND Abstract:(readab* understandab* legib*)) OR (Keyword:(“test” “code”) AND Keyword:(smell) AND Keyword:(readab* understandab* legib*)) OR (Title:(“test” “code”) AND Title:(smell) AND Title:(readab* understandab* legib*))) |
-
Conference papers, journal/magazine articles, or PhD theses (returned by ACM)
-
Readability, understandability or legibility of test code is an object of the study
-
Not written in English
-
Conference summaries, talks, books, master thesis
-
Duplicate or superseded studies
-
Studies not identifying factors that influence test code readability
-
Step 3: Backward & Forward Snowballing. Since relevant literature might refer to further important studies, we used the references included in the 11 studies for backward snowballing via Scopus. The 11 studies might also be cited by other relevant studies, hence we also performed forward snowballing, by using Scopus to find studies, which cite one of the initial 11 studies. This increased the result set by 330 from backward snowballing and 174 from forward snowballing to a total of 515 studies.
-
Step 4: Deduplicate & Filter Results. By comparing these 515 studies with the initial result set we found and removed 83 duplicates. Similar to step 2, one of the authors of this paper applied the inclusion and exclusion criteria. Additionally, after a full text reading, all studies were discussed and reevaluated by the author team. With this, we reduced the result set by 496 and obtained a final result of 19 studies.
Idx | Title | Authors | Venue | Year | Study Type |
---|---|---|---|---|---|
[A17] | Developer’s Perspectives on Unit Test Cases Understandability | Setiani N. et al. | ICSESS | 2021 | Experiment + Survey (hum) |
[A16] | DeepTC-Enhancer: Improving the Readability of Automatically Generated Tests | Roy D. et al. | ASE | 2020 | Experiment + Survey (hum) |
[A18] | Test case understandability model | Setiani N. et al. | IEEE Access | 2020 | Experiment (hum) |
[A13] | On the quality of identifiers in test code | Lin B. et al. | SCAM | 2019 | Survey (hum) |
[A3] | What Factors Make SQL Test Cases Understandable for Testers? A Human Study of Automated Test Data Generation Techniques | Alsharif A. et al. | ICSME | 2019 | Experiment + Survey (hum) |
[A10] | Fluent vs basic assertions in Java: An empirical study | Leotta M. et al. | QUATIC | 2018 | Experiment (hum) |
[A9] | An empirical investigation on the readability of manual and generated test cases | Grano G. et al. | ICPC | 2018 | Experiment |
[A12] | Aiding comprehension of unit test cases and test suites with stereotype-based tagging | Li B. et al. | ICPC | 2018 | Experiment + User Study (hum) |
[A7] | Specification-Based testing in software engineering courses | Fisher G. and Johnson C. | SIGCSE | 2018 | Experiment + Survey (hum) |
[A2] | An industrial evaluation of unit test generation: Finding real faults in a financial application | Almasi M. et al. | ICSE-SEIP | 2017 | Experiment + Survey (hum) |
[A6] | Generating unit tests with descriptive names or: Would you name your children thing1 and thing2? | Daka E. et al. | ISSTA | 2017 | Experiment + Survey (hum) |
[A4] | How Good Are My Tests? | Bowes D. et al. | WETSoM | 2017 | Concept paper (hum) |
[A14] | Automatic test case generation: What if test code quality matters? | Palomba F. et al. | ISSTA | 2016 | Experiment |
[A11] | Automatically Documenting Unit Test Cases | Li B. et al. | ICST | 2016 | User study (hum) |
[A15] | The impact of test case summaries on bug fixing performance: An empirical investigation | Panichella S. et al. | ICSE | 2016 | Experiment (hum) |
[A19] | Towards automatically generating descriptive names for unit tests | Zhang B. et al. | ASE | 2016 | Prototype and User Study (hum) |
[A5] | Modeling readability to improve unit tests | Daka E. et al. | ESEC/FSE | 2015 | Experiment + Survey (hum) |
[A1] | Evolving readable string test inputs using a natural language model to reduce human oracle cost | Afshan S. et al. | ICST | 2013 | Experiment (hum) |
[A8] | Exploiting common object usage in test case generation | Fraser G. and Zeller A. | ICST | 2011 | Experiment |
4.2 Systematic Mapping Study Results
4.2.1 Which influence factors are analyzed in scientific literature (RQ1.1)?
4.2.2 Which research methods are used in scientific studies (RQ1.2)?
5 Grey Literature Review
5.1 Study Protocol and Process
-
Readability or understandability of test code is a relevant part of the source. This is the case if the length of the content on readability is sufficient and if the source contains concrete examples of factors influencing readability.
-
Not written in English
-
Literature indexed by ACM, Scopus, IEEE
-
Duplicates, videos, dead links
5.2 Grey Literature Analysis Results
5.2.1 Which influence factors are discussed in grey literature (RQ2.1)?
5.2.2 What is the difference between influence factors in scientific literature and grey literature (RQ2.2)?
6 Evaluation of Influence Factors
6.1 Experiment Setup and Procedure
6.1.1 Select Tests
6.1.2 Apply Best Practices
6.1.3 Create Survey
6.1.4 Execute A/B Experiment
6.1.5 Analysis
Years | Absolute | Percentage |
---|---|---|
(a) General Software Development Experience [years] | ||
0 | 0 | 0% |
1-2 | 2 | 2.6% |
2-5 | 41 | 53.2% |
>5 | 34 | 44.2% |
Sum: | 77 | 100.0% |
(b) Professional Software Development Experience [years] | ||
0 | 20 | 26.0% |
1-2 | 24 | 31.2% |
2-5 | 25 | 32.5% |
>5 | 8 | 10.3% |
Sum: | 77 | 100.0% |
6.2 Experiment Results
6.2.1 Do factors discussed in practice show an influence on readability when scientific methods are used (RQ3.1)?
7 Summary, Threats to Validity, and Future Work
7.1 Summary
7.2 Limitations and Threats to Validity
-
In context of the Systematic Mapping study, the keyword, search string, analysis items, and the data extraction and analysis has been executed by one of the authors and intensively reviewed and discussed within the author team and external experts.
-
The controlled experiment setup has been initially executed in a pilot run to ensure consistency of the experiment material. We have used a cross-over design of test case samples to avoid any bias of the experiment participants.
-
Three unmodified test cases were used as control groups in A/B testing. The Wilcoxon Rank Sum test does not suggest a significant difference between ratings provided by participants, when comparing groups with the same questionnaire. However, there is a significant effect when comparing control groups of different questionnaires. These results confirm a consistent rating behavior within groups and the significant differences between groups is as expected due to the independent ratings of participants from different groups.
-
We have conducted a literature reviews based on the guidelines of Petersen et al. (2015) complemented by a systematic analysis of grey literature (Garousi et al. 2019). Therefore, the analysis results identified most prominent research directions in scientific literature complemented by practical discussions in non-academic sources (such as blogs). This approach enabled us to identify similar and/or different key topics in academia and industry.
-
Experiment participants were recruited on a voluntary basis from three classes of a master course on software testing at TU Wien. We captured background knowledge of the participants to identify participant experience. Most of the participants work in industry and can be considered as “junior professionals”. Therefore, the results are applicable for industry applications.
-
We used real-world test cases from open source projects as well as results from software testing exercises to ensure close to industry test cases.
-
For the controlled experiment, we captured individual test case assessments for A-B tests (i.e., original tests taken from existing projects and slightly modified test cases) based on a 5-point Likert scale.
-
To avoid a bias introduced by the order of questions for the experiment, we reversed the question ordering for half of the experiment groups.
-
To avoid random readability ratings, we asked participants to give reasons for their ratings as free text. Furthermore, the participants were told that their reward (bonus points) is coupled with active participation in the challenge.
-
We tried to select test cases for A/B testing in our experiment, which could be clearly related to individual influence factors. Since the test cases we used were retrieved from real world projects instead of constructed examples, which could limit the relevance of our results, we only covered 7 out of the 14 influence factors identified in the literature search. Nevertheless, a certain amount of fuzziness with respect to influence factors may still be present, e.g., as discussed in the results for the modification Try Catch vs. AssertThrows (see Section 6.2.1).
-
We used the Shapiro-Wilk test for testing for normality, which would allow us to use a parametric statistical test. This approach is also used by Roy et al. (2020a) whose methodology is similar to ours.
-
We used the non parametric Wilcoxon Rank Sum test, because our groups are unpaired and the Shapiro-Wilk test does not suggest a normal distribution of our result data.
-
We report the effect size with Cliff’s Delta, because it allows an interpretation of the magnitude of difference between two groups. It is also used by other studies in this field like Grano et al. (2018a).
7.3 Implications for Research and Practitioners
-
For the Software Testing community, we identified influencing factors, observed only in grey literature, that could initiate additional research initiatives with focus on topics that are of interest for practitioners with limited attention of researchers.
-
Researchers in Software Engineering and/or Software Testing can take up the results from literature review with focus on replicating and extending the presented research work.
-
The Empirical Research community can build on the the SMS protocol, the grey literature protocol, and the study design to replicate and extend the study protocol in different context.
-
We selected a representative set of test cases that could be used by researchers to (i) design and develop a method and or tool to semi-automatically assess the readability of test code and (ii) to apply the test code set for evaluation purposes in different contexts.
-
In the Software Testing communities, factors, such as Setup methods/Fixtures, Helper Methods, DRYness are widely discussed in the domain of practitioners. Considering these in test code generation could be useful for generating more readable tests. In a recent study Panichella et al. (2022) also suggest to include capabilities for complex object instantiation into test suite generators.
-
Finally, the findings of the study can be used as input for researchers from Software Engineering communities to improve software maintenance tasks that benefit from readability assessments.
-
For Software and System Engineering organizations, results of this work can support software testers and developers to improve test code readability based on guidelines and identified influencing factors.
-
Project and Quality Managers can use the results to setup organization specific development guidelines to support software development, software testing, and software maintenance and evolution by a team of software experts. Applied best practices might help to improve the quality of test cases and reduce effort and cost for maintenance activities.
-
Factors with similar views from practitioners and academia include Test Names, Identifier Names, and Test Data. For test and identifier names both domains agree on the use of naming patterns in order to achieve consistency across the test suite. For test data also both domains agree on the use of realistic and simple values and avoiding magic values.
-
However, the experiment results show that application of best-practices is no guarantee for improved readability.