research-article

Does Automated Unit Test Generation Really Help Software Testers? A Controlled Empirical Study

Authors:
Gordon Fraser

University of Sheffield, Sheffield, UK

University of Sheffield, Sheffield, UK
View Profile

,
Matt Staats

University of Luxembourg, Luxembourg

University of Luxembourg, Luxembourg
View Profile

,
Phil McMinn

University of Sheffield, Sheffield, UK

University of Sheffield, Sheffield, UK
View Profile

,
Andrea Arcuri

Simula Research Laboratory, Lysaker, Norway

Simula Research Laboratory, Lysaker, Norway
View Profile

,
Frank Padberg

Karlsruhe Institute of Technology, Karlsruhe, Germany

Karlsruhe Institute of Technology, Karlsruhe, Germany
View Profile

ACM Transactions on Software Engineering and Methodology Volume 24 Issue 4Article No.: 23pp 1–49https://doi.org/10.1145/2699688

Published:02 September 2015Publication History

ACM Transactions on Software Engineering and Methodology

Abstract

Work on automated test generation has produced several tools capable of generating test data which achieves high structural coverage over a program. In the absence of a specification, developers are expected to manually construct or verify the test oracle for each test input. Nevertheless, it is assumed that these generated tests ease the task of testing for the developer, as testing is reduced to checking the results of tests. While this assumption has persisted for decades, there has been no conclusive evidence to date confirming it. However, the limited adoption in industry indicates this assumption may not be correct, and calls into question the practical value of test generation tools. To investigate this issue, we performed two controlled experiments comparing a total of 97 subjects split between writing tests manually and writing tests with the aid of an automated unit test generation tool, EvoSuite. We found that, on one hand, tool support leads to clear improvements in commonly applied quality metrics such as code coverage (up to 300% increase). However, on the other hand, there was no measurable improvement in the number of bugs actually found by developers. Our results not only cast some doubt on how the research community evaluates test generation tools, but also point to improvements and future work necessary before automated test generation tools will be widely adopted by practitioners.

References

S. Afshan, P. McMinn, and M. Stevenson. 2013. Evolving readable string test inputs using a natural language model to reduce Human Oracle cost. In Proceedings of the International Conference on Software Testing, Verification and Validation (ICST). Google ScholarDigital Library
J. H. Andrews, L. C. Briand, and Y. Labiche. 2005. Is mutation an appropriate tool for testing experiments? In Proceedings of the 27th International Conference on Software Engineering (ICSE). 402--411. Google ScholarDigital Library
A. Arcuri and L. Briand. 2014. A hitchhiker's guide to statistical tests for assessing randomized algorithms in software engineering. Softw. Test. Verif. Reliability 24, 3, 219--250.Google ScholarDigital Library
L. Baresi, P. L. Lanzi, and M. Miraz. 2010. TestFul: An evolutionary test approach for Java. In Proceedings of the IEEE International Conference on Software Testing, Verification and Validation (ICST). 185--194. Google ScholarDigital Library
R. P. L. Buse, C. Sadowski, and W. Weimer. 2011. Benefits and barriers of user evaluation in software engineering research. ACM SIGPLAN Notices 46, 643--656. Google ScholarDigital Library
C. Csallner and Y. Smaragdakis. 2004. JCrasher: An automatic robustness tester for Java. Softw. Practice Exper. 34, 11, 1025--1050. Google ScholarDigital Library
J. T. de Souza, C. L. Maia, F. G. de Freitas, and D. P. Coutinho. 2010. The human competitiveness of search based software engineering. In Proceedings of the International Symposium on Search Based Software Engineering (SSBSE). 143--152. Google ScholarDigital Library
H. Do and G. Rothermel. 2006. On the use of mutation faults in empirical assessments of test case prioritization techniques. IEEE Trans. Softw. Eng. 32, 9, 733--752. Google ScholarDigital Library
G. Fraser and A. Arcuri. 2011. EvoSuite: Automatic test suite generation for object-oriented software. In Proceedings of the ACM Symposium on the Foundations of Software Engineering (FSE). 416--419. Google ScholarDigital Library
G. Fraser and A. Arcuri. 2012a. The Seed is strong: Seeding strategies in search-based software testing. In Proceedings of the IEEE International Conference on Software Testing, Verification and Validation (ICST). Google ScholarDigital Library
G. Fraser and A. Arcuri. 2012b. Sound empirical evidence in software testing. In Proceedings of the ACM/IEEE International Conference on Software Engineering (ICSE). 178--188. Google ScholarDigital Library
G. Fraser and A. Arcuri. 2013. Whole Test Suite Generation. IEEE Trans. Softw. Eng. 39 2, 276--291. Google ScholarDigital Library
G. Fraser, M. Staats, P. McMinn, A. Arcuri, and F. Padberg. 2013. Does Automated White-Box Test Generation Really Help Software Testers? In Proceedings of the ACM International Symposium on Software Testing and Analysis (ISSTA). Google ScholarDigital Library
G. Fraser and A. Zeller. 2011. Exploiting common object usage in test case generation. In Proceedings of the IEEE International Conference on Software Testing, Verification and Validation (ICST). IEEE Computer Society, 80--89. Google ScholarDigital Library
G. Fraser and A Zeller. 2012. Mutation-driven generation of unit tests and oracles. IEEE Trans. Softw. Eng. 28, 2, 278--292. Google ScholarDigital Library
M. Harman, S. A. Mansouri, and Y. Zhang. 2012. Search-based software engineering: trends, techniques and applications. ACM Comput. Surv. 45, 1 Google ScholarDigital Library
M. Harman and P. McMinn. 2010. A theoretical and empirical study of search based testing: Local, global and hybrid search. IEEE Trans. Softw. Eng. 36, 2, 226--247. Google ScholarDigital Library
M. Inzlicht and T. Ben-Zeev. 2000. A threatening intellectual environment: Why females are susceptible to experiencing problem-solving deficits in the presence of males. Psychol. Sci. 11, 5, 365--371.Google ScholarCross Ref
M. Islam and C. Csallner. 2010. Dsc+Mock: A test case + mock class generator in support of coding against interfaces. In Proceedings of the International Workshop on Dynamic Analysis (WODA). 26--31. Google ScholarDigital Library
R. Just, F. Schweiggert, and G. M. Kapfhammer. 2011. MAJOR: An efficient and extensible tool for mutation analysis in a Java compiler. InProceedings of the International Conference on Automated Software Engineering (ASE). 612--615. Google ScholarDigital Library
B. A. Kitchenham, S. L. Pfleeger, L. M. Pickard, P. W. Jones, D. C. Hoaglin, K. El Emam, and J. Rosenberg. 2002. Preliminary guidelines for empirical research in software engineering. IEEE Trans. Softw. Eng. 28, 8, 721--734. Google ScholarDigital Library
J. R. Koza. 2010. Human-competitive results produced by genetic programming. Genetic Prog. Evolvable Mach. 11, 3, 251--284. Google ScholarDigital Library
K. Lakhotia, P. McMinn, and M. Harman. 2010. An empirical investigation into branch coverage for C programs Using CUTE and AUSTIN. J. Syst. Softw. 83, 12, 2379--2391. Google ScholarDigital Library
N. Li, X. Meng, J. Offutt, and L. Deng. 2013. Is bytecode instrumentation as good as source code instrumentation: An empirical study with industrial tools (Experience Report). In Proceedings of the 24th International IEEE Symposium on Proceedings of Software Reliability Engineering (ISSRE). 380--389.Google Scholar
P. McMinn. 2004. Search-based software test data generation: A survey. Softw. Test. Verif. Reliability 14, 2, 105--156. Google ScholarDigital Library
E. F. Miller Jr. and R. A. Melton. 1975. Automated generation of testcase datasets. In Proceedings of the ACM International Conference on Reliable Software. 51--58. Google ScholarDigital Library
A. S. Namin and J. H. Andrews. 2009. The influence of size and coverage on test suite effectiveness. In Proceedings of the ACM International Symposium on Software Testing and Analysis (ISSTA). ACM. Google ScholarDigital Library
C. Pacheco and M. D. Ernst. 2007. Randoop: Feedback-directed random testing for Java. In Proceedings of the Object-Oriented Programming Systems, Languages, and Applications Conference (OOPSLA). ACM, 815--816. Google ScholarDigital Library
C. Parnin and A. Orso. 2011. Are automated debugging techniques actually helping programmers? In Proceedings of the ACM International Symposium on Software Testing and Analysis (ISSTA). 199--209. Google ScholarDigital Library
C. S. Pasareanu and N. Rungta. 2010. Symbolic PathFinder: Symbolic execution of Java bytecode. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE). Vol. 10, 179--180. Google ScholarDigital Library
F. Pastore, L. Mariani, and G. Fraser. 2013. CrowdOracles: Can the crowd solve the Oracle problem? In Proceedings of the IEEE International Conference on Software Testing, Verification and Validation (ICST). Google ScholarDigital Library
R. Ramler, D. Winkler, and M. Schmidt. 2012. Random test case generation and manual unit testing: Substitute or complement in retrofitting tests for legacy code? In Proceedings of the EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA). IEEE, 286--293. Google ScholarDigital Library
G. Sautter, K. Böhm, F. Padberg, and W. Tichy. 2007. Empirical Evaluation of Semi-Automated XML Annotation of Text Documents with the GoldenGATE Editor. In Proceedings of the European Conference on Research and Advanced Technology for Digital Libraries. 357--367. Google ScholarDigital Library
C. B. Seaman. 1999. Qualitative methods in empirical studies of software engineering. IEEE Trans. Softw. Eng. 25, 4, 557--572. Google ScholarDigital Library
D. I. K. Sjoberg, J. E. Hannay, O. Hansen, V. B. Kampenes, A. Karahasanovic, N. K. Liborg, and A. C. Rekdal. 2005. A survey of controlled experiments in software engineering. IEEE Trans. Softw. Eng. 31, 9, 733--753. Google ScholarDigital Library
M. Staats, G. Gay, and M. P. E. Heimdahl. 2012a. Automated oracle creation support, or: How I learned to stop worrying about fault propagation and love mutation testing. In Proceedings of the ACM/IEEE International Conference on Software Engineering (ICSE). 870--880. Google ScholarDigital Library
M. Staats, S. Hong, M. Kim, and G. Rothermel. 2012b. Understanding user understanding: Determining correctness of generated program invariants. In Proceedings of the ACM International Symposium on Software Testing and Analysis (ISSTA). 188--198. Google ScholarDigital Library
N. Tillmann and N. J. de Halleux. 2008. Pex—White box test generation for .NET. In Proceedings of the International Conference on Tests And Proofs (TAP). 134--253. Google ScholarDigital Library
P. Tonella. 2004. Evolutionary testing of classes. In Proceedings of the ACM International Symposium on Software Testing and Analysis (ISSTA). 119--128. Google ScholarDigital Library
J. Wegener, A. Baresel, and H. Sthamer. 2001. Evolutionary test environment for automatic structural testing. Inform. Softw. Technol. 43, 14, 841--854.Google ScholarCross Ref
Y. Wei, C. Furia, N. Kazmin, and B. Meyer. 2011. Inferring better contracts. In Proceedings of the ACM/IEEE International Conference on Software Engineering (ICSE). 191--200. Google ScholarDigital Library
S. Yoo and M. Harman. 2012. Test data regeneration: Generating new test data from existing test data. Softw. Test. Verif. Reliability 22, 3, 171--201. Google ScholarDigital Library

Index Terms

Does Automated Unit Test Generation Really Help Software Testers? A Controlled Empirical Study
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Automated unit test generation during software development: a controlled experiment and think-aloud observations
ISSTA 2015: Proceedings of the 2015 International Symposium on Software Testing and Analysis

Automated unit test generation tools can produce tests that are superior to manually written ones in terms of code coverage, but are these tests helpful to developers while they are writing code? A developer would first need to know when and how to ...
Read More
A Large-Scale Evaluation of Automated Unit Test Generation Using EvoSuite

Research on software testing produces many innovative automated techniques, but because software testing is by necessity incomplete and approximate, any new technique faces the challenge of an empirical assessment. In the past, we have demonstrated ...
Read More
Does automated white-box test generation really help software testers?
ISSTA 2013: Proceedings of the 2013 International Symposium on Software Testing and Analysis

Automated test generation techniques can efficiently produce test data that systematically cover structural aspects of a program. In the absence of a specification, a common assumption is that these tests relieve a developer of most of the work, as the ...
Read More

Reviews

Reviewer: Julia Yousefi

If you are a developer of an automated test generation tool, you may want to know how this type of software impacts the testing process. In general, automated test generation software is evaluated by the percentage of code covered in the resulting test cases. One of the outcomes the authors wanted to determine is whether code coverage is a good metric to use. They designed a study to compare how well users could detect faults using EvoSuite (an automated unit test generator) or manually developed unit tests. It turned out that even though EvoSuite generated tests with higher code coverage, about the same number of defects was found by each group. The purpose of the paper is to describe the study (and results) and how it relates to research in the area of automated test generation. It contains nine sections: "Introduction, "Study Design," "Results: Initial Study," "Results: Replication Study," "Discussion" (interpreting the results), "Background and Exit Questionnaires," "Implications for Future Work," "Related Work," and "Conclusions." Overall, it is well organized and extremely detailed. The most interesting parts are the sections that interpret the results and provide direction for future research. In addition to examining code coverage during testing, the researchers also wanted to understand how automated test generation impacts the ability of testers to detect faults, how many tests mismatched the intended behavior of the class, and the ability of the produced test suites to detect regression faults. From the exit questionnaires, they learned that most users in the group that used EvoSuite wanted to use the generated tests even if the tests were bad. One conclusion was that a combination of manual and automated tests is needed and that the manual tests should somehow inform the automated ones using a technique that has yet to be developed. Furthermore, test automation tools should be able to generate tests that users can easily understand and trust. The time saved in generating tests was used up by analyzing the tests produced by the tool. For the researcher, this paper is relevant for the questions it raises about the current state of automated test generation and its suggestions for future research. General readers may find the introduction and conclusion enough to give them a basic understanding of the intent of this research. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Software Engineering and Methodology Volume 24, Issue 4
Special Issue on ISSTA 2013
August 2015
177 pages
ISSN:1049-331X
EISSN:1557-7392
DOI:10.1145/2820114
Editor:
David S. Rosenblum
National University of Singapore, Singapore
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 September 2015
- Accepted: 1 September 2014
- Revised: 1 July 2014
- Received: 1 January 2014
Published in tosem Volume 24, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Unit testing
automated test generation
branch coverage
empirical software engineering
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 61
  Total Citations
  View Citations
- 1,346
  Total Downloads
- Downloads (Last 12 months)120
- Downloads (Last 6 weeks)19
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Does Automated Unit Test Generation Really Help Software Testers? A Controlled Empirical Study

ACM Transactions on Software Engineering and Methodology

Abstract

References

Cited By

Index Terms

Recommendations

Automated unit test generation during software development: a controlled experiment and think-aloud observations

A Large-Scale Evaluation of Automated Unit Test Generation Using EvoSuite

Does automated white-box test generation really help software testers?

Reviews

Access critical reviews of Computing literature here