Abstract
Work on automated test generation has produced several tools capable of generating test data which achieves high structural coverage over a program. In the absence of a specification, developers are expected to manually construct or verify the test oracle for each test input. Nevertheless, it is assumed that these generated tests ease the task of testing for the developer, as testing is reduced to checking the results of tests. While this assumption has persisted for decades, there has been no conclusive evidence to date confirming it. However, the limited adoption in industry indicates this assumption may not be correct, and calls into question the practical value of test generation tools. To investigate this issue, we performed two controlled experiments comparing a total of 97 subjects split between writing tests manually and writing tests with the aid of an automated unit test generation tool, EvoSuite. We found that, on one hand, tool support leads to clear improvements in commonly applied quality metrics such as code coverage (up to 300% increase). However, on the other hand, there was no measurable improvement in the number of bugs actually found by developers. Our results not only cast some doubt on how the research community evaluates test generation tools, but also point to improvements and future work necessary before automated test generation tools will be widely adopted by practitioners.
- S. Afshan, P. McMinn, and M. Stevenson. 2013. Evolving readable string test inputs using a natural language model to reduce Human Oracle cost. In Proceedings of the International Conference on Software Testing, Verification and Validation (ICST). Google ScholarDigital Library
- J. H. Andrews, L. C. Briand, and Y. Labiche. 2005. Is mutation an appropriate tool for testing experiments? In Proceedings of the 27th International Conference on Software Engineering (ICSE). 402--411. Google ScholarDigital Library
- A. Arcuri and L. Briand. 2014. A hitchhiker's guide to statistical tests for assessing randomized algorithms in software engineering. Softw. Test. Verif. Reliability 24, 3, 219--250.Google ScholarDigital Library
- L. Baresi, P. L. Lanzi, and M. Miraz. 2010. TestFul: An evolutionary test approach for Java. In Proceedings of the IEEE International Conference on Software Testing, Verification and Validation (ICST). 185--194. Google ScholarDigital Library
- R. P. L. Buse, C. Sadowski, and W. Weimer. 2011. Benefits and barriers of user evaluation in software engineering research. ACM SIGPLAN Notices 46, 643--656. Google ScholarDigital Library
- C. Csallner and Y. Smaragdakis. 2004. JCrasher: An automatic robustness tester for Java. Softw. Practice Exper. 34, 11, 1025--1050. Google ScholarDigital Library
- J. T. de Souza, C. L. Maia, F. G. de Freitas, and D. P. Coutinho. 2010. The human competitiveness of search based software engineering. In Proceedings of the International Symposium on Search Based Software Engineering (SSBSE). 143--152. Google ScholarDigital Library
- H. Do and G. Rothermel. 2006. On the use of mutation faults in empirical assessments of test case prioritization techniques. IEEE Trans. Softw. Eng. 32, 9, 733--752. Google ScholarDigital Library
- G. Fraser and A. Arcuri. 2011. EvoSuite: Automatic test suite generation for object-oriented software. In Proceedings of the ACM Symposium on the Foundations of Software Engineering (FSE). 416--419. Google ScholarDigital Library
- G. Fraser and A. Arcuri. 2012a. The Seed is strong: Seeding strategies in search-based software testing. In Proceedings of the IEEE International Conference on Software Testing, Verification and Validation (ICST). Google ScholarDigital Library
- G. Fraser and A. Arcuri. 2012b. Sound empirical evidence in software testing. In Proceedings of the ACM/IEEE International Conference on Software Engineering (ICSE). 178--188. Google ScholarDigital Library
- G. Fraser and A. Arcuri. 2013. Whole Test Suite Generation. IEEE Trans. Softw. Eng. 39 2, 276--291. Google ScholarDigital Library
- G. Fraser, M. Staats, P. McMinn, A. Arcuri, and F. Padberg. 2013. Does Automated White-Box Test Generation Really Help Software Testers? In Proceedings of the ACM International Symposium on Software Testing and Analysis (ISSTA). Google ScholarDigital Library
- G. Fraser and A. Zeller. 2011. Exploiting common object usage in test case generation. In Proceedings of the IEEE International Conference on Software Testing, Verification and Validation (ICST). IEEE Computer Society, 80--89. Google ScholarDigital Library
- G. Fraser and A Zeller. 2012. Mutation-driven generation of unit tests and oracles. IEEE Trans. Softw. Eng. 28, 2, 278--292. Google ScholarDigital Library
- M. Harman, S. A. Mansouri, and Y. Zhang. 2012. Search-based software engineering: trends, techniques and applications. ACM Comput. Surv. 45, 1 Google ScholarDigital Library
- M. Harman and P. McMinn. 2010. A theoretical and empirical study of search based testing: Local, global and hybrid search. IEEE Trans. Softw. Eng. 36, 2, 226--247. Google ScholarDigital Library
- M. Inzlicht and T. Ben-Zeev. 2000. A threatening intellectual environment: Why females are susceptible to experiencing problem-solving deficits in the presence of males. Psychol. Sci. 11, 5, 365--371.Google ScholarCross Ref
- M. Islam and C. Csallner. 2010. Dsc+Mock: A test case + mock class generator in support of coding against interfaces. In Proceedings of the International Workshop on Dynamic Analysis (WODA). 26--31. Google ScholarDigital Library
- R. Just, F. Schweiggert, and G. M. Kapfhammer. 2011. MAJOR: An efficient and extensible tool for mutation analysis in a Java compiler. InProceedings of the International Conference on Automated Software Engineering (ASE). 612--615. Google ScholarDigital Library
- B. A. Kitchenham, S. L. Pfleeger, L. M. Pickard, P. W. Jones, D. C. Hoaglin, K. El Emam, and J. Rosenberg. 2002. Preliminary guidelines for empirical research in software engineering. IEEE Trans. Softw. Eng. 28, 8, 721--734. Google ScholarDigital Library
- J. R. Koza. 2010. Human-competitive results produced by genetic programming. Genetic Prog. Evolvable Mach. 11, 3, 251--284. Google ScholarDigital Library
- K. Lakhotia, P. McMinn, and M. Harman. 2010. An empirical investigation into branch coverage for C programs Using CUTE and AUSTIN. J. Syst. Softw. 83, 12, 2379--2391. Google ScholarDigital Library
- N. Li, X. Meng, J. Offutt, and L. Deng. 2013. Is bytecode instrumentation as good as source code instrumentation: An empirical study with industrial tools (Experience Report). In Proceedings of the 24th International IEEE Symposium on Proceedings of Software Reliability Engineering (ISSRE). 380--389.Google Scholar
- P. McMinn. 2004. Search-based software test data generation: A survey. Softw. Test. Verif. Reliability 14, 2, 105--156. Google ScholarDigital Library
- E. F. Miller Jr. and R. A. Melton. 1975. Automated generation of testcase datasets. In Proceedings of the ACM International Conference on Reliable Software. 51--58. Google ScholarDigital Library
- A. S. Namin and J. H. Andrews. 2009. The influence of size and coverage on test suite effectiveness. In Proceedings of the ACM International Symposium on Software Testing and Analysis (ISSTA). ACM. Google ScholarDigital Library
- C. Pacheco and M. D. Ernst. 2007. Randoop: Feedback-directed random testing for Java. In Proceedings of the Object-Oriented Programming Systems, Languages, and Applications Conference (OOPSLA). ACM, 815--816. Google ScholarDigital Library
- C. Parnin and A. Orso. 2011. Are automated debugging techniques actually helping programmers? In Proceedings of the ACM International Symposium on Software Testing and Analysis (ISSTA). 199--209. Google ScholarDigital Library
- C. S. Pasareanu and N. Rungta. 2010. Symbolic PathFinder: Symbolic execution of Java bytecode. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE). Vol. 10, 179--180. Google ScholarDigital Library
- F. Pastore, L. Mariani, and G. Fraser. 2013. CrowdOracles: Can the crowd solve the Oracle problem? In Proceedings of the IEEE International Conference on Software Testing, Verification and Validation (ICST). Google ScholarDigital Library
- R. Ramler, D. Winkler, and M. Schmidt. 2012. Random test case generation and manual unit testing: Substitute or complement in retrofitting tests for legacy code? In Proceedings of the EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA). IEEE, 286--293. Google ScholarDigital Library
- G. Sautter, K. Böhm, F. Padberg, and W. Tichy. 2007. Empirical Evaluation of Semi-Automated XML Annotation of Text Documents with the GoldenGATE Editor. In Proceedings of the European Conference on Research and Advanced Technology for Digital Libraries. 357--367. Google ScholarDigital Library
- C. B. Seaman. 1999. Qualitative methods in empirical studies of software engineering. IEEE Trans. Softw. Eng. 25, 4, 557--572. Google ScholarDigital Library
- D. I. K. Sjoberg, J. E. Hannay, O. Hansen, V. B. Kampenes, A. Karahasanovic, N. K. Liborg, and A. C. Rekdal. 2005. A survey of controlled experiments in software engineering. IEEE Trans. Softw. Eng. 31, 9, 733--753. Google ScholarDigital Library
- M. Staats, G. Gay, and M. P. E. Heimdahl. 2012a. Automated oracle creation support, or: How I learned to stop worrying about fault propagation and love mutation testing. In Proceedings of the ACM/IEEE International Conference on Software Engineering (ICSE). 870--880. Google ScholarDigital Library
- M. Staats, S. Hong, M. Kim, and G. Rothermel. 2012b. Understanding user understanding: Determining correctness of generated program invariants. In Proceedings of the ACM International Symposium on Software Testing and Analysis (ISSTA). 188--198. Google ScholarDigital Library
- N. Tillmann and N. J. de Halleux. 2008. Pex—White box test generation for .NET. In Proceedings of the International Conference on Tests And Proofs (TAP). 134--253. Google ScholarDigital Library
- P. Tonella. 2004. Evolutionary testing of classes. In Proceedings of the ACM International Symposium on Software Testing and Analysis (ISSTA). 119--128. Google ScholarDigital Library
- J. Wegener, A. Baresel, and H. Sthamer. 2001. Evolutionary test environment for automatic structural testing. Inform. Softw. Technol. 43, 14, 841--854.Google ScholarCross Ref
- Y. Wei, C. Furia, N. Kazmin, and B. Meyer. 2011. Inferring better contracts. In Proceedings of the ACM/IEEE International Conference on Software Engineering (ICSE). 191--200. Google ScholarDigital Library
- S. Yoo and M. Harman. 2012. Test data regeneration: Generating new test data from existing test data. Softw. Test. Verif. Reliability 22, 3, 171--201. Google ScholarDigital Library
Index Terms
- Does Automated Unit Test Generation Really Help Software Testers? A Controlled Empirical Study
Recommendations
Automated unit test generation during software development: a controlled experiment and think-aloud observations
ISSTA 2015: Proceedings of the 2015 International Symposium on Software Testing and AnalysisAutomated unit test generation tools can produce tests that are superior to manually written ones in terms of code coverage, but are these tests helpful to developers while they are writing code? A developer would first need to know when and how to ...
A Large-Scale Evaluation of Automated Unit Test Generation Using EvoSuite
Research on software testing produces many innovative automated techniques, but because software testing is by necessity incomplete and approximate, any new technique faces the challenge of an empirical assessment. In the past, we have demonstrated ...
Does automated white-box test generation really help software testers?
ISSTA 2013: Proceedings of the 2013 International Symposium on Software Testing and AnalysisAutomated test generation techniques can efficiently produce test data that systematically cover structural aspects of a program. In the absence of a specification, a common assumption is that these tests relieve a developer of most of the work, as the ...
Comments