skip to main content
research-article

Does Automated Unit Test Generation Really Help Software Testers? A Controlled Empirical Study

Published:02 September 2015Publication History
Skip Abstract Section

Abstract

Work on automated test generation has produced several tools capable of generating test data which achieves high structural coverage over a program. In the absence of a specification, developers are expected to manually construct or verify the test oracle for each test input. Nevertheless, it is assumed that these generated tests ease the task of testing for the developer, as testing is reduced to checking the results of tests. While this assumption has persisted for decades, there has been no conclusive evidence to date confirming it. However, the limited adoption in industry indicates this assumption may not be correct, and calls into question the practical value of test generation tools. To investigate this issue, we performed two controlled experiments comparing a total of 97 subjects split between writing tests manually and writing tests with the aid of an automated unit test generation tool, EvoSuite. We found that, on one hand, tool support leads to clear improvements in commonly applied quality metrics such as code coverage (up to 300% increase). However, on the other hand, there was no measurable improvement in the number of bugs actually found by developers. Our results not only cast some doubt on how the research community evaluates test generation tools, but also point to improvements and future work necessary before automated test generation tools will be widely adopted by practitioners.

References

  1. S. Afshan, P. McMinn, and M. Stevenson. 2013. Evolving readable string test inputs using a natural language model to reduce Human Oracle cost. In Proceedings of the International Conference on Software Testing, Verification and Validation (ICST). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. H. Andrews, L. C. Briand, and Y. Labiche. 2005. Is mutation an appropriate tool for testing experiments? In Proceedings of the 27th International Conference on Software Engineering (ICSE). 402--411. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Arcuri and L. Briand. 2014. A hitchhiker's guide to statistical tests for assessing randomized algorithms in software engineering. Softw. Test. Verif. Reliability 24, 3, 219--250.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Baresi, P. L. Lanzi, and M. Miraz. 2010. TestFul: An evolutionary test approach for Java. In Proceedings of the IEEE International Conference on Software Testing, Verification and Validation (ICST). 185--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. P. L. Buse, C. Sadowski, and W. Weimer. 2011. Benefits and barriers of user evaluation in software engineering research. ACM SIGPLAN Notices 46, 643--656. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Csallner and Y. Smaragdakis. 2004. JCrasher: An automatic robustness tester for Java. Softw. Practice Exper. 34, 11, 1025--1050. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. T. de Souza, C. L. Maia, F. G. de Freitas, and D. P. Coutinho. 2010. The human competitiveness of search based software engineering. In Proceedings of the International Symposium on Search Based Software Engineering (SSBSE). 143--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. H. Do and G. Rothermel. 2006. On the use of mutation faults in empirical assessments of test case prioritization techniques. IEEE Trans. Softw. Eng. 32, 9, 733--752. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Fraser and A. Arcuri. 2011. EvoSuite: Automatic test suite generation for object-oriented software. In Proceedings of the ACM Symposium on the Foundations of Software Engineering (FSE). 416--419. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Fraser and A. Arcuri. 2012a. The Seed is strong: Seeding strategies in search-based software testing. In Proceedings of the IEEE International Conference on Software Testing, Verification and Validation (ICST). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Fraser and A. Arcuri. 2012b. Sound empirical evidence in software testing. In Proceedings of the ACM/IEEE International Conference on Software Engineering (ICSE). 178--188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Fraser and A. Arcuri. 2013. Whole Test Suite Generation. IEEE Trans. Softw. Eng. 39 2, 276--291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Fraser, M. Staats, P. McMinn, A. Arcuri, and F. Padberg. 2013. Does Automated White-Box Test Generation Really Help Software Testers? In Proceedings of the ACM International Symposium on Software Testing and Analysis (ISSTA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G. Fraser and A. Zeller. 2011. Exploiting common object usage in test case generation. In Proceedings of the IEEE International Conference on Software Testing, Verification and Validation (ICST). IEEE Computer Society, 80--89. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. Fraser and A Zeller. 2012. Mutation-driven generation of unit tests and oracles. IEEE Trans. Softw. Eng. 28, 2, 278--292. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Harman, S. A. Mansouri, and Y. Zhang. 2012. Search-based software engineering: trends, techniques and applications. ACM Comput. Surv. 45, 1 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Harman and P. McMinn. 2010. A theoretical and empirical study of search based testing: Local, global and hybrid search. IEEE Trans. Softw. Eng. 36, 2, 226--247. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Inzlicht and T. Ben-Zeev. 2000. A threatening intellectual environment: Why females are susceptible to experiencing problem-solving deficits in the presence of males. Psychol. Sci. 11, 5, 365--371.Google ScholarGoogle ScholarCross RefCross Ref
  19. M. Islam and C. Csallner. 2010. Dsc+Mock: A test case + mock class generator in support of coding against interfaces. In Proceedings of the International Workshop on Dynamic Analysis (WODA). 26--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Just, F. Schweiggert, and G. M. Kapfhammer. 2011. MAJOR: An efficient and extensible tool for mutation analysis in a Java compiler. InProceedings of the International Conference on Automated Software Engineering (ASE). 612--615. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. B. A. Kitchenham, S. L. Pfleeger, L. M. Pickard, P. W. Jones, D. C. Hoaglin, K. El Emam, and J. Rosenberg. 2002. Preliminary guidelines for empirical research in software engineering. IEEE Trans. Softw. Eng. 28, 8, 721--734. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. R. Koza. 2010. Human-competitive results produced by genetic programming. Genetic Prog. Evolvable Mach. 11, 3, 251--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. K. Lakhotia, P. McMinn, and M. Harman. 2010. An empirical investigation into branch coverage for C programs Using CUTE and AUSTIN. J. Syst. Softw. 83, 12, 2379--2391. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. N. Li, X. Meng, J. Offutt, and L. Deng. 2013. Is bytecode instrumentation as good as source code instrumentation: An empirical study with industrial tools (Experience Report). In Proceedings of the 24th International IEEE Symposium on Proceedings of Software Reliability Engineering (ISSRE). 380--389.Google ScholarGoogle Scholar
  25. P. McMinn. 2004. Search-based software test data generation: A survey. Softw. Test. Verif. Reliability 14, 2, 105--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. E. F. Miller Jr. and R. A. Melton. 1975. Automated generation of testcase datasets. In Proceedings of the ACM International Conference on Reliable Software. 51--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. S. Namin and J. H. Andrews. 2009. The influence of size and coverage on test suite effectiveness. In Proceedings of the ACM International Symposium on Software Testing and Analysis (ISSTA). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C. Pacheco and M. D. Ernst. 2007. Randoop: Feedback-directed random testing for Java. In Proceedings of the Object-Oriented Programming Systems, Languages, and Applications Conference (OOPSLA). ACM, 815--816. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. Parnin and A. Orso. 2011. Are automated debugging techniques actually helping programmers? In Proceedings of the ACM International Symposium on Software Testing and Analysis (ISSTA). 199--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. S. Pasareanu and N. Rungta. 2010. Symbolic PathFinder: Symbolic execution of Java bytecode. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE). Vol. 10, 179--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. F. Pastore, L. Mariani, and G. Fraser. 2013. CrowdOracles: Can the crowd solve the Oracle problem? In Proceedings of the IEEE International Conference on Software Testing, Verification and Validation (ICST). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Ramler, D. Winkler, and M. Schmidt. 2012. Random test case generation and manual unit testing: Substitute or complement in retrofitting tests for legacy code? In Proceedings of the EUROMICRO Conference on Software Engineering and Advanced Applications (SEAA). IEEE, 286--293. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. G. Sautter, K. Böhm, F. Padberg, and W. Tichy. 2007. Empirical Evaluation of Semi-Automated XML Annotation of Text Documents with the GoldenGATE Editor. In Proceedings of the European Conference on Research and Advanced Technology for Digital Libraries. 357--367. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. C. B. Seaman. 1999. Qualitative methods in empirical studies of software engineering. IEEE Trans. Softw. Eng. 25, 4, 557--572. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. D. I. K. Sjoberg, J. E. Hannay, O. Hansen, V. B. Kampenes, A. Karahasanovic, N. K. Liborg, and A. C. Rekdal. 2005. A survey of controlled experiments in software engineering. IEEE Trans. Softw. Eng. 31, 9, 733--753. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Staats, G. Gay, and M. P. E. Heimdahl. 2012a. Automated oracle creation support, or: How I learned to stop worrying about fault propagation and love mutation testing. In Proceedings of the ACM/IEEE International Conference on Software Engineering (ICSE). 870--880. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Staats, S. Hong, M. Kim, and G. Rothermel. 2012b. Understanding user understanding: Determining correctness of generated program invariants. In Proceedings of the ACM International Symposium on Software Testing and Analysis (ISSTA). 188--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. N. Tillmann and N. J. de Halleux. 2008. Pex—White box test generation for .NET. In Proceedings of the International Conference on Tests And Proofs (TAP). 134--253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. P. Tonella. 2004. Evolutionary testing of classes. In Proceedings of the ACM International Symposium on Software Testing and Analysis (ISSTA). 119--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. Wegener, A. Baresel, and H. Sthamer. 2001. Evolutionary test environment for automatic structural testing. Inform. Softw. Technol. 43, 14, 841--854.Google ScholarGoogle ScholarCross RefCross Ref
  41. Y. Wei, C. Furia, N. Kazmin, and B. Meyer. 2011. Inferring better contracts. In Proceedings of the ACM/IEEE International Conference on Software Engineering (ICSE). 191--200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. S. Yoo and M. Harman. 2012. Test data regeneration: Generating new test data from existing test data. Softw. Test. Verif. Reliability 22, 3, 171--201. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Does Automated Unit Test Generation Really Help Software Testers? A Controlled Empirical Study

    Recommendations

    Reviews

    Julia Yousefi

    If you are a developer of an automated test generation tool, you may want to know how this type of software impacts the testing process. In general, automated test generation software is evaluated by the percentage of code covered in the resulting test cases. One of the outcomes the authors wanted to determine is whether code coverage is a good metric to use. They designed a study to compare how well users could detect faults using EvoSuite (an automated unit test generator) or manually developed unit tests. It turned out that even though EvoSuite generated tests with higher code coverage, about the same number of defects was found by each group. The purpose of the paper is to describe the study (and results) and how it relates to research in the area of automated test generation. It contains nine sections: "Introduction, "Study Design," "Results: Initial Study," "Results: Replication Study," "Discussion" (interpreting the results), "Background and Exit Questionnaires," "Implications for Future Work," "Related Work," and "Conclusions." Overall, it is well organized and extremely detailed. The most interesting parts are the sections that interpret the results and provide direction for future research. In addition to examining code coverage during testing, the researchers also wanted to understand how automated test generation impacts the ability of testers to detect faults, how many tests mismatched the intended behavior of the class, and the ability of the produced test suites to detect regression faults. From the exit questionnaires, they learned that most users in the group that used EvoSuite wanted to use the generated tests even if the tests were bad. One conclusion was that a combination of manual and automated tests is needed and that the manual tests should somehow inform the automated ones using a technique that has yet to be developed. Furthermore, test automation tools should be able to generate tests that users can easily understand and trust. The time saved in generating tests was used up by analyzing the tests produced by the tool. For the researcher, this paper is relevant for the questions it raises about the current state of automated test generation and its suggestions for future research. General readers may find the introduction and conclusion enough to give them a basic understanding of the intent of this research. Online Computing Reviews Service

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Software Engineering and Methodology
      ACM Transactions on Software Engineering and Methodology  Volume 24, Issue 4
      Special Issue on ISSTA 2013
      August 2015
      177 pages
      ISSN:1049-331X
      EISSN:1557-7392
      DOI:10.1145/2820114
      Issue’s Table of Contents

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 2 September 2015
      • Accepted: 1 September 2014
      • Revised: 1 July 2014
      • Received: 1 January 2014
      Published in tosem Volume 24, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader