Abstract
A fundamental question in software testing research is how to compare test suites, often as a means for comparing test-generation techniques that produce those test suites. Researchers frequently compare test suites by measuring their coverage. A coverage criterion C provides a set of test requirements and measures how many requirements a given suite satisfies. A suite that satisfies 100% of the feasible requirements is called C-adequate. Previous rigorous evaluations of coverage criteria mostly focused on such adequate test suites: given two criteria C and C′, are C-adequate suites on average more effective than C′-adequate suites? However, in many realistic cases, producing adequate suites is impractical or even impossible.
This article presents the first extensive study that evaluates coverage criteria for the common case of non-adequate test suites: given two criteria C and C′, which one is better to use to compare test suites? Namely, if suites T1, T2,…,Tn have coverage values c1, c2,…,cn for C and c1′, c2′,…,cn′ for C′, is it better to compare suites based on c1, c2,…,cn or based on c1′, c2′,…,cn′? We evaluate a large set of plausible criteria, including basic criteria such as statement and branch coverage, as well as stronger criteria used in recent studies, including criteria based on program paths, equivalence classes of covered statements, and predicate states. The criteria are evaluated on a set of Java and C programs with both manually written and automatically generated test suites. The evaluation uses three correlation measures. Based on these experiments, two criteria perform best: branch coverage and an intraprocedural acyclic path coverage. We provide guidelines for testing researchers aiming to evaluate test suites using coverage criteria as well as for other researchers evaluating coverage criteria for research use.
- Martijn Adolfsen. 2011. Industrial validation of test coverage quality. Master's thesis. University of Twente.Google Scholar
- Paul Ammann and Jeff Offutt. 2008. Introduction to Software Testing. Cambridge University Press. Google Scholar
- James H. Andrews, Lionel C. Briand, and Yvan Labiche. 2005. Is mutation an appropriate tool for testing experiments? In Proceedings of the International Conference on Software Engineering. 402--411. Google ScholarDigital Library
- James H. Andrews, Lionel C. Briand, Yvan Labiche, and Akbar Siami Namin. 2006. Using mutation analysis for assessing and comparing testing coverage criteria. Trans. Softw. Eng. 32, 608--624. Google ScholarDigital Library
- Andrea Arcuri and Lionel C. Briand. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In Proceedings of the International Conference on Software Engineering. 1--10. Google ScholarDigital Library
- Thomas Ball. 2004. A theory of predicate-complete test coverage and generation. Tech. Rep. MSR- TR-2004-28, Microsoft Research.Google Scholar
- Thomas Ball. 2005. A theory of predicate-complete test coverage and generation. In Proceedings of the 3rd International Conference on Formal Methods for Components and Objects (FMCO). 1--22. Google ScholarDigital Library
- Thomas Ball and James R. Larus. 1996. Efficient path profiling. In Proceedings of the International Symposium on Microarchitecture. 46--57. Google ScholarDigital Library
- Thomas Ball and Sriram K. Rajamani. 2001. Automatically validating temporal safety properties of interfaces. In Proceedings of the Workshop on Model Checking of Software. 103--122. Google ScholarDigital Library
- Benoit Baudry, Franck Fleurey, and Yves Le Traon. 2006. Improving test suites for efficient fault localization. In Proceedings of the International Conference on Software Engineering. 82--91. Google ScholarDigital Library
- Xia Cai and Michael R. Lyu. 2005. The effect of code coverage on fault detection under different testing profiles. In Proceedings of the International Workshop on Advances in Model-Based Testing. 1--7. Google ScholarDigital Library
- Sagar Chaki, Edmund M. Clarke, Alex Groce, and Ofer Strichman. 2003. Predicate abstraction with minimum predicates. In Proceedings of the Conference on Correct Hardware Design and Verification Methods. 19--34.Google ScholarCross Ref
- Sagar Chaki, Alex Groce, and Ofer Strichman. 2004. Explaining abstract counterexamples. In Proceedings of the Symposium on the Foundations of Software Engineering. 73--82. Google ScholarDigital Library
- Trishul M. Chilimbi, Ben Liblit, Krishna Mehra, Aditya V. Nori, and Kapil Vaswani. 2009. HOLMES: Effective statistical debugging via efficient path profiling. In Proceedings of the International Conference on Software Engineering. 34--44. Google ScholarDigital Library
- Norman Cliff. 1996. Ordinal Methods for Behavioral Data Analysis. Pyschology Press.Google Scholar
- Cloc. 2013. Count lines of code. http://cloc.sourceforge.net/.Google Scholar
- Cobertura. 2013. Cobertura. http://cobertura.sourceforge.net/.Google Scholar
- CoCo. 2014. CoCo. http://mir.cs.illinois.edu/coco/.Google Scholar
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. 2009. Introduction to Algorithms, 3rd Ed. The MIT Press. Google ScholarDigital Library
- Herbert L. Costner. 1965. Criteria for measures of association. Amer. Sociological Revi. 3.Google Scholar
- Coverage. 2013. Instrumented container classes - Predicate coverage. http://mir.cs.illinois.edu/coverage/.Google Scholar
- Richard A. DeMillo, Richard J. Lipton, and Frederick G. Sayward. 1978. Hints on test data selection: Help for the practicing programmer. Comput. 11, 34--41. Google ScholarDigital Library
- Hyunsook Do, Sebastian G. Elbaum, and Gregg Rothermel. 2005. Supporting controlled experimentation with testing techniques: An infrastructure and its potential impact. Empirical Softw. Eng. 10, 405--435. Google ScholarDigital Library
- Eclipse. 2013. Eclipse. http://http://www.eclipse.org/.Google Scholar
- Emma. 2013. EMMA. http://emma.sourceforge.net/.Google Scholar
- Phyllis G. Frankl and Oleg Iakounenko. 1998. Further empirical studies of test effectiveness. In Proceedings of the Symposium on the Foundations of Software Engineering. 153--162. Google ScholarDigital Library
- Phyllis G. Frankl and Stewart N. Weiss. 1993. An experimental comparison of the effectiveness of branch testing and data flow testing. Trans. Software Eng. 19, 774--787. Google ScholarDigital Library
- Chen Fu and Barbara G. Ryder. 2005. Navigating error recovery code in Java applications. In Proceedings of the Workshop on Eclipse Technology eXchange. 40--44. Google ScholarDigital Library
- Juan Pablo Galeotti, Nicolás Rosner, Carlos Gustavo López Pombo, and Marcelo Fabian Frias. 2010. Analysis of invariants for efficient bounded verification. In Proceedings of the International Symposium on Software Testing and Analysis. 25--36. Google ScholarDigital Library
- gcov. 2013. gcov--a Test Coverage Program. http://gcc.gnu.org/onlinedocs/gcc/Gcov.html.Google Scholar
- Milos Gligoric, Alex Groce, Chaoqiang Zhang, Rohan Sharma, Mohammad Amin Alipour, and Darko Marinov. 2013. Comparing non-adequate test suites using coverage criteria. In Proceedings of the International Symposium on Software Testing and Analysis. 302--313. Google ScholarDigital Library
- Patrice Godefroid. 2007. Compositional dynamic test generation. In Proceedings of the Symposium on Principles of Programming Languages. 47--54. Google ScholarDigital Library
- Rahul Gopinath, Carlos Jensen, and Alex Groce. 2014. Code coverage for suite evaluation by developers. In Proceedings of the International Conference on Software Engineering. 72--82. Google ScholarDigital Library
- Alex Groce. 2009. (Quickly) testing the tester via path coverage. In Proceedings of the Workshop on Dynamic Analysis. 22--28. Google ScholarDigital Library
- Alex Groce. 2011. Coverage rewarded: Test input generation via adaptation-based programming. In Proceedings of the International Conference on Automated Software Engineering. 380--383. Google ScholarDigital Library
- Alex Groce, Alan Fern, Jervis Pinto, Tim Bauer, Mohammad Amin Alipour, Martin Erwig, and Camden Lopez. 2012. Lightweight automated testing with adaptation-based programming. In Proceedings of the International Symposium on Software Reliability Engineering. 161--170. Google ScholarDigital Library
- Alex Groce, Gerard Holzmann, and Rajeev Joshi. 2007. Randomized differential testing as a prelude to formal verification. In Proceedings of the International Conference on Software Engineering. 621--631. Google ScholarDigital Library
- Alex Groce, Chaoqiang Zhang, Eric Eide, Yang Chen, and John Regehr. 2012. Swarm testing. In Proceedings of the International Symposium on Software Testing and Analysis. 78--88. Google ScholarDigital Library
- Joy Paul Guilford. 1956. Fundamental Statistics in Psychology and Education. McGraw-Hill.Google Scholar
- Atul Gupta and Pankaj Jalote. 2008. An approach for experimentally evaluating effectiveness and efficiency of coverage criteria for software testing. Softw. Tools Technol. Transf. 10, 145--160. Google ScholarDigital Library
- Richard G. Hamlet. 1977. Testing programs with the aid of a compiler. Trans. Softw. Eng. 3, 279--290. Google ScholarDigital Library
- Michael Harder, Jeff Mellen, and Michael D. Ernst. 2003. Improving test suites via operational abstraction. In Proceedings of the International Conference on Software Engineering. 60--71. Google ScholarDigital Library
- Mohammad Mahdi Hassan and James H. Andrews. 2013. Comparing multi-point stride coverage and dataflow coverage. In Proceedings of the International Conference on Software Engineering. 172--181. Google ScholarDigital Library
- Thomas A. Henzinger, Ranjit Jhala, Rupak Majumdar, and Grégoire Sutre. 2002. Lazy abstraction. In Proceedings of the Symposium on Principles of Programming Languages. 58--70. Google ScholarDigital Library
- Andreas Holzer, Christian Schallhart, Michael Tautschnig, and Helmut Veith. 2009. Query-driven program testing. In Proceedings of the International Conference on Verification, Model Checking, and Abstract Interpretation. 151--166. Google ScholarDigital Library
- Monica Hutchins, Herb Foster, Tarak Goradia, and Thomas Ostrand. 1994. Experiments of the effectiveness of dataflow- and controlflow-based test adequacy criteria. In Proceedings of the International Conference on Software Engineering. 191--200. Google ScholarDigital Library
- Laura Inozemtseva and Reid Holmes. 2014. Coverage is not strongly correlated with test suite effectiveness. In Proceedings of the International Conference on Software Engineering. 435--445. Google ScholarDigital Library
- Laura Michelle McLean Inozemtseva. 2012. Predicting test suite effectiveness for Java programs. Master's thesis, University of Waterloo.Google Scholar
- JFreeChart. 2013. JFreeChart. http://www.jfree.org/jfreechart/.Google Scholar
- Yue Jia and Mark Harman. 2011. An analysis and survey of the development of mutation testing. Trans. Softw. Eng. 37, 649--678. Google ScholarDigital Library
- JodaTime. 2013. JodaTime. http://joda-time.sourceforge.net/.Google Scholar
- René Just, Gregory M. Kapfhammer, and Franz Schweiggert. 2012. Using non-redundant mutation operators and test suite prioritization to achieve efficient and scalable mutation analysis. In Proceedings of the International Symposium on Software Reliability Engineering. 11--20. Google ScholarDigital Library
- Maurice Kendall. 1938. A new measure of rank correlation. Biometrika 1--2, 81--89.Google Scholar
- James R. Larus. 1999. Whole program paths. In Proceedings of the Conference on Programming Language Design and Implementation. 259--269. Google ScholarDigital Library
- Nan Li, Upsorn Praphamontripong, and Jeff Offutt. 2009. An experimental comparison of four unit test criteria: Mutation, edge-pair, all-uses and prime path coverage. In Proceedings of the International Workshop on Mutation Analysis. 220--229. Google ScholarDigital Library
- Akbar Siami Namin and James H. Andrews. 2009. The influence of size and coverage on test suite effectiveness. In Proceedings of the International Symposium on Software Testing and Analysis. 57--68. Google ScholarDigital Library
- Akbar Siami Namin, James H. Andrews, and Duncan J. Murdoch. 2008. Sufficient mutation operators for measuring test effectiveness. In Proceedings of the International Conference on Software Engineering. 351--360. Google ScholarDigital Library
- George Necula, Scott McPeak, Shree P. Rahul, and Westley Weimer. 2002. CIL: Intermediate language and tools for analysis and transformation of C programs. In Proceedings of the International Conference on Compiler Construction. 213--228. Google ScholarDigital Library
- A. Jefferson Offutt, Gregg Rothermel, and Christian Zapf. 1993. An experimental evaluation of selective mutation. In Proceedings of the International Conference on Software Engineering. 100--107. Google ScholarDigital Library
- Peter Ohmann and Ben Liblit. 2013. Lightweight control-flow instrumentation and postmortem analysis in support of debugging. In Procedings of the International Conference on Automated Software Engineering. 378--388.Google ScholarDigital Library
- Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, and Thomas Ball. 2007. Feedback-directed random test generation. In Proceedings of the International Conference on Software Engineering. 75--84. Google ScholarDigital Library
- Mike Papadakis, Christopher Henard, and Yves Le Traon. 2014. Sampling program inputs with mutation analysis: Going beyond combinatorial interaction testing. In Proceedings of the International Conference on Software Testing, Verification and Validation. 1--10. Google ScholarDigital Library
- Sanjay J. Patel, Tony Tung, Satarupa Bose, and Matthew M. Crum. 2000. Increasing the size of atomic instruction blocks using control flow assertions. In Proceedings of the International Symposium on Microarchitecture. IEEE, 303--313. Google ScholarDigital Library
- Purify. 2013. IBM Rational purify documentation. ftp://ftp.software.ibm.com/software/rational/docs/documentation/manuals/unixsuites/pdf/purify/purify.pdf.Google Scholar
- Gregg Rothermel, Roland Untch, Chengyun Chu, and Mary Jean Harrold. 2001. Test case prioritization. Trans. Softw. Eng. 27, 929--948.Google ScholarDigital Library
- Atanas Rountev. 2004. Precise identification of side-effect-free methods in Java. In Proceedings of the International Conference on Software Maintenance. 82--91. Google ScholarDigital Library
- Alexandru Sălcianu and Martin Rinard. 2005. Purity and side effect analysis for Java programs. In Proceedings of the 6th Annual Conference on Verification, Model Checking, and Abstract Interpretation. 199--215. Google ScholarDigital Library
- David Schuler and Andreas Zeller. 2009. Javalanche: Efficient mutation testing for Java. In Proceedings of the Symposium on the Foundations of Software Engineering. 297--298. Google ScholarDigital Library
- David Schuler and Andreas Zeller. 2013. Checked coverage: An indicator for oracle quality. Softw. Testing, Verification Reliability 23, 7, 531--551.Google ScholarCross Ref
- Rohan Sharma, Milos Gligoric, Andrea Arcuri, Gordon Fraser, and Darko Marinov. 2011. Testing container classes: Random or systematic? In Fundamental Approaches to Software Engineering, 262--277. Google ScholarDigital Library
- Rohan Sharma, Milos Gligoric, Vilas Jagannath, and Darko Marinov. 2010. A comparison of constraint- based and sequence-based generation of complex input data structures. In Proceedings of the Software Testing, Verification, and Validation Workshops. 337--342. Google ScholarDigital Library
- Charles Spearman. 1904. The proof and measurement of association between two things. Amer. J. Psychol 15, 1, 72--101.Google ScholarCross Ref
- SQLite. 2013. SQLite. http://www.sqlite.org/.Google Scholar
- Willem Visser, Corina S. Pasareanu, and Radek Pelánek. 2006. Test input generation for Java containers using state matching. In Proceedings of the International Symposium on Software Testing and Analysis. 37--48. Google ScholarDigital Library
- Marian Vittek, Peter Borovansky, and Pierre-Etienne Moreau. 2006. A simple generic library for C. In Proceedings of the International Conference on Software Reuse. 423--426. Google ScholarDigital Library
- VMSpec. 2013. Java class file format. http://docs.oracle.com/javase/specs/jvms/se5.0/html/ClassFile.doc.html.Google Scholar
- Filipos I. Vokolos and Phyllis G. Frankl. 1998. Empirical evaluation of the textual differencing regression testing technique. In Proceedings of the International Conference on Software Maintenance. 44--53. Google ScholarDigital Library
- WALA. 2013. WALA: T. J. Watson Libraries for Analysis. http://wala.sf.net.Google Scholar
- Tao Wang and Abhik Roychoudhury. 2005. Automated path generation for software fault localization. In Proceedings of the International Conference on Automated Software Engineering. 347--351. Google ScholarDigital Library
- Yi Wei, Bertrand Meyer, and Manuel Oriol. 2012. Is branch coverage a good measure of testing effectiveness? In Empirical Software Engineering and Verification, Bertrand Meyer and Martin Nordio (Eds.), vol. 7007, Springer Berlin, Heidelberg, 194--212. Google ScholarDigital Library
- W. Eric Wong, Joseph R. Horgan, Saul London, and Aditya P. Mathur. 1994. Effect of test set size and block coverage on fault detection effectiveness. In Proceedings of the International Symposium on Software Reliability. 230--238.Google Scholar
- W. Eric Wong, Joseph R. Horgan, Saul London, and Aditya P. Mathur. 1995. Effect of test set minimization on fault detection effectiveness. In Proceedings of the International Conference on Software Engineering. 41--50. Google ScholarDigital Library
- YAFFS2. 2013. YAFFS: A flash file system for embedded use. http://www.yaffs.net.Google Scholar
- Lingming Zhang, Milos Gligoric, Darko Marinov, and Sarfraz Khurshid. 2013. Operator-based and random mutant selection: Better together. In Proceedings of the International Conference on Automated Software Engineering. 92--102.Google ScholarDigital Library
- Lu Zhang, Shan-Shan Hou, Jun-Jue Hu, Tao Xie, and Hong Mei. 2010. Is operator-based mutant selection superior to random mutant selection? In Proceedings of the International Conference on Software Engineering. 435--444. Google ScholarDigital Library
Index Terms
- Guidelines for Coverage-Based Comparisons of Non-Adequate Test Suites
Recommendations
Comparing non-adequate test suites using coverage criteria
ISSTA 2013: Proceedings of the 2013 International Symposium on Software Testing and AnalysisA fundamental question in software testing research is how to compare test suites, often as a means for comparing test-generation techniques. Researchers frequently compare test suites by measuring their coverage. A coverage criterion C provides a set ...
Coverage Based Testing with Test Purposes
QSIC '08: Proceedings of the 2008 The Eighth International Conference on Quality SoftwareTest purposes have been presented as a solution to avoid the state space explosion when selecting test cases from formal models. Although such techniques work very well with regard to the speed of the test derivation, they leave the tester with one ...
Towards Impact Analysis of Test Goal Prioritization on the Efficient Execution of Automatically Generated Test Suites Based on State Machines
QSIC '11: Proceedings of the 2011 11th International Conference on Quality SoftwareTest prioritization aims at reducing test execution costs. There are several approaches to prioritize test cases based on collected data of previous test runs, e.g., in regression testing. In this paper, we present a new approach to test prioritization ...
Comments