research-article

Guidelines for Coverage-Based Comparisons of Non-Adequate Test Suites

Authors:
Milos Gligoric

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

,
Alex Groce

Oregon State University

Oregon State University
View Profile

,
Chaoqiang Zhang

Oregon State University

Oregon State University
View Profile

,
Rohan Sharma

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

,
Mohammad Amin Alipour

Oregon State University

Oregon State University
View Profile

,
Darko Marinov

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

ACM Transactions on Software Engineering and Methodology Volume 24 Issue 4Article No.: 22pp 1–33https://doi.org/10.1145/2660767

Published:02 September 2015Publication History

ACM Transactions on Software Engineering and Methodology

Abstract

A fundamental question in software testing research is how to compare test suites, often as a means for comparing test-generation techniques that produce those test suites. Researchers frequently compare test suites by measuring their coverage. A coverage criterion C provides a set of test requirements and measures how many requirements a given suite satisfies. A suite that satisfies 100% of the feasible requirements is called C-adequate. Previous rigorous evaluations of coverage criteria mostly focused on such adequate test suites: given two criteria C and C′, are C-adequate suites on average more effective than C′-adequate suites? However, in many realistic cases, producing adequate suites is impractical or even impossible.

This article presents the first extensive study that evaluates coverage criteria for the common case of non-adequate test suites: given two criteria C and C′, which one is better to use to compare test suites? Namely, if suites T₁, T₂,…,T_n have coverage values c₁, c₂,…,c_n for C and c₁′, c₂′,…,c_n′ for C′, is it better to compare suites based on c₁, c₂,…,c_n or based on c₁′, c₂′,…,c_n′? We evaluate a large set of plausible criteria, including basic criteria such as statement and branch coverage, as well as stronger criteria used in recent studies, including criteria based on program paths, equivalence classes of covered statements, and predicate states. The criteria are evaluated on a set of Java and C programs with both manually written and automatically generated test suites. The evaluation uses three correlation measures. Based on these experiments, two criteria perform best: branch coverage and an intraprocedural acyclic path coverage. We provide guidelines for testing researchers aiming to evaluate test suites using coverage criteria as well as for other researchers evaluating coverage criteria for research use.

References

Martijn Adolfsen. 2011. Industrial validation of test coverage quality. Master's thesis. University of Twente.Google Scholar
Paul Ammann and Jeff Offutt. 2008. Introduction to Software Testing. Cambridge University Press. Google Scholar
James H. Andrews, Lionel C. Briand, and Yvan Labiche. 2005. Is mutation an appropriate tool for testing experiments? In Proceedings of the International Conference on Software Engineering. 402--411. Google ScholarDigital Library
James H. Andrews, Lionel C. Briand, Yvan Labiche, and Akbar Siami Namin. 2006. Using mutation analysis for assessing and comparing testing coverage criteria. Trans. Softw. Eng. 32, 608--624. Google ScholarDigital Library
Andrea Arcuri and Lionel C. Briand. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In Proceedings of the International Conference on Software Engineering. 1--10. Google ScholarDigital Library
Thomas Ball. 2004. A theory of predicate-complete test coverage and generation. Tech. Rep. MSR- TR-2004-28, Microsoft Research.Google Scholar
Thomas Ball. 2005. A theory of predicate-complete test coverage and generation. In Proceedings of the 3rd International Conference on Formal Methods for Components and Objects (FMCO). 1--22. Google ScholarDigital Library
Thomas Ball and James R. Larus. 1996. Efficient path profiling. In Proceedings of the International Symposium on Microarchitecture. 46--57. Google ScholarDigital Library
Thomas Ball and Sriram K. Rajamani. 2001. Automatically validating temporal safety properties of interfaces. In Proceedings of the Workshop on Model Checking of Software. 103--122. Google ScholarDigital Library
Benoit Baudry, Franck Fleurey, and Yves Le Traon. 2006. Improving test suites for efficient fault localization. In Proceedings of the International Conference on Software Engineering. 82--91. Google ScholarDigital Library
Xia Cai and Michael R. Lyu. 2005. The effect of code coverage on fault detection under different testing profiles. In Proceedings of the International Workshop on Advances in Model-Based Testing. 1--7. Google ScholarDigital Library
Sagar Chaki, Edmund M. Clarke, Alex Groce, and Ofer Strichman. 2003. Predicate abstraction with minimum predicates. In Proceedings of the Conference on Correct Hardware Design and Verification Methods. 19--34.Google ScholarCross Ref
Sagar Chaki, Alex Groce, and Ofer Strichman. 2004. Explaining abstract counterexamples. In Proceedings of the Symposium on the Foundations of Software Engineering. 73--82. Google ScholarDigital Library
Trishul M. Chilimbi, Ben Liblit, Krishna Mehra, Aditya V. Nori, and Kapil Vaswani. 2009. HOLMES: Effective statistical debugging via efficient path profiling. In Proceedings of the International Conference on Software Engineering. 34--44. Google ScholarDigital Library
Norman Cliff. 1996. Ordinal Methods for Behavioral Data Analysis. Pyschology Press.Google Scholar
Cloc. 2013. Count lines of code. http://cloc.sourceforge.net/.Google Scholar
Cobertura. 2013. Cobertura. http://cobertura.sourceforge.net/.Google Scholar
CoCo. 2014. CoCo. http://mir.cs.illinois.edu/coco/.Google Scholar
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. 2009. Introduction to Algorithms, 3rd Ed. The MIT Press. Google ScholarDigital Library
Herbert L. Costner. 1965. Criteria for measures of association. Amer. Sociological Revi. 3.Google Scholar
Coverage. 2013. Instrumented container classes - Predicate coverage. http://mir.cs.illinois.edu/coverage/.Google Scholar
Richard A. DeMillo, Richard J. Lipton, and Frederick G. Sayward. 1978. Hints on test data selection: Help for the practicing programmer. Comput. 11, 34--41. Google ScholarDigital Library
Hyunsook Do, Sebastian G. Elbaum, and Gregg Rothermel. 2005. Supporting controlled experimentation with testing techniques: An infrastructure and its potential impact. Empirical Softw. Eng. 10, 405--435. Google ScholarDigital Library
Eclipse. 2013. Eclipse. http://http://www.eclipse.org/.Google Scholar
Emma. 2013. EMMA. http://emma.sourceforge.net/.Google Scholar
Phyllis G. Frankl and Oleg Iakounenko. 1998. Further empirical studies of test effectiveness. In Proceedings of the Symposium on the Foundations of Software Engineering. 153--162. Google ScholarDigital Library
Phyllis G. Frankl and Stewart N. Weiss. 1993. An experimental comparison of the effectiveness of branch testing and data flow testing. Trans. Software Eng. 19, 774--787. Google ScholarDigital Library
Chen Fu and Barbara G. Ryder. 2005. Navigating error recovery code in Java applications. In Proceedings of the Workshop on Eclipse Technology eXchange. 40--44. Google ScholarDigital Library
Juan Pablo Galeotti, Nicolás Rosner, Carlos Gustavo López Pombo, and Marcelo Fabian Frias. 2010. Analysis of invariants for efficient bounded verification. In Proceedings of the International Symposium on Software Testing and Analysis. 25--36. Google ScholarDigital Library
gcov. 2013. gcov--a Test Coverage Program. http://gcc.gnu.org/onlinedocs/gcc/Gcov.html.Google Scholar
Milos Gligoric, Alex Groce, Chaoqiang Zhang, Rohan Sharma, Mohammad Amin Alipour, and Darko Marinov. 2013. Comparing non-adequate test suites using coverage criteria. In Proceedings of the International Symposium on Software Testing and Analysis. 302--313. Google ScholarDigital Library
Patrice Godefroid. 2007. Compositional dynamic test generation. In Proceedings of the Symposium on Principles of Programming Languages. 47--54. Google ScholarDigital Library
Rahul Gopinath, Carlos Jensen, and Alex Groce. 2014. Code coverage for suite evaluation by developers. In Proceedings of the International Conference on Software Engineering. 72--82. Google ScholarDigital Library
Alex Groce. 2009. (Quickly) testing the tester via path coverage. In Proceedings of the Workshop on Dynamic Analysis. 22--28. Google ScholarDigital Library
Alex Groce. 2011. Coverage rewarded: Test input generation via adaptation-based programming. In Proceedings of the International Conference on Automated Software Engineering. 380--383. Google ScholarDigital Library
Alex Groce, Alan Fern, Jervis Pinto, Tim Bauer, Mohammad Amin Alipour, Martin Erwig, and Camden Lopez. 2012. Lightweight automated testing with adaptation-based programming. In Proceedings of the International Symposium on Software Reliability Engineering. 161--170. Google ScholarDigital Library
Alex Groce, Gerard Holzmann, and Rajeev Joshi. 2007. Randomized differential testing as a prelude to formal verification. In Proceedings of the International Conference on Software Engineering. 621--631. Google ScholarDigital Library
Alex Groce, Chaoqiang Zhang, Eric Eide, Yang Chen, and John Regehr. 2012. Swarm testing. In Proceedings of the International Symposium on Software Testing and Analysis. 78--88. Google ScholarDigital Library
Joy Paul Guilford. 1956. Fundamental Statistics in Psychology and Education. McGraw-Hill.Google Scholar
Atul Gupta and Pankaj Jalote. 2008. An approach for experimentally evaluating effectiveness and efficiency of coverage criteria for software testing. Softw. Tools Technol. Transf. 10, 145--160. Google ScholarDigital Library
Richard G. Hamlet. 1977. Testing programs with the aid of a compiler. Trans. Softw. Eng. 3, 279--290. Google ScholarDigital Library
Michael Harder, Jeff Mellen, and Michael D. Ernst. 2003. Improving test suites via operational abstraction. In Proceedings of the International Conference on Software Engineering. 60--71. Google ScholarDigital Library
Mohammad Mahdi Hassan and James H. Andrews. 2013. Comparing multi-point stride coverage and dataflow coverage. In Proceedings of the International Conference on Software Engineering. 172--181. Google ScholarDigital Library
Thomas A. Henzinger, Ranjit Jhala, Rupak Majumdar, and Grégoire Sutre. 2002. Lazy abstraction. In Proceedings of the Symposium on Principles of Programming Languages. 58--70. Google ScholarDigital Library
Andreas Holzer, Christian Schallhart, Michael Tautschnig, and Helmut Veith. 2009. Query-driven program testing. In Proceedings of the International Conference on Verification, Model Checking, and Abstract Interpretation. 151--166. Google ScholarDigital Library
Monica Hutchins, Herb Foster, Tarak Goradia, and Thomas Ostrand. 1994. Experiments of the effectiveness of dataflow- and controlflow-based test adequacy criteria. In Proceedings of the International Conference on Software Engineering. 191--200. Google ScholarDigital Library
Laura Inozemtseva and Reid Holmes. 2014. Coverage is not strongly correlated with test suite effectiveness. In Proceedings of the International Conference on Software Engineering. 435--445. Google ScholarDigital Library
Laura Michelle McLean Inozemtseva. 2012. Predicting test suite effectiveness for Java programs. Master's thesis, University of Waterloo.Google Scholar
JFreeChart. 2013. JFreeChart. http://www.jfree.org/jfreechart/.Google Scholar
Yue Jia and Mark Harman. 2011. An analysis and survey of the development of mutation testing. Trans. Softw. Eng. 37, 649--678. Google ScholarDigital Library
JodaTime. 2013. JodaTime. http://joda-time.sourceforge.net/.Google Scholar
René Just, Gregory M. Kapfhammer, and Franz Schweiggert. 2012. Using non-redundant mutation operators and test suite prioritization to achieve efficient and scalable mutation analysis. In Proceedings of the International Symposium on Software Reliability Engineering. 11--20. Google ScholarDigital Library
Maurice Kendall. 1938. A new measure of rank correlation. Biometrika 1--2, 81--89.Google Scholar
James R. Larus. 1999. Whole program paths. In Proceedings of the Conference on Programming Language Design and Implementation. 259--269. Google ScholarDigital Library
Nan Li, Upsorn Praphamontripong, and Jeff Offutt. 2009. An experimental comparison of four unit test criteria: Mutation, edge-pair, all-uses and prime path coverage. In Proceedings of the International Workshop on Mutation Analysis. 220--229. Google ScholarDigital Library
Akbar Siami Namin and James H. Andrews. 2009. The influence of size and coverage on test suite effectiveness. In Proceedings of the International Symposium on Software Testing and Analysis. 57--68. Google ScholarDigital Library
Akbar Siami Namin, James H. Andrews, and Duncan J. Murdoch. 2008. Sufficient mutation operators for measuring test effectiveness. In Proceedings of the International Conference on Software Engineering. 351--360. Google ScholarDigital Library
George Necula, Scott McPeak, Shree P. Rahul, and Westley Weimer. 2002. CIL: Intermediate language and tools for analysis and transformation of C programs. In Proceedings of the International Conference on Compiler Construction. 213--228. Google ScholarDigital Library
A. Jefferson Offutt, Gregg Rothermel, and Christian Zapf. 1993. An experimental evaluation of selective mutation. In Proceedings of the International Conference on Software Engineering. 100--107. Google ScholarDigital Library
Peter Ohmann and Ben Liblit. 2013. Lightweight control-flow instrumentation and postmortem analysis in support of debugging. In Procedings of the International Conference on Automated Software Engineering. 378--388.Google ScholarDigital Library
Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, and Thomas Ball. 2007. Feedback-directed random test generation. In Proceedings of the International Conference on Software Engineering. 75--84. Google ScholarDigital Library
Mike Papadakis, Christopher Henard, and Yves Le Traon. 2014. Sampling program inputs with mutation analysis: Going beyond combinatorial interaction testing. In Proceedings of the International Conference on Software Testing, Verification and Validation. 1--10. Google ScholarDigital Library
Sanjay J. Patel, Tony Tung, Satarupa Bose, and Matthew M. Crum. 2000. Increasing the size of atomic instruction blocks using control flow assertions. In Proceedings of the International Symposium on Microarchitecture. IEEE, 303--313. Google ScholarDigital Library
Purify. 2013. IBM Rational purify documentation. ftp://ftp.software.ibm.com/software/rational/docs/documentation/manuals/unixsuites/pdf/purify/purify.pdf.Google Scholar
Gregg Rothermel, Roland Untch, Chengyun Chu, and Mary Jean Harrold. 2001. Test case prioritization. Trans. Softw. Eng. 27, 929--948.Google ScholarDigital Library
Atanas Rountev. 2004. Precise identification of side-effect-free methods in Java. In Proceedings of the International Conference on Software Maintenance. 82--91. Google ScholarDigital Library
Alexandru Sălcianu and Martin Rinard. 2005. Purity and side effect analysis for Java programs. In Proceedings of the 6th Annual Conference on Verification, Model Checking, and Abstract Interpretation. 199--215. Google ScholarDigital Library
David Schuler and Andreas Zeller. 2009. Javalanche: Efficient mutation testing for Java. In Proceedings of the Symposium on the Foundations of Software Engineering. 297--298. Google ScholarDigital Library
David Schuler and Andreas Zeller. 2013. Checked coverage: An indicator for oracle quality. Softw. Testing, Verification Reliability 23, 7, 531--551.Google ScholarCross Ref
Rohan Sharma, Milos Gligoric, Andrea Arcuri, Gordon Fraser, and Darko Marinov. 2011. Testing container classes: Random or systematic? In Fundamental Approaches to Software Engineering, 262--277. Google ScholarDigital Library
Rohan Sharma, Milos Gligoric, Vilas Jagannath, and Darko Marinov. 2010. A comparison of constraint- based and sequence-based generation of complex input data structures. In Proceedings of the Software Testing, Verification, and Validation Workshops. 337--342. Google ScholarDigital Library
Charles Spearman. 1904. The proof and measurement of association between two things. Amer. J. Psychol 15, 1, 72--101.Google ScholarCross Ref
SQLite. 2013. SQLite. http://www.sqlite.org/.Google Scholar
Willem Visser, Corina S. Pasareanu, and Radek Pelánek. 2006. Test input generation for Java containers using state matching. In Proceedings of the International Symposium on Software Testing and Analysis. 37--48. Google ScholarDigital Library
Marian Vittek, Peter Borovansky, and Pierre-Etienne Moreau. 2006. A simple generic library for C. In Proceedings of the International Conference on Software Reuse. 423--426. Google ScholarDigital Library
VMSpec. 2013. Java class file format. http://docs.oracle.com/javase/specs/jvms/se5.0/html/ClassFile.doc.html.Google Scholar
Filipos I. Vokolos and Phyllis G. Frankl. 1998. Empirical evaluation of the textual differencing regression testing technique. In Proceedings of the International Conference on Software Maintenance. 44--53. Google ScholarDigital Library
WALA. 2013. WALA: T. J. Watson Libraries for Analysis. http://wala.sf.net.Google Scholar
Tao Wang and Abhik Roychoudhury. 2005. Automated path generation for software fault localization. In Proceedings of the International Conference on Automated Software Engineering. 347--351. Google ScholarDigital Library
Yi Wei, Bertrand Meyer, and Manuel Oriol. 2012. Is branch coverage a good measure of testing effectiveness? In Empirical Software Engineering and Verification, Bertrand Meyer and Martin Nordio (Eds.), vol. 7007, Springer Berlin, Heidelberg, 194--212. Google ScholarDigital Library
W. Eric Wong, Joseph R. Horgan, Saul London, and Aditya P. Mathur. 1994. Effect of test set size and block coverage on fault detection effectiveness. In Proceedings of the International Symposium on Software Reliability. 230--238.Google Scholar
W. Eric Wong, Joseph R. Horgan, Saul London, and Aditya P. Mathur. 1995. Effect of test set minimization on fault detection effectiveness. In Proceedings of the International Conference on Software Engineering. 41--50. Google ScholarDigital Library
YAFFS2. 2013. YAFFS: A flash file system for embedded use. http://www.yaffs.net.Google Scholar
Lingming Zhang, Milos Gligoric, Darko Marinov, and Sarfraz Khurshid. 2013. Operator-based and random mutant selection: Better together. In Proceedings of the International Conference on Automated Software Engineering. 92--102.Google ScholarDigital Library
Lu Zhang, Shan-Shan Hou, Jun-Jue Hu, Tao Xie, and Hong Mei. 2010. Is operator-based mutant selection superior to random mutant selection? In Proceedings of the International Conference on Software Engineering. 435--444. Google ScholarDigital Library

Index Terms

Guidelines for Coverage-Based Comparisons of Non-Adequate Test Suites
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Comparing non-adequate test suites using coverage criteria
ISSTA 2013: Proceedings of the 2013 International Symposium on Software Testing and Analysis

A fundamental question in software testing research is how to compare test suites, often as a means for comparing test-generation techniques. Researchers frequently compare test suites by measuring their coverage. A coverage criterion C provides a set ...
Read More
Coverage Based Testing with Test Purposes
QSIC '08: Proceedings of the 2008 The Eighth International Conference on Quality Software

Test purposes have been presented as a solution to avoid the state space explosion when selecting test cases from formal models. Although such techniques work very well with regard to the speed of the test derivation, they leave the tester with one ...
Read More
Towards Impact Analysis of Test Goal Prioritization on the Efficient Execution of Automatically Generated Test Suites Based on State Machines
QSIC '11: Proceedings of the 2011 11th International Conference on Quality Software

Test prioritization aims at reducing test execution costs. There are several approaches to prioritize test cases based on collected data of previous test runs, e.g., in regression testing. In this paper, we present a new approach to test prioritization ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Software Engineering and Methodology Volume 24, Issue 4
Special Issue on ISSTA 2013
August 2015
177 pages
ISSN:1049-331X
EISSN:1557-7392
DOI:10.1145/2820114
Editor:
David S. Rosenblum
National University of Singapore, Singapore
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 September 2015
- Accepted: 1 August 2014
- Revised: 1 May 2014
- Received: 1 January 2014
Published in tosem Volume 24, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Coverage criteria
non-adequate test suites
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 45
  Total Citations
  View Citations
- 454
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Guidelines for Coverage-Based Comparisons of Non-Adequate Test Suites

ACM Transactions on Software Engineering and Methodology

Abstract

References

Cited By

Index Terms

Recommendations

Comparing non-adequate test suites using coverage criteria

Coverage Based Testing with Test Purposes

Towards Impact Analysis of Test Goal Prioritization on the Efficient Execution of Automatically Generated Test Suites Based on State Machines