skip to main content
research-article

Guidelines for Coverage-Based Comparisons of Non-Adequate Test Suites

Published:02 September 2015Publication History
Skip Abstract Section

Abstract

A fundamental question in software testing research is how to compare test suites, often as a means for comparing test-generation techniques that produce those test suites. Researchers frequently compare test suites by measuring their coverage. A coverage criterion C provides a set of test requirements and measures how many requirements a given suite satisfies. A suite that satisfies 100% of the feasible requirements is called C-adequate. Previous rigorous evaluations of coverage criteria mostly focused on such adequate test suites: given two criteria C and C′, are C-adequate suites on average more effective than C′-adequate suites? However, in many realistic cases, producing adequate suites is impractical or even impossible.

This article presents the first extensive study that evaluates coverage criteria for the common case of non-adequate test suites: given two criteria C and C′, which one is better to use to compare test suites? Namely, if suites T1, T2,…,Tn have coverage values c1, c2,…,cn for C and c1′, c2′,…,cn′ for C′, is it better to compare suites based on c1, c2,…,cn or based on c1′, c2′,…,cn? We evaluate a large set of plausible criteria, including basic criteria such as statement and branch coverage, as well as stronger criteria used in recent studies, including criteria based on program paths, equivalence classes of covered statements, and predicate states. The criteria are evaluated on a set of Java and C programs with both manually written and automatically generated test suites. The evaluation uses three correlation measures. Based on these experiments, two criteria perform best: branch coverage and an intraprocedural acyclic path coverage. We provide guidelines for testing researchers aiming to evaluate test suites using coverage criteria as well as for other researchers evaluating coverage criteria for research use.

References

  1. Martijn Adolfsen. 2011. Industrial validation of test coverage quality. Master's thesis. University of Twente.Google ScholarGoogle Scholar
  2. Paul Ammann and Jeff Offutt. 2008. Introduction to Software Testing. Cambridge University Press. Google ScholarGoogle Scholar
  3. James H. Andrews, Lionel C. Briand, and Yvan Labiche. 2005. Is mutation an appropriate tool for testing experiments? In Proceedings of the International Conference on Software Engineering. 402--411. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. James H. Andrews, Lionel C. Briand, Yvan Labiche, and Akbar Siami Namin. 2006. Using mutation analysis for assessing and comparing testing coverage criteria. Trans. Softw. Eng. 32, 608--624. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Andrea Arcuri and Lionel C. Briand. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In Proceedings of the International Conference on Software Engineering. 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Thomas Ball. 2004. A theory of predicate-complete test coverage and generation. Tech. Rep. MSR- TR-2004-28, Microsoft Research.Google ScholarGoogle Scholar
  7. Thomas Ball. 2005. A theory of predicate-complete test coverage and generation. In Proceedings of the 3rd International Conference on Formal Methods for Components and Objects (FMCO). 1--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Thomas Ball and James R. Larus. 1996. Efficient path profiling. In Proceedings of the International Symposium on Microarchitecture. 46--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Thomas Ball and Sriram K. Rajamani. 2001. Automatically validating temporal safety properties of interfaces. In Proceedings of the Workshop on Model Checking of Software. 103--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Benoit Baudry, Franck Fleurey, and Yves Le Traon. 2006. Improving test suites for efficient fault localization. In Proceedings of the International Conference on Software Engineering. 82--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Xia Cai and Michael R. Lyu. 2005. The effect of code coverage on fault detection under different testing profiles. In Proceedings of the International Workshop on Advances in Model-Based Testing. 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Sagar Chaki, Edmund M. Clarke, Alex Groce, and Ofer Strichman. 2003. Predicate abstraction with minimum predicates. In Proceedings of the Conference on Correct Hardware Design and Verification Methods. 19--34.Google ScholarGoogle ScholarCross RefCross Ref
  13. Sagar Chaki, Alex Groce, and Ofer Strichman. 2004. Explaining abstract counterexamples. In Proceedings of the Symposium on the Foundations of Software Engineering. 73--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Trishul M. Chilimbi, Ben Liblit, Krishna Mehra, Aditya V. Nori, and Kapil Vaswani. 2009. HOLMES: Effective statistical debugging via efficient path profiling. In Proceedings of the International Conference on Software Engineering. 34--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Norman Cliff. 1996. Ordinal Methods for Behavioral Data Analysis. Pyschology Press.Google ScholarGoogle Scholar
  16. Cloc. 2013. Count lines of code. http://cloc.sourceforge.net/.Google ScholarGoogle Scholar
  17. Cobertura. 2013. Cobertura. http://cobertura.sourceforge.net/.Google ScholarGoogle Scholar
  18. CoCo. 2014. CoCo. http://mir.cs.illinois.edu/coco/.Google ScholarGoogle Scholar
  19. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. 2009. Introduction to Algorithms, 3rd Ed. The MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Herbert L. Costner. 1965. Criteria for measures of association. Amer. Sociological Revi. 3.Google ScholarGoogle Scholar
  21. Coverage. 2013. Instrumented container classes - Predicate coverage. http://mir.cs.illinois.edu/coverage/.Google ScholarGoogle Scholar
  22. Richard A. DeMillo, Richard J. Lipton, and Frederick G. Sayward. 1978. Hints on test data selection: Help for the practicing programmer. Comput. 11, 34--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Hyunsook Do, Sebastian G. Elbaum, and Gregg Rothermel. 2005. Supporting controlled experimentation with testing techniques: An infrastructure and its potential impact. Empirical Softw. Eng. 10, 405--435. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Eclipse. 2013. Eclipse. http://http://www.eclipse.org/.Google ScholarGoogle Scholar
  25. Emma. 2013. EMMA. http://emma.sourceforge.net/.Google ScholarGoogle Scholar
  26. Phyllis G. Frankl and Oleg Iakounenko. 1998. Further empirical studies of test effectiveness. In Proceedings of the Symposium on the Foundations of Software Engineering. 153--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Phyllis G. Frankl and Stewart N. Weiss. 1993. An experimental comparison of the effectiveness of branch testing and data flow testing. Trans. Software Eng. 19, 774--787. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Chen Fu and Barbara G. Ryder. 2005. Navigating error recovery code in Java applications. In Proceedings of the Workshop on Eclipse Technology eXchange. 40--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Juan Pablo Galeotti, Nicolás Rosner, Carlos Gustavo López Pombo, and Marcelo Fabian Frias. 2010. Analysis of invariants for efficient bounded verification. In Proceedings of the International Symposium on Software Testing and Analysis. 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. gcov. 2013. gcov--a Test Coverage Program. http://gcc.gnu.org/onlinedocs/gcc/Gcov.html.Google ScholarGoogle Scholar
  31. Milos Gligoric, Alex Groce, Chaoqiang Zhang, Rohan Sharma, Mohammad Amin Alipour, and Darko Marinov. 2013. Comparing non-adequate test suites using coverage criteria. In Proceedings of the International Symposium on Software Testing and Analysis. 302--313. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Patrice Godefroid. 2007. Compositional dynamic test generation. In Proceedings of the Symposium on Principles of Programming Languages. 47--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Rahul Gopinath, Carlos Jensen, and Alex Groce. 2014. Code coverage for suite evaluation by developers. In Proceedings of the International Conference on Software Engineering. 72--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Alex Groce. 2009. (Quickly) testing the tester via path coverage. In Proceedings of the Workshop on Dynamic Analysis. 22--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Alex Groce. 2011. Coverage rewarded: Test input generation via adaptation-based programming. In Proceedings of the International Conference on Automated Software Engineering. 380--383. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Alex Groce, Alan Fern, Jervis Pinto, Tim Bauer, Mohammad Amin Alipour, Martin Erwig, and Camden Lopez. 2012. Lightweight automated testing with adaptation-based programming. In Proceedings of the International Symposium on Software Reliability Engineering. 161--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Alex Groce, Gerard Holzmann, and Rajeev Joshi. 2007. Randomized differential testing as a prelude to formal verification. In Proceedings of the International Conference on Software Engineering. 621--631. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Alex Groce, Chaoqiang Zhang, Eric Eide, Yang Chen, and John Regehr. 2012. Swarm testing. In Proceedings of the International Symposium on Software Testing and Analysis. 78--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Joy Paul Guilford. 1956. Fundamental Statistics in Psychology and Education. McGraw-Hill.Google ScholarGoogle Scholar
  40. Atul Gupta and Pankaj Jalote. 2008. An approach for experimentally evaluating effectiveness and efficiency of coverage criteria for software testing. Softw. Tools Technol. Transf. 10, 145--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Richard G. Hamlet. 1977. Testing programs with the aid of a compiler. Trans. Softw. Eng. 3, 279--290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Michael Harder, Jeff Mellen, and Michael D. Ernst. 2003. Improving test suites via operational abstraction. In Proceedings of the International Conference on Software Engineering. 60--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Mohammad Mahdi Hassan and James H. Andrews. 2013. Comparing multi-point stride coverage and dataflow coverage. In Proceedings of the International Conference on Software Engineering. 172--181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Thomas A. Henzinger, Ranjit Jhala, Rupak Majumdar, and Grégoire Sutre. 2002. Lazy abstraction. In Proceedings of the Symposium on Principles of Programming Languages. 58--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Andreas Holzer, Christian Schallhart, Michael Tautschnig, and Helmut Veith. 2009. Query-driven program testing. In Proceedings of the International Conference on Verification, Model Checking, and Abstract Interpretation. 151--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Monica Hutchins, Herb Foster, Tarak Goradia, and Thomas Ostrand. 1994. Experiments of the effectiveness of dataflow- and controlflow-based test adequacy criteria. In Proceedings of the International Conference on Software Engineering. 191--200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Laura Inozemtseva and Reid Holmes. 2014. Coverage is not strongly correlated with test suite effectiveness. In Proceedings of the International Conference on Software Engineering. 435--445. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Laura Michelle McLean Inozemtseva. 2012. Predicting test suite effectiveness for Java programs. Master's thesis, University of Waterloo.Google ScholarGoogle Scholar
  49. JFreeChart. 2013. JFreeChart. http://www.jfree.org/jfreechart/.Google ScholarGoogle Scholar
  50. Yue Jia and Mark Harman. 2011. An analysis and survey of the development of mutation testing. Trans. Softw. Eng. 37, 649--678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. JodaTime. 2013. JodaTime. http://joda-time.sourceforge.net/.Google ScholarGoogle Scholar
  52. René Just, Gregory M. Kapfhammer, and Franz Schweiggert. 2012. Using non-redundant mutation operators and test suite prioritization to achieve efficient and scalable mutation analysis. In Proceedings of the International Symposium on Software Reliability Engineering. 11--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Maurice Kendall. 1938. A new measure of rank correlation. Biometrika 1--2, 81--89.Google ScholarGoogle Scholar
  54. James R. Larus. 1999. Whole program paths. In Proceedings of the Conference on Programming Language Design and Implementation. 259--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Nan Li, Upsorn Praphamontripong, and Jeff Offutt. 2009. An experimental comparison of four unit test criteria: Mutation, edge-pair, all-uses and prime path coverage. In Proceedings of the International Workshop on Mutation Analysis. 220--229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Akbar Siami Namin and James H. Andrews. 2009. The influence of size and coverage on test suite effectiveness. In Proceedings of the International Symposium on Software Testing and Analysis. 57--68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Akbar Siami Namin, James H. Andrews, and Duncan J. Murdoch. 2008. Sufficient mutation operators for measuring test effectiveness. In Proceedings of the International Conference on Software Engineering. 351--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. George Necula, Scott McPeak, Shree P. Rahul, and Westley Weimer. 2002. CIL: Intermediate language and tools for analysis and transformation of C programs. In Proceedings of the International Conference on Compiler Construction. 213--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. A. Jefferson Offutt, Gregg Rothermel, and Christian Zapf. 1993. An experimental evaluation of selective mutation. In Proceedings of the International Conference on Software Engineering. 100--107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Peter Ohmann and Ben Liblit. 2013. Lightweight control-flow instrumentation and postmortem analysis in support of debugging. In Procedings of the International Conference on Automated Software Engineering. 378--388.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, and Thomas Ball. 2007. Feedback-directed random test generation. In Proceedings of the International Conference on Software Engineering. 75--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Mike Papadakis, Christopher Henard, and Yves Le Traon. 2014. Sampling program inputs with mutation analysis: Going beyond combinatorial interaction testing. In Proceedings of the International Conference on Software Testing, Verification and Validation. 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Sanjay J. Patel, Tony Tung, Satarupa Bose, and Matthew M. Crum. 2000. Increasing the size of atomic instruction blocks using control flow assertions. In Proceedings of the International Symposium on Microarchitecture. IEEE, 303--313. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Purify. 2013. IBM Rational purify documentation. ftp://ftp.software.ibm.com/software/rational/docs/documentation/manuals/unixsuites/pdf/purify/purify.pdf.Google ScholarGoogle Scholar
  65. Gregg Rothermel, Roland Untch, Chengyun Chu, and Mary Jean Harrold. 2001. Test case prioritization. Trans. Softw. Eng. 27, 929--948.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Atanas Rountev. 2004. Precise identification of side-effect-free methods in Java. In Proceedings of the International Conference on Software Maintenance. 82--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Alexandru Sălcianu and Martin Rinard. 2005. Purity and side effect analysis for Java programs. In Proceedings of the 6th Annual Conference on Verification, Model Checking, and Abstract Interpretation. 199--215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. David Schuler and Andreas Zeller. 2009. Javalanche: Efficient mutation testing for Java. In Proceedings of the Symposium on the Foundations of Software Engineering. 297--298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. David Schuler and Andreas Zeller. 2013. Checked coverage: An indicator for oracle quality. Softw. Testing, Verification Reliability 23, 7, 531--551.Google ScholarGoogle ScholarCross RefCross Ref
  70. Rohan Sharma, Milos Gligoric, Andrea Arcuri, Gordon Fraser, and Darko Marinov. 2011. Testing container classes: Random or systematic? In Fundamental Approaches to Software Engineering, 262--277. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Rohan Sharma, Milos Gligoric, Vilas Jagannath, and Darko Marinov. 2010. A comparison of constraint- based and sequence-based generation of complex input data structures. In Proceedings of the Software Testing, Verification, and Validation Workshops. 337--342. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Charles Spearman. 1904. The proof and measurement of association between two things. Amer. J. Psychol 15, 1, 72--101.Google ScholarGoogle ScholarCross RefCross Ref
  73. SQLite. 2013. SQLite. http://www.sqlite.org/.Google ScholarGoogle Scholar
  74. Willem Visser, Corina S. Pasareanu, and Radek Pelánek. 2006. Test input generation for Java containers using state matching. In Proceedings of the International Symposium on Software Testing and Analysis. 37--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Marian Vittek, Peter Borovansky, and Pierre-Etienne Moreau. 2006. A simple generic library for C. In Proceedings of the International Conference on Software Reuse. 423--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. VMSpec. 2013. Java class file format. http://docs.oracle.com/javase/specs/jvms/se5.0/html/ClassFile.doc.html.Google ScholarGoogle Scholar
  77. Filipos I. Vokolos and Phyllis G. Frankl. 1998. Empirical evaluation of the textual differencing regression testing technique. In Proceedings of the International Conference on Software Maintenance. 44--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. WALA. 2013. WALA: T. J. Watson Libraries for Analysis. http://wala.sf.net.Google ScholarGoogle Scholar
  79. Tao Wang and Abhik Roychoudhury. 2005. Automated path generation for software fault localization. In Proceedings of the International Conference on Automated Software Engineering. 347--351. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Yi Wei, Bertrand Meyer, and Manuel Oriol. 2012. Is branch coverage a good measure of testing effectiveness? In Empirical Software Engineering and Verification, Bertrand Meyer and Martin Nordio (Eds.), vol. 7007, Springer Berlin, Heidelberg, 194--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. W. Eric Wong, Joseph R. Horgan, Saul London, and Aditya P. Mathur. 1994. Effect of test set size and block coverage on fault detection effectiveness. In Proceedings of the International Symposium on Software Reliability. 230--238.Google ScholarGoogle Scholar
  82. W. Eric Wong, Joseph R. Horgan, Saul London, and Aditya P. Mathur. 1995. Effect of test set minimization on fault detection effectiveness. In Proceedings of the International Conference on Software Engineering. 41--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. YAFFS2. 2013. YAFFS: A flash file system for embedded use. http://www.yaffs.net.Google ScholarGoogle Scholar
  84. Lingming Zhang, Milos Gligoric, Darko Marinov, and Sarfraz Khurshid. 2013. Operator-based and random mutant selection: Better together. In Proceedings of the International Conference on Automated Software Engineering. 92--102.Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Lu Zhang, Shan-Shan Hou, Jun-Jue Hu, Tao Xie, and Hong Mei. 2010. Is operator-based mutant selection superior to random mutant selection? In Proceedings of the International Conference on Software Engineering. 435--444. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Guidelines for Coverage-Based Comparisons of Non-Adequate Test Suites

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Software Engineering and Methodology
      ACM Transactions on Software Engineering and Methodology  Volume 24, Issue 4
      Special Issue on ISSTA 2013
      August 2015
      177 pages
      ISSN:1049-331X
      EISSN:1557-7392
      DOI:10.1145/2820114
      Issue’s Table of Contents

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 2 September 2015
      • Accepted: 1 August 2014
      • Revised: 1 May 2014
      • Received: 1 January 2014
      Published in tosem Volume 24, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader