Top

Software Quality Journal

Published in:

10-05-2016

Does choice of mutation tool matter?

Authors: Rahul Gopinath, Iftekhar Ahmed, Mohammad Amin Alipour, Carlos Jensen, Alex Groce

Published in: Software Quality Journal | Issue 3/2017

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Though mutation analysis is the primary means of evaluating the quality of test suites, it suffers from inadequate standardization. Mutation analysis tools vary based on language, when mutants are generated (phase of compilation), and target audience. Mutation tools rarely implement the complete set of operators proposed in the literature and mostly implement at least a few domain-specific mutation operators. Thus different tools may not always agree on the mutant kills of a test suite. Few criteria exist to guide a practitioner in choosing the right tool for either evaluating effectiveness of a test suite or for comparing different testing techniques. We investigate an ensemble of measures for evaluating efficacy of mutants produced by different tools. These include the traditional difficulty of detection, strength of minimal sets, and the diversity of mutants, as well as the information carried by the mutants produced. We find that mutation tools rarely agree. The disagreement between scores can be large, and the variation due to characteristics of the project—even after accounting for difference due to test suites—is a significant factor. However, the mean difference between tools is very small, indicating that no single tool consistently skews mutation scores high or low for all projects. These results suggest that experiments yielding small differences in mutation score, especially using a single tool, or a small number of projects may not be reliable. There is a clear need for greater standardization of mutation analysis. We propose one approach for such a standardization.

previous article Application of metamorphic testing monitored by test adequacy in a Monte Carlo simulation program

next article A large-scale study of call graph-based impact prediction using mutation testing

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

Very often, a single high level statement is implemented as multiple lower level instructions. Hence, a simple change in assembly may not have an equivalent source representation. See Pit switch mutator (Coles 2016b) for an example which does not have a direct source equivalent.

See Pit return values mutator (Coles 2016b) for an example where first-order source changes imply much larger bytecode changes.

By semantics, we mean the actual behavior (in contrast to the static syntax) of the mutants. That is, some mutants, while syntactically different, are actually indistinguishable in their behavior. Similarly mutants may be hard or easy to detect, and a set of mutants may encode more difference in behavior than another set. We use measures such as mutual information and entropy to measure the ability of a set of mutants to provide a diverse a behavior set.

For any set of mutants, the strength of a test suite required to detect them depends on the number of non-redundant mutants within that set. Thus, for this paper, we define the strength of a set of mutants as the number of non-redundant mutants within that set.

Diversity of a set of mutants refers to how different one can expect any two mutants from the set to be, in terms of the tests that kill them. For example, say we have mutant set A, and killing tests given by \(\{(m_1,t_1), (m_2,t_2)\}\), and mutant set B and killing tests given by \(\{(m_1,t_1), (m_2,t_2), (m_3,t_3)\}\), both have similar diversity, while another set C given by \(\{(m_1,t_1), (m_2,t_1)\}\) has a different diversity.

Note that the LOC given by Delahaye et al. is ambiguous. The text suggests that the LOC is that of the program. However, checking the LOC of some of the programs such as jopt-simple and commons-lang suggests that the given LOC is that of the test suite (and it is reported in the table as details of the test suite). Hence we do not include LOC details here.

The Siemens test suite is a test suite curated by researchers (Untch 2009), and this is at best a questionable representative for real-world test suites.

Even though a script mode is available, it still requires GUI to be present, and communication with its authors did not produce any assistance on this point.

In the case of Pit, we extended Pit to provide a more complete set of mutants, a modification which was latter accepted to the main line (Pit 1.0).

Statistical significance is the confidence we have in our estimates. It says nothing about the effect size. That is, we can be highly confident of a small consistent difference, but it may not be practically relevant.

Analysis of variance—ANOVA—is a statistical procedure used to compare the goodness of fit of statistical models. It can tell us whether a variable contributes significantly (statistical) to the variation in the dependent variable by comparing against a model that does not contain that variable. If the p value—given in tables as \(Pr({>}F)\)—is not statistically significant, it is an indication that the variable contributes little to the model fit. Note that the \(R^2\) reported is adjusted \(R^2\) after adjusting for the effect of complexity of the model due to the number of variables considered.

Acree, Jr. A. T. (1980). On mutation. Ph.D. dissertation, Georgia Institute of Technology, Atlanta, GA, USA.

Ammann, P. (2015a). Problems with jester. https://sites.google.com/site/mutationworkshop2015/program/MutationKeynote.

Ammann, P. (2015b). Transforming mutation testing from the technology of the future into the technology of the present. In International conference on software testing, verification and validation workshops. IEEE.

Ammann, P., Delamaro, M. E., & Offutt, J. (2014). Establishing theoretical minimal sets of mutants. In International conference on software testing, verification and validation (pp. 21–30). Washington, DC, USA: IEEE Computer Society.

Andrews, J. H., Briand, L. C., & Labiche, Y. (2005). Is mutation an appropriate tool for testing experiments? In International conference on software engineering (pp. 402–411). IEEE.

Andrews, J. H., Briand, L. C., Labiche, Y., & Namin, A. S. (2006). Using mutation analysis for assessing and comparing testing coverage criteria. IEEE Transactions on Software Engineering, 32(8), 608–624.CrossRef

Apache Software Foundation. (2016). Apache commons. http://commons.apache.org/.

Baldwin, D., & Sayward, F. (1979). Heuristics for determining equivalence of program mutations. DTIC Document: Tech. rep.

Barbosa, E. F., Maldonado, J. C., & Vincenzi, A. M. R. (2001). Toward the determination of sufficient mutant operators for c. Software Testing, Verification and Reliability, 11(2), 113–136.CrossRef

Budd, T. A. (1980). Mutation analysis of program test data. Ph.D. dissertation, Yale University, New Haven, CT, USA.

Budd, T. A., DeMillo, R. A., Lipton, R. J., & Sayward, F. G. (1980). Theoretical and empirical studies on using program mutation to test the functional correctness of programs. In ACM SIGPLAN-SIGACT symposium on principles of programming languages (pp. 220–233). ACM.

Budd, T. A., Lipton, R. J., DeMillo, R. A., & Sayward, F. G. (1979). Mutation analysis. Yale University, Department of Computer Science.

Budd, T. A., & Gopal, A. S. (1985). Program testing by specification mutation. Computer Languages, 10(1), 63–73.CrossRefMATH

Cai, X., & Lyu, M. R. (2005). The effect of code coverage on fault detection under different testing profiles. In ACM SIGSOFT software engineering notes (Vol. 30, no. 4, pp. 1–7). ACM.

Chevalley, P., & Thévenod-Fosse, P. (2003). A mutation analysis tool for java programs. International Journal on Software Tools for Technology Transfer, 5(1), 90–103.CrossRef

Coles, H. (2016). Pit mutation testing. http://pitest.org/.

Coles, H. (2016a). Mutation testing systems for java compared. http://pitest.org/java_mutation_testing_systems/.

Coles, H. (2016b). Pit mutators. http://pitest.org/quickstart/mutators/.

Daran, M., & Thévenod-Fosse, P. (1996). Software error analysis: A real case study involving real faults and mutations. In ACM SIGSOFT international symposium on software testing and analysis (pp. 158–171). ACM.

Delahaye, M., & Du Bousquet, L. (2013). A comparison of mutation analysis tools for java. In Quality software (QSIC), 2013 13th international conference on (pp. 187–195). IEEE.

DeMillo, R. A., Guindi, D. S., McCracken, W., Offutt, A., & King, K. (1988). An extended overview of the mothra software testing environment. In International conference on software testing, verification and validation workshops (pp. 142–151). IEEE.

DeMillo, R. A., Lipton, R. J., & Sayward, F. G. (1978). Hints on test data selection: Help for the practicing programmer. Computer, 11(4), 34–41.CrossRef

Derezińska, A., & Hałas, K. (2014). Analysis of mutation operators for the python language. In International conference on dependability and complex systems, ser. Advances in Intelligent Systems and Computing (Vol. 286, pp. 155–164). Springer.

Do, H., & Rothermel, G. (2006). On the use of mutation faults in empirical assessments of test case prioritization techniques. IEEE Transactions on Software Engineering, 32(9), 733–752.CrossRef

Duraes, J., & Madeira, H. (2002). Emulation of software faults by educated mutations at machine-code level. International Symposium on Software Reliability Engineering, 2002, 329–340.CrossRef

GitHub Inc. (2016). Software repository. http://www.github.com.

Gligoric, M., Groce, A., Zhang, C., Sharma, R., Alipour, M. A., & Marinov, D. (2013). Comparing non-adequate test suites using coverage criteria. In ACM SIGSOFT international symposium on software testing and analysis. ACM.

Gligoric, M., Jagannath, V., & Marinov, D. (2010). Mutmut: Efficient exploration for mutation testing of multithreaded code. In Software testing, verification and validation (ICST), 2010 third international conference on (pp. 55–64). IEEE.

Gopinath, R. (2015). Replication data for: Does choice of mutation tool matter?. http://eecs.osuosl.org/rahul/sqj2015.

Gopinath, R., Alipour, A., Ahmed, I., Jensen, C., & Groce, A. (2015). Do mutation reduction strategies matter? Oregon State University, tech. rep., August 2015, under review for Software Quality Journal. http://hdl.handle.net/1957/56917.

Gopinath, R., Alipour, A., Ahmed, I., Jensen, C., & Groce, A. (2016). On the limits of mutation reduction strategies. In Proceedings of the 38th international conference on software engineering. ACM.

Gopinath, R., Alipour, A., Iftekhar, A., Jensen, C., & Groce, A. (2015). How hard does mutation analysis have to be, anyway? In International symposium on software reliability engineering. IEEE.

Gopinath, R., Jensen, C., & Groce, A. (2014). Code coverage for suite evaluation by developers. In International conference on software engineering. IEEE.

Gopinath, R., Jensen, C., & Groce, A. (2014). Mutations: How close are they to real faults? In Software reliability engineering (ISSRE), 2014 IEEE 25th international symposium on (pp. 189–200), November 2014.

Harder, M., Mellen, J., & Ernst, M.D. (2003). Improving test suites via operational abstraction. In International conference on software engineering (pp. 60–71). IEEE Computer Society.

Harder, M., Morse, B., & Ernst, M. D. (2001). Specification coverage as a measure of test suite quality. MIT Lab for Computer Science: tech. rep.

Irvine, S. A., Pavlinic, T., Trigg, L., Cleary, J. G., Inglis, S., & Utting, M. (2007). Jumble java byte code to measure the effectiveness of unit tests. In Testing: Academic and industrial conference practice and research techniques-MUTATION, 2007. TAICPART- MUTATION 2007 (pp. 169–175). IEEE, 2007.

Jia, Y., & Harman, M. (2008). Milu: A customizable, runtime-optimized higher order mutation testing tool for the full c language. In Practice and Research Techniques, 2008. TAIC PART’08. Testing: Academic & industrial conference (pp. 94–98). IEEE, 2008.

Jia, Y., & Harman, M. (2011). An analysis and survey of the development of mutation testing. IEEE Transactions on Software Engineering, 37(5), 649–678.CrossRef

Just, R. (2014). The major mutation framework: Efficient and scalable mutation analysis for java. In Proceedings of the 2014 international symposium on software testing and analysis, ser. ISSTA 2014 (pp. 433–436). New York, NY: ACM.

Just, R., Kapfhammer, G. M., & Schweiggert, F. (2012). Do redundant mutants affect the effectiveness and efficiency of mutation analysis? In Software testing, verification and validation (ICST), 2012 IEEE fifth international conference on (pp. 720–725). IEEE.

Just, R., Jalali, D., Inozemtseva, L., Ernst, M. D., Holmes, R., & Fraser, G. (2014). Are mutants a valid substitute for real faults in software testing? ACM SIGSOFT symposium on the foundations of software engineering (pp. 654–665). Hong Kong: ACM.

Kintis, M., Papadakis, M., & Malevris, N. (2010). Evaluating mutation testing alternatives: A collateral experiment. In Asia Pacific software engineering conference (APSEC) (pp. 300–309). IEEE.

Kurtz, B., Ammann, P., Delamaro, M. E., Offutt, J., & Deng, L. (2014). Mutant subsumption graphs. In Software testing, verification and validation workshops (ICSTW), 2014 IEEE seventh international conference on (pp. 176–185). IEEE, 2014.

Kusano, M., & Wang, C. (2013). Ccmutator: A mutation generator for concurrency constructs in multithreaded c/c++ applications. In Automated software engineering (ASE), 2013 IEEE/ACM 28th international conference on (pp. 722–725). IEEE.

Langdon, W. B., Harman, M., & Jia, Y. (2010). Efficient multi-objective higher order mutation testing with genetic programming. Journal of systems and Software, 83(12), 2416–2430.CrossRef

Le, D., Alipour, M. A., Gopinath, R., & Groce, A. (2014). Mucheck: An extensible tool for mutation testing of haskell programs. In Proceedings of the 2014 international symposium on software testing and analysis (pp. 429–432). ACM.

Lipton, R. J. (1971). Fault diagnosis of computer programs. Carnegie Mellon University, Tech. rep.

Ma, Y.-S., Kwon, Y.-R., & Offutt, J. (2002). Inter-class mutation operators for java. In International symposium on software reliability engineering (pp. 352–363). IEEE.

Ma, Y.-S., Offutt, J., & Kwon, Y.-R. (2006). Mujava: A mutation system for java. In Proceedings of the 28th international conference on software engineering, ser. ICSE’06 (pp. 827–830). New York, NY: ACM, 2006.

Macedo, M. G. (2016). Mutator. http://ortask.com/mutator/.

Madeyski, L., & Radyk, N. (2010). Judy—A mutation testing tool for java. IET software, 4(1), 32–42.CrossRef

Ma, Y.-S., Offutt, J., & Kwon, Y. R. (2005). Mujava: An automated class mutation system. Software Testing, Verification and Reliability, 15(2), 97–133.CrossRef

Mathur, A. (1991). Performance, effectiveness, and reliability issues in software testing. In Annual international computer software and applications conference, COMPSAC (pp. 604–605), 1991.

Mathur, A. P., & Wong, W. E. (1994). An empirical comparison of data flow and mutation-based test adequacy criteria. Software Testing, Verification and Reliability, 4(1), 9–31.CrossRef

Moore, I. (2001). Jester—a junit test tester. In International conference on extreme programming (pp. 84–87).

Namin, A. S., & Andrews, J. H. (2009). The influence of size and coverage on test suite effectiveness. In ACM SIGSOFT international symposium on software testing and analysis (pp. 57–68). ACM.

Namin, A. S., Andrews, J. H., & Murdoch, D. J. (2008). Sufficient mutation operators for measuring test effectiveness. In International conference on software engineering (pp. 351–360). ACM.

Nanavati, J., Wu, F., Harman, M., Jia, Y., & Krinke, J. (2015). Mutation testing of memory-related operators. In Software testing, verification and validation workshops (ICSTW), 2015 IEEE eighth international conference on (pp. 1–10). IEEE.

Nica, S., & Wotawa, F. (2012). Using constraints for equivalent mutant detection. In Workshop on formal methods in the development of software, WS-FMDS (pp. 1–8).

Nimmer, J. W., & Ernst, M. D. (2002). Automatic generation of program specifications. ACM SIGSOFT Software Engineering Notes, 27(4), 229–239.CrossRef

Offut, J. (2016a). Problems with jester. https://cs.gmu.edu/offutt/documents/personal/jester-anal.html.

Offut, J. (2016b). Problems with parasoft insure++. https://cs.gmu.edu/offutt/documents/handouts/parasoft-anal.html.

Offutt, J. (2016). Insure++ critique. https://cs.gmu.edu/offutt/documents/handouts/parasoft-anal.html.

Offutt, A. J., & Untch, R. H. (2000). Mutation, uniting the orthogonal. In Mutation testing for the new century (pp. 34–44). Springer, 2001.

Offutt, A. J., & Voas, J. M. (1996). ‘Subsumption of condition coverage techniques by mutation testing. Technical report ISSE-TR-96-01. Information and Software Systems Engineering. Tech. rep.: George Mason University.

Offutt, A. J., Rothermel, G., & Zapf, C. (1993). An experimental evaluation of selective mutation. In International conference on software engineering (pp. 100–107). IEEE Computer Society Press.

Offutt, A. J. (1989). The coupling effect: Fact or fiction? ACM SIGSOFT Software Engineering Notes, 14(8), 131–140.CrossRef

Offutt, A. J. (1992). Investigations of the software testing coupling effect. ACM Transactions on Software Engineering and Methodology, 1(1), 5–20.CrossRef

Offutt, A. J., & Craft, W. M. (1994). Using compiler optimization techniques to detect equivalent mutants. Software Testing, Verification and Reliability, 4(3), 131–154.CrossRef

Offutt, A. J., Lee, A., Rothermel, G., Untch, R. H., & Zapf, C. (1996). An experimental determination of sufficient mutant operators. ACM Transactions on Software Engineering and Methodology, 5(2), 99–118.CrossRef

Offutt, A. J., & Pan, J. (1997). Automatically detecting equivalent mutants and infeasible paths. Software Testing, Verification and Reliability, 7(3), 165–192.CrossRef

Okun, V. (2004). Specification mutation for test generation and analysis. Ph.D. dissertation, University of Maryland Baltimore County.

Papadakis, M., Jia, Y., Harman, M., & Traon, Y. L. (2015). Trivial compiler equivalence: A large scale empirical study of a simple, fast and effective equivalent mutant detection technique. In International conference on software engineering.

Parasoft. (2014). Insure++. www.parasoft.com/products/insure/papers/tech_mut.htm.

Parasoft. (2015). Insure++ mutation analysis. http://www.parasoft.com/jsp/products/article.jsp?articleId=291&product=Insure.

Schuler, D., & Zeller, A. (2009). Javalanche: Efficient mutation testing for java. In ACM SIGSOFT symposium on the foundations of software engineering (pp. 297–298). August, 2009.

Schuler, D., Dallmeier, V., & Zeller, A. (2009). Efficient mutation testing by checking invariant violations. In ACM SIGSOFT international symposium on software testing and analysis (pp. 69–80). ACM.

Schuler, D., & Zeller, A. (2013). Covering and uncovering equivalent mutants. Software Testing, Verification and Reliability, 23(5), 353–374.CrossRef

Shannon, C. E. (2001). A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1), 3–55.MathSciNetCrossRef

Singh, P. K., Sangwan, O. P., & Sharma, A. (2014). A study and review on the development of mutation testing tools for java and aspect-j programs. International Journal of Modern Education and Computer Science (IJMECS), 6(11), 1.CrossRef

Smith, B. H., & Williams, L. (2007). An empirical evaluation of the mujava mutation operators. In Testing: academic and industrial conference practice and research techniques-MUTATION, 2007. TAICPART-MUTATION 2007 (pp. 193–202). IEEE.

Sridharan, M., & Namin, A. S. (2010). Prioritizing mutation operators based on importance sampling. In International symposium on software reliability engineering (pp. 378–387). IEEE.

Untch, R. H. (2009). On reduced neighborhood mutation analysis using a single mutagenic operator. In Annual southeast regional conference, ser. ACM-SE 47 (pp. 71:1–71:4). New York, NY: ACM.

Usaola, M. P., & Mateo, P. R. (2012). Bacterio: Java mutation testing tool: A framework to evaluate quality of tests cases. In Proceedings of the 2012 IEEE international conference on software maintenance (ICSM), ser. ICSM’12 (pp. 646–649). Washington, DC: IEEE Computer Society.

Wah, K. S. H. T. (2000). A theoretical study of fault coupling. Software Testing, Verification and Reliability, 10(1), 3–45.MathSciNetCrossRef

Wah, K. S. H. T. (2003). An analysis of the coupling effect i: Single test data. Science of Computer Programming, 48(2), 119–161.MathSciNetMATH

Watanabe, S. (1960). Information theoretical analysis of multivariate correlation. IBM Journal of Research and Development, 4(1), 66–82.MathSciNetCrossRefMATH

Wong, W. E. (1993). On mutation and data flow. Ph.D. dissertation, Purdue University, West Lafayette, IN, USA, uMI Order No. GAX94-20921.

Wong, W., & Mathur, A. P. (1995). Reducing the cost of mutation testing: An empirical study. Journal of Systems and Software, 31(3), 185–196.CrossRef

Yao, X., Harman, M., & Jia, Y. (2014). A study of equivalent and stubborn mutation operators using human analysis of equivalence. In International conference on software engineering (pp. 919–930).

Zhang, L., Gligoric, M., Marinov, D., & Khurshid, S. (2013). Operator-based and random mutant selection: Better together. In IEEE/ACM automated software engineering. ACM.

Zhang, L., Hou, S.-S., Hu, J.-J., Xie, T., & Mei, H. (2010). Is operator-based mutant selection superior to random mutant selection? In International conference on software engineering (pp. 435–444). New York, NY: ACM.

Zhang, J., Zhu, M., Hao, D., & Zhang, L. (2014). An empirical study on the scalability of selective mutation testing. In International symposium on software reliability engineering. ACM.

Zhou, C., & Frankl, P. (2009). Mutation testing for java database applications. In Software testing verification and validation, ICST’09. International conference on (pp. 396–405). IEEE, 2009.

Title: Does choice of mutation tool matter?
Authors: Rahul Gopinath
Iftekhar Ahmed
Mohammad Amin Alipour
Carlos Jensen
Alex Groce
Publication date: 10-05-2016
Publisher: Springer US
Published in: Software Quality Journal / Issue 3/2017
Print ISSN: 0963-9314
Electronic ISSN: 1573-1367
DOI: https://doi.org/10.1007/s11219-016-9317-7

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Other articles of this Issue 3/2017

Toward automatically quantifying the impact of a change in systems

State-based models in regression test suite prioritization

An empirical study on the effects of code visibility on program testability

Investigating the relation between lexical smells and change- and fault-proneness: an empirical study

Application of metamorphic testing monitored by test adequacy in a Monte Carlo simulation program

A large-scale study of call graph-based impact prediction using mutation testing

Premium Partner