skip to main content
10.1145/3383219.3383232acmotherconferencesArticle/Chapter ViewAbstractPublication PageseaseConference Proceedingsconference-collections
research-article

Assessing software defection prediction performance: why using the Matthews correlation coefficient matters

Published:17 April 2020Publication History

ABSTRACT

Context: There is considerable diversity in the range and design of computational experiments to assess classifiers for software defect prediction. This is particularly so, regarding the choice of classifier performance metrics. Unfortunately some widely used metrics are known to be biased, in particular F1.

Objective: We want to understand the extent to which the widespread use of the F1 renders empirical results in software defect prediction unreliable.

Method: We searched for defect prediction studies that report both F1 and the Matthews correlation coefficient (MCC). This enabled us to determine the proportion of results that are consistent between both metrics and the proportion that change.

Results: Our systematic review identifies 8 studies comprising 4017 pairwise results. Of these results, the direction of the comparison changes in 23% of the cases when the unbiased MCC metric is employed.

Conclusion: We find compelling reasons why the choice of classification performance metric matters, specifically the biased and misleading F1 metric should be deprecated.

References

  1. G. Abaei, A. Selamat, and J. Al Dallal. 2018. A fuzzy logic expert system to predict module fault proneness using unlabeled data. Journal of King Saud University-Computer and Information Sciences online (2018).Google ScholarGoogle Scholar
  2. P. Baldi, S. Brunak, Y. Chauvin, C. Andersen, and H. Nielsen. 2000. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16, 5 (2000), 412--424.Google ScholarGoogle ScholarCross RefCross Ref
  3. C. Catal and B. Diri. 2009. A systematic review of software fault prediction studies. Expert Systems with Applications 36, 4 (2009), 7346--7354.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Ellis. 2010. The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results. Cambridge University Press.Google ScholarGoogle Scholar
  5. T. Fawcett. 2006. An introduction to ROC analysis. Pattern Recognition Letters 27, 8 (2006), 861--874.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Ferri, J. Hernández-Orallo, and R. Modroiu. 2009. An experimental comparison of performance measures for classification. Pattern Recognition Letters 30, 1 (2009), 27--38.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Flach and M. Kull. 2015. Precision-recall-gain curves: PR analysis done right. In Advances in Neural Information Processing Systems (NIPS 2015). 838--846.Google ScholarGoogle Scholar
  8. L. Gong, S. Jiang, and L. Jiang. 2019. An improved transfer adaptive boosting approach for mixed-project defect prediction. Journal of Software: Evolution and Process 31, 10 (2019), e2172.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. 2012. A Systematic Literature Review on Fault Prediction Performance in Software Engineering. IEEE Transactions on Software Engineering 38, 6 (2012), 1276--1304.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Hand. 2009. Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine Learning 77 (2009), 103--123. https://doi.org/10.1007/s10994-009-5119-5Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Hernández-Orallo, P. Flach, and C. Ferri. 2012. A unified view of performance metrics: translating threshold choice into expected classification loss. Journal of Machine Learning Research 13, 10 (2012), 2813--2869.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Hosseini, B. Turhan, and D Gunarathna. 2017. A systematic literature review and meta-analysis on cross project defect prediction. IEEE Transactions on Software Engineering 45, 2 (2017), 111--147.Google ScholarGoogle ScholarCross RefCross Ref
  13. K. Khan, S. Daya, and A. Jadad. 1996. The importance of quality of primary studies in producing unbiased systematic reviews. Archives of Internal Medicine 156, 6 (1996), 661--666.Google ScholarGoogle ScholarCross RefCross Ref
  14. B. Kitchenham, D. Budgen, and P. Brereton. 2015. Evidence-Based Software engineering and systematic reviews. CRC Press, Boca Raton, Fl, US.Google ScholarGoogle Scholar
  15. A. Luque, A. Carrasco, A. Martín, and A. de las Heras. 2019. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition 91 (2019), 216--231.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Malhotra. 2015. A systematic review of machine learning techniques for software fault prediction. Applied Soft Computing 27 (2015), 504--518. https://doi.org/10.1016/j.asoc.2014.11.023Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G Maušaa, F. Sarro, and T. Grbaca. 2017. Learning Techniques for Systems in Evolution in Software Defect Prediction. Information and Software Technology online (2017).Google ScholarGoogle Scholar
  18. T. Mende and R. Koschke. 2010. Effort-aware defect prediction models. In 2010 14th IEEE European Conference on Software Maintenance and Reengineering. IEEE, 107--116.Google ScholarGoogle Scholar
  19. T. Menzies and M. Shepperd. 2012. Editorial: Special issue on repeatable results in software engineering prediction. Empirical Software Engineering 17, 1-2 (2012), 1--17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Morasca and L. Lavazza. 2017. Risk-averse slope-based thresholds: Definition and empirical evaluation. Information and Software Technology 89 (2017), 37--63.Google ScholarGoogle ScholarCross RefCross Ref
  21. M. NezhadShokouhi, M. Majidi, and A. Rasoolzadegan. 2019. Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance. The Journal of Supercomputing online (2019), 1--34.Google ScholarGoogle Scholar
  22. C. Pan, M. Lu, B. Xu, and H. Gao. 2019. An Improved CNN Model for Within-Project Software Defect Prediction. Applied Sciences 9, 10 (2019), 2138.Google ScholarGoogle ScholarCross RefCross Ref
  23. D. Powers. 2011. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies 2, 1 (2011), 37--63.Google ScholarGoogle ScholarCross RefCross Ref
  24. D. Radjenović, M. Heri|čko, R. Torkar, and A. Živkovi|č. 2013. Software Fault Prediction Metrics: A Systematic Literature Review. Information and Software Technology 55, 8 (2013), 1397--1418.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. F. Rahman, D. Posnett, and P. Devanbu. 2012. Recalling the imprecision of cross-project defect prediction. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering.Google ScholarGoogle Scholar
  26. F. Rahman, D. Posnett, I. Herraiz, and P. Devanbu. 2013. Sample size vs. bias in defect prediction. In Proceedings of the 9th Joint meeting on Foundations of Software Engineering. ACM, 147--157.Google ScholarGoogle Scholar
  27. D. Rodriguez, I. Herraiz, R. Harrison, J. Dolado, and J. Riquelme. 2014. Preliminary comparison of techniques for dealing with imbalance in software defect prediction. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering. ACM, 43.Google ScholarGoogle Scholar
  28. W. Shadish, T. Cook, and D. Campbell. 2002. Experimental and quasi-experimental designs for generalized causal inference. Houghton Mifflin, Boston.Google ScholarGoogle Scholar
  29. M. Shepperd, D. Bowes, and T. Hall. 2014. Researcher Bias: The Use of Machine Learning in Software Defect Prediction. IEEE Transactions on Software Engineering 40, 6 (2014), 603--616.Google ScholarGoogle ScholarCross RefCross Ref
  30. M. Sokolova, N. Japkowicz, and S. Szpakowicz. 2006. Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. In Australasian Joint Conference on Artificial Intelligence. Springer, 1015--1021.Google ScholarGoogle Scholar
  31. M. Sokolova and G. Lapalme. 2009. A systematic analysis of performance measures for classification tasks. Information Processing and Management, 45, 4 (2009), 427--437.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Q. Song, Y. Guo, and M. Shepperd. 2019. A Comprehensive Investigation of the Role of Imbalanced Learning for Software Defect Prediction. IEEE Transactions on Software Engineering 45, 12 (2019), 1253--1269.Google ScholarGoogle ScholarCross RefCross Ref
  33. Y. Sun, A. Wong, and M. Kamel. 2009. Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence 23, 04 (2009), 687--719.Google ScholarGoogle ScholarCross RefCross Ref
  34. H. Tong, B. Liu, and S. Wang. 2018. Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Information and Software Technology 96 (2018), 94--111.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. C. van Rijsbergen. 1979. Information Retrieval (2nd ed.). Butterworths.Google ScholarGoogle Scholar
  36. M. Warrens. 2008. On Association Coefficients For 2 × 2 Tables and Properties That Do Not Depend on the Marginal Distributions. Psychometrika 73, 4 (2008), 777--789.Google ScholarGoogle ScholarCross RefCross Ref
  37. L. Zhao, Z. Shang, L. Zhao, A. Qin, and Y. Tang. 2019. Siamese Dense Neural Network for Software Defect Prediction With Small Data. IEEE Access 7 (2019), 7663--7677.Google ScholarGoogle ScholarCross RefCross Ref
  38. T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy. 2009. Crossproject defect prediction: a large scale experiment on data vs. domain vs. process. In Proceedings of the the 7th ACM Joint Meeting of the European Software Engineering Conference and the Symposium on The foundations of Software Engineering. ACM, 91--100.Google ScholarGoogle Scholar

Index Terms

  1. Assessing software defection prediction performance: why using the Matthews correlation coefficient matters

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      EASE '20: Proceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering
      April 2020
      544 pages
      ISBN:9781450377317
      DOI:10.1145/3383219
      • General Chairs:
      • Jingyue Li,
      • Letizia Jaccheri,
      • Program Chairs:
      • Torgeir Dingsøyr,
      • Ruzanna Chitchyan

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 April 2020

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate71of232submissions,31%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader