research-article

Assessing software defection prediction performance: why using the Matthews correlation coefficient matters

Authors:
Jingxiu Yao

Beihang University, China

Beihang University, China
View Profile

,
Martin Shepperd

Brunel University London, UK

Brunel University London, UK
View Profile

EASE '20: Proceedings of the 24th International Conference on Evaluation and Assessment in Software EngineeringApril 2020Pages 120–129https://doi.org/10.1145/3383219.3383232

Published:17 April 2020Publication History

EASE '20: Proceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering

Pages 120–129

ABSTRACT

Context: There is considerable diversity in the range and design of computational experiments to assess classifiers for software defect prediction. This is particularly so, regarding the choice of classifier performance metrics. Unfortunately some widely used metrics are known to be biased, in particular F1.

Objective: We want to understand the extent to which the widespread use of the F1 renders empirical results in software defect prediction unreliable.

Method: We searched for defect prediction studies that report both F1 and the Matthews correlation coefficient (MCC). This enabled us to determine the proportion of results that are consistent between both metrics and the proportion that change.

Results: Our systematic review identifies 8 studies comprising 4017 pairwise results. Of these results, the direction of the comparison changes in 23% of the cases when the unbiased MCC metric is employed.

Conclusion: We find compelling reasons why the choice of classification performance metric matters, specifically the biased and misleading F1 metric should be deprecated.

References

G. Abaei, A. Selamat, and J. Al Dallal. 2018. A fuzzy logic expert system to predict module fault proneness using unlabeled data. Journal of King Saud University-Computer and Information Sciences online (2018).Google Scholar
P. Baldi, S. Brunak, Y. Chauvin, C. Andersen, and H. Nielsen. 2000. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16, 5 (2000), 412--424.Google ScholarCross Ref
C. Catal and B. Diri. 2009. A systematic review of software fault prediction studies. Expert Systems with Applications 36, 4 (2009), 7346--7354.Google ScholarDigital Library
P. Ellis. 2010. The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results. Cambridge University Press.Google Scholar
T. Fawcett. 2006. An introduction to ROC analysis. Pattern Recognition Letters 27, 8 (2006), 861--874.Google ScholarDigital Library
C. Ferri, J. Hernández-Orallo, and R. Modroiu. 2009. An experimental comparison of performance measures for classification. Pattern Recognition Letters 30, 1 (2009), 27--38.Google ScholarDigital Library
P. Flach and M. Kull. 2015. Precision-recall-gain curves: PR analysis done right. In Advances in Neural Information Processing Systems (NIPS 2015). 838--846.Google Scholar
L. Gong, S. Jiang, and L. Jiang. 2019. An improved transfer adaptive boosting approach for mixed-project defect prediction. Journal of Software: Evolution and Process 31, 10 (2019), e2172.Google ScholarDigital Library
T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. 2012. A Systematic Literature Review on Fault Prediction Performance in Software Engineering. IEEE Transactions on Software Engineering 38, 6 (2012), 1276--1304.Google ScholarDigital Library
D. Hand. 2009. Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine Learning 77 (2009), 103--123. https://doi.org/10.1007/s10994-009-5119-5Google ScholarDigital Library
J. Hernández-Orallo, P. Flach, and C. Ferri. 2012. A unified view of performance metrics: translating threshold choice into expected classification loss. Journal of Machine Learning Research 13, 10 (2012), 2813--2869.Google ScholarDigital Library
S. Hosseini, B. Turhan, and D Gunarathna. 2017. A systematic literature review and meta-analysis on cross project defect prediction. IEEE Transactions on Software Engineering 45, 2 (2017), 111--147.Google ScholarCross Ref
K. Khan, S. Daya, and A. Jadad. 1996. The importance of quality of primary studies in producing unbiased systematic reviews. Archives of Internal Medicine 156, 6 (1996), 661--666.Google ScholarCross Ref
B. Kitchenham, D. Budgen, and P. Brereton. 2015. Evidence-Based Software engineering and systematic reviews. CRC Press, Boca Raton, Fl, US.Google Scholar
A. Luque, A. Carrasco, A. Martín, and A. de las Heras. 2019. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition 91 (2019), 216--231.Google ScholarDigital Library
R. Malhotra. 2015. A systematic review of machine learning techniques for software fault prediction. Applied Soft Computing 27 (2015), 504--518. https://doi.org/10.1016/j.asoc.2014.11.023Google ScholarDigital Library
G Maušaa, F. Sarro, and T. Grbaca. 2017. Learning Techniques for Systems in Evolution in Software Defect Prediction. Information and Software Technology online (2017).Google Scholar
T. Mende and R. Koschke. 2010. Effort-aware defect prediction models. In 2010 14th IEEE European Conference on Software Maintenance and Reengineering. IEEE, 107--116.Google Scholar
T. Menzies and M. Shepperd. 2012. Editorial: Special issue on repeatable results in software engineering prediction. Empirical Software Engineering 17, 1-2 (2012), 1--17.Google ScholarDigital Library
S. Morasca and L. Lavazza. 2017. Risk-averse slope-based thresholds: Definition and empirical evaluation. Information and Software Technology 89 (2017), 37--63.Google ScholarCross Ref
M. NezhadShokouhi, M. Majidi, and A. Rasoolzadegan. 2019. Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance. The Journal of Supercomputing online (2019), 1--34.Google Scholar
C. Pan, M. Lu, B. Xu, and H. Gao. 2019. An Improved CNN Model for Within-Project Software Defect Prediction. Applied Sciences 9, 10 (2019), 2138.Google ScholarCross Ref
D. Powers. 2011. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies 2, 1 (2011), 37--63.Google ScholarCross Ref
D. Radjenović, M. Heri|čko, R. Torkar, and A. Živkovi|č. 2013. Software Fault Prediction Metrics: A Systematic Literature Review. Information and Software Technology 55, 8 (2013), 1397--1418.Google ScholarDigital Library
F. Rahman, D. Posnett, and P. Devanbu. 2012. Recalling the imprecision of cross-project defect prediction. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering.Google Scholar
F. Rahman, D. Posnett, I. Herraiz, and P. Devanbu. 2013. Sample size vs. bias in defect prediction. In Proceedings of the 9th Joint meeting on Foundations of Software Engineering. ACM, 147--157.Google Scholar
D. Rodriguez, I. Herraiz, R. Harrison, J. Dolado, and J. Riquelme. 2014. Preliminary comparison of techniques for dealing with imbalance in software defect prediction. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering. ACM, 43.Google Scholar
W. Shadish, T. Cook, and D. Campbell. 2002. Experimental and quasi-experimental designs for generalized causal inference. Houghton Mifflin, Boston.Google Scholar
M. Shepperd, D. Bowes, and T. Hall. 2014. Researcher Bias: The Use of Machine Learning in Software Defect Prediction. IEEE Transactions on Software Engineering 40, 6 (2014), 603--616.Google ScholarCross Ref
M. Sokolova, N. Japkowicz, and S. Szpakowicz. 2006. Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. In Australasian Joint Conference on Artificial Intelligence. Springer, 1015--1021.Google Scholar
M. Sokolova and G. Lapalme. 2009. A systematic analysis of performance measures for classification tasks. Information Processing and Management, 45, 4 (2009), 427--437.Google ScholarDigital Library
Q. Song, Y. Guo, and M. Shepperd. 2019. A Comprehensive Investigation of the Role of Imbalanced Learning for Software Defect Prediction. IEEE Transactions on Software Engineering 45, 12 (2019), 1253--1269.Google ScholarCross Ref
Y. Sun, A. Wong, and M. Kamel. 2009. Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence 23, 04 (2009), 687--719.Google ScholarCross Ref
H. Tong, B. Liu, and S. Wang. 2018. Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Information and Software Technology 96 (2018), 94--111.Google ScholarDigital Library
C. van Rijsbergen. 1979. Information Retrieval (2nd ed.). Butterworths.Google Scholar
M. Warrens. 2008. On Association Coefficients For 2 × 2 Tables and Properties That Do Not Depend on the Marginal Distributions. Psychometrika 73, 4 (2008), 777--789.Google ScholarCross Ref
L. Zhao, Z. Shang, L. Zhao, A. Qin, and Y. Tang. 2019. Siamese Dense Neural Network for Software Defect Prediction With Small Data. IEEE Access 7 (2019), 7663--7677.Google ScholarCross Ref
T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy. 2009. Crossproject defect prediction: a large scale experiment on data vs. domain vs. process. In Proceedings of the the 7th ACM Joint Meeting of the European Software Engineering Conference and the Symposium on The foundations of Software Engineering. ACM, 91--100.Google Scholar

Index Terms

Assessing software defection prediction performance: why using the Matthews correlation coefficient matters
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis

Recommendations

The impact of using biased performance metrics on software defect prediction research
Abstract Context:
Software engineering researchers have undertaken many experiments investigating the potential of software defect prediction algorithms. Unfortunately some widely used performance metrics are known to be problematic, ...
Read More
Software defect prediction: do different classifiers find the same defects?

During the last 10 years, hundreds of different defect prediction models have been published. The performance of the classifiers used in these models is reported to be similar with models rarely performing above the predictive performance ceiling of ...
Read More
Source Code Metrics for Software Defects Prediction
SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing

In current research, there are contrasting results about the applicability of software source code metrics as features for defect prediction models. The goal of the paper is to evaluate the adoption of software metrics in models for software defect ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EASE '20: Proceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering
April 2020
544 pages
ISBN:9781450377317
DOI:10.1145/3383219
General Chairs:
Jingyue Li
Norwegian University of Science and Technology, Norway
,
Letizia Jaccheri
Norwegian University of Science and Technology, Norway
,
Program Chairs:
Torgeir Dingsøyr
SINTEF Digital and Norwegian University of Science and Technology, Norway
,
Ruzanna Chitchyan
University of Bristol, UK
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 April 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Classification metrics
Software defect prediction
Software engineering experimentation
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate71of232submissions,31%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 57
  Total Citations
  View Citations
- 544
  Total Downloads
- Downloads (Last 12 months)138
- Downloads (Last 6 weeks)17
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Assessing software defection prediction performance: why using the Matthews correlation coefficient matters

EASE '20: Proceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

The impact of using biased performance metrics on software defect prediction research

Software defect prediction: do different classifiers find the same defects?

Source Code Metrics for Software Defects Prediction