ABSTRACT
Context: There is considerable diversity in the range and design of computational experiments to assess classifiers for software defect prediction. This is particularly so, regarding the choice of classifier performance metrics. Unfortunately some widely used metrics are known to be biased, in particular F1.
Objective: We want to understand the extent to which the widespread use of the F1 renders empirical results in software defect prediction unreliable.
Method: We searched for defect prediction studies that report both F1 and the Matthews correlation coefficient (MCC). This enabled us to determine the proportion of results that are consistent between both metrics and the proportion that change.
Results: Our systematic review identifies 8 studies comprising 4017 pairwise results. Of these results, the direction of the comparison changes in 23% of the cases when the unbiased MCC metric is employed.
Conclusion: We find compelling reasons why the choice of classification performance metric matters, specifically the biased and misleading F1 metric should be deprecated.
- G. Abaei, A. Selamat, and J. Al Dallal. 2018. A fuzzy logic expert system to predict module fault proneness using unlabeled data. Journal of King Saud University-Computer and Information Sciences online (2018).Google Scholar
- P. Baldi, S. Brunak, Y. Chauvin, C. Andersen, and H. Nielsen. 2000. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16, 5 (2000), 412--424.Google ScholarCross Ref
- C. Catal and B. Diri. 2009. A systematic review of software fault prediction studies. Expert Systems with Applications 36, 4 (2009), 7346--7354.Google ScholarDigital Library
- P. Ellis. 2010. The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results. Cambridge University Press.Google Scholar
- T. Fawcett. 2006. An introduction to ROC analysis. Pattern Recognition Letters 27, 8 (2006), 861--874.Google ScholarDigital Library
- C. Ferri, J. Hernández-Orallo, and R. Modroiu. 2009. An experimental comparison of performance measures for classification. Pattern Recognition Letters 30, 1 (2009), 27--38.Google ScholarDigital Library
- P. Flach and M. Kull. 2015. Precision-recall-gain curves: PR analysis done right. In Advances in Neural Information Processing Systems (NIPS 2015). 838--846.Google Scholar
- L. Gong, S. Jiang, and L. Jiang. 2019. An improved transfer adaptive boosting approach for mixed-project defect prediction. Journal of Software: Evolution and Process 31, 10 (2019), e2172.Google ScholarDigital Library
- T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. 2012. A Systematic Literature Review on Fault Prediction Performance in Software Engineering. IEEE Transactions on Software Engineering 38, 6 (2012), 1276--1304.Google ScholarDigital Library
- D. Hand. 2009. Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine Learning 77 (2009), 103--123. https://doi.org/10.1007/s10994-009-5119-5Google ScholarDigital Library
- J. Hernández-Orallo, P. Flach, and C. Ferri. 2012. A unified view of performance metrics: translating threshold choice into expected classification loss. Journal of Machine Learning Research 13, 10 (2012), 2813--2869.Google ScholarDigital Library
- S. Hosseini, B. Turhan, and D Gunarathna. 2017. A systematic literature review and meta-analysis on cross project defect prediction. IEEE Transactions on Software Engineering 45, 2 (2017), 111--147.Google ScholarCross Ref
- K. Khan, S. Daya, and A. Jadad. 1996. The importance of quality of primary studies in producing unbiased systematic reviews. Archives of Internal Medicine 156, 6 (1996), 661--666.Google ScholarCross Ref
- B. Kitchenham, D. Budgen, and P. Brereton. 2015. Evidence-Based Software engineering and systematic reviews. CRC Press, Boca Raton, Fl, US.Google Scholar
- A. Luque, A. Carrasco, A. Martín, and A. de las Heras. 2019. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition 91 (2019), 216--231.Google ScholarDigital Library
- R. Malhotra. 2015. A systematic review of machine learning techniques for software fault prediction. Applied Soft Computing 27 (2015), 504--518. https://doi.org/10.1016/j.asoc.2014.11.023Google ScholarDigital Library
- G Maušaa, F. Sarro, and T. Grbaca. 2017. Learning Techniques for Systems in Evolution in Software Defect Prediction. Information and Software Technology online (2017).Google Scholar
- T. Mende and R. Koschke. 2010. Effort-aware defect prediction models. In 2010 14th IEEE European Conference on Software Maintenance and Reengineering. IEEE, 107--116.Google Scholar
- T. Menzies and M. Shepperd. 2012. Editorial: Special issue on repeatable results in software engineering prediction. Empirical Software Engineering 17, 1-2 (2012), 1--17.Google ScholarDigital Library
- S. Morasca and L. Lavazza. 2017. Risk-averse slope-based thresholds: Definition and empirical evaluation. Information and Software Technology 89 (2017), 37--63.Google ScholarCross Ref
- M. NezhadShokouhi, M. Majidi, and A. Rasoolzadegan. 2019. Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance. The Journal of Supercomputing online (2019), 1--34.Google Scholar
- C. Pan, M. Lu, B. Xu, and H. Gao. 2019. An Improved CNN Model for Within-Project Software Defect Prediction. Applied Sciences 9, 10 (2019), 2138.Google ScholarCross Ref
- D. Powers. 2011. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies 2, 1 (2011), 37--63.Google ScholarCross Ref
- D. Radjenović, M. Heri|čko, R. Torkar, and A. Živkovi|č. 2013. Software Fault Prediction Metrics: A Systematic Literature Review. Information and Software Technology 55, 8 (2013), 1397--1418.Google ScholarDigital Library
- F. Rahman, D. Posnett, and P. Devanbu. 2012. Recalling the imprecision of cross-project defect prediction. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering.Google Scholar
- F. Rahman, D. Posnett, I. Herraiz, and P. Devanbu. 2013. Sample size vs. bias in defect prediction. In Proceedings of the 9th Joint meeting on Foundations of Software Engineering. ACM, 147--157.Google Scholar
- D. Rodriguez, I. Herraiz, R. Harrison, J. Dolado, and J. Riquelme. 2014. Preliminary comparison of techniques for dealing with imbalance in software defect prediction. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering. ACM, 43.Google Scholar
- W. Shadish, T. Cook, and D. Campbell. 2002. Experimental and quasi-experimental designs for generalized causal inference. Houghton Mifflin, Boston.Google Scholar
- M. Shepperd, D. Bowes, and T. Hall. 2014. Researcher Bias: The Use of Machine Learning in Software Defect Prediction. IEEE Transactions on Software Engineering 40, 6 (2014), 603--616.Google ScholarCross Ref
- M. Sokolova, N. Japkowicz, and S. Szpakowicz. 2006. Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. In Australasian Joint Conference on Artificial Intelligence. Springer, 1015--1021.Google Scholar
- M. Sokolova and G. Lapalme. 2009. A systematic analysis of performance measures for classification tasks. Information Processing and Management, 45, 4 (2009), 427--437.Google ScholarDigital Library
- Q. Song, Y. Guo, and M. Shepperd. 2019. A Comprehensive Investigation of the Role of Imbalanced Learning for Software Defect Prediction. IEEE Transactions on Software Engineering 45, 12 (2019), 1253--1269.Google ScholarCross Ref
- Y. Sun, A. Wong, and M. Kamel. 2009. Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence 23, 04 (2009), 687--719.Google ScholarCross Ref
- H. Tong, B. Liu, and S. Wang. 2018. Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Information and Software Technology 96 (2018), 94--111.Google ScholarDigital Library
- C. van Rijsbergen. 1979. Information Retrieval (2nd ed.). Butterworths.Google Scholar
- M. Warrens. 2008. On Association Coefficients For 2 × 2 Tables and Properties That Do Not Depend on the Marginal Distributions. Psychometrika 73, 4 (2008), 777--789.Google ScholarCross Ref
- L. Zhao, Z. Shang, L. Zhao, A. Qin, and Y. Tang. 2019. Siamese Dense Neural Network for Software Defect Prediction With Small Data. IEEE Access 7 (2019), 7663--7677.Google ScholarCross Ref
- T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy. 2009. Crossproject defect prediction: a large scale experiment on data vs. domain vs. process. In Proceedings of the the 7th ACM Joint Meeting of the European Software Engineering Conference and the Symposium on The foundations of Software Engineering. ACM, 91--100.Google Scholar
Index Terms
- Assessing software defection prediction performance: why using the Matthews correlation coefficient matters
Recommendations
The impact of using biased performance metrics on software defect prediction research
Abstract Context:Software engineering researchers have undertaken many experiments investigating the potential of software defect prediction algorithms. Unfortunately some widely used performance metrics are known to be problematic, ...
Software defect prediction: do different classifiers find the same defects?
During the last 10 years, hundreds of different defect prediction models have been published. The performance of the classifiers used in these models is reported to be similar with models rarely performing above the predictive performance ceiling of ...
Source Code Metrics for Software Defects Prediction
SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied ComputingIn current research, there are contrasting results about the applicability of software source code metrics as features for defect prediction models. The goal of the paper is to evaluate the adoption of software metrics in models for software defect ...
Comments