ABSTRACT
Imbalanced data is a common problem in data mining when dealing with classification problems, where samples of a class vastly outnumber other classes. In this situation, many data mining algorithms generate poor models as they try to optimize the overall accuracy and perform badly in classes with very few samples. Software Engineering data in general and defect prediction datasets are not an exception and in this paper, we compare different approaches, namely sampling, cost-sensitive, ensemble and hybrid approaches to the problem of defect prediction with different datasets preprocessed differently. We have used the well-known NASA datasets curated by Shepperd et al. There are differences in the results depending on the characteristics of the dataset and the evaluation metrics, especially if duplicates and inconsistencies are removed as a preprocessing step.
Further Results and replication package: http://www.cc.uah.es/drg/ease14
- E. Arisholm, L. C. Briand, and E. B. Johannessen. A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. Journal of Systems and Software, 83(1):2--17, 2010. Google ScholarDigital Library
- S. Bibi, G. Tsoumakas, I. Stamelos, and I. Vlahvas. Software defect prediction using regression via classification. In IEEE International Conference on Computer Systems and Applications (AICCSA 2006), pages 330--336, 8 2006. Google ScholarDigital Library
- L. Breiman. Bagging predictors. Machine Learning, 24:123--140, 1996. Google ScholarDigital Library
- L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001. Google ScholarDigital Library
- C. Catal and B. Diri. A systematic review of software fault prediction studies. Expert Systems with Applications, 36(4):7346--7354, 2009. Google ScholarDigital Library
- N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer. Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321--357, 2002. Google ScholarCross Ref
- N. Chawla, A. Lazarevic, L. Hall, and K. Bowyer. Smoteboost: Improving prediction of the minority class in boosting. In 7th European Conference on Principles and Practice of Knowledge Discovery in Databases(PKDD 2003), pages 107--119, 2003.Google ScholarCross Ref
- N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority over-sampling technique. J. Artif. Intell. Res. (JAIR), 16:321--357, 2002. Google ScholarDigital Library
- J. Davis and M. Goadrich. The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning (ICLM'06, ICML'06, pages 233--240, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- J. Demšar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1--30, Dec. 2006. Google ScholarDigital Library
- P. Domingos. Metacost: a general method for making classifiers cost-sensitive. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '99, pages 155--164, New York, NY, USA, 1999. ACM. Google ScholarDigital Library
- K. O. Elish and M. O. Elish. Predicting defect-prone software modules using support vector machines. Journal of Systems and Software, 81(5):649--660, 2008. Google ScholarDigital Library
- T. Fawcett. An introduction to roc analysis. Pattern Recogn. Lett., 27(8):861--874, June 2006. Google ScholarDigital Library
- Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119--139, 1997. Google ScholarDigital Library
- Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Thirteenth International Conference on Machine Learning, pages 148--156, San Francisco, 1996. Morgan Kaufmann.Google ScholarDigital Library
- M. Galar, A. Fernández, E. Barrenechea, H. Bustince, and F. Herrera. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 42(4):463--484, 2012.Google ScholarDigital Library
- M. Galar, A. Fernández, E. Barrenechea, and F. Herrera. EUSboost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognition, null(null), May 2013. Google ScholarDigital Library
- S. García, R. Aler, and I. M. Galván. Using evolutionary multiobjective techniques for imbalanced classification data. In K. Diamantaras, W. Duch, and L. S. Iliadis, editors, Artificial Neural Networks - ICANN 2010, volume 6352 of Lecture Notes in Computer Science, pages 422--427. Springer Berlin Heidelberg, 2010. Google ScholarDigital Library
- T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. A systematic literature review on fault prediction performance in software engineering. Transactions on Software Engineering, In Press -- 2011. Google ScholarDigital Library
- M. Halstead. Elements of software science. Elsevier Computer Science Library. Operating And Programming Systems Series; 2. Elsevier, New York; Oxford, 1977. Google ScholarDigital Library
- J. V. Hulse and T. Khoshgoftaar. Knowledge discovery from imbalanced and noisy data. Data & Knowledge Engineering, 68(12):1513--1542, 2009. Google ScholarDigital Library
- N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5):429--449, Oct. 2002. Google ScholarCross Ref
- T. M. Khoshgoftaar, E. Allen, and J. Deng. Using regression trees to classify fault-prone software modules. IEEE Transactions on Reliability, 51(4):455--462, 2002.Google ScholarCross Ref
- T. M. Khoshgoftaar, E. Allen, J. Hudepohl, and S. Aud. Application of neural networks to software quality modeling of a very large telecommunications system. IEEE Transactions on Neural Networks, 8(4):902--909, 1997. Google ScholarDigital Library
- T. M. Khoshgoftaar and N. Seliya. Analogy-based practical classification rules for software quality estimation. Empirical Software Engineering, 8(4):325--350, 2003. Google ScholarDigital Library
- S. Lessmann, B. Baesens, C. Mues, and S. Pietsch. Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Transactions on Software Engineering, 34(4):485--496, July-Aug. 2008. Google ScholarDigital Library
- V. López, A. Fernández, and F. Herrera. On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed. Information Sciences, 257:1--13, 2014. Google ScholarDigital Library
- V. López, A. Fernández, J. G. Moreno-Torres, and F. Herrera. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. open problems on intrinsic data characteristics. Expert Systems with Applications, 39(7):6585--6608, June 2012. Google ScholarDigital Library
- B. W. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et biophysica acta, 405(2):442--451, Oct. 1975.Google Scholar
- T. McCabe. A complexity measure. IEEE Transactions on Software Engineering, 2(4):308--320, December 1976. Google ScholarDigital Library
- T. Mende and R. Koschke. Revisiting the evaluation of defect prediction models. In Proceedings of the 5th International Conference on Predictor Models in Software Engineering (PROMISE'09), pages 1--10, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- T. Mende and R. Koschke. Effort-aware defect prediction models. In Proceedings of the 2010 14th European Conference on Software Maintenance and Reengineering (CSMR'10), CSMR'10, pages 107--116, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarDigital Library
- T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan. The promise repository of empirical software engineering data, June 2012.Google Scholar
- T. Menzies, A. Dekhtyar, J. Distefano, and J. Greenwald. Problems with precision: A response to comments on data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 33(9):637--640, 2007. Google ScholarDigital Library
- T. Menzies, J. Greenwald, and A. Frank. Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 2007. Google ScholarDigital Library
- T. Mitchell. Machine Learning. McGraw Hill, 1997. Google ScholarDigital Library
- Y. Peng, G. Kou, G. Wang, H. Wang, and F. Ko. Empirical evaluation of classifiers for software risk management. International Journal of Information Technology & Decision Making (IJITDM), 08(04):749--767, 2009.Google Scholar
- Y. Peng, G. Wang, and H. Wang. User preferences based software defect detection algorithms selection using MCDM. Information Sciences, In Press.:--, 2010. Google ScholarDigital Library
- J. Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo, California, 1993. Google ScholarDigital Library
- J. Rodriguez, L. Kuncheva, and C. Alonso. Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1619--1630, oct 2006. Google ScholarDigital Library
- C. Seiffert, T. Khoshgoftaar, and J. Van Hulse. Improving software-quality predictions with data sampling and boosting. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 39(6):1283--1294, 2009. Google ScholarDigital Library
- C. Seiffert, T. Khoshgoftaar, J. Van Hulse, and A. Napolitano. RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, 40(1):185--197, 2010. Google ScholarDigital Library
- M. Shepperd, Q. Song, Z. Sun, and C. Mair. Data quality: Some comments on the nasa software defect datasets. IEEE Transactions on Software Engineering, 39(9):1208--1215, 2013. Google ScholarDigital Library
- J. Van Hulse, T. M. Khoshgoftaar, and A. Napolitano. Experimental perspectives on learning from imbalanced data. In Proceedings of the 24th international conference on Machine learning (ICLM), ICML '07, pages 935--942, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- O. Vandecruys, D. Martens, B. Baesens, C. Mues, M. De Backer, and R. Haesen. Mining software repositories for comprehensible software fault prediction models. Journal of Systems and Software, 81(5):823--839, 2008. Google ScholarDigital Library
- D. Wilson. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man and Cybernetics, (3):408--421, 1972.Google ScholarCross Ref
- I. H. Witten, E. Frank, and M. A. Hall. Data Mining: Practical Machine Learning Tools and Techniques (Third Edition). Morgan Kaufmann, 2011. Google ScholarDigital Library
- H. Zhang and X. Zhang. Comments on "data mining static code attributes to learn defect predictors". IEEE Transactions on Software Engineering, 33(9):635--637, 2007. Google ScholarDigital Library
Index Terms
- Preliminary comparison of techniques for dealing with imbalance in software defect prediction
Recommendations
The significant effects of data sampling approaches on software defect prioritization and classification
ESEM '17: Proceedings of the 11th ACM/IEEE International Symposium on Empirical Software Engineering and MeasurementContext: Recent studies have shown that performance of defect prediction models can be affected when data sampling approaches are applied to unbalanced training data for building defect prediction models. However, the magnitude (degree and power) of the ...
An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data
AbstractSoftware defect prediction is important to identify defects in the early phases of software development life cycle. This early identification and thereby removal of software defects is crucial to yield a cost-effective and good quality ...
So You Need More Method Level Datasets for Your Software Defect Prediction?: Voilà!
ESEM '16: Proceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering and MeasurementContext: Defect prediction research is based on a small number of defect datasets and most are at class not method level. Consequently our knowledge of defects is limited. Identifying defect datasets for prediction is not easy and extracting quality ...
Comments