skip to main content
10.1145/2601248.2601294acmotherconferencesArticle/Chapter ViewAbstractPublication PageseaseConference Proceedingsconference-collections
research-article

Preliminary comparison of techniques for dealing with imbalance in software defect prediction

Published:13 May 2014Publication History

ABSTRACT

Imbalanced data is a common problem in data mining when dealing with classification problems, where samples of a class vastly outnumber other classes. In this situation, many data mining algorithms generate poor models as they try to optimize the overall accuracy and perform badly in classes with very few samples. Software Engineering data in general and defect prediction datasets are not an exception and in this paper, we compare different approaches, namely sampling, cost-sensitive, ensemble and hybrid approaches to the problem of defect prediction with different datasets preprocessed differently. We have used the well-known NASA datasets curated by Shepperd et al. There are differences in the results depending on the characteristics of the dataset and the evaluation metrics, especially if duplicates and inconsistencies are removed as a preprocessing step.

Further Results and replication package: http://www.cc.uah.es/drg/ease14

References

  1. E. Arisholm, L. C. Briand, and E. B. Johannessen. A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. Journal of Systems and Software, 83(1):2--17, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Bibi, G. Tsoumakas, I. Stamelos, and I. Vlahvas. Software defect prediction using regression via classification. In IEEE International Conference on Computer Systems and Applications (AICCSA 2006), pages 330--336, 8 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. Breiman. Bagging predictors. Machine Learning, 24:123--140, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Catal and B. Diri. A systematic review of software fault prediction studies. Expert Systems with Applications, 36(4):7346--7354, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer. Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321--357, 2002. Google ScholarGoogle ScholarCross RefCross Ref
  7. N. Chawla, A. Lazarevic, L. Hall, and K. Bowyer. Smoteboost: Improving prediction of the minority class in boosting. In 7th European Conference on Principles and Practice of Knowledge Discovery in Databases(PKDD 2003), pages 107--119, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  8. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority over-sampling technique. J. Artif. Intell. Res. (JAIR), 16:321--357, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Davis and M. Goadrich. The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning (ICLM'06, ICML'06, pages 233--240, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Demšar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1--30, Dec. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Domingos. Metacost: a general method for making classifiers cost-sensitive. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '99, pages 155--164, New York, NY, USA, 1999. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. O. Elish and M. O. Elish. Predicting defect-prone software modules using support vector machines. Journal of Systems and Software, 81(5):649--660, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Fawcett. An introduction to roc analysis. Pattern Recogn. Lett., 27(8):861--874, June 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119--139, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Thirteenth International Conference on Machine Learning, pages 148--156, San Francisco, 1996. Morgan Kaufmann.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Galar, A. Fernández, E. Barrenechea, H. Bustince, and F. Herrera. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 42(4):463--484, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Galar, A. Fernández, E. Barrenechea, and F. Herrera. EUSboost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognition, null(null), May 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. García, R. Aler, and I. M. Galván. Using evolutionary multiobjective techniques for imbalanced classification data. In K. Diamantaras, W. Duch, and L. S. Iliadis, editors, Artificial Neural Networks - ICANN 2010, volume 6352 of Lecture Notes in Computer Science, pages 422--427. Springer Berlin Heidelberg, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. A systematic literature review on fault prediction performance in software engineering. Transactions on Software Engineering, In Press -- 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Halstead. Elements of software science. Elsevier Computer Science Library. Operating And Programming Systems Series; 2. Elsevier, New York; Oxford, 1977. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. V. Hulse and T. Khoshgoftaar. Knowledge discovery from imbalanced and noisy data. Data & Knowledge Engineering, 68(12):1513--1542, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5):429--449, Oct. 2002. Google ScholarGoogle ScholarCross RefCross Ref
  23. T. M. Khoshgoftaar, E. Allen, and J. Deng. Using regression trees to classify fault-prone software modules. IEEE Transactions on Reliability, 51(4):455--462, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  24. T. M. Khoshgoftaar, E. Allen, J. Hudepohl, and S. Aud. Application of neural networks to software quality modeling of a very large telecommunications system. IEEE Transactions on Neural Networks, 8(4):902--909, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. T. M. Khoshgoftaar and N. Seliya. Analogy-based practical classification rules for software quality estimation. Empirical Software Engineering, 8(4):325--350, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Lessmann, B. Baesens, C. Mues, and S. Pietsch. Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Transactions on Software Engineering, 34(4):485--496, July-Aug. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. V. López, A. Fernández, and F. Herrera. On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed. Information Sciences, 257:1--13, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. V. López, A. Fernández, J. G. Moreno-Torres, and F. Herrera. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. open problems on intrinsic data characteristics. Expert Systems with Applications, 39(7):6585--6608, June 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. B. W. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et biophysica acta, 405(2):442--451, Oct. 1975.Google ScholarGoogle Scholar
  30. T. McCabe. A complexity measure. IEEE Transactions on Software Engineering, 2(4):308--320, December 1976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. T. Mende and R. Koschke. Revisiting the evaluation of defect prediction models. In Proceedings of the 5th International Conference on Predictor Models in Software Engineering (PROMISE'09), pages 1--10, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. T. Mende and R. Koschke. Effort-aware defect prediction models. In Proceedings of the 2010 14th European Conference on Software Maintenance and Reengineering (CSMR'10), CSMR'10, pages 107--116, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan. The promise repository of empirical software engineering data, June 2012.Google ScholarGoogle Scholar
  34. T. Menzies, A. Dekhtyar, J. Distefano, and J. Greenwald. Problems with precision: A response to comments on data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 33(9):637--640, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. T. Menzies, J. Greenwald, and A. Frank. Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. T. Mitchell. Machine Learning. McGraw Hill, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Y. Peng, G. Kou, G. Wang, H. Wang, and F. Ko. Empirical evaluation of classifiers for software risk management. International Journal of Information Technology & Decision Making (IJITDM), 08(04):749--767, 2009.Google ScholarGoogle Scholar
  38. Y. Peng, G. Wang, and H. Wang. User preferences based software defect detection algorithms selection using MCDM. Information Sciences, In Press.:--, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. J. Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo, California, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. Rodriguez, L. Kuncheva, and C. Alonso. Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1619--1630, oct 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. C. Seiffert, T. Khoshgoftaar, and J. Van Hulse. Improving software-quality predictions with data sampling and boosting. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 39(6):1283--1294, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. C. Seiffert, T. Khoshgoftaar, J. Van Hulse, and A. Napolitano. RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, 40(1):185--197, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. M. Shepperd, Q. Song, Z. Sun, and C. Mair. Data quality: Some comments on the nasa software defect datasets. IEEE Transactions on Software Engineering, 39(9):1208--1215, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. J. Van Hulse, T. M. Khoshgoftaar, and A. Napolitano. Experimental perspectives on learning from imbalanced data. In Proceedings of the 24th international conference on Machine learning (ICLM), ICML '07, pages 935--942, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. O. Vandecruys, D. Martens, B. Baesens, C. Mues, M. De Backer, and R. Haesen. Mining software repositories for comprehensible software fault prediction models. Journal of Systems and Software, 81(5):823--839, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. D. Wilson. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man and Cybernetics, (3):408--421, 1972.Google ScholarGoogle ScholarCross RefCross Ref
  47. I. H. Witten, E. Frank, and M. A. Hall. Data Mining: Practical Machine Learning Tools and Techniques (Third Edition). Morgan Kaufmann, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. H. Zhang and X. Zhang. Comments on "data mining static code attributes to learn defect predictors". IEEE Transactions on Software Engineering, 33(9):635--637, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Preliminary comparison of techniques for dealing with imbalance in software defect prediction

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Other conferences
              EASE '14: Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering
              May 2014
              486 pages
              ISBN:9781450324762
              DOI:10.1145/2601248

              Copyright © 2014 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 13 May 2014

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              Overall Acceptance Rate71of232submissions,31%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader