research-article

Preliminary comparison of techniques for dealing with imbalance in software defect prediction

Authors:
Daniel Rodriguez

University of Alcala, Madrid, Spain

University of Alcala, Madrid, Spain
View Profile

,
Israel Herraiz

Technical Univ of Madrid (UPM), Madrid, Spain

Technical Univ of Madrid (UPM), Madrid, Spain
View Profile

,
Rachel Harrison

Oxford Brookes University, Oxford, UK

Oxford Brookes University, Oxford, UK
View Profile

,
Javier Dolado

Univ of the Basque Country, Donostia, Spain

Univ of the Basque Country, Donostia, Spain
View Profile

,
José C. Riquelme

University of Seville, Seville, Spain

University of Seville, Seville, Spain
View Profile

EASE '14: Proceedings of the 18th International Conference on Evaluation and Assessment in Software EngineeringMay 2014Article No.: 43Pages 1–10https://doi.org/10.1145/2601248.2601294

Published:13 May 2014Publication History

EASE '14: Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering

Pages 1–10

ABSTRACT

Imbalanced data is a common problem in data mining when dealing with classification problems, where samples of a class vastly outnumber other classes. In this situation, many data mining algorithms generate poor models as they try to optimize the overall accuracy and perform badly in classes with very few samples. Software Engineering data in general and defect prediction datasets are not an exception and in this paper, we compare different approaches, namely sampling, cost-sensitive, ensemble and hybrid approaches to the problem of defect prediction with different datasets preprocessed differently. We have used the well-known NASA datasets curated by Shepperd et al. There are differences in the results depending on the characteristics of the dataset and the evaluation metrics, especially if duplicates and inconsistencies are removed as a preprocessing step.

Further Results and replication package: http://www.cc.uah.es/drg/ease14

References

E. Arisholm, L. C. Briand, and E. B. Johannessen. A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. Journal of Systems and Software, 83(1):2--17, 2010. Google ScholarDigital Library
S. Bibi, G. Tsoumakas, I. Stamelos, and I. Vlahvas. Software defect prediction using regression via classification. In IEEE International Conference on Computer Systems and Applications (AICCSA 2006), pages 330--336, 8 2006. Google ScholarDigital Library
L. Breiman. Bagging predictors. Machine Learning, 24:123--140, 1996. Google ScholarDigital Library
L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001. Google ScholarDigital Library
C. Catal and B. Diri. A systematic review of software fault prediction studies. Expert Systems with Applications, 36(4):7346--7354, 2009. Google ScholarDigital Library
N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer. Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321--357, 2002. Google ScholarCross Ref
N. Chawla, A. Lazarevic, L. Hall, and K. Bowyer. Smoteboost: Improving prediction of the minority class in boosting. In 7th European Conference on Principles and Practice of Knowledge Discovery in Databases(PKDD 2003), pages 107--119, 2003.Google ScholarCross Ref
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority over-sampling technique. J. Artif. Intell. Res. (JAIR), 16:321--357, 2002. Google ScholarDigital Library
J. Davis and M. Goadrich. The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning (ICLM'06, ICML'06, pages 233--240, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
J. Demšar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1--30, Dec. 2006. Google ScholarDigital Library
P. Domingos. Metacost: a general method for making classifiers cost-sensitive. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '99, pages 155--164, New York, NY, USA, 1999. ACM. Google ScholarDigital Library
K. O. Elish and M. O. Elish. Predicting defect-prone software modules using support vector machines. Journal of Systems and Software, 81(5):649--660, 2008. Google ScholarDigital Library
T. Fawcett. An introduction to roc analysis. Pattern Recogn. Lett., 27(8):861--874, June 2006. Google ScholarDigital Library
Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119--139, 1997. Google ScholarDigital Library
Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Thirteenth International Conference on Machine Learning, pages 148--156, San Francisco, 1996. Morgan Kaufmann.Google ScholarDigital Library
M. Galar, A. Fernández, E. Barrenechea, H. Bustince, and F. Herrera. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 42(4):463--484, 2012.Google ScholarDigital Library
M. Galar, A. Fernández, E. Barrenechea, and F. Herrera. EUSboost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognition, null(null), May 2013. Google ScholarDigital Library
S. García, R. Aler, and I. M. Galván. Using evolutionary multiobjective techniques for imbalanced classification data. In K. Diamantaras, W. Duch, and L. S. Iliadis, editors, Artificial Neural Networks - ICANN 2010, volume 6352 of Lecture Notes in Computer Science, pages 422--427. Springer Berlin Heidelberg, 2010. Google ScholarDigital Library
T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. A systematic literature review on fault prediction performance in software engineering. Transactions on Software Engineering, In Press -- 2011. Google ScholarDigital Library
M. Halstead. Elements of software science. Elsevier Computer Science Library. Operating And Programming Systems Series; 2. Elsevier, New York; Oxford, 1977. Google ScholarDigital Library
J. V. Hulse and T. Khoshgoftaar. Knowledge discovery from imbalanced and noisy data. Data & Knowledge Engineering, 68(12):1513--1542, 2009. Google ScholarDigital Library
N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5):429--449, Oct. 2002. Google ScholarCross Ref
T. M. Khoshgoftaar, E. Allen, and J. Deng. Using regression trees to classify fault-prone software modules. IEEE Transactions on Reliability, 51(4):455--462, 2002.Google ScholarCross Ref
T. M. Khoshgoftaar, E. Allen, J. Hudepohl, and S. Aud. Application of neural networks to software quality modeling of a very large telecommunications system. IEEE Transactions on Neural Networks, 8(4):902--909, 1997. Google ScholarDigital Library
T. M. Khoshgoftaar and N. Seliya. Analogy-based practical classification rules for software quality estimation. Empirical Software Engineering, 8(4):325--350, 2003. Google ScholarDigital Library
S. Lessmann, B. Baesens, C. Mues, and S. Pietsch. Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Transactions on Software Engineering, 34(4):485--496, July-Aug. 2008. Google ScholarDigital Library
V. López, A. Fernández, and F. Herrera. On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed. Information Sciences, 257:1--13, 2014. Google ScholarDigital Library
V. López, A. Fernández, J. G. Moreno-Torres, and F. Herrera. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. open problems on intrinsic data characteristics. Expert Systems with Applications, 39(7):6585--6608, June 2012. Google ScholarDigital Library
B. W. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et biophysica acta, 405(2):442--451, Oct. 1975.Google Scholar
T. McCabe. A complexity measure. IEEE Transactions on Software Engineering, 2(4):308--320, December 1976. Google ScholarDigital Library
T. Mende and R. Koschke. Revisiting the evaluation of defect prediction models. In Proceedings of the 5th International Conference on Predictor Models in Software Engineering (PROMISE'09), pages 1--10, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
T. Mende and R. Koschke. Effort-aware defect prediction models. In Proceedings of the 2010 14th European Conference on Software Maintenance and Reengineering (CSMR'10), CSMR'10, pages 107--116, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarDigital Library
T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan. The promise repository of empirical software engineering data, June 2012.Google Scholar
T. Menzies, A. Dekhtyar, J. Distefano, and J. Greenwald. Problems with precision: A response to comments on data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 33(9):637--640, 2007. Google ScholarDigital Library
T. Menzies, J. Greenwald, and A. Frank. Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 2007. Google ScholarDigital Library
T. Mitchell. Machine Learning. McGraw Hill, 1997. Google ScholarDigital Library
Y. Peng, G. Kou, G. Wang, H. Wang, and F. Ko. Empirical evaluation of classifiers for software risk management. International Journal of Information Technology & Decision Making (IJITDM), 08(04):749--767, 2009.Google Scholar
Y. Peng, G. Wang, and H. Wang. User preferences based software defect detection algorithms selection using MCDM. Information Sciences, In Press.:--, 2010. Google ScholarDigital Library
J. Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo, California, 1993. Google ScholarDigital Library
J. Rodriguez, L. Kuncheva, and C. Alonso. Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1619--1630, oct 2006. Google ScholarDigital Library
C. Seiffert, T. Khoshgoftaar, and J. Van Hulse. Improving software-quality predictions with data sampling and boosting. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 39(6):1283--1294, 2009. Google ScholarDigital Library
C. Seiffert, T. Khoshgoftaar, J. Van Hulse, and A. Napolitano. RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, 40(1):185--197, 2010. Google ScholarDigital Library
M. Shepperd, Q. Song, Z. Sun, and C. Mair. Data quality: Some comments on the nasa software defect datasets. IEEE Transactions on Software Engineering, 39(9):1208--1215, 2013. Google ScholarDigital Library
J. Van Hulse, T. M. Khoshgoftaar, and A. Napolitano. Experimental perspectives on learning from imbalanced data. In Proceedings of the 24th international conference on Machine learning (ICLM), ICML '07, pages 935--942, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
O. Vandecruys, D. Martens, B. Baesens, C. Mues, M. De Backer, and R. Haesen. Mining software repositories for comprehensible software fault prediction models. Journal of Systems and Software, 81(5):823--839, 2008. Google ScholarDigital Library
D. Wilson. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man and Cybernetics, (3):408--421, 1972.Google ScholarCross Ref
I. H. Witten, E. Frank, and M. A. Hall. Data Mining: Practical Machine Learning Tools and Techniques (Third Edition). Morgan Kaufmann, 2011. Google ScholarDigital Library
H. Zhang and X. Zhang. Comments on "data mining static code attributes to learn defect predictors". IEEE Transactions on Software Engineering, 33(9):635--637, 2007. Google ScholarDigital Library

Index Terms

Preliminary comparison of techniques for dealing with imbalance in software defect prediction

Recommendations

The significant effects of data sampling approaches on software defect prioritization and classification
ESEM '17: Proceedings of the 11th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement

Context: Recent studies have shown that performance of defect prediction models can be affected when data sampling approaches are applied to unbalanced training data for building defect prediction models. However, the magnitude (degree and power) of the ...
Read More
An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data
Abstract
Software defect prediction is important to identify defects in the early phases of software development life cycle. This early identification and thereby removal of software defects is crucial to yield a cost-effective and good quality ...
Read More
So You Need More Method Level Datasets for Your Software Defect Prediction?: Voilà!
ESEM '16: Proceedings of the 10th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement

Context: Defect prediction research is based on a small number of defect datasets and most are at class not method level. Consequently our knowledge of defects is limited. Identifying defect datasets for prediction is not easy and extracting quality ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EASE '14: Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering
May 2014
486 pages
ISBN:9781450324762
DOI:10.1145/2601248
General Chair:
Martin Shepperd
Brunel University, UK
,
Program Chairs:
Tracy Hall
Brunel University
,
Ingunn Myrtveit
The Norwegian Business School
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 May 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data quality
defect prediction
imbalanced data
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate71of232submissions,31%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 73
  Total Citations
  View Citations
- 512
  Total Downloads
- Downloads (Last 12 months)31
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Preliminary comparison of techniques for dealing with imbalance in software defect prediction

EASE '14: Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

The significant effects of data sampling approaches on software defect prioritization and classification

An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data

So You Need More Method Level Datasets for Your Software Defect Prediction?: Voilà!

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Preliminary comparison of techniques for dealing with imbalance in software defect prediction

EASE '14: Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

The significant effects of data sampling approaches on software defect prioritization and classification

An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data

So You Need More Method Level Datasets for Your Software Defect Prediction?: Voilà!

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media