Skip to main content
Erschienen in: Innovations in Systems and Software Engineering 3/2023

01.06.2022 | Original Article

Efficiency of oversampling methods for enhancing software defect prediction by using imbalanced data

verfasst von: Tirimula Rao Benala, Karunya Tantati

Erschienen in: Innovations in Systems and Software Engineering | Ausgabe 3/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Software defect prediction (SDP) is essential to analyze and identify defects present in a software model in early stages of software development. The identification of these defects and their early removal provides cost-efficient software. Machine learning (ML) techniques have been successfully used for developing defect prediction models. However, these techniques deliver off-target results when implemented on imbalanced datasets. For example, a dataset with unequal class distribution is technically imbalanced. Thus, ML techniques on such imbalanced data lead to a biased prediction of minority class instances, which are more important than majority class instances. Therefore, the imbalanced data problem must be resolved to successfully develop an efficient SDP model. In this study, we evaluated the prediction capability of ML classifiers for software defect prediction on nine imbalanced NASA datasets by applying oversampling methods. In addition, we considered five oversampling methods to synthesize minority class instances and make the datasets balanced. Dataset imbalance was eliminated using the five oversampling techniques. The oversampling techniques replicated or synthesized the instances of minority classes to balance the datasets. When the datasets were balanced, the ML classifiers were used to develop a defect prediction model. The experimental results acquired by applying ML classifiers on the imbalanced and balanced data showed an enhancement in the learning capability of ML techniques with the implementation of sampling techniques. Oversampling methods considerably improved the prediction performance of the ML classifiers.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
1.
Zurück zum Zitat Krasner H (2020) The cost of poor quality software in the us: a 2020 report. Consortium for I.T. Software Quality, Technical report, 10. Krasner H (2020) The cost of poor quality software in the us: a 2020 report. Consortium for I.T. Software Quality, Technical report, 10.
2.
Zurück zum Zitat Shepperd M, Bowes D, Hall T (2014) Researcher bias: the use of machine learning in software defect prediction. IEEE Trans Softw Eng 40(6):603–616CrossRef Shepperd M, Bowes D, Hall T (2014) Researcher bias: the use of machine learning in software defect prediction. IEEE Trans Softw Eng 40(6):603–616CrossRef
3.
Zurück zum Zitat Malhotra R, Kamal S (2019) An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data. Neurocomputing 343:120–140CrossRef Malhotra R, Kamal S (2019) An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data. Neurocomputing 343:120–140CrossRef
4.
Zurück zum Zitat Feng S, Keung J, Yu X, Xiao Y, Zhang M (2021) Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction. Inf Softw Technol 139:106662CrossRef Feng S, Keung J, Yu X, Xiao Y, Zhang M (2021) Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction. Inf Softw Technol 139:106662CrossRef
5.
Zurück zum Zitat Zhang F, Hassan AE, McIntosh S, Zou Y (2016) The use of summation to aggregate software metrics hinders the performance of defect prediction models. IEEE Trans Softw Eng 43(5):476–491CrossRef Zhang F, Hassan AE, McIntosh S, Zou Y (2016) The use of summation to aggregate software metrics hinders the performance of defect prediction models. IEEE Trans Softw Eng 43(5):476–491CrossRef
6.
Zurück zum Zitat Zhang Y, Li JX, Zhao J, Wang SZ, Pan Y, Tanaka K, Kadota S (2005) Synthesis and activity of oleanolic acid derivatives, a novel class of inhibitors of osteoclast formation. Bioorg Med Chem Lett 15(6):1629–1632CrossRef Zhang Y, Li JX, Zhao J, Wang SZ, Pan Y, Tanaka K, Kadota S (2005) Synthesis and activity of oleanolic acid derivatives, a novel class of inhibitors of osteoclast formation. Bioorg Med Chem Lett 15(6):1629–1632CrossRef
7.
Zurück zum Zitat Saçar MD, Allmer J (2013) Data mining for microrna gene prediction: on the impact of class imbalance and feature number for microrna gene prediction. In: 2013 8th international symposium on health informatics and bioinformatics. IEEE, pp 1–6 Saçar MD, Allmer J (2013) Data mining for microrna gene prediction: on the impact of class imbalance and feature number for microrna gene prediction. In: 2013 8th international symposium on health informatics and bioinformatics. IEEE, pp 1–6
8.
Zurück zum Zitat Provost F (2000) Machine learning from imbalanced data sets 101. In: Proceedings of the AAAI’2000 workshop on imbalanced data sets, vol 68, no. 2000. AAAI Press, pp 1–3 Provost F (2000) Machine learning from imbalanced data sets 101. In: Proceedings of the AAAI’2000 workshop on imbalanced data sets, vol 68, no. 2000. AAAI Press, pp 1–3
9.
Zurück zum Zitat Bennin KE, Keung J, Phannachitta P, Monden A, Mensah S (2017) Mahakil: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Softw Eng 44(6):534–550CrossRef Bennin KE, Keung J, Phannachitta P, Monden A, Mensah S (2017) Mahakil: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Softw Eng 44(6):534–550CrossRef
10.
Zurück zum Zitat Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, Berlin, Heidelberg, pp 878–887 Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, Berlin, Heidelberg, pp 878–887
11.
Zurück zum Zitat Agrawal A, Menzies T (2018) Is" Better Data" Better Than" Better Data Miners"? In: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), 27 May–3 June 2018, pp 1050–1061. IEEE, Gothenburg, Sweden Agrawal A, Menzies T (2018) Is" Better Data" Better Than" Better Data Miners"? In: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), 27 May–3 June 2018, pp 1050–1061. IEEE, Gothenburg, Sweden
12.
Zurück zum Zitat Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. IEEE Trans Syst Man Cybern Part C Appl Rev 42(6):1806–1817CrossRef Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. IEEE Trans Syst Man Cybern Part C Appl Rev 42(6):1806–1817CrossRef
13.
Zurück zum Zitat Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf Softw Technol 58:388–402CrossRef Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf Softw Technol 58:388–402CrossRef
14.
Zurück zum Zitat Xia X, Lo D, Shihab E, Wang X, Yang X (2015) Elblocker: Predicting blocking bugs with ensemble imbalance learning. Inf Softw Technol 61:93–106CrossRef Xia X, Lo D, Shihab E, Wang X, Yang X (2015) Elblocker: Predicting blocking bugs with ensemble imbalance learning. Inf Softw Technol 61:93–106CrossRef
15.
Zurück zum Zitat Wang H, Khoshgoftaar TM, Napolitano A (2010) A comparative study of ensemble feature selection techniques for software defect prediction. In: 2010 Ninth international conference on machine learning and applications. IEEE, pp 135–140 Wang H, Khoshgoftaar TM, Napolitano A (2010) A comparative study of ensemble feature selection techniques for software defect prediction. In: 2010 Ninth international conference on machine learning and applications. IEEE, pp 135–140
16.
Zurück zum Zitat Liu M, Miao L, Zhang D (2014) Two-stage cost-sensitive learning for software defect prediction. IEEE Trans Reliab 63(2):676–686CrossRef Liu M, Miao L, Zhang D (2014) Two-stage cost-sensitive learning for software defect prediction. IEEE Trans Reliab 63(2):676–686CrossRef
17.
Zurück zum Zitat Jing XY, Ying S, Zhang ZW, Wu SS, Liu J (2014) Dictionary learning-based software defect prediction. In: Proceedings of the 36th international conference on software engineering, pp 414–423 Jing XY, Ying S, Zhang ZW, Wu SS, Liu J (2014) Dictionary learning-based software defect prediction. In: Proceedings of the 36th international conference on software engineering, pp 414–423
18.
Zurück zum Zitat Yu X, Wu M, Jian Y, Bennin KE, Fu M, Ma C (2018) Cross-company defect prediction via semi-supervised clustering-based data filtering and MSTrA-based transfer learning. Soft Comput 22(10):3461–3472CrossRef Yu X, Wu M, Jian Y, Bennin KE, Fu M, Ma C (2018) Cross-company defect prediction via semi-supervised clustering-based data filtering and MSTrA-based transfer learning. Soft Comput 22(10):3461–3472CrossRef
19.
Zurück zum Zitat Tomar D, Agarwal S (2016) Prediction of defective software modules using class imbalance learning. In: Applied computational intelligence and soft computing, 2016 Tomar D, Agarwal S (2016) Prediction of defective software modules using class imbalance learning. In: Applied computational intelligence and soft computing, 2016
20.
Zurück zum Zitat Drummond C, Holte RC (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced datasets II, vol 11, pp 1–8 Drummond C, Holte RC (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced datasets II, vol 11, pp 1–8
21.
Zurück zum Zitat Guo X, Yin Y, Dong C, Yang G, Zhou G (2008) On the class imbalance problem. In: 2008 Fourth international conference on natural computation, vol 4. IEEE, pp 192–201 Guo X, Yin Y, Dong C, Yang G, Zhou G (2008) On the class imbalance problem. In: 2008 Fourth international conference on natural computation, vol 4. IEEE, pp 192–201
22.
Zurück zum Zitat Bennin KE, Keung JW, Monden A (2019) On the relative value of data resampling approaches for software defect prediction. Empir Softw Eng 24(2):602–636CrossRef Bennin KE, Keung JW, Monden A (2019) On the relative value of data resampling approaches for software defect prediction. Empir Softw Eng 24(2):602–636CrossRef
23.
Zurück zum Zitat Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRefMATH Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRefMATH
24.
Zurück zum Zitat Kamei Y, Monden A, Matsumoto S, Kakimoto T, Matsumoto KI (2007) The effects of over and under sampling on fault-prone module detection. In: First international symposium on empirical software engineering and measurement (ESEM 2007). IEEE, pp 196–204 Kamei Y, Monden A, Matsumoto S, Kakimoto T, Matsumoto KI (2007) The effects of over and under sampling on fault-prone module detection. In: First international symposium on empirical software engineering and measurement (ESEM 2007). IEEE, pp 196–204
25.
Zurück zum Zitat Riquelme JC, Ruiz R, Rodríguez D, Moreno J (2008) Finding defective modules from highly unbalanced datasets. Actas de los Talleres de las Jornadas de Ingeniería del Software y Bases de Datos 2(1):67–74 Riquelme JC, Ruiz R, Rodríguez D, Moreno J (2008) Finding defective modules from highly unbalanced datasets. Actas de los Talleres de las Jornadas de Ingeniería del Software y Bases de Datos 2(1):67–74
26.
Zurück zum Zitat Shatnawi R (2012) Improving software fault prediction for imbalanced data. In: 2012 international conference on innovations in information technology (IIT). IEEE, pp 54–59 Shatnawi R (2012) Improving software fault prediction for imbalanced data. In: 2012 international conference on innovations in information technology (IIT). IEEE, pp 54–59
27.
Zurück zum Zitat Menzies T, Dekhtyar A, Distefano J, Greenwald J (2007) Problems with precision: a response to" comments on’ data mining static code attributes to learn defect predictors’". IEEE Trans Softw Eng 33(9):637–640CrossRef Menzies T, Dekhtyar A, Distefano J, Greenwald J (2007) Problems with precision: a response to" comments on’ data mining static code attributes to learn defect predictors’". IEEE Trans Softw Eng 33(9):637–640CrossRef
28.
Zurück zum Zitat Buda M, Maki A, Mazurowski MA (2018) A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw 106:249–259CrossRef Buda M, Maki A, Mazurowski MA (2018) A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw 106:249–259CrossRef
29.
Zurück zum Zitat Kovács G (2019) Smote-variants: a python implementation of 85 minority oversampling techniques. Neurocomputing 366:352–354CrossRef Kovács G (2019) Smote-variants: a python implementation of 85 minority oversampling techniques. Neurocomputing 366:352–354CrossRef
30.
Zurück zum Zitat Menardi G, Torelli N (2014) Training and assessing classification rules with imbalanced data. Data Min Knowl Disc 28(1):92–122MathSciNetCrossRefMATH Menardi G, Torelli N (2014) Training and assessing classification rules with imbalanced data. Data Min Knowl Disc 28(1):92–122MathSciNetCrossRefMATH
31.
Zurück zum Zitat He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE, pp 1322–1328 He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE, pp 1322–1328
32.
Zurück zum Zitat Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the imbalanced class problem. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, Heidelberg, pp 475–482 Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the imbalanced class problem. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, Heidelberg, pp 475–482
33.
Zurück zum Zitat Nguyen HM, Cooper EW, Kamei K (2011) Borderline over-sampling for imbalanced data classification. Int J Knowl Eng Soft Data Paradig 3(1):4–21CrossRef Nguyen HM, Cooper EW, Kamei K (2011) Borderline over-sampling for imbalanced data classification. Int J Knowl Eng Soft Data Paradig 3(1):4–21CrossRef
34.
Zurück zum Zitat Ian HW, Eibe F (2005) Data mining: practical machine learning tools and techniques Ian HW, Eibe F (2005) Data mining: practical machine learning tools and techniques
36.
Zurück zum Zitat Murphy KP (2006) Naive Bayes classifiers. Univ B C 18(60):1–8 Murphy KP (2006) Naive Bayes classifiers. Univ B C 18(60):1–8
38.
Zurück zum Zitat Yu H, Sun C, Yang W, Yang X, Zuo X (2015) AL-ELM: one uncertainty-based active learning algorithm using extreme learning machine. Neurocomputing 166:140–150CrossRef Yu H, Sun C, Yang W, Yang X, Zuo X (2015) AL-ELM: one uncertainty-based active learning algorithm using extreme learning machine. Neurocomputing 166:140–150CrossRef
39.
Zurück zum Zitat Shanab AA, Khoshgoftaar TM, Wald R, Napolitano A (2012) Impact of noise and data sampling on stability of feature ranking techniques for biological datasets. In: 2012 IEEE 13th international conference on information reuse and integration (IRI). IEEE, pp 415–422 Shanab AA, Khoshgoftaar TM, Wald R, Napolitano A (2012) Impact of noise and data sampling on stability of feature ranking techniques for biological datasets. In: 2012 IEEE 13th international conference on information reuse and integration (IRI). IEEE, pp 415–422
40.
Zurück zum Zitat Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some comments on the NASA software defect datasets. IEEE Trans Softw Eng 39(9):1208–1215CrossRef Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some comments on the NASA software defect datasets. IEEE Trans Softw Eng 39(9):1208–1215CrossRef
41.
Zurück zum Zitat Keung J, Kocaguneli E, Menzies T (2013) Finding conclusion stability for selecting the best effort predictor in software effort estimation. Autom Softw Eng 20(4):543–567CrossRef Keung J, Kocaguneli E, Menzies T (2013) Finding conclusion stability for selecting the best effort predictor in software effort estimation. Autom Softw Eng 20(4):543–567CrossRef
42.
Zurück zum Zitat Cohen G, Hilario M, Sax H, Hugonnet S, Geissbuhler A (2006) Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 37(1):7–18CrossRef Cohen G, Hilario M, Sax H, Hugonnet S, Geissbuhler A (2006) Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 37(1):7–18CrossRef
Metadaten
Titel
Efficiency of oversampling methods for enhancing software defect prediction by using imbalanced data
verfasst von
Tirimula Rao Benala
Karunya Tantati
Publikationsdatum
01.06.2022
Verlag
Springer London
Erschienen in
Innovations in Systems and Software Engineering / Ausgabe 3/2023
Print ISSN: 1614-5046
Elektronische ISSN: 1614-5054
DOI
https://doi.org/10.1007/s11334-022-00457-3

Weitere Artikel der Ausgabe 3/2023

Innovations in Systems and Software Engineering 3/2023 Zur Ausgabe

Premium Partner