Skip to main content
Erschienen in: Artificial Intelligence Review 3/2022

14.08.2021

Handling Class-Imbalance with KNN (Neighbourhood) Under-Sampling for Software Defect Prediction

verfasst von: Somya Goyal

Erschienen in: Artificial Intelligence Review | Ausgabe 3/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Software Defect Prediction (SDP) is highly crucial task in software development process to forecast about which modules are more prone to errors and faults before the instigation of the testing phase. It aims to reduce the development cost of the software by focusing the testing efforts to those predicted faulty modules. Though, it ensures in-time delivery of good quality end-product, but class-imbalance of dataset is a major hinderance to SDP. This paper proposes a novel Neighbourhood based Under-Sampling (N-US) algorithm to handle class imbalance issue. This work is dedicated to demonstrating the effectiveness of proposed Neighbourhood based Under-Sampling (N-US) approach to attain high accuracy while predicting the defective modules. The algorithm N-US under samples the dataset to maximize the visibility of minority data points while restricting the excessive elimination of majority data points to avoid information loss. To assess the applicability of N-US, it is compared with three standard under-sampling techniques. Further, this study investigates the performance of N-US as a trusted ally for SDP classifiers. Extensive experiments are conducted using benchmark datasets from NASA repository which are CM1, JM1, KC1, KC2 and PC1. The proposed SDP classifier with N-US technique is compared with baseline models statistically to assess the effectiveness of N-US algorithm for SDP. The proposed model outperforms the rest of the candidate SDP models with the highest AUC score (= 95.6%), the maximum Accuracy value (= 96.9%) and the closest ROC curve to the top left corner. It shows up with the best prediction power statistically with confidence level of 95%.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Boucher A, Badri M (2018) Software metrics thresholds calculation techniques to predict fault-proneness. Inf Softw Technol 96:38–67CrossRef Boucher A, Badri M (2018) Software metrics thresholds calculation techniques to predict fault-proneness. Inf Softw Technol 96:38–67CrossRef
Zurück zum Zitat Cai X, Niu Y, Geng S, Zhang J, Cui Z, Li J, Chen J (2019) An under-sampled software defect prediction method based on hybrid multi-objective cuckoo search. Concurr Comput Prac Exp 32(5):e5478 Cai X, Niu Y, Geng S, Zhang J, Cui Z, Li J, Chen J (2019) An under-sampled software defect prediction method based on hybrid multi-objective cuckoo search. Concurr Comput Prac Exp 32(5):e5478
Zurück zum Zitat Erturk E, Sezer EA (2015) A comparison of some soft computing methods for software fault prediction. Expert syst Appl 42:1872–1879CrossRef Erturk E, Sezer EA (2015) A comparison of some soft computing methods for software fault prediction. Expert syst Appl 42:1872–1879CrossRef
Zurück zum Zitat Felix EA, Lee SP (2019) Systematic literature review of preprocessing techniques for imbalanced data. IET Software 13(6):479–496CrossRef Felix EA, Lee SP (2019) Systematic literature review of preprocessing techniques for imbalanced data. IET Software 13(6):479–496CrossRef
Zurück zum Zitat Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2011) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Sys, Man, Cyber Part C (Applications and Reviews) 42(4):463–484CrossRef Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2011) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Sys, Man, Cyber Part C (Applications and Reviews) 42(4):463–484CrossRef
Zurück zum Zitat Haixiang G, Yijing Li, Jennifer Shang Gu, Mingyun HY, Bing G (2017) Learning from class-imbalanced data: Review of methods and applications. Expert Syst Appl 73:220–239CrossRef Haixiang G, Yijing Li, Jennifer Shang Gu, Mingyun HY, Bing G (2017) Learning from class-imbalanced data: Review of methods and applications. Expert Syst Appl 73:220–239CrossRef
Zurück zum Zitat Hanley J, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic ROC curve. Radiology 143:29–36CrossRef Hanley J, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic ROC curve. Radiology 143:29–36CrossRef
Zurück zum Zitat Ivan T (1976) An Experiment with the edited nearest-neighbor rule. IEEE Trans Syst Man Cybern 6:448–452MathSciNetMATH Ivan T (1976) An Experiment with the edited nearest-neighbor rule. IEEE Trans Syst Man Cybern 6:448–452MathSciNetMATH
Zurück zum Zitat Kumar L, Sripada SK, Sureka A, Rath SK (2018) Effective fault prediction model developed using Least Square Support Vector Machine (LSSVM). J Syst Softw 137:686–712CrossRef Kumar L, Sripada SK, Sureka A, Rath SK (2018) Effective fault prediction model developed using Least Square Support Vector Machine (LSSVM). J Syst Softw 137:686–712CrossRef
Zurück zum Zitat Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Info Soft Tech 58:388–402CrossRef Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Info Soft Tech 58:388–402CrossRef
Zurück zum Zitat Lee HK, Kim SB (2018) An overlap-sensitive margin classifier for imbalanced and overlapping data. Expert Syst Appl 98:72–83CrossRef Lee HK, Kim SB (2018) An overlap-sensitive margin classifier for imbalanced and overlapping data. Expert Syst Appl 98:72–83CrossRef
Zurück zum Zitat Lehmann EL, Romano JP (2008) Testing Statistical Hypothesis: Springer Texts in Statistics”. Springer, New York Lehmann EL, Romano JP (2008) Testing Statistical Hypothesis: Springer Texts in Statistics”. Springer, New York
Zurück zum Zitat Lin WC, Tsai CF, Hu YH, Jhang JS (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci 409:17–26CrossRef Lin WC, Tsai CF, Hu YH, Jhang JS (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci 409:17–26CrossRef
Zurück zum Zitat Menzies T, DiStefano J, Orrego A, Chapman R (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 32(11):1–12 Menzies T, DiStefano J, Orrego A, Chapman R (2007) Data mining static code attributes to learn defect predictors. IEEE Trans Softw Eng 32(11):1–12
Zurück zum Zitat Miholca D, Czibula G, Czibula I (2018) A novel approach for software defect prediction through hybridizing gradual relational association rules with artificial neural networks. J. Infor Sci 441:152–170MathSciNetCrossRef Miholca D, Czibula G, Czibula I (2018) A novel approach for software defect prediction through hybridizing gradual relational association rules with artificial neural networks. J. Infor Sci 441:152–170MathSciNetCrossRef
Zurück zum Zitat Rathore S, Kumar S (2017a) Towards an ensemble-based system for predicting the number of software faults. Expert Syst Appl 82:357–382CrossRef Rathore S, Kumar S (2017a) Towards an ensemble-based system for predicting the number of software faults. Expert Syst Appl 82:357–382CrossRef
Zurück zum Zitat Rathore SS, Kumar S (2017b) Linear and non-linear heterogeneous ensemble methods to predict the number of faults in software systems. Knowl-Based Syst 119:232–256CrossRef Rathore SS, Kumar S (2017b) Linear and non-linear heterogeneous ensemble methods to predict the number of faults in software systems. Knowl-Based Syst 119:232–256CrossRef
Zurück zum Zitat Ross SM (2005) Probability and Statistics for Engineers and Scientists, 3rd edn. Elsevier Press, Armsterdam Ross SM (2005) Probability and Statistics for Engineers and Scientists, 3rd edn. Elsevier Press, Armsterdam
Zurück zum Zitat Siers MJ, Islam MZ (2015) Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem. Inf Syst 51:62–71CrossRef Siers MJ, Islam MZ (2015) Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem. Inf Syst 51:62–71CrossRef
Zurück zum Zitat Sun Z, Zhang J, Sun H, Zhu X (2020) Collaborative filtering based recommendation of sampling methods for software defect prediction. Appl Soft Comput 90:106–163 Sun Z, Zhang J, Sun H, Zhu X (2020) Collaborative filtering based recommendation of sampling methods for software defect prediction. Appl Soft Comput 90:106–163
Zurück zum Zitat Thomas J (1976) McCabe, a complexity measure. IEEE Trans Software Eng 2(4):308–320 Thomas J (1976) McCabe, a complexity measure. IEEE Trans Software Eng 2(4):308–320
Zurück zum Zitat Tsai CF, Lin WC, Hu YH, Yao GT (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf Sci 477:47–54CrossRef Tsai CF, Lin WC, Hu YH, Yao GT (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf Sci 477:47–54CrossRef
Zurück zum Zitat Wang S, Yao X (2013) Using class imbalance learning for software defect prediction. IEEE Trans Reliab 62(2):434–443CrossRef Wang S, Yao X (2013) Using class imbalance learning for software defect prediction. IEEE Trans Reliab 62(2):434–443CrossRef
Zurück zum Zitat Wang T, Zhang Z, Jing X, Zhang L (2015) Multiple kernel ensemble learning for software defect prediction. Autom Softw Eng 23:569–590CrossRef Wang T, Zhang Z, Jing X, Zhang L (2015) Multiple kernel ensemble learning for software defect prediction. Autom Softw Eng 23:569–590CrossRef
Zurück zum Zitat Xia X, Lo D, Shihab E, Wang X, Yang X (2015) ELBlocker: predicting blocking bugs with ensemble imbalance learning. Inf Softw Technol 61:93–106CrossRef Xia X, Lo D, Shihab E, Wang X, Yang X (2015) ELBlocker: predicting blocking bugs with ensemble imbalance learning. Inf Softw Technol 61:93–106CrossRef
Zurück zum Zitat Yang X, Lo D, Xia X, Sun J (2017) TLEL: A two-layer ensemble learning approach for just-in-time defect prediction. J. Info Soft Tech 87:206–220CrossRef Yang X, Lo D, Xia X, Sun J (2017) TLEL: A two-layer ensemble learning approach for just-in-time defect prediction. J. Info Soft Tech 87:206–220CrossRef
Zurück zum Zitat Zhang Y, Lo D, Xia X, Sun J (2018) Combined classifier for cross-project defect prediction: an extended empirical study. Front Comput Sci 12(2):280–296CrossRef Zhang Y, Lo D, Xia X, Sun J (2018) Combined classifier for cross-project defect prediction: an extended empirical study. Front Comput Sci 12(2):280–296CrossRef
Metadaten
Titel
Handling Class-Imbalance with KNN (Neighbourhood) Under-Sampling for Software Defect Prediction
verfasst von
Somya Goyal
Publikationsdatum
14.08.2021
Verlag
Springer Netherlands
Erschienen in
Artificial Intelligence Review / Ausgabe 3/2022
Print ISSN: 0269-2821
Elektronische ISSN: 1573-7462
DOI
https://doi.org/10.1007/s10462-021-10044-w

Weitere Artikel der Ausgabe 3/2022

Artificial Intelligence Review 3/2022 Zur Ausgabe

Premium Partner