nach oben

Innovations in Systems and Software Engineering

Erschienen in:

01.06.2022 | Original Article

Efficiency of oversampling methods for enhancing software defect prediction by using imbalanced data

verfasst von: Tirimula Rao Benala, Karunya Tantati

Erschienen in: Innovations in Systems and Software Engineering | Ausgabe 3/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Software defect prediction (SDP) is essential to analyze and identify defects present in a software model in early stages of software development. The identification of these defects and their early removal provides cost-efficient software. Machine learning (ML) techniques have been successfully used for developing defect prediction models. However, these techniques deliver off-target results when implemented on imbalanced datasets. For example, a dataset with unequal class distribution is technically imbalanced. Thus, ML techniques on such imbalanced data lead to a biased prediction of minority class instances, which are more important than majority class instances. Therefore, the imbalanced data problem must be resolved to successfully develop an efficient SDP model. In this study, we evaluated the prediction capability of ML classifiers for software defect prediction on nine imbalanced NASA datasets by applying oversampling methods. In addition, we considered five oversampling methods to synthesize minority class instances and make the datasets balanced. Dataset imbalance was eliminated using the five oversampling techniques. The oversampling techniques replicated or synthesized the instances of minority classes to balance the datasets. When the datasets were balanced, the ML classifiers were used to develop a defect prediction model. The experimental results acquired by applying ML classifiers on the imbalanced and balanced data showed an enhancement in the learning capability of ML techniques with the implementation of sampling techniques. Oversampling methods considerably improved the prediction performance of the ML classifiers.

Vorheriger Artikel An improved authentication and key management scheme for hierarchical IoT network using elliptic curve cryptography

Nächster Artikel Formal modeling of the gPTP clock synchronization algorithm in automotive ethernet

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

Krasner H (2020) The cost of poor quality software in the us: a 2020 report. Consortium for I.T. Software Quality, Technical report, 10.

Shepperd M, Bowes D, Hall T (2014) Researcher bias: the use of machine learning in software defect prediction. IEEE Trans Softw Eng 40(6):603–616CrossRef

Malhotra R, Kamal S (2019) An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data. Neurocomputing 343:120–140CrossRef

Feng S, Keung J, Yu X, Xiao Y, Zhang M (2021) Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction. Inf Softw Technol 139:106662CrossRef

Zhang F, Hassan AE, McIntosh S, Zou Y (2016) The use of summation to aggregate software metrics hinders the performance of defect prediction models. IEEE Trans Softw Eng 43(5):476–491CrossRef

Zhang Y, Li JX, Zhao J, Wang SZ, Pan Y, Tanaka K, Kadota S (2005) Synthesis and activity of oleanolic acid derivatives, a novel class of inhibitors of osteoclast formation. Bioorg Med Chem Lett 15(6):1629–1632CrossRef

Saçar MD, Allmer J (2013) Data mining for microrna gene prediction: on the impact of class imbalance and feature number for microrna gene prediction. In: 2013 8th international symposium on health informatics and bioinformatics. IEEE, pp 1–6

Provost F (2000) Machine learning from imbalanced data sets 101. In: Proceedings of the AAAI’2000 workshop on imbalanced data sets, vol 68, no. 2000. AAAI Press, pp 1–3

Bennin KE, Keung J, Phannachitta P, Monden A, Mensah S (2017) Mahakil: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Softw Eng 44(6):534–550CrossRef

10.

Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, Berlin, Heidelberg, pp 878–887

11.

Agrawal A, Menzies T (2018) Is" Better Data" Better Than" Better Data Miners"? In: 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), 27 May–3 June 2018, pp 1050–1061. IEEE, Gothenburg, Sweden

12.

Sun Z, Song Q, Zhu X (2012) Using coding-based ensemble learning to improve software defect prediction. IEEE Trans Syst Man Cybern Part C Appl Rev 42(6):1806–1817CrossRef

13.

Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf Softw Technol 58:388–402CrossRef

14.

Xia X, Lo D, Shihab E, Wang X, Yang X (2015) Elblocker: Predicting blocking bugs with ensemble imbalance learning. Inf Softw Technol 61:93–106CrossRef

15.

Wang H, Khoshgoftaar TM, Napolitano A (2010) A comparative study of ensemble feature selection techniques for software defect prediction. In: 2010 Ninth international conference on machine learning and applications. IEEE, pp 135–140

16.

Liu M, Miao L, Zhang D (2014) Two-stage cost-sensitive learning for software defect prediction. IEEE Trans Reliab 63(2):676–686CrossRef

17.

Jing XY, Ying S, Zhang ZW, Wu SS, Liu J (2014) Dictionary learning-based software defect prediction. In: Proceedings of the 36th international conference on software engineering, pp 414–423

18.

Yu X, Wu M, Jian Y, Bennin KE, Fu M, Ma C (2018) Cross-company defect prediction via semi-supervised clustering-based data filtering and MSTrA-based transfer learning. Soft Comput 22(10):3461–3472CrossRef

19.

Tomar D, Agarwal S (2016) Prediction of defective software modules using class imbalance learning. In: Applied computational intelligence and soft computing, 2016

20.

Drummond C, Holte RC (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced datasets II, vol 11, pp 1–8

21.

Guo X, Yin Y, Dong C, Yang G, Zhou G (2008) On the class imbalance problem. In: 2008 Fourth international conference on natural computation, vol 4. IEEE, pp 192–201

22.

Bennin KE, Keung JW, Monden A (2019) On the relative value of data resampling approaches for software defect prediction. Empir Softw Eng 24(2):602–636CrossRef

23.

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRefMATH

24.

Kamei Y, Monden A, Matsumoto S, Kakimoto T, Matsumoto KI (2007) The effects of over and under sampling on fault-prone module detection. In: First international symposium on empirical software engineering and measurement (ESEM 2007). IEEE, pp 196–204

25.

Riquelme JC, Ruiz R, Rodríguez D, Moreno J (2008) Finding defective modules from highly unbalanced datasets. Actas de los Talleres de las Jornadas de Ingeniería del Software y Bases de Datos 2(1):67–74

26.

Shatnawi R (2012) Improving software fault prediction for imbalanced data. In: 2012 international conference on innovations in information technology (IIT). IEEE, pp 54–59

27.

Menzies T, Dekhtyar A, Distefano J, Greenwald J (2007) Problems with precision: a response to" comments on’ data mining static code attributes to learn defect predictors’". IEEE Trans Softw Eng 33(9):637–640CrossRef

28.

Buda M, Maki A, Mazurowski MA (2018) A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw 106:249–259CrossRef

29.

Kovács G (2019) Smote-variants: a python implementation of 85 minority oversampling techniques. Neurocomputing 366:352–354CrossRef

30.

Menardi G, Torelli N (2014) Training and assessing classification rules with imbalanced data. Data Min Knowl Disc 28(1):92–122MathSciNetCrossRefMATH

31.

He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE, pp 1322–1328

32.

Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the imbalanced class problem. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, Heidelberg, pp 475–482

33.

Nguyen HM, Cooper EW, Kamei K (2011) Borderline over-sampling for imbalanced data classification. Int J Knowl Eng Soft Data Paradig 3(1):4–21CrossRef

34.

Ian HW, Eibe F (2005) Data mining: practical machine learning tools and techniques

35.

Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefMATH

36.

Murphy KP (2006) Naive Bayes classifiers. Univ B C 18(60):1–8

37.

Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140CrossRefMATH

38.

Yu H, Sun C, Yang W, Yang X, Zuo X (2015) AL-ELM: one uncertainty-based active learning algorithm using extreme learning machine. Neurocomputing 166:140–150CrossRef

39.

Shanab AA, Khoshgoftaar TM, Wald R, Napolitano A (2012) Impact of noise and data sampling on stability of feature ranking techniques for biological datasets. In: 2012 IEEE 13th international conference on information reuse and integration (IRI). IEEE, pp 415–422

40.

Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: some comments on the NASA software defect datasets. IEEE Trans Softw Eng 39(9):1208–1215CrossRef

41.

Keung J, Kocaguneli E, Menzies T (2013) Finding conclusion stability for selecting the best effort predictor in software effort estimation. Autom Softw Eng 20(4):543–567CrossRef

42.

Cohen G, Hilario M, Sax H, Hugonnet S, Geissbuhler A (2006) Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 37(1):7–18CrossRef

Titel: Efficiency of oversampling methods for enhancing software defect prediction by using imbalanced data
verfasst von: Tirimula Rao Benala
Karunya Tantati
Publikationsdatum: 01.06.2022
Verlag: Springer London
Erschienen in: Innovations in Systems and Software Engineering / Ausgabe 3/2023
Print ISSN: 1614-5046
Elektronische ISSN: 1614-5054
DOI: https://doi.org/10.1007/s11334-022-00457-3

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 3/2023

Component level diagnosability of discrete event systems based on observations

An improved authentication and key management scheme for hierarchical IoT network using elliptic curve cryptography

Frequent itemset mining using FP-tree: a CLA-based approach and its extended application in biodiversity data

TimeLine Depiction: an approach to graphical notation for supporting temporal property specification

Formal modeling of the gPTP clock synchronization algorithm in automotive ethernet

Premium Partner