nach oben

Knowledge and Information Systems

09.04.2024 | Regular Paper

Noise-free sampling with majority framework for an imbalanced classification problem

Erschienen in: Knowledge and Information Systems

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Class imbalance has been widely accepted as a significant factor that negatively impacts a machine learning classifier’s performance. One of the techniques to avoid this problem is to balance the data distribution by using sampling-based approaches, in which synthetic data is generated using the probability distribution of the classes. However, this process is sensitive to the presence of noise in the data, and the boundaries between the majority class and the minority class are blurred. Such phenomena shift the algorithm’s decision boundary away from the ideal outcome. In this work, we propose a hybrid framework for two primary objectives. The first objective is to address class distribution imbalance by synthetically increasing the data of a minority class, and the second objective is, to devise an efficient noise reduction technique that improves the class balance algorithm. The proposed framework focuses on removing noisy elements from the majority class, and by doing so, provides more accurate information to the subsequent synthetic data generator algorithm. To evaluate the effectiveness of our framework, we employ the geometric mean (G-mean) as the evaluation metric. The experimental results show that our framework is capable of improving the prediction G-mean for eight classifiers across eleven datasets. The range of improvements varies from 7.78% on the Loan dataset to 67.45% on the Abalone19_vs_10-11-12-13 dataset.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

https://github.com/sarathi-tech/lending-club

Spelmen VS, Porkodi R (2018) A review on handling imbalanced data. In: 2018 International conference on current trends towards converging technologies (ICCTCT), pp. 1–11. https://doi.org/10.1109/ICCTCT.2018.8551020

Rekha G, Tyagi AK, Krishna Reddy V (2020) A novel approach to solve class imbalance problem using noise filter method. In: Abraham A, Cherukuri AK, Melin P, Gandhi N (eds) Intelligent systems design and applications. Springer, Cham, pp 486–496. https://doi.org/10.1007/978-3-030-16657-1_45CrossRef

Li J, Zhu Q, Wu Q, Fan Z (2021) A novel oversampling technique for class-imbalanced learning based on SMOTE and natural neighbors. Inf Sci 565:438–455. https://doi.org/10.1016/j.ins.2021.03.041MathSciNetCrossRef

Guzmán-Ponce A, Sánchez JS, Valdovinos RM, Marcial-Romero JR (2021) DBIG-US: A two-stage under-sampling algorithm to face the class imbalance problem. Expert Syst Appl 168:114301. https://doi.org/10.1016/j.eswa.2020.114301CrossRef

Rezvani S, Wang X (2023) A broad review on class imbalance learning techniques. Appl Soft Comput 143:110415. https://doi.org/10.1016/j.asoc.2023.110415CrossRef

Koziarski M, Krawczyk B, Woźniak M (2019) Radial-based oversampling for noisy imbalanced data classification. Neurocomputing 343:19–33. https://doi.org/10.1016/j.neucom.2018.04.089CrossRef

Liu J (2021) A minority oversampling approach for fault detection with heterogeneous imbalanced data. Expert Syst Appl 184:115492. https://doi.org/10.1016/j.eswa.2021.115492CrossRef

Isangediok M, Gajamannage K (2022) Fraud detection using optimized machine learning tools under imbalance classes. https://doi.org/10.48550/arXiv.2209.01642

Sun J, Li J, Fujita H (2022) Multi-class imbalanced enterprise credit evaluation based on asymmetric bagging combined with light gradient boosting machine. Appl Soft Comput 130:109637. https://doi.org/10.1016/j.asoc.2022.109637CrossRef

10.

Teh K, Armitage P, Tesfaye S, Selvarajah D, Wilkinson ID (2020) Imbalanced learning: Improving classification of diabetic neuropathy from magnetic resonance imaging. PLoS ONE 15(12):1–15. https://doi.org/10.1371/journal.pone.0243907CrossRef

11.

Kumar V, Lalotra GS, Sasikala P, Rajput DS, Kaluri R, Lakshmanna K, Shorfuzzaman M, Alsufyani A, Uddin M (2022) Addressing binary classification over class imbalanced clinical datasets using computationally intelligent techniques. Healthcare. https://doi.org/10.3390/healthcare10071293CrossRef

12.

Matsuoka D (2021) Classification of imbalanced cloud image data using deep neural networks: performance improvement. Prog Earth Planet Sci 8:68. https://doi.org/10.1186/s40645-021-00459-yCrossRef

13.

Xu Y, Li Y-L, Li J, Lu C (2022) Constructing balance from imbalance for long-tailed image recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision–ECCV 2022, pp. 38–56. Springer, Cham. https://doi.org/10.1007/978-3-031-20044-1_3

14.

Ahmed J, Green RC II (2022) Predicting severely imbalanced data disk drive failures with machine learning models. Mach Learn Appl 9:100361. https://doi.org/10.1016/j.mlwa.2022.100361CrossRef

15.

Pandey S, Kumar K (2023) Software fault prediction for imbalanced data: a survey on recent developments. Proc Comput Sci 218:1815–1824. https://doi.org/10.1016/j.procs.2023.01.159CrossRef

16.

Moniz N, Cerqueira V (2021) Automated imbalanced classification via meta-learning. Expert Syst Appl 178:115011. https://doi.org/10.1016/j.eswa.2021.115011CrossRef

17.

Saripuddin M, Suliman A, Syarmila Sameon S, Jorgensen BN (2022) Random undersampling on imbalance time series data for anomaly detection. In: Proceedings of the 2021 4th international conference on machine learning and machine intelligence. MLMI ’21, pp. 151–156. Association for computing machinery, New York, NY, USA. https://doi.org/10.1145/3490725.3490748

18.

García V, Sánchez JS, Marqués AI, Florencia R, Rivera G (2020) Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data. Expert Syst Appl 158:113026. https://doi.org/10.1016/j.eswa.2019.113026CrossRef

19.

Fujiwara K, Huang Y, Hori K, Nishioji K, Kobayashi M, Kamaguchi M, Kano M (2020) Over- and under-sampling approach for extremely imbalanced and small minority data problem in health record analysis. Front Public Health. https://doi.org/10.3389/fpubh.2020.00178CrossRef

20.

Santoso B, Wijayanto H, Notodiputro KA, Sartono B (2017) Synthetic over sampling methods for handling class imbalanced problems: a review. IOP Conf Ser Earth Environ Sci 58(1):012031. https://doi.org/10.1088/1755-1315/58/1/012031CrossRef

21.

Mohammed R, Rawashdeh J, Abdullah M (2020) Machine learning with oversampling and undersampling techniques: overview study and experimental results. In: 2020 11th International conference on information and communication systems (ICICS), pp. 243–248. https://doi.org/10.1109/ICICS49469.2020.239556

22.

Wongvorachan T, He S, Bulut O (2023) A comparison of undersampling, oversampling, and SMOTE methods for dealing with imbalanced classification in educational data mining. Information. https://doi.org/10.3390/info14010054CrossRef

23.

Shamsudin H, Yusof UK, Jayalakshmi A, Akmal Khalid MN (2020) Combining oversampling and undersampling techniques for imbalanced classification: a comparative study using credit card fraudulent transaction dataset. In: 2020 IEEE 16th international conference on control & automation (ICCA), pp. 803–808. https://doi.org/10.1109/ICCA51439.2020.9264517

24.

Park S, Park H (2021) Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic. Computing 103:401–424. https://doi.org/10.1007/s00607-020-00854-1MathSciNetCrossRef

25.

Bellinger C, Sharma S, Japkowicz N, Zaïane OR (2020) Framework for extreme imbalance classification: SWIM—sampling with the majority class. Knowl Inf Syst 62(3):841–866. https://doi.org/10.1007/s10115-019-01380-zCrossRef

26.

Firdausanti NA, Fatyanosa TN, Data M, Mendonça I, Aritsugi M (2022) Two-stage sampling: a framework for imbalanced classification with overlapped classes. In: 2022 IEEE international conference on big data (Big Data) pp. 271–280. https://doi.org/10.1109/BigData55660.2022.10020788

27.

Asniar Maulidevi NU, Surendro K (2022) SMOTE-LOF for noise identification in imbalanced data classification. J King Saud Univ Comput Inf Sci 34(6):3413–3423. https://doi.org/10.1016/j.jksuci.2021.01.014CrossRef

28.

Li J, Zhu Q, Wu Q, Zhang Z, Gong Y, He Z, Zhu F (2021) Smote-nan-de: addressing the noisy and borderline examples problem in imbalanced classification by natural neighbors and differential evolution. Knowl-Based Syst 223:107056. https://doi.org/10.1016/j.knosys.2021.107056CrossRef

29.

Hao S, Zhou X, Song H (2015) A new method for noise data detection based on DBSCAN and SVDD. In: 2015 IEEE International conference on cyber technology in automation, control, and intelligent systems (CYBER), pp. 784–789. https://doi.org/10.1109/CYBER.2015.7288042

30.

Saeedi Emadi H, Mazinani SM (2018) A novel anomaly detection algorithm using DBSCAN and SVM in wireless sensor networks. Wireless Pers Commun 98:2025–2035. https://doi.org/10.1007/s11277-017-4961-1CrossRef

31.

Chen H, Yu G, Liu F, Cai Z, Liu A, Chen S, Huang H, Cheang CF (2020) Unsupervised anomaly detection via DBSCAN for KPIs jitters in network managements. Comput Materi Cont 62(2):917–927. https://doi.org/10.32604/cmc.2020.05981CrossRef

32.

Sheridan K, Puranik TG, Mangortey E, Pinon-Fischer OJ, Kirby M, Mavris DN. An application of DBSCAN clustering for flight anomaly detection during the approach phase. https://doi.org/10.2514/6.2020-1851

33.

Wibisono S, Anwar MT, Supriyanto A, Amin IHA (2021) Multivariate weather anomaly detection using dbscan clustering algorithm. J Phys Conf Ser 1869(1):012077. https://doi.org/10.1088/1742-6596/1869/1/012077CrossRef

34.

Chandralekha HM C, PS N, PS SP, Ghosh MK (2022) Anomaly detection in recorded CAN log using DBSCAN and LSTM autoencoder. In: 2022 IEEE 3rd global conference for advancement in technology (GCAT), pp. 1–7. https://doi.org/10.1109/GCAT55367.2022.9971885

35.

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357. https://doi.org/10.1613/jair.953CrossRef

36.

Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Huang D-S, Zhang X-P, Huang G-B (eds) Advances in intelligent computing. Springer, Berlin, pp 878–887CrossRef

37.

Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6(11):769–772. https://doi.org/10.1109/TSMC.1976.4309452MathSciNetCrossRef

38.

Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421. https://doi.org/10.1109/TSMC.1972.4309137MathSciNetCrossRef

39.

Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl 6(1):20–29. https://doi.org/10.1145/1007730.1007735CrossRef

40.

Sasada T, Liu Z, Baba T, Hatano K, Kimura Y (2020) A resampling method for imbalanced datasets considering noise and overlap. Proc Comput Sci 176:420–429. https://doi.org/10.1016/j.procs.2020.08.043CrossRef

41.

Miranda ALB, Garcia LPF, Carvalho ACPLF, Lorena AC (2009) Use of classification algorithms in noise detection and elimination. In: Corchado E, Wu X, Oja E, Herrero Á, Baruque B (eds) Hybrid artificial intelligence systems. Springer, Berlin, Heidelberg, pp 417–424. https://doi.org/10.1007/978-3-642-02319-4_50CrossRef

42.

Puri A, Kumar Gupta M (2021) Knowledge discovery from noisy imbalanced and incomplete binary class data. Expert Syst Appl 181:115179. https://doi.org/10.1016/j.eswa.2021.115179CrossRef

43.

Fang X, Chong CF, Yang X, Wang Y (2022) Clustering algorithms based noise identification from air pollution monitoring data. In: 2022 IEEE Asia-pacific conference on computer science and data engineering (CSDE), pp. 1–6. https://doi.org/10.1109/CSDE56538.2022.10089276

44.

Kotary DK, Nanda SJ (2021) A distributed neighbourhood DBSCAN algorithm for effective data clustering in wireless sensor networks. Wireless Pers Commun 121(4):2545–2568. https://doi.org/10.1007/s11277-021-08836-yCrossRef

45.

Schubert E, Sander J, Ester M, Kriegel HP, Xu X (2017) DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Trans Database Syst. https://doi.org/10.1145/3068335MathSciNetCrossRef

46.

Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Val Logic Soft Comput 17(2–3):255–287

47.

Santos MS, Abreu PH, Japkowicz N, Fernández A, Soares C, Wilk S, Santos J (2022) On the joint-effect of class imbalance and overlap: a critical review. Artif Intell Rev 55:6207–6275. https://doi.org/10.1007/s10462-022-10150-3CrossRef

48.

Patro SGK, Sahu KK (2015) Normalization: a preprocessing stage. https://doi.org/10.48550/arXiv.1503.06462

49.

Swana EF, Doorsamy W, Bokoro P (2022) Tomek link and smote approaches for machine fault classification with an imbalanced dataset. Sensors. https://doi.org/10.3390/s22093246CrossRef

50.

Tonini M, Abellan A (2014) Rockfall detection from terrestrial LiDAR point clouds: a clustering approach using R. J Spat Inf Sci 8:95–110. https://doi.org/10.5311/JOSIS.2014.8.123CrossRef

51.

Starczewski A, Goetzen P, Er MJ (2020) A new method for automatic determining of the dbscan parameters. J Artif Intell Soft Comput Res 10(3):209–221. https://doi.org/10.2478/jaiscr-2020-0014CrossRef

52.

Bessrour M, Elouedi Z, Lefévre E (2020) E-DBSCAN: An evidential version of the DBSCAN method. In: 2020 IEEE Symposium series on computational intelligence (SSCI), pp. 3073–3080. https://doi.org/10.1109/SSCI47803.2020.9308578

53.

McKinney: data structures for statistical computing in python. In: Walt, Millman (eds.) Proceedings of the 9th python in science conference, pp. 56–61 (2010). https://doi.org/10.25080/Majora-92bf1922-00a

54.

Harris CR, Millman KJ, Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, Kern R, Picus M, Hoyer S, Kerkwijk MH, Brett M, Haldane A, Río JF, Wiebe M, Peterson P, Gérard-Marchant P, Sheppard K, Reddy T, Weckesser W, Abbasi H, Gohlke C, Oliphant TE (2020) Array programming with NumPy. Nature 585(7825):357–362. https://doi.org/10.1038/s41586-020-2649-2CrossRef

55.

Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ, Brett M, Wilson J, Millman KJ, Mayorov N, Nelson ARJ, Jones E, Kern R, Larson E, Carey CJ, Polat İ, Feng Y, Moore EW, VanderPlas J, Laxalde D, Perktold J, Cimrman R, Henriksen I, Quintero EA, Harris CR, Archibald AM, Ribeiro AH, Pedregosa F, van Mulbregt P, SciPy 1.0 Contributors (2020) SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 17:261–272. https://doi.org/10.1038/s41592-019-0686-2

56.

Hunter JD (2007) Matplotlib: A 2d graphics environment. Comput Sci Eng 9(3):90–95. https://doi.org/10.1109/MCSE.2007.55CrossRef

57.

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830MathSciNet

58.

Lemaître G, Nogueira F, Aridas CK (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res 18(17):1–5

59.

He H, Bai Y, Garcia EA, Li S (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International joint conference on neural networks (ieee world congress on computational intelligence), pp. 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969

60.

Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang D-S, Zhang X-P, Huang G-B (eds) Adv Intell Comput. Springer, Berlin, Heidelberg, pp 878–887CrossRef

61.

Zhang J, Mani I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of the ICML’2003 workshop on learning from imbalanced datasets

62.

Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. Artif Intell Med 2101:63–66. https://doi.org/10.1007/3-540-48229-6_9CrossRef

63.

Scheff SW (2016) Chapter 8 - nonparametric statistics. In: Scheff, S.W. (ed.) Fundamental Statistical Principles for the Neurobiologist, pp. 157–182. https://doi.org/10.1016/B978-0-12-804753-8.00008-7 . https://www.sciencedirect.com/science/article/pii/B9780128047538000087

64.

Xia Y (2020) Chapter eleven - correlation and association analyses in microbiome study integrating multiomics in health and disease. In: Sun, J. (ed.) The Microbiome in health and disease. Progress in Molecular Biology and Translational Science, vol. 171, pp. 309–491. https://doi.org/10.1016/bs.pmbts.2020.04.003 . https://www.sciencedirect.com/science/article/pii/S1877117320300478

Titel: Noise-free sampling with majority framework for an imbalanced classification problem
Publikationsdatum: 09.04.2024
Erschienen in: Knowledge and Information Systems
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI: https://doi.org/10.1007/s10115-024-02079-6

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Premium Partner