nach oben

Computing

Erschienen in:

21.10.2020 | Special Issue Article

Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic

verfasst von: Seunghyun Park, Hyunhee Park

Erschienen in: Computing | Ausgabe 3/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Network traffic data basically comprise a major amount of normal traffic data and a minor amount of attack data. Such an imbalance problem in the amounts of the two types of data reduces prediction performance, such as by prediction bias of the minority data and miscalculation of normal data as outliers. To address the imbalance problem, representative sampling methods include various minority data synthesis models based on oversampling. However, as the oversampling method for resolving the imbalance problem involves repeatedly learning the same data, the classification model can overfit the learning data. Meanwhile, the undersampling methods proposed to address the imbalance problem can cause information loss because they remove data. To improve the performance of these oversampling and undersampling approaches, we propose an oversampling ensemble method based on the slow-start algorithm. The proposed combined oversampling and undersampling method based on the slow-start (COUSS) algorithm is based on the congestion control algorithm of the transmission control protocol. Therefore, an imbalanced dataset oversamples until overfitting occurs, based on a minimally applied undersampling dataset. The simulation results obtained using the KDD99 dataset show that the proposed COUSS method improves the F1 score by 8.639%, 6.858%, 5.003%, and 4.074% compared to synthetic minority oversampling technique (SMOTE), borderline-SMOTE, adaptive synthetic sampling, and generative adversarial network oversampling algorithms, respectively. Therefore, the COUSS method can be perceived as a practical solution in data analysis applications.

Vorheriger Artikel Blockchain-based secure multi-resource trading model for smart marketplace

Nächster Artikel Towards 5G-based IoT security analysis against Vo5G eavesdropping

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

O’Brien R, Ishwaran H (2019) A random forests quantile classifier for class imbalanced data. Pattern Recognit 90:232–249CrossRef

Ertekin S, Huang J, Bottou L, Giles L (2007) Learning on the border: active learning in imbalanced data classification. In: Proceedings of the sixteenth ACM conference on information and knowledge management, pp 127–136

Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36MathSciNetCrossRef

Kubat M, Holte R, Matwin S (1997) Learning when negative examples abound. In: Proceedings of European conference on machine learning. Springer, Berlin, pp 146–153

Dumais S, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Proceedings of the seventh international conference on information and knowledge management, pp 148–155

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRef

Yen S-J, Lee Y-S (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727CrossRef

Huda S, Liu K, Abdelrazek M, Ibrahim A, Alyahya S, Al-Dossari H, Ahmad S (2018) An ensemble oversampling model for class imbalance problem in software defect prediction. IEEE Access 6:24184–24195CrossRef

Bruzzone L, Serpico SB (1997) Classification of imbalanced remote-sensing data by neural networks. Pattern Recognit Lett 18(11–13):1323–1328CrossRef

10.

Van Hulse J, Khoshgoftaar TM, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th international conference on machine learning, pp 935–942

11.

Liu X-Y, Jianxin W, Zhou Z-H (2008) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern Part B (Cybern) 39(2):539–550

12.

He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRef

13.

He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of 2008 IEEE international joint conference on neural networks. IEEE, pp 1322–1328

14.

Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: Proceedings of international conference on intelligent computing. Springer, Berlin, pp 878–887

15.

Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Proceedings of advances in neural information processing systems, pp 2672–2680

16.

Ali-Gombe A, Elyan E, Jayne C (2019) Multiple fake classes GAN for data augmentation in face image dataset. In: Proceedings of 2019 international joint conference on neural networks (IJCNN). IEEE, pp 1–8

17.

Douzas G, Bacao F (2018) Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Syst Appl 91:464–471CrossRef

18.

Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K (2019) Modeling tabular data using conditional GAN. In: Proceedings of advances in neural information processing systems, pp 7335–7345

19.

Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784

20.

Xu L, Veeramachaneni K (2018) Synthesizing tabular data using generative adversarial networks. arXiv preprint arXiv:1811.11264

21.

Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140MATH

22.

Hastie T, Rosset S, Zhu J, Zou H (2009) Multi-class adaboost. Stat. Interface 2(3):349–360MathSciNetCrossRef

23.

Nguyen HM, Cooper EW, Kamei K (2011) Borderline over-sampling for imbalanced data classification. Int J Knowl Eng Soft Data Paradig 3(1):4–21CrossRef

24.

Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297MATH

25.

Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. ACM Sigkdd Explor Newsl 6(1):40–49CrossRef

26.

Macia N, Bernadó-Mansilla E, Orriols-Puig A (2008) Preliminary approach on synthetic data sets generation based on class separability measure. In: Proceedings of 2008 19th international conference on pattern ecognition. IEEE, pp 1–4

27.

Wang H-Y (2008) Combination approach of smote and biased-SVM for imbalanced datasets. In: Proceedings of 2008 IEEE international joint conference on neural networks. IEEE, pp 228–231

28.

Hoi C-H, Chan C-H, Huang K, Lyu MR, King I (2004) Biased support vector machine for relevance feedback in image retrieval. In: Proceedings of 2004 IEEE international joint conference on neural networks, vol 4. IEEE, pp 3189–3194

29.

Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29CrossRef

30.

Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 3:408–421MathSciNetCrossRef

31.

Tomek I et al (1976) Two modifications of CNN

32.

Liu Y, An A, Huang X (2006) Boosting prediction accuracy on imbalanced datasets with SVM ensembles. In: Proceedings of Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 107–118

33.

Crammer K, Singer Y (2001) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2(Dec):265–292MATH

34.

Jacobson V (1988) Congestion avoidance and control. ACM SIGCOMM Comput Commun Rev 18(4):314–329CrossRef

35.

Tavallaee M, Bagheri E, Lu W, Ghorbani AA (2009) A detailed analysis of the kdd cup 99 data set. In: Proceedings of 2009 IEEE symposium on computational intelligence for security and defense applications. IEEE, pp 1–6

36.

Atilla Özgür, Hamit Erdem (2016) A review of kdd99 dataset usage in intrusion detection and machine learning between 2010 and 2015. PeerJ Preprints 4:e1954v1

37.

Revathi S, Malathi A (2013) A detailed analysis on NSL-KDD dataset using various machine learning techniques for intrusion detection. Int J Eng Res Technol 2(12):1848–1853

38.

Fares AH, Sharawy MI, Zayed HH (2011) Intrusion detection: supervised machine learning. J Comput Sci Eng 5(4):305–313CrossRef

39.

Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th international conference on machine learning

40.

Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

41.

Corder GW, Foreman DI (2014) Nonparametric statistics: a step-by-step approach. Wiley, New YorkMATH

42.

Lee K, Lim J, Bok K, Yoo J (2019) Handling method of imbalance data for machine learning: focused on sampling. J Korea Contents Assoc 19(11):567–577

Titel: Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic
verfasst von: Seunghyun Park
Hyunhee Park
Publikationsdatum: 21.10.2020
Verlag: Springer Vienna
Erschienen in: Computing / Ausgabe 3/2021
Print ISSN: 0010-485X
Elektronische ISSN: 1436-5057
DOI: https://doi.org/10.1007/s00607-020-00854-1

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Weitere Artikel der Ausgabe 3/2021

K-predictions based data reduction approach in WSN for smart agriculture

Differential privacy distributed learning under chaotic quantum particle swarm optimization

Allocation of applications to Fog resources via semantic clustering techniques: with scenarios from intelligent transportation systems

Supervised ensemble sentiment-based framework to measure chatbot quality of services

Towards 5G-based IoT security analysis against Vo5G eavesdropping

Blockchain-based secure multi-resource trading model for smart marketplace

Premium Partner