Skip to main content
Erschienen in: International Journal of Machine Learning and Cybernetics 10/2023

10.05.2023 | Original Article

OUBoost: boosting based over and under sampling technique for handling imbalanced data

verfasst von: Sahar Hassanzadeh Mostafaei, Jafar Tanha

Erschienen in: International Journal of Machine Learning and Cybernetics | Ausgabe 10/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Most real-world datasets usually contain imbalanced data. Learning from datasets where the number of samples in one class (minority) is much smaller than in another class (majority) creates biased classifiers to the majority class. The overall prediction accuracy in imbalanced datasets is higher than 90%, while this accuracy is relatively lower for minority classes. In this paper, we first propose a new technique for under-sampling based on the Peak clustering method from the majority class on imbalanced datasets. We then propose a novel boosting-based algorithm for learning from imbalanced datasets, based on a combination of the proposed Peak under-sampling algorithm and over-sampling technique (SMOTE) in the boosting procedure, named OUBoost. In the proposed OUBoost algorithm, misclassified examples are not given equal weights. OUBoost selects useful examples from the majority class and creates synthetic examples for the minority class. In fact, it indirectly updates the weights of samples. We designed experiments using several evaluation metrics, such as Recall, MCC, Gmean, and F-score on 30 real-world imbalanced datasets. The results show improved prediction performance in the minority class in most used datasets using OUBoost. We further report time comparisons and statistical tests to analyze our proposed algorithm in more details.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Weitere Produktempfehlungen anzeigen
Literatur
1.
Zurück zum Zitat Majid A, Ali S, Iqbal M, Kausar N (2014) Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines. Comput Methods Programs Biomed 113(3):792–808CrossRef Majid A, Ali S, Iqbal M, Kausar N (2014) Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines. Comput Methods Programs Biomed 113(3):792–808CrossRef
2.
Zurück zum Zitat Di Martino M, Decia F, Molinelli J, Fernández A (2012) Improving electric fraud detection using class imbalance strategies. ICPRAM (2). Di Martino M, Decia F, Molinelli J, Fernández A (2012) Improving electric fraud detection using class imbalance strategies. ICPRAM (2).
3.
Zurück zum Zitat Chawla NV, Japkowicz N, Kotcz A (2004) Special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6(1):1–6CrossRef Chawla NV, Japkowicz N, Kotcz A (2004) Special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6(1):1–6CrossRef
4.
Zurück zum Zitat Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl 6(1):7–19CrossRef Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl 6(1):7–19CrossRef
5.
Zurück zum Zitat Ganganwar V (2012) An overview of classification algorithms for imbalanced datasets. Int J Emerg Technol Adv Eng 2(4):42–47 Ganganwar V (2012) An overview of classification algorithms for imbalanced datasets. Int J Emerg Technol Adv Eng 2(4):42–47
6.
Zurück zum Zitat Hernandez J, Carrasco-Ochoa JA, Martínez-Trinidad JF (2013) An empirical study of oversampling and undersampling for instance selection methods on imbalance datasets. In: Iberoamerican Congress on Pattern Recognition. Hernandez J, Carrasco-Ochoa JA, Martínez-Trinidad JF (2013) An empirical study of oversampling and undersampling for instance selection methods on imbalance datasets. In: Iberoamerican Congress on Pattern Recognition.
7.
Zurück zum Zitat Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRefMATH Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRefMATH
8.
Zurück zum Zitat Estabrooks A (2000) A combination scheme for inductive learning from imbalanced data sets [DalTech]. Estabrooks A (2000) A combination scheme for inductive learning from imbalanced data sets [DalTech].
9.
Zurück zum Zitat Kubat M, & Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. Icml. Kubat M, & Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. Icml.
10.
Zurück zum Zitat Pazzani M, Merz C, Murphy P, Ali K, Hume T, Brunk C (1994) Reducing misclassification costs. In: Machine Learning Proceedings 1994 (pp. 217–225). Elsevier. Pazzani M, Merz C, Murphy P, Ali K, Hume T, Brunk C (1994) Reducing misclassification costs. In: Machine Learning Proceedings 1994 (pp. 217–225). Elsevier.
11.
Zurück zum Zitat Japkowicz N (2001) Supervised versus unsupervised binary-learning by feedforward neural networks. Mach Learn 42(1):97–122CrossRefMATH Japkowicz N (2001) Supervised versus unsupervised binary-learning by feedforward neural networks. Mach Learn 42(1):97–122CrossRefMATH
12.
Zurück zum Zitat Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29CrossRef Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29CrossRef
13.
Zurück zum Zitat Drummond C, Holte RC (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced datasets II. Drummond C, Holte RC (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced datasets II.
14.
Zurück zum Zitat Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. icml. Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. icml.
15.
Zurück zum Zitat Freund Y, Schapire R, Abe N (1999) A short introduction to boosting. J Jpn Soc Artif Intell 14(771–780):1612 Freund Y, Schapire R, Abe N (1999) A short introduction to boosting. J Jpn Soc Artif Intell 14(771–780):1612
16.
Zurück zum Zitat Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2008) Resampling or reweighting: a comparison of boosting implementations. In: 2008 20th IEEE International Conference on Tools with Artificial Intelligence. Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2008) Resampling or reweighting: a comparison of boosting implementations. In: 2008 20th IEEE International Conference on Tools with Artificial Intelligence.
17.
Zurück zum Zitat Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2009) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Hum 40(1):185–197CrossRef Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2009) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Hum 40(1):185–197CrossRef
18.
Zurück zum Zitat Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: European conference on principles of data mining and knowledge discovery. Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: European conference on principles of data mining and knowledge discovery.
19.
Zurück zum Zitat Hasib KM, Iqbal M, Shah FM, Mahmud JA, Popel MH, Showrov M, Hossain I, Ahmed S, Rahman O (2020) A survey of methods for managing the classification and solution of data imbalance problem. arXiv preprint arXiv:2012.11870. Hasib KM, Iqbal M, Shah FM, Mahmud JA, Popel MH, Showrov M, Hossain I, Ahmed S, Rahman O (2020) A survey of methods for managing the classification and solution of data imbalance problem. arXiv preprint arXiv:​2012.​11870.
20.
21.
Zurück zum Zitat Popel MH, Hasib KM, Habib SA, Shah FM (2018) A hybrid under-sampling method (HUSBoost) to classify imbalanced data. In: 2018 21st international conference of computer and information technology (ICCIT). Popel MH, Hasib KM, Habib SA, Shah FM (2018) A hybrid under-sampling method (HUSBoost) to classify imbalanced data. In: 2018 21st international conference of computer and information technology (ICCIT).
22.
Zurück zum Zitat Ahmed S, Rayhan F, Mahbub A, Jani R, Shatabda S, Farid DM. (2019). LIUBoost: locality informed under-boosting for imbalanced data classification. In: Emerging Technologies in Data Mining and Information Security (pp. 133–144). Springer. Ahmed S, Rayhan F, Mahbub A, Jani R, Shatabda S, Farid DM. (2019). LIUBoost: locality informed under-boosting for imbalanced data classification. In: Emerging Technologies in Data Mining and Information Security (pp. 133–144). Springer.
23.
Zurück zum Zitat Raghuwanshi BS, Shukla S (2019) Classifying imbalanced data using ensemble of reduced kernelized weighted extreme learning machine. Int J Mach Learn Cybern 10(11):3071–3097CrossRef Raghuwanshi BS, Shukla S (2019) Classifying imbalanced data using ensemble of reduced kernelized weighted extreme learning machine. Int J Mach Learn Cybern 10(11):3071–3097CrossRef
24.
Zurück zum Zitat Hsiao Y-H, Su C-T, Fu P-C (2020) Integrating MTS with bagging strategy for class imbalance problems. Int J Mach Learn Cybern 11(6):1217–1230CrossRef Hsiao Y-H, Su C-T, Fu P-C (2020) Integrating MTS with bagging strategy for class imbalance problems. Int J Mach Learn Cybern 11(6):1217–1230CrossRef
25.
Zurück zum Zitat Raghuwanshi BS, Shukla S (2020) SMOTE based class-specific extreme learning machine for imbalanced learning. Knowl Based Syst 187:104814CrossRef Raghuwanshi BS, Shukla S (2020) SMOTE based class-specific extreme learning machine for imbalanced learning. Knowl Based Syst 187:104814CrossRef
26.
Zurück zum Zitat Raghuwanshi BS, Shukla S (2021) Classifying imbalanced data using SMOTE based class-specific kernelized ELM. Int J Mach Learn Cybern 12(5):1255–1280CrossRef Raghuwanshi BS, Shukla S (2021) Classifying imbalanced data using SMOTE based class-specific kernelized ELM. Int J Mach Learn Cybern 12(5):1255–1280CrossRef
27.
Zurück zum Zitat Jiang M, Yang Y, Qiu H (2022) Fuzzy entropy and fuzzy support-based boosting random forests for imbalanced data. Appl Intell 52(4):4126–4143CrossRef Jiang M, Yang Y, Qiu H (2022) Fuzzy entropy and fuzzy support-based boosting random forests for imbalanced data. Appl Intell 52(4):4126–4143CrossRef
28.
Zurück zum Zitat Dong J, Qian Q (2022) A density-based random forest for imbalanced data classification. Fut Internet 14(3):90CrossRef Dong J, Qian Q (2022) A density-based random forest for imbalanced data classification. Fut Internet 14(3):90CrossRef
29.
Zurück zum Zitat Kamalov F, Moussa S, Avante Reyes J (2022) KDE-based ensemble learning for imbalanced data. Electronics 11(17):2703CrossRef Kamalov F, Moussa S, Avante Reyes J (2022) KDE-based ensemble learning for imbalanced data. Electronics 11(17):2703CrossRef
30.
Zurück zum Zitat Puri A, Kumar Gupta M (2022) Improved hybrid bag-boost ensemble with K-means-SMOTE–ENN technique for handling noisy class imbalanced data. Comput J 65(1):124–138CrossRef Puri A, Kumar Gupta M (2022) Improved hybrid bag-boost ensemble with K-means-SMOTE–ENN technique for handling noisy class imbalanced data. Comput J 65(1):124–138CrossRef
31.
Zurück zum Zitat Zhai J, Qi J, Zhang S (2022) Imbalanced data classification based on diverse sample generation and classifier fusion. Int J Mach Learn Cybern 13(3):735–750CrossRef Zhai J, Qi J, Zhang S (2022) Imbalanced data classification based on diverse sample generation and classifier fusion. Int J Mach Learn Cybern 13(3):735–750CrossRef
32.
Zurück zum Zitat Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science, 344(6191): 1492–1496. Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science, 344(6191): 1492–1496.
33.
Zurück zum Zitat Li Z, Tang Y (2018) Comparative density peaks clustering. Expert Syst Appl 95:236–247CrossRef Li Z, Tang Y (2018) Comparative density peaks clustering. Expert Syst Appl 95:236–247CrossRef
34.
Zurück zum Zitat Mohseni M, Tanha J (2021) A density-based undersampling approach to intrusion detection. In: 2021 5th International Conference on Pattern Recognition and Image Analysis (IPRIA). Mohseni M, Tanha J (2021) A density-based undersampling approach to intrusion detection. In: 2021 5th International Conference on Pattern Recognition and Image Analysis (IPRIA).
35.
Zurück zum Zitat Bache K, Lichman M (2017) UCI machine learning repository. In: University of California, School of Information and Computer Science, Irvine, CA (2013) Bache K, Lichman M (2017) UCI machine learning repository. In: University of California, School of Information and Computer Science, Irvine, CA (2013)
36.
Zurück zum Zitat Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple Valued Logic Soft Comput: 17. Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple Valued Logic Soft Comput: 17.
38.
Zurück zum Zitat Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 465:1–20CrossRef Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 465:1–20CrossRef
39.
Zurück zum Zitat Sharafaldin I, Lashkari AH, Ghorbani AA (2018) Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp 1:108–116 Sharafaldin I, Lashkari AH, Ghorbani AA (2018) Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp 1:108–116
42.
Zurück zum Zitat Chicco D, Jurman G (2020) The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom 21(1):1–13CrossRef Chicco D, Jurman G (2020) The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom 21(1):1–13CrossRef
43.
Zurück zum Zitat Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. In: Sixth international conference on data mining (ICDM'06). Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. In: Sixth international conference on data mining (ICDM'06).
44.
Zurück zum Zitat Rahman MS, Rahman MK, Kaykobad M, Rahman MS (2018) isGPT: an optimized model to identify sub-Golgi protein types using SVM and Random Forest based feature selection. Artif Intell Med 84:90–100CrossRef Rahman MS, Rahman MK, Kaykobad M, Rahman MS (2018) isGPT: an optimized model to identify sub-Golgi protein types using SVM and Random Forest based feature selection. Artif Intell Med 84:90–100CrossRef
45.
Zurück zum Zitat Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetMATH Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MathSciNetMATH
46.
Zurück zum Zitat Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat: 65–70. Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat: 65–70.
Metadaten
Titel
OUBoost: boosting based over and under sampling technique for handling imbalanced data
verfasst von
Sahar Hassanzadeh Mostafaei
Jafar Tanha
Publikationsdatum
10.05.2023
Verlag
Springer Berlin Heidelberg
Erschienen in
International Journal of Machine Learning and Cybernetics / Ausgabe 10/2023
Print ISSN: 1868-8071
Elektronische ISSN: 1868-808X
DOI
https://doi.org/10.1007/s13042-023-01839-0

Weitere Artikel der Ausgabe 10/2023

International Journal of Machine Learning and Cybernetics 10/2023 Zur Ausgabe

Neuer Inhalt