Skip to main content
Erschienen in: Neural Computing and Applications 22/2021

21.06.2021 | Original Article

ODBOT: Outlier detection-based oversampling technique for imbalanced datasets learning

verfasst von: Mohammed H. IBRAHIM

Erschienen in: Neural Computing and Applications | Ausgabe 22/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In many real-world problems, the datasets are imbalanced when the samples of majority classes are much greater than the samples of minority classes. In general, machine learning and data mining classification algorithms perform poorly on imbalanced datasets. In recent years, various oversampling techniques have been developed in the literature to solve the class imbalance problem. Unfortunately, few of the oversampling techniques can be spread to tackle the relationship between the classes and use the correlation between attributes. Moreover, in most cases, the existing oversampling techniques do not handle multi-class imbalanced datasets. To this end, in this paper, a simple but effective outlier detection-based oversampling technique (ODBOT) is proposed to handle the multi-class imbalance problem. In the proposed ODBOT, the outlier samples are detected by clustering within the minority class(es), and then, the synthetic samples are generated by consideration of these outlier samples. The proposed ODBOT generates very efficient and consistent synthetic samples for the minority class(es) by analyzing well the dissimilarity relationships among attribute values of all classes. Moreover, ODBOT can reduce the risk of the overlapping problem among different class regions and can build a better classification model. The performance of the proposed ODBOT is evaluated with extensive experiments using commonly used 60 imbalanced datasets and five classification algorithms. The experimental results show that the proposed ODBOT oversampling technique consistently outperformed the other common and state-of-the-art techniques in terms of various evaluation criteria.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
1.
Zurück zum Zitat Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, AmsterdamMATH Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, AmsterdamMATH
2.
Zurück zum Zitat Hall EL, Kruger RP, Dwyer SJ, Hall DL, Mclaren RW, Lodwick GS (1971) A survey of preprocessing and feature extraction techniques for radiographic images. IEEE Trans Comput 100(9):1032–1044CrossRef Hall EL, Kruger RP, Dwyer SJ, Hall DL, Mclaren RW, Lodwick GS (1971) A survey of preprocessing and feature extraction techniques for radiographic images. IEEE Trans Comput 100(9):1032–1044CrossRef
3.
Zurück zum Zitat Chawla NV (2009) Data mining for imbalanced datasets: an overview. In: Data mining and knowledge discovery handbook. Springer, pp 875–886 Chawla NV (2009) Data mining for imbalanced datasets: an overview. In: Data mining and knowledge discovery handbook. Springer, pp 875–886
4.
Zurück zum Zitat Zheng Z, Cai Y, Li Y (2016) Oversampling method for imbalanced classification. Comput Inform 34(5):1017–1037 Zheng Z, Cai Y, Li Y (2016) Oversampling method for imbalanced classification. Comput Inform 34(5):1017–1037
5.
Zurück zum Zitat Stone CJ (1984) Classification and regression trees. Wadsworth Intl Group 8:452–456 Stone CJ (1984) Classification and regression trees. Wadsworth Intl Group 8:452–456
6.
Zurück zum Zitat Kaur G, Chhabra A (2014) Improved J48 classification algorithm for the prediction of diabetes. Int J Comput Appl 98(22):13–17 Kaur G, Chhabra A (2014) Improved J48 classification algorithm for the prediction of diabetes. Int J Comput Appl 98(22):13–17
7.
Zurück zum Zitat Yakowitz S, Karlsson M (1987) Nearest neighbor methods for time series, with application to rainfall/runoff prediction. In: Advances in the statistical sciences: stochastic hydrology. Springer, pp 149–160 Yakowitz S, Karlsson M (1987) Nearest neighbor methods for time series, with application to rainfall/runoff prediction. In: Advances in the statistical sciences: stochastic hydrology. Springer, pp 149–160
8.
Zurück zum Zitat Vapnik V (2013) The nature of statistical learning theory. Springer, BerlinMATH Vapnik V (2013) The nature of statistical learning theory. Springer, BerlinMATH
9.
Zurück zum Zitat Zurada JM (1992) Introduction to artificial neural systems, vol 8. West Publishing Company, St. Paul Zurada JM (1992) Introduction to artificial neural systems, vol 8. West Publishing Company, St. Paul
10.
Zurück zum Zitat de Bruijne M (2016) Machine learning approaches in medical image analysis: From detection to diagnosis. Elsevier, Amsterdam de Bruijne M (2016) Machine learning approaches in medical image analysis: From detection to diagnosis. Elsevier, Amsterdam
11.
Zurück zum Zitat Carneiro N, Figueira G, Costa M (2017) A data mining based system for credit-card fraud detection in e-tail. Decis Support Syst 95:91–101CrossRef Carneiro N, Figueira G, Costa M (2017) A data mining based system for credit-card fraud detection in e-tail. Decis Support Syst 95:91–101CrossRef
12.
Zurück zum Zitat Pérez-Ortiz M, Jiménez-Fernández S, Gutiérrez PA, Alexandre E, Hervás-Martínez C, Salcedo-Sanz S (2016) A review of classification problems and algorithms in renewable energy applications. Energies 9(8):607CrossRef Pérez-Ortiz M, Jiménez-Fernández S, Gutiérrez PA, Alexandre E, Hervás-Martínez C, Salcedo-Sanz S (2016) A review of classification problems and algorithms in renewable energy applications. Energies 9(8):607CrossRef
13.
Zurück zum Zitat Chen C-h (2015) Handbook of pattern recognition and computer vision. World Scientific, Singapore Chen C-h (2015) Handbook of pattern recognition and computer vision. World Scientific, Singapore
14.
Zurück zum Zitat Tsai C-F, Hsu Y-F, Lin C-Y, Lin W-Y (2009) Intrusion detection by machine learning: a review. Expert Syst Appl 36(10):11994–12000CrossRef Tsai C-F, Hsu Y-F, Lin C-Y, Lin W-Y (2009) Intrusion detection by machine learning: a review. Expert Syst Appl 36(10):11994–12000CrossRef
15.
Zurück zum Zitat Cireşan D, Meier U (2015) Multi-column deep neural networks for offline handwritten Chinese character classification. In: 2015 international joint conference on neural networks (IJCNN). IEEE, pp 1–6 Cireşan D, Meier U (2015) Multi-column deep neural networks for offline handwritten Chinese character classification. In: 2015 international joint conference on neural networks (IJCNN). IEEE, pp 1–6
16.
Zurück zum Zitat Ibrahim MH, Hacibeyoglu M (2020) A novel switching function approach for data mining classification problems. Soft Comput 24(7):4941–4957CrossRef Ibrahim MH, Hacibeyoglu M (2020) A novel switching function approach for data mining classification problems. Soft Comput 24(7):4941–4957CrossRef
17.
Zurück zum Zitat Tümer AE, Akkuş A (2018) Forecasting gross domestic product per capita using artificial neural networks with non-economical parameters. Phys A 512:468–473CrossRef Tümer AE, Akkuş A (2018) Forecasting gross domestic product per capita using artificial neural networks with non-economical parameters. Phys A 512:468–473CrossRef
18.
Zurück zum Zitat Ganganwar V (2012) An overview of classification algorithms for imbalanced datasets. Int J Emerg Technol Adv Eng 2(4):42–47 Ganganwar V (2012) An overview of classification algorithms for imbalanced datasets. Int J Emerg Technol Adv Eng 2(4):42–47
19.
Zurück zum Zitat Akila S, Reddy US (2016) Data imbalance: effects and solutions for classification of large and highly imbalanced data. Proc ICRECT 16:28–34 Akila S, Reddy US (2016) Data imbalance: effects and solutions for classification of large and highly imbalanced data. Proc ICRECT 16:28–34
20.
Zurück zum Zitat Rout N, Mishra D, Mallick MK (2018) Handling imbalanced data: a survey. In: International proceedings on advances in soft computing, intelligent systems and applications. Springer, pp 431–443 Rout N, Mishra D, Mallick MK (2018) Handling imbalanced data: a survey. In: International proceedings on advances in soft computing, intelligent systems and applications. Springer, pp 431–443
21.
Zurück zum Zitat Namvar A, Siami M, Rabhi F, Naderpour M (2018) Credit risk prediction in an imbalanced social lending environment. arXiv preprint arXiv:180500801 Namvar A, Siami M, Rabhi F, Naderpour M (2018) Credit risk prediction in an imbalanced social lending environment. arXiv preprint arXiv:​180500801
22.
Zurück zum Zitat Santos MS, Soares JP, Abreu PH, Araujo H, Santos J (2018) Cross-validation for imbalanced datasets: avoiding overoptimistic and overfitting approaches [research frontier]. IEEE Comput Intell Mag 13(4):59–76CrossRef Santos MS, Soares JP, Abreu PH, Araujo H, Santos J (2018) Cross-validation for imbalanced datasets: avoiding overoptimistic and overfitting approaches [research frontier]. IEEE Comput Intell Mag 13(4):59–76CrossRef
23.
Zurück zum Zitat Chowdhury A, Alspector J (2003) Data duplication: an imbalance problem? In: ICML'2003 workshop on learning from imbalanced data sets (II), Washington, DC Chowdhury A, Alspector J (2003) Data duplication: an imbalance problem? In: ICML'2003 workshop on learning from imbalanced data sets (II), Washington, DC
24.
Zurück zum Zitat Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATHCrossRef Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATHCrossRef
25.
Zurück zum Zitat Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International conference on intelligent computing. Springer, pp 878–887 Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International conference on intelligent computing. Springer, pp 878–887
26.
Zurück zum Zitat Zhang Z (2016) Introduction to machine learning: k-nearest neighbors. Ann Transl Med 4(11):3–7CrossRef Zhang Z (2016) Introduction to machine learning: k-nearest neighbors. Ann Transl Med 4(11):3–7CrossRef
27.
Zurück zum Zitat Ramentol E, Caballero Y, Bello R, Herrera F (2012) SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33(2):245–265CrossRef Ramentol E, Caballero Y, Bello R, Herrera F (2012) SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33(2):245–265CrossRef
28.
Zurück zum Zitat Maciejewski T, Stefanowski J (2011) Local neighbourhood extension of SMOTE for mining imbalanced data. In: 2011 IEEE symposium on computational intelligence and data mining (CIDM). IEEE, pp 104–111 Maciejewski T, Stefanowski J (2011) Local neighbourhood extension of SMOTE for mining imbalanced data. In: 2011 IEEE symposium on computational intelligence and data mining (CIDM). IEEE, pp 104–111
29.
Zurück zum Zitat Koziarski M, Krawczyk B, Woźniak M (2019) Radial-based oversampling for noisy imbalanced data classification. Neurocomputing 343:19–33CrossRef Koziarski M, Krawczyk B, Woźniak M (2019) Radial-based oversampling for noisy imbalanced data classification. Neurocomputing 343:19–33CrossRef
30.
Zurück zum Zitat Ren R, Yang Y, Sun L (2020) Oversampling technique based on fuzzy representativeness difference for classifying imbalanced data. Appl Intell 50:2465–2487CrossRef Ren R, Yang Y, Sun L (2020) Oversampling technique based on fuzzy representativeness difference for classifying imbalanced data. Appl Intell 50:2465–2487CrossRef
31.
Zurück zum Zitat Wei J, Huang H, Yao L, Hu Y, Fan Q, Huang D (2020) NI-MWMOTE: an improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems. Expert Syst Appl 158:113504CrossRef Wei J, Huang H, Yao L, Hu Y, Fan Q, Huang D (2020) NI-MWMOTE: an improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems. Expert Syst Appl 158:113504CrossRef
32.
Zurück zum Zitat Elyan E, Moreno-Garcia CF, Jayne C (2021) CDSMOTE: class decomposition and synthetic minority class oversampling technique for imbalanced-data classification. Neural Comput Appl 33(7):2839–2851CrossRef Elyan E, Moreno-Garcia CF, Jayne C (2021) CDSMOTE: class decomposition and synthetic minority class oversampling technique for imbalanced-data classification. Neural Comput Appl 33(7):2839–2851CrossRef
33.
Zurück zum Zitat Zhu T, Lin Y, Liu Y, Zhang W, Zhang J (2019) Minority oversampling for imbalanced ordinal regression. Knowl-Based Syst 166:140–155CrossRef Zhu T, Lin Y, Liu Y, Zhang W, Zhang J (2019) Minority oversampling for imbalanced ordinal regression. Knowl-Based Syst 166:140–155CrossRef
34.
Zurück zum Zitat García V, Sánchez JS, Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl-Based Syst 25(1):13–21CrossRef García V, Sánchez JS, Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl-Based Syst 25(1):13–21CrossRef
35.
Zurück zum Zitat Maldonado S, López J, Vairetti C (2019) An alternative SMOTE oversampling strategy for high-dimensional datasets. Appl Soft Comput 76:380–389CrossRef Maldonado S, López J, Vairetti C (2019) An alternative SMOTE oversampling strategy for high-dimensional datasets. Appl Soft Comput 76:380–389CrossRef
36.
Zurück zum Zitat Barua S, Islam MM, Yao X, Murase K (2012) MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425CrossRef Barua S, Islam MM, Yao X, Murase K (2012) MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425CrossRef
37.
Zurück zum Zitat Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 475–482CrossRef Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 475–482CrossRef
38.
Zurück zum Zitat Samad SA (2013) Random walk oversampling technique for minority class classification Samad SA (2013) Random walk oversampling technique for minority class classification
39.
Zurück zum Zitat Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203CrossRef Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203CrossRef
40.
Zurück zum Zitat Das B, Krishnan NC, Cook DJ (2014) RACOG and wRACOG: Two probabilistic oversampling techniques. IEEE Trans Knowl Data Eng 27(1):222–234CrossRef Das B, Krishnan NC, Cook DJ (2014) RACOG and wRACOG: Two probabilistic oversampling techniques. IEEE Trans Knowl Data Eng 27(1):222–234CrossRef
41.
Zurück zum Zitat Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) DBSMOTE: density-based synthetic minority over-sampling technique. Appl Intell 36(3):664–684CrossRef Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) DBSMOTE: density-based synthetic minority over-sampling technique. Appl Intell 36(3):664–684CrossRef
42.
Zurück zum Zitat Liu S, Zhang J, Xiang Y, Zhou W (2017) Fuzzy-based information decomposition for incomplete and imbalanced data learning. IEEE Trans Fuzzy Syst 25(6):1476–1490CrossRef Liu S, Zhang J, Xiang Y, Zhou W (2017) Fuzzy-based information decomposition for incomplete and imbalanced data learning. IEEE Trans Fuzzy Syst 25(6):1476–1490CrossRef
43.
Zurück zum Zitat Liu G, Yang Y, Li B (2018) Fuzzy rule-based oversampling technique for imbalanced and incomplete data learning. Knowl-Based Syst 158:154–174CrossRef Liu G, Yang Y, Li B (2018) Fuzzy rule-based oversampling technique for imbalanced and incomplete data learning. Knowl-Based Syst 158:154–174CrossRef
44.
Zurück zum Zitat Gong L, Jiang S, Jiang L (2019) Tackling class imbalance problem in software defect prediction through cluster-based over-sampling with filtering. IEEE Access 7:145725–145737CrossRef Gong L, Jiang S, Jiang L (2019) Tackling class imbalance problem in software defect prediction through cluster-based over-sampling with filtering. IEEE Access 7:145725–145737CrossRef
45.
Zurück zum Zitat Khan FU, Aziz IB (2019) Reducing high variability in medical image collection by a novel cluster based synthetic oversampling technique. In: 2019 IEEE conference on big data and analytics (ICBDA). IEEE, pp 45–50 Khan FU, Aziz IB (2019) Reducing high variability in medical image collection by a novel cluster based synthetic oversampling technique. In: 2019 IEEE conference on big data and analytics (ICBDA). IEEE, pp 45–50
46.
Zurück zum Zitat Santos MS, Abreu PH, García-Laencina PJ, Simão A, Carvalho A (2015) A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J Biomed Inform 58:49–59CrossRef Santos MS, Abreu PH, García-Laencina PJ, Simão A, Carvalho A (2015) A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J Biomed Inform 58:49–59CrossRef
47.
Zurück zum Zitat Tao X, Li Q, Guo W, Ren C, He Q, Liu R, Zou J (2020) Adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering. Inf Sci 519:43–73MathSciNetMATHCrossRef Tao X, Li Q, Guo W, Ren C, He Q, Liu R, Zou J (2020) Adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering. Inf Sci 519:43–73MathSciNetMATHCrossRef
48.
Zurück zum Zitat Nekooeimehr I, Lai-Yuen SK (2016) Cluster-based weighted oversampling for ordinal regression (CWOS-Ord). Neurocomputing 218:51–60CrossRef Nekooeimehr I, Lai-Yuen SK (2016) Cluster-based weighted oversampling for ordinal regression (CWOS-Ord). Neurocomputing 218:51–60CrossRef
49.
Zurück zum Zitat Nakamura M, Kajiwara Y, Otsuka A, Kimura H (2013) Lvq-smote–learning vector quantization based synthetic minority over–sampling technique for biomedical data. BioData mining 6(1):16CrossRef Nakamura M, Kajiwara Y, Otsuka A, Kimura H (2013) Lvq-smote–learning vector quantization based synthetic minority over–sampling technique for biomedical data. BioData mining 6(1):16CrossRef
50.
Zurück zum Zitat Kim M-J, Kang D-K, Kim HB (2015) Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction. Expert Syst Appl 42(3):1074–1082CrossRef Kim M-J, Kang D-K, Kim HB (2015) Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction. Expert Syst Appl 42(3):1074–1082CrossRef
51.
Zurück zum Zitat Chang S, Zhenzong X, Xuan G (2018) Improvement of K mean clustering algorithm based on density. arXiv preprint arXiv:181004559 Chang S, Zhenzong X, Xuan G (2018) Improvement of K mean clustering algorithm based on density. arXiv preprint arXiv:​181004559
53.
Zurück zum Zitat Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Logic Soft Comput 17:255–287 Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Logic Soft Comput 17:255–287
54.
Zurück zum Zitat Lichman M (2013) UCI machine learning repository. Irvine, CA Lichman M (2013) UCI machine learning repository. Irvine, CA
55.
Zurück zum Zitat Holmes G, Donkin A, Witten IH (1994) Weka: A machine learning workbench Holmes G, Donkin A, Witten IH (1994) Weka: A machine learning workbench
56.
Zurück zum Zitat Paul A, Sil J, Mukhopadhyay CD (2017) Gene selection for designing optimal fuzzy rule base classifier by estimating missing value. Appl Soft Comput 55:276–288CrossRef Paul A, Sil J, Mukhopadhyay CD (2017) Gene selection for designing optimal fuzzy rule base classifier by estimating missing value. Appl Soft Comput 55:276–288CrossRef
Metadaten
Titel
ODBOT: Outlier detection-based oversampling technique for imbalanced datasets learning
verfasst von
Mohammed H. IBRAHIM
Publikationsdatum
21.06.2021
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 22/2021
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-021-06198-x

Weitere Artikel der Ausgabe 22/2021

Neural Computing and Applications 22/2021 Zur Ausgabe

Premium Partner