Skip to main content
Erschienen in: Neural Computing and Applications 2/2023

26.09.2022 | Original Article

Distance-based arranging oversampling technique for imbalanced data

verfasst von: Qi Dai, Jian-wei Liu, Jia-Liang Zhao

Erschienen in: Neural Computing and Applications | Ausgabe 2/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Class imbalance data sets are common in a vast variety of real-world application areas. Synthetic minority oversampling technique (SMOTE) is an important technique for processing imbalanced data sets. SMOTE requires the user to preset the number of nearest neighbor instances before synthesizing instances, which is often difficult to choose accurately. Moreover, SMOTE is easy to synthesize minority instances in the majority areas, which leads to the performance degradation of the classifier. To address these issues, in this paper, a novel distance-based arranging oversampling (DAO) technique is proposed. DAO can effectively prevent users from selecting inaccurate hyperparameters, and DAO can be used as an alternative algorithm to replace the SMOTE-based oversampling technique. We further filter the synthesized instances by setting appropriate conditions to avoid generating minority instances in the majority domain. In our experiments, we collect 25 public benchmark data sets from the KEEL database and HDDT database, and apply CART and ID3 classification models on the oversampling training set of each data set to assess our DAO technique. Under the two evaluation metrics, F-measure and kappa, compared with the state-of-the-art oversampling techniques, our proposed method is superior or partially superior to them.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat He HB, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284 He HB, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
2.
Zurück zum Zitat Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449MATH Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449MATH
3.
Zurück zum Zitat Somasundaram A, Reddy S (2019) Parallel and incremental credit card fraud detection model to handle concept drift and data imbalance. Neural Comput Appl 31(1):3–14 Somasundaram A, Reddy S (2019) Parallel and incremental credit card fraud detection model to handle concept drift and data imbalance. Neural Comput Appl 31(1):3–14
4.
Zurück zum Zitat Rajadurai H, Gandhi UD (2020) A stacked ensemble learning model for intrusion detection in wireless network. Neural Comput Appl 34:1–9 Rajadurai H, Gandhi UD (2020) A stacked ensemble learning model for intrusion detection in wireless network. Neural Comput Appl 34:1–9
5.
Zurück zum Zitat Feng S, Keung J, Yu X, Xiao Y, Bennin KE, Kabir MA, Zhang M (2020) COSTE: Complexity-based Oversampling Technique to alleviate the class imbalance problem in software defect prediction. Inf Softw Technol 129:106432 Feng S, Keung J, Yu X, Xiao Y, Bennin KE, Kabir MA, Zhang M (2020) COSTE: Complexity-based Oversampling Technique to alleviate the class imbalance problem in software defect prediction. Inf Softw Technol 129:106432
6.
Zurück zum Zitat Wang C, Tao L, Ding Y, Lu C, Ma J (2022) An adversarial model for electromechanical actuator fault diagnosis under nonideal data conditions. Neural Comput Appl 34(8):5883–5904 Wang C, Tao L, Ding Y, Lu C, Ma J (2022) An adversarial model for electromechanical actuator fault diagnosis under nonideal data conditions. Neural Comput Appl 34(8):5883–5904
7.
Zurück zum Zitat Pławiak P, Acharya UR (2020) Novel deep genetic ensemble of classifiers for arrhythmia detection using ECG signals. Neural Comput Appl 32(15):11137–11161 Pławiak P, Acharya UR (2020) Novel deep genetic ensemble of classifiers for arrhythmia detection using ECG signals. Neural Comput Appl 32(15):11137–11161
8.
Zurück zum Zitat Zhang J, Dai Q (2022) A cost-sensitive active learning algorithm: toward imbalanced time series forecasting. Neural Comput Appl 34(9):6953–6972 Zhang J, Dai Q (2022) A cost-sensitive active learning algorithm: toward imbalanced time series forecasting. Neural Comput Appl 34(9):6953–6972
9.
Zurück zum Zitat Hassan BA, Rashid TA (2021) A multidisciplinary ensemble algorithm for clustering heterogeneous datasets. Neural Comput Appl 33(17):10987–11010 Hassan BA, Rashid TA (2021) A multidisciplinary ensemble algorithm for clustering heterogeneous datasets. Neural Comput Appl 33(17):10987–11010
10.
Zurück zum Zitat Yan YT, Wu ZB, Du XQ, Chen J, Zhao S, Zhang YP (2018) A three-way decision ensemble method for imbalanced data oversampling. Int J Approx Reason 107:1–16MathSciNetMATH Yan YT, Wu ZB, Du XQ, Chen J, Zhao S, Zhang YP (2018) A three-way decision ensemble method for imbalanced data oversampling. Int J Approx Reason 107:1–16MathSciNetMATH
11.
Zurück zum Zitat Yang J, Liu Y (2019) Undersampled face recognition based on virtual samples and representation classification. Neural Comput Appl 31(7):2447–2453MathSciNet Yang J, Liu Y (2019) Undersampled face recognition based on virtual samples and representation classification. Neural Comput Appl 31(7):2447–2453MathSciNet
12.
Zurück zum Zitat Zhou LG (2013) Performance of corporate bankruptcy prediction models on imbalanced dataset: the effect of sampling methods. Knowl Based Syst 41:16–25 Zhou LG (2013) Performance of corporate bankruptcy prediction models on imbalanced dataset: the effect of sampling methods. Knowl Based Syst 41:16–25
13.
Zurück zum Zitat Wong GY, Leung FH, Ling SH (2018) A hybrid evolutionary preprocessing method for imbalanced datasets. Inf Sci 454–455:161–177MathSciNet Wong GY, Leung FH, Ling SH (2018) A hybrid evolutionary preprocessing method for imbalanced datasets. Inf Sci 454–455:161–177MathSciNet
14.
Zurück zum Zitat Tahir MA, Kittler J, Yan F (2012) Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recognit 45(10):3738–3750 Tahir MA, Kittler J, Yan F (2012) Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recognit 45(10):3738–3750
15.
Zurück zum Zitat Wang X, Wang H, Wang Y (2020) A density weighted fuzzy outlier clustering approach for class imbalanced learning. Neural Comput Appl 32(16):13035–13049 Wang X, Wang H, Wang Y (2020) A density weighted fuzzy outlier clustering approach for class imbalanced learning. Neural Comput Appl 32(16):13035–13049
16.
Zurück zum Zitat Elyan E, Moreno-Garcia CF, Jayne C (2021) CDSMOTE: class decomposition and synthetic minority class oversampling technique for imbalanced-data classification. Neural Comput Appl 33(7):2839–2851 Elyan E, Moreno-Garcia CF, Jayne C (2021) CDSMOTE: class decomposition and synthetic minority class oversampling technique for imbalanced-data classification. Neural Comput Appl 33(7):2839–2851
17.
Zurück zum Zitat Ibrahim MH (2021) ODBOT: outlier detection-based oversampling technique for imbalanced datasets learning. Neural Comput Appl 33(22):15781–15806 Ibrahim MH (2021) ODBOT: outlier detection-based oversampling technique for imbalanced datasets learning. Neural Comput Appl 33(22):15781–15806
18.
Zurück zum Zitat Gupta D, Richhariya B, Borah P (2019) A fuzzy twin support vector machine based on information entropy for class imbalance learning. Neural Comput Appl 31(11):7153–7164 Gupta D, Richhariya B, Borah P (2019) A fuzzy twin support vector machine based on information entropy for class imbalance learning. Neural Comput Appl 31(11):7153–7164
19.
Zurück zum Zitat Yang M, Wang Z, Li Y, Zhou Y, Li D, Du W (2022) Gravitation balanced multiple kernel learning for imbalanced classification. Neural Comput Appl 34:1–17 Yang M, Wang Z, Li Y, Zhou Y, Li D, Du W (2022) Gravitation balanced multiple kernel learning for imbalanced classification. Neural Comput Appl 34:1–17
20.
Zurück zum Zitat Zhang YC, Li Y, Sun ZY, Xiong HY, Qin RW, Li C (2020) Cost-imbalanced hyper parameter learning framework for quality classification. J Clean Prod 242:118481 Zhang YC, Li Y, Sun ZY, Xiong HY, Qin RW, Li C (2020) Cost-imbalanced hyper parameter learning framework for quality classification. J Clean Prod 242:118481
21.
Zurück zum Zitat Artetxe A, Graña M, Beristain A, Ríos S (2020) Balanced training of a hybrid ensemble method for imbalanced datasets: a case of emergency department readmission prediction. Neural Comput Appl 32(10):5735–5744 Artetxe A, Graña M, Beristain A, Ríos S (2020) Balanced training of a hybrid ensemble method for imbalanced datasets: a case of emergency department readmission prediction. Neural Comput Appl 32(10):5735–5744
22.
Zurück zum Zitat Liang XW, Jiang AP, Li T, Xue YY, Wang GT (2020) LR-SMOTE-An improved unbalanced data set oversampling based on K-means and SVM. Knowl Based Syst 196:105845 Liang XW, Jiang AP, Li T, Xue YY, Wang GT (2020) LR-SMOTE-An improved unbalanced data set oversampling based on K-means and SVM. Knowl Based Syst 196:105845
23.
Zurück zum Zitat Tsai CF, Lin WC, Hu YH, Yao GT (2018) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf Sci 477:47–54 Tsai CF, Lin WC, Hu YH, Yao GT (2018) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf Sci 477:47–54
24.
Zurück zum Zitat Kamalov F, Denisov D (2020) Gamma distribution-based sampling for imbalanced data. Knowl Based Syst 207:106368 Kamalov F, Denisov D (2020) Gamma distribution-based sampling for imbalanced data. Knowl Based Syst 207:106368
25.
Zurück zum Zitat Ye XC, Li HM, Imakura A, Sakurai T (2020) An oversampling framework for imbalanced classification based on Laplacian eigenmaps. Neurocomputing 399:107–116 Ye XC, Li HM, Imakura A, Sakurai T (2020) An oversampling framework for imbalanced classification based on Laplacian eigenmaps. Neurocomputing 399:107–116
26.
Zurück zum Zitat Vuttipittayamongkol P, Elyan E, Petrovski A, Jayne C (2018) Overlap-based undersampling for improving imbalanced data classification. In: Proceedings of the international conference on intelligent data engineering and automated learning, IDEAL, pp 689–697 Vuttipittayamongkol P, Elyan E, Petrovski A, Jayne C (2018) Overlap-based undersampling for improving imbalanced data classification. In: Proceedings of the international conference on intelligent data engineering and automated learning, IDEAL, pp 689–697
27.
Zurück zum Zitat Piri S, Delen D, Liu TM (2018) A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from imbalanced datasets. Decis Support Syst 106:15–29 Piri S, Delen D, Liu TM (2018) A synthetic informative minority over-sampling (SIMO) algorithm leveraging support vector machine to enhance learning from imbalanced datasets. Decis Support Syst 106:15–29
28.
Zurück zum Zitat Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357MATH Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357MATH
29.
Zurück zum Zitat Nguyen HM, Cooper EW, Kamei K (2009) Borderline over-sampling for imbalanced data classification. In: Proceedings of the 5th international workshop on computational intelligence and applications, pp 24–29 Nguyen HM, Cooper EW, Kamei K (2009) Borderline over-sampling for imbalanced data classification. In: Proceedings of the 5th international workshop on computational intelligence and applications, pp 24–29
30.
Zurück zum Zitat Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Proceeding of the Pacific-Asia conference on advances in knowledge discovery & data mining, pp 475–482 Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Proceeding of the Pacific-Asia conference on advances in knowledge discovery & data mining, pp 475–482
31.
Zurück zum Zitat He HB, Yang B, Garcia EA, Li ST (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceeding of the IEEE international joint conference on neural networks, p 10365271 He HB, Yang B, Garcia EA, Li ST (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceeding of the IEEE international joint conference on neural networks, p 10365271
32.
Zurück zum Zitat Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5:221–232 Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5:221–232
33.
Zurück zum Zitat Barua S, Islam MM, Yao X, Marase K (2013) MWMOTE—majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425 Barua S, Islam MM, Yao X, Marase K (2013) MWMOTE—majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425
34.
Zurück zum Zitat Nekooeimehr I, Lai-Yuen SK (2016) Adaptive semi-unsupervised weighted oversampling (A-SUWO) for Imbalanced Datasets. Expert Syst Appl 46:405–416 Nekooeimehr I, Lai-Yuen SK (2016) Adaptive semi-unsupervised weighted oversampling (A-SUWO) for Imbalanced Datasets. Expert Syst Appl 46:405–416
35.
Zurück zum Zitat Zhu TF, Lin YP, Liu YH (2020) Improving interpolation-based oversampling for imbalanced data learning. Knowl Based Syst 187:104826 Zhu TF, Lin YP, Liu YH (2020) Improving interpolation-based oversampling for imbalanced data learning. Knowl Based Syst 187:104826
36.
Zurück zum Zitat Tao XM, Li Q, Guo WJ, Ren C, He Q, Liu R, Zou JR (2020) Adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering. Inf Sci 519:43–73MathSciNetMATH Tao XM, Li Q, Guo WJ, Ren C, He Q, Liu R, Zou JR (2020) Adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering. Inf Sci 519:43–73MathSciNetMATH
37.
Zurück zum Zitat Soltanzadeh P, Hashemzadeh M (2020) RCSMOTE: Range-Controlled Synthetic Minority Over-sampling Technique for handling the class imbalance problem. Inf Sci 542:92–111MathSciNetMATH Soltanzadeh P, Hashemzadeh M (2020) RCSMOTE: Range-Controlled Synthetic Minority Over-sampling Technique for handling the class imbalance problem. Inf Sci 542:92–111MathSciNetMATH
38.
Zurück zum Zitat Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl 6(1):20–29 Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl 6(1):20–29
39.
Zurück zum Zitat Ramentol E, Caballero Y, Bello R, Herrera F (2012) SMOTE-RSB∗: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33(2):245–265 Ramentol E, Caballero Y, Bello R, Herrera F (2012) SMOTE-RSB: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33(2):245–265
40.
Zurück zum Zitat Susan S, Kumar A (2019) SSOMaj-SMOTE-SSOMin: three-step intelligent pruning of majority and minority samples for learning from imbalanced datasets. Appl Soft Comput 78:141–149 Susan S, Kumar A (2019) SSOMaj-SMOTE-SSOMin: three-step intelligent pruning of majority and minority samples for learning from imbalanced datasets. Appl Soft Comput 78:141–149
41.
Zurück zum Zitat Zhu YW, Yan YT, Zhang YW, Zhang YP (2020) EHSO: Evolutionary Hybrid Sampling in overlapping scenarios for imbalanced learning. Neurocomputing 417:333–346 Zhu YW, Yan YT, Zhang YW, Zhang YP (2020) EHSO: Evolutionary Hybrid Sampling in overlapping scenarios for imbalanced learning. Neurocomputing 417:333–346
42.
Zurück zum Zitat Mirzaei B, Nikpour B, Nezamabadi-Pour H (2021) CDBH: a clustering and density-based hybrid approach for imbalanced data classification. Expert Syst Appl 164:114035 Mirzaei B, Nikpour B, Nezamabadi-Pour H (2021) CDBH: a clustering and density-based hybrid approach for imbalanced data classification. Expert Syst Appl 164:114035
43.
Zurück zum Zitat Bennin KE, Keung J, Phannachitta P, Monden A, Mensah S (2018) MAHAKIL: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Softw Eng 44(6):534–550 Bennin KE, Keung J, Phannachitta P, Monden A, Mensah S (2018) MAHAKIL: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IEEE Trans Softw Eng 44(6):534–550
44.
Zurück zum Zitat Mahalanobis PC (1936) On the generalized distance in statistics. In: Proceedings of national institute of science, India, vol 2, pp 49–55 Mahalanobis PC (1936) On the generalized distance in statistics. In: Proceedings of national institute of science, India, vol 2, pp 49–55
45.
Zurück zum Zitat Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 465:1–20 Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 465:1–20
46.
Zurück zum Zitat Kovács G (2019) An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Appl Soft Comput 83:105662 Kovács G (2019) An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Appl Soft Comput 83:105662
Metadaten
Titel
Distance-based arranging oversampling technique for imbalanced data
verfasst von
Qi Dai
Jian-wei Liu
Jia-Liang Zhao
Publikationsdatum
26.09.2022
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 2/2023
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-022-07828-8

Weitere Artikel der Ausgabe 2/2023

Neural Computing and Applications 2/2023 Zur Ausgabe

Premium Partner