Skip to main content
Top
Published in: Data Mining and Knowledge Discovery 4/2021

18-05-2021

An overlap sensitive neural network for class imbalanced data

Authors: Shaukat Ali Shahee, Usha Ananthakumar

Published in: Data Mining and Knowledge Discovery | Issue 4/2021

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Class imbalance is one of the well-known challenges in machine learning. Class imbalance occurs when one class dominates the other class in terms of the number of observations. Due to this imbalance, conventional classifiers fail to classify the minority class correctly. The challenges become even more severe when class overlap occurs in imbalanced data. Though literature is available to sequentially deal with class imbalance and class overlap, these methods are quite complex and not so efficient. In this paper, we propose an overlap-sensitive artificial neural network that can handle the problem of class overlapping and class imbalance simultaneously, along with noisy and outlier observations. The strength of this method lies in identifying the overlapping observations rather than the region and in not using multiple classifiers unlike the other existing methods. The key idea of the proposed method is in weighing the observations based on its location in the feature space before training the neural network. The performance of the proposed method is evaluated on 12 simulated data sets and 23 real-life data sets and compared with other well known methods.The results clearly indicate the strength and ability of the proposed method for a wide variety of imbalance ratio and levels of overlapping. Also, it is shown that the proposed method is statistically superior to the other methods in terms of different performance measures.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
go back to reference Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple Valued Logic Soft Comput 17:255–287 Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple Valued Logic Soft Comput 17:255–287
go back to reference Alibeigi M, Hashemi S, Hamzeh A (2012) DBFS: an effective density based feature selection scheme for small sample size and high dimensional imbalanced data sets. Data Knowl Eng 81:67–103CrossRef Alibeigi M, Hashemi S, Hamzeh A (2012) DBFS: an effective density based feature selection scheme for small sample size and high dimensional imbalanced data sets. Data Knowl Eng 81:67–103CrossRef
go back to reference Alshomrani S, Bawakid A, Shim S-O, Fernández A, Herrera F (2015) A proposal for evolutionary fuzzy systems using feature weighting: dealing with overlapping in imbalanced datasets. Knowl-Based Syst 73:1–17CrossRef Alshomrani S, Bawakid A, Shim S-O, Fernández A, Herrera F (2015) A proposal for evolutionary fuzzy systems using feature weighting: dealing with overlapping in imbalanced datasets. Knowl-Based Syst 73:1–17CrossRef
go back to reference Barua S, Islam MM, Yao X, Murase K (2012) Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425CrossRef Barua S, Islam MM, Yao X, Murase K (2012) Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425CrossRef
go back to reference Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29CrossRef Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29CrossRef
go back to reference Batista GE, Prati RC, Monard MC (2005) Balancing strategies and class overlapping. In: International symposium on intelligent data analysis. Springer, Berlin, pp 24–35 Batista GE, Prati RC, Monard MC (2005) Balancing strategies and class overlapping. In: International symposium on intelligent data analysis. Springer, Berlin, pp 24–35
go back to reference Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159CrossRef Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159CrossRef
go back to reference Burez J, Van den Poel D (2009) Handling class imbalance in customer churn prediction. Expert Syst Appl 36(3):4626–4636CrossRef Burez J, Van den Poel D (2009) Handling class imbalance in customer churn prediction. Expert Syst Appl 36(3):4626–4636CrossRef
go back to reference Ceci M, Pio G, Kuzmanovski V, Džeroski S (2015) Semi-supervised multi-view learning for gene network reconstruction. PLoS ONE 10(12):e0144031CrossRef Ceci M, Pio G, Kuzmanovski V, Džeroski S (2015) Semi-supervised multi-view learning for gene network reconstruction. PLoS ONE 10(12):e0144031CrossRef
go back to reference Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRef Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRef
go back to reference Chawla NV, Japkowicz N, Kotcz A (2004) Special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6(1):1–6CrossRef Chawla NV, Japkowicz N, Kotcz A (2004) Special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6(1):1–6CrossRef
go back to reference Cleofas-Sánchez L, García V, Marqués A, Sánchez JS (2016) Financial distress prediction using the hybrid associative memory with translation. Appl Soft Comput 44:144–152CrossRef Cleofas-Sánchez L, García V, Marqués A, Sánchez JS (2016) Financial distress prediction using the hybrid associative memory with translation. Appl Soft Comput 44:144–152CrossRef
go back to reference Cui Y, Jia M, Lin T-Y, Song Y, Belongie S (2019) Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 9268–9277 Cui Y, Jia M, Lin T-Y, Song Y, Belongie S (2019) Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 9268–9277
go back to reference Das B, Krishnan NC, Cook DJ (2013) Handling class overlap and imbalance to detect prompt situations in smart homes. In: 2013 IEEE 13th international conference on data mining workshops. IEEE, pp 266–273 Das B, Krishnan NC, Cook DJ (2013) Handling class overlap and imbalance to detect prompt situations in smart homes. In: 2013 IEEE 13th international conference on data mining workshops. IEEE, pp 266–273
go back to reference Elkan C (2001) The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence. vol 17. Lawrence Erlbaum Associates Ltd, pp 973–978 Elkan C (2001) The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence. vol 17. Lawrence Erlbaum Associates Ltd, pp 973–978
go back to reference Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36MathSciNetCrossRef Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36MathSciNetCrossRef
go back to reference Guo H, Viktor HL (2004a) Boosting with data generation: improving the classification of hard to learn examples. In: International conference on industrial, engineering and other applications of applied intelligent systems. Springer Berlin, pp 1082–1091 Guo H, Viktor HL (2004a) Boosting with data generation: improving the classification of hard to learn examples. In: International conference on industrial, engineering and other applications of applied intelligent systems. Springer Berlin, pp 1082–1091
go back to reference Guo H, Viktor HL (2004b) Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. ACM SIGKDD Explor Newsl 6(1):30–39CrossRef Guo H, Viktor HL (2004b) Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. ACM SIGKDD Explor Newsl 6(1):30–39CrossRef
go back to reference Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, Berlin, pp 878–887 Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, Berlin, pp 878–887
go back to reference He H, Garcia EA (2008) Learning from imbalanced data. IEEE Trans Knowl Data Eng 9:1263–1284 He H, Garcia EA (2008) Learning from imbalanced data. IEEE Trans Knowl Data Eng 9:1263–1284
go back to reference He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks, 2008. IJCNN 2008. IEEE world congress on computational intelligence. IEEE, pp 1322–1328 He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks, 2008. IJCNN 2008. IEEE world congress on computational intelligence. IEEE, pp 1322–1328
go back to reference Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310CrossRef Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310CrossRef
go back to reference Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intelligent data analysis 6(5):429–449CrossRef Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intelligent data analysis 6(5):429–449CrossRef
go back to reference Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. ACM SIGKDD Explor Newsl 6(1):40–49CrossRef Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. ACM SIGKDD Explor Newsl 6(1):40–49CrossRef
go back to reference Lee HK, Kim SB (2018) An overlap-sensitive margin classifier for imbalanced and overlapping data. Expert Syst Appl 98:72–83CrossRef Lee HK, Kim SB (2018) An overlap-sensitive margin classifier for imbalanced and overlapping data. Expert Syst Appl 98:72–83CrossRef
go back to reference Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp 2980–2988 Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp 2980–2988
go back to reference López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141CrossRef López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141CrossRef
go back to reference McClelland JL, Rumelhart DE, Hinton GE (1988) The appeal of parallel distributed processing. Morgan Kaufmann, BurlingtonCrossRef McClelland JL, Rumelhart DE, Hinton GE (1988) The appeal of parallel distributed processing. Morgan Kaufmann, BurlingtonCrossRef
go back to reference Piras L, Giacinto G (2012) Synthetic pattern generation for imbalanced learning in image retrieval. Pattern Recognit Lett 33(16):2198–2205CrossRef Piras L, Giacinto G (2012) Synthetic pattern generation for imbalanced learning in image retrieval. Pattern Recognit Lett 33(16):2198–2205CrossRef
go back to reference Prati RC, Batista GE, Monard MC (2004) Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Mexican international conference on artificial intelligence. Springer, Berlin, pp 312–321 Prati RC, Batista GE, Monard MC (2004) Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Mexican international conference on artificial intelligence. Springer, Berlin, pp 312–321
go back to reference Provost FJ, Fawcett T et al (1997) Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In: KDD-97 Proceedings, vol. 97. American Association for Artificial Intelligence, pp 43–48 Provost FJ, Fawcett T et al (1997) Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions. In: KDD-97 Proceedings, vol. 97. American Association for Artificial Intelligence, pp 43–48
go back to reference Qu Y, Su H, Guo L, Chu J (2011) A novel SVM modeling approach for highly imbalanced and overlapping classification. Intell Data Anal 15(3):319–341CrossRef Qu Y, Su H, Guo L, Chu J (2011) A novel SVM modeling approach for highly imbalanced and overlapping classification. Intell Data Anal 15(3):319–341CrossRef
go back to reference Richardson A (2010) Nonparametric statistics for non-statisticians: a step-by-step approach by Gregory W. Corder, Dale I. Foreman. Int Stat Rev 78(3):451–452CrossRef Richardson A (2010) Nonparametric statistics for non-statisticians: a step-by-step approach by Gregory W. Corder, Dale I. Foreman. Int Stat Rev 78(3):451–452CrossRef
go back to reference Shahee SA, Ananthakumar U (2018a) An adaptive oversampling technique for imbalanced datasets. In: Industrial conference on data mining. Springer, Berlin, pp 1–16 Shahee SA, Ananthakumar U (2018a) An adaptive oversampling technique for imbalanced datasets. In: Industrial conference on data mining. Springer, Berlin, pp 1–16
go back to reference Shahee SA, Ananthakumar U (2018b) Synthetic sampling approach based on model-based clustering for imbalanced data. Int J Artif Intell Soft Comput 6(4):348–364CrossRef Shahee SA, Ananthakumar U (2018b) Synthetic sampling approach based on model-based clustering for imbalanced data. Int J Artif Intell Soft Comput 6(4):348–364CrossRef
go back to reference Shahee SA, Ananthakumar U (2019) An effective distance based feature selection approach for imbalanced data. Appl Intell 5:1–29 Shahee SA, Ananthakumar U (2019) An effective distance based feature selection approach for imbalanced data. Appl Intell 5:1–29
go back to reference Simard PY, Steinkraus D, Platt JC et al (2003) Best practices for convolutional neural networks applied to visual document analysis. In: Icdar. vol 3 Simard PY, Steinkraus D, Platt JC et al (2003) Best practices for convolutional neural networks applied to visual document analysis. In: Icdar. vol 3
go back to reference Sun Y, Kamel MS, Wong AK, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378CrossRef Sun Y, Kamel MS, Wong AK, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40(12):3358–3378CrossRef
go back to reference Tang Y, Gao J (2007) Improved classification for problem involving overlapping patterns. IEICE Trans Inf Syst 90(11):1787–1795CrossRef Tang Y, Gao J (2007) Improved classification for problem involving overlapping patterns. IEICE Trans Inf Syst 90(11):1787–1795CrossRef
go back to reference Tang W, Mao K, Mak LO, Ng GW (2010) Classification for overlapping classes using optimized overlapping region detection and soft decision. In: 2010 13th international conference on information fusion. IEEE, pp 1–8 Tang W, Mao K, Mak LO, Ng GW (2010) Classification for overlapping classes using optimized overlapping region detection and soft decision. In: 2010 13th international conference on information fusion. IEEE, pp 1–8
go back to reference Tax DM, Duin RP (2004) Support vector data description. Mach Learn 54(1):45–66CrossRef Tax DM, Duin RP (2004) Support vector data description. Mach Learn 54(1):45–66CrossRef
go back to reference Thanathamathee P, Lursinsap C (2013) Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and adaboost techniques. Pattern Recogn Lett 34(12):1339–1347CrossRef Thanathamathee P, Lursinsap C (2013) Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and adaboost techniques. Pattern Recogn Lett 34(12):1339–1347CrossRef
go back to reference Tharwat A (2018) Classification assessment methods. Appl Comput Inform 17(1):168–192 Tharwat A (2018) Classification assessment methods. Appl Comput Inform 17(1):168–192
go back to reference Ting KM (2002) An instance-weighting method to induce cost-sensitive trees. IEEE Trans Knowl Data Eng 3:659–665CrossRef Ting KM (2002) An instance-weighting method to induce cost-sensitive trees. IEEE Trans Knowl Data Eng 3:659–665CrossRef
go back to reference Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybernet 3:408–421MathSciNetCrossRef Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybernet 3:408–421MathSciNetCrossRef
go back to reference Xiong H, Wu J, Liu L (2010) Classification with classoverlapping: a systematic study. In: Proceedings of the 1st international conference on E-business intelligence (ICEBI2010). pp Atlantis Press Xiong H, Wu J, Liu L (2010) Classification with classoverlapping: a systematic study. In: Proceedings of the 1st international conference on E-business intelligence (ICEBI2010). pp Atlantis Press
go back to reference Yin L, Ge Y, Xiao K, Wang X, Quan X (2013) Feature selection for high-dimensional imbalanced data. Neurocomputing 105:3–11CrossRef Yin L, Ge Y, Xiao K, Wang X, Quan X (2013) Feature selection for high-dimensional imbalanced data. Neurocomputing 105:3–11CrossRef
go back to reference Zhou L (2013) Performance of corporate bankruptcy prediction models on imbalanced dataset: the effect of sampling methods. Knowl-Based Syst 41:16–25CrossRef Zhou L (2013) Performance of corporate bankruptcy prediction models on imbalanced dataset: the effect of sampling methods. Knowl-Based Syst 41:16–25CrossRef
go back to reference Zikeba M, Tomczak SK, Tomczak JM (2016) Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction. Expert Syst Appl 58:93–101CrossRef Zikeba M, Tomczak SK, Tomczak JM (2016) Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction. Expert Syst Appl 58:93–101CrossRef
Metadata
Title
An overlap sensitive neural network for class imbalanced data
Authors
Shaukat Ali Shahee
Usha Ananthakumar
Publication date
18-05-2021
Publisher
Springer US
Published in
Data Mining and Knowledge Discovery / Issue 4/2021
Print ISSN: 1384-5810
Electronic ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-021-00766-4

Other articles of this Issue 4/2021

Data Mining and Knowledge Discovery 4/2021 Go to the issue

Premium Partner