Skip to main content
Erschienen in: Soft Computing 20/2020

11.04.2020 | Methodologies and Application

Robust hybrid data-level sampling approach to handle imbalanced data during classification

verfasst von: Prabhjot Kaur, Anjana Gosain

Erschienen in: Soft Computing | Ausgabe 20/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Classification process is significant in finding different patterns from data. The performance of classifiers is highly affected with many data impurities like imbalance data, noise, class overlapping and different distributions of data within classes. The data in the real-world applications are often corrupted with multiple data impurities. To handle this issue, this paper proposed a hybrid data-level method to handle multiple data impurities like class imbalance, noise and different data distributions within classes. The proposed approach works in phases; in the first phase, it identifies and removes noise from the data, and then, it detects minority and majority cluster by using kernel-based fuzzy clustering approach. Radial basis kernel is used for clustering. In the next phase, minority and majority clusters are processed to balance the data. It uses radial basis kernel fuzzy membership and \(\alpha \)-cut to reduce the data size of majority cluster- and firefly-based SMOTE method to intelligently produce synthetic data within minority cluster. After removing all the data impurities, a traditional classifier (Decision Tree) is used to classify the balanced data. Performance of proposed method is tested with 3 synthetic data-sets and 44 UCI real-world data-sets of different imbalance ratios (imbalance ratio varies from 1.82 to 129.44). Area under the ROC curve is used to assess and compare the performance of proposed method with 20 other data-level methods. Experimental results confirmed that proposed method outperformed every other method especially in the case of highly imbalanced data-set.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In: European conference on machine learning. Springer, Berlin, pp 39–50 Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In: European conference on machine learning. Springer, Berlin, pp 39–50
Zurück zum Zitat Alcalá-Fdez J, Sanchez L, Garcia S, del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM et al (2009) Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3):307–318CrossRef Alcalá-Fdez J, Sanchez L, Garcia S, del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM et al (2009) Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3):307–318CrossRef
Zurück zum Zitat Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult.-Valued Logic Soft Comput 17 Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult.-Valued Logic Soft Comput 17
Zurück zum Zitat Asuncion A, Newman D (2007) UCI machine learning repository Asuncion A, Newman D (2007) UCI machine learning repository
Zurück zum Zitat Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29CrossRef Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29CrossRef
Zurück zum Zitat Bezdek JC (1981) Objective function clustering. In: Pattern recognition with fuzzy objective function algorithm, Springer, Berlin, pp 43–93 Bezdek JC (1981) Objective function clustering. In: Pattern recognition with fuzzy objective function algorithm, Springer, Berlin, pp 43–93
Zurück zum Zitat Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia conference on knowledge discovery and data mining, Springer, Berlin, pp 475–482 Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia conference on knowledge discovery and data mining, Springer, Berlin, pp 475–482
Zurück zum Zitat Chaira T (2011) A novel intuitionistic fuzzy c means clustering algorithm and its application to medical images. Appl Soft Comput 11(2):1711–1717CrossRef Chaira T (2011) A novel intuitionistic fuzzy c means clustering algorithm and its application to medical images. Appl Soft Comput 11(2):1711–1717CrossRef
Zurück zum Zitat Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATHCrossRef Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATHCrossRef
Zurück zum Zitat Deng W, Zhao H (2019) An effective improved co-evolution ant colony optimization algorithm with multi-strategies and its application. Int J Bio-inspired Comput Paper:in Press Deng W, Zhao H (2019) An effective improved co-evolution ant colony optimization algorithm with multi-strategies and its application. Int J Bio-inspired Comput Paper:in Press
Zurück zum Zitat Deng W, Zhao H, Yang X, Xiong J, Sun M, Li B (2017a) Study on an improved adaptive pso algorithm for solving multi-objective gate assignment. Appl Soft Comput 59:288–302CrossRef Deng W, Zhao H, Yang X, Xiong J, Sun M, Li B (2017a) Study on an improved adaptive pso algorithm for solving multi-objective gate assignment. Appl Soft Comput 59:288–302CrossRef
Zurück zum Zitat Deng W, Zhao H, Zou L, Li G, Yang X, Wu D (2017b) A novel collaborative optimization algorithm in solving complex optimization problems. Soft Comput 21(15):4387–4398CrossRef Deng W, Zhao H, Zou L, Li G, Yang X, Wu D (2017b) A novel collaborative optimization algorithm in solving complex optimization problems. Soft Comput 21(15):4387–4398CrossRef
Zurück zum Zitat Deng W, Xu J, Zhao H (2019) An improved ant colony optimization algorithm based on hybrid strategies for scheduling problem. IEEE Access 7:20,281–20,292CrossRef Deng W, Xu J, Zhao H (2019) An improved ant colony optimization algorithm based on hybrid strategies for scheduling problem. IEEE Access 7:20,281–20,292CrossRef
Zurück zum Zitat Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3(3):32–57MathSciNetMATHCrossRef Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3(3):32–57MathSciNetMATHCrossRef
Zurück zum Zitat D’Addabbo A, Maglietta R (2015) Parallel selective sampling method for imbalanced and large data classification. Pattern Recognit Lett 62:61–67CrossRef D’Addabbo A, Maglietta R (2015) Parallel selective sampling method for imbalanced and large data classification. Pattern Recognit Lett 62:61–67CrossRef
Zurück zum Zitat Feng L, Qiu MH, Wang YX, Xiang QL, Yang YF, Liu K (2010) A fast divisive clustering algorithm using an improved discrete particle swarm optimizer. Pattern Recognit Lett 31(11):1216–1225CrossRef Feng L, Qiu MH, Wang YX, Xiang QL, Yang YF, Liu K (2010) A fast divisive clustering algorithm using an improved discrete particle swarm optimizer. Pattern Recognit Lett 31(11):1216–1225CrossRef
Zurück zum Zitat FernáNdez A, LóPez V, Galar M, Del Jesus MJ, Herrera F (2013) Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches. Knowl-Based Syst 42:97–110CrossRef FernáNdez A, LóPez V, Galar M, Del Jesus MJ, Herrera F (2013) Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches. Knowl-Based Syst 42:97–110CrossRef
Zurück zum Zitat Fister I, Fister I Jr, Yang XS, Brest J (2013) A comprehensive review of firefly algorithms. Swarm Evolut Comput 13:34–46CrossRef Fister I, Fister I Jr, Yang XS, Brest J (2013) A comprehensive review of firefly algorithms. Swarm Evolut Comput 13:34–46CrossRef
Zurück zum Zitat Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18CrossRef Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18CrossRef
Zurück zum Zitat Han H, Wang WY, Mao BH (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing, Springer, Berlin, pp 878–887 Han H, Wang WY, Mao BH (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing, Springer, Berlin, pp 878–887
Zurück zum Zitat Hart P (1968) The condensed nearest neighbor rule (corresp.). IEEE Trans Inf Theory 14(3):515–516CrossRef Hart P (1968) The condensed nearest neighbor rule (corresp.). IEEE Trans Inf Theory 14(3):515–516CrossRef
Zurück zum Zitat Kanimozhi U, Ganapathy S, Manjula D, Kannan A (2019) An intelligent risk prediction system for breast cancer using fuzzy temporal rules. Natl Acad Sci Lett 42(3):227–232CrossRef Kanimozhi U, Ganapathy S, Manjula D, Kannan A (2019) An intelligent risk prediction system for breast cancer using fuzzy temporal rules. Natl Acad Sci Lett 42(3):227–232CrossRef
Zurück zum Zitat Kaur P, Gosain A (2018a) Comparing the behaviour of undersampling and oversampling of class imbalance learning by combining class imbalance problem with noise. In: ICT based innovations, advances in intelligent systems and computing, Springer, Berlin, pp 23–30 Kaur P, Gosain A (2018a) Comparing the behaviour of undersampling and oversampling of class imbalance learning by combining class imbalance problem with noise. In: ICT based innovations, advances in intelligent systems and computing, Springer, Berlin, pp 23–30
Zurück zum Zitat Kaur P, Gosain A (2018b) An intelligent undersampling technique based upon intuitionistic fuzzy sets to alleviate class imbalance problem of classification with noisy environment. Int J Intell Eng Inform 6(5):417–433 Kaur P, Gosain A (2018b) An intelligent undersampling technique based upon intuitionistic fuzzy sets to alleviate class imbalance problem of classification with noisy environment. Int J Intell Eng Inform 6(5):417–433
Zurück zum Zitat Kaur P, Gosain A (2019) Ff-smote: a metaheuristic approach to combat class imbalance in binary classification. Appl Artif Intell 33(5):420–439CrossRef Kaur P, Gosain A (2019) Ff-smote: a metaheuristic approach to combat class imbalance in binary classification. Appl Artif Intell 33(5):420–439CrossRef
Zurück zum Zitat Kaur P, Soni A, Gosain A (2011) Robust intuitionistic fuzzy c-means clustering for linearly and nonlinearly separable data. In: 2011 International conference on image information processing, IEEE, pp 1–6 Kaur P, Soni A, Gosain A (2011) Robust intuitionistic fuzzy c-means clustering for linearly and nonlinearly separable data. In: 2011 International conference on image information processing, IEEE, pp 1–6
Zurück zum Zitat Kaur P, Soni A, Gosain A (2013) Robust kernelized approach to clustering by incorporating new distance measure. Eng Appl Artif Intell 26(2):833–847CrossRef Kaur P, Soni A, Gosain A (2013) Robust kernelized approach to clustering by incorporating new distance measure. Eng Appl Artif Intell 26(2):833–847CrossRef
Zurück zum Zitat Kubat M, Matwin S et al (1997) Addressing the curse of imbalanced training sets: one-sided selection. Icml 97:179–186 Kubat M, Matwin S et al (1997) Addressing the curse of imbalanced training sets: one-sided selection. Icml 97:179–186
Zurück zum Zitat Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Conference on artificial intelligence in medicine in Europe, Springer, Berlin, pp 63–66 Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Conference on artificial intelligence in medicine in Europe, Springer, Berlin, pp 63–66
Zurück zum Zitat Li DC, Wu CS, Tsai TI, Lina YS (2007) Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Comput Oper Res 34(4):966–982MATHCrossRef Li DC, Wu CS, Tsai TI, Lina YS (2007) Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Comput Oper Res 34(4):966–982MATHCrossRef
Zurück zum Zitat Maruthi Padmaja T, Raju BS, Hota RN, Krishna PR (2014) Class imbalance and its effect on pca preprocessing. Int J Knowl Eng Soft Data Paradig 4(3):272–294CrossRef Maruthi Padmaja T, Raju BS, Hota RN, Krishna PR (2014) Class imbalance and its effect on pca preprocessing. Int J Knowl Eng Soft Data Paradig 4(3):272–294CrossRef
Zurück zum Zitat Matlab V (2010) 7.10. 0 (r2010a). The MathWorks Inc, Natick Matlab V (2010) 7.10. 0 (r2010a). The MathWorks Inc, Natick
Zurück zum Zitat Mollineda R, Alejo R, Sotoca J (2007) The class imbalance problem in pattern classification and learning. In: II Congreso Espanol de Informática (CEDI 2007). ISBN, pp 978–84 Mollineda R, Alejo R, Sotoca J (2007) The class imbalance problem in pattern classification and learning. In: II Congreso Espanol de Informática (CEDI 2007). ISBN, pp 978–84
Zurück zum Zitat Perumal SP, Sannasi G, Arputharaj K (2019) An intelligent fuzzy rule-based e-learning recommendation system for dynamic user interests. J Supercomput 75(8):5145–5160CrossRef Perumal SP, Sannasi G, Arputharaj K (2019) An intelligent fuzzy rule-based e-learning recommendation system for dynamic user interests. J Supercomput 75(8):5145–5160CrossRef
Zurück zum Zitat Prati RC, Batista GE, Monard MC (2004) Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Mexican international conference on artificial intelligence. Springer, Berlin, pp 312–321 Prati RC, Batista GE, Monard MC (2004) Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Mexican international conference on artificial intelligence. Springer, Berlin, pp 312–321
Zurück zum Zitat Ramentol E, Caballero Y, Bello R, Herrera F (2012) SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl Inf Syst 33(2):245–265CrossRef Ramentol E, Caballero Y, Bello R, Herrera F (2012) SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl Inf Syst 33(2):245–265CrossRef
Zurück zum Zitat Ramesh LS, Ganapathy S, Bhuvaneshwari R, Kulothungan K, Pandiyaraju V, Kannan A (2015) Prediction of user interests for providing relevant information using relevance feedback and re-ranking. Int J Intell Inf Technol 11(4):55–71CrossRef Ramesh LS, Ganapathy S, Bhuvaneshwari R, Kulothungan K, Pandiyaraju V, Kannan A (2015) Prediction of user interests for providing relevant information using relevance feedback and re-ranking. Int J Intell Inf Technol 11(4):55–71CrossRef
Zurück zum Zitat Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2009) Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern-Part A: Syst Hum 40(1):185–197CrossRef Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2009) Rusboost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern-Part A: Syst Hum 40(1):185–197CrossRef
Zurück zum Zitat Sharma S, Goel M, Kaur P (2013) Performance comparison of various robust data clustering algorithms. Int J Intell Syst Appl 5(7):63 Sharma S, Goel M, Kaur P (2013) Performance comparison of various robust data clustering algorithms. Int J Intell Syst Appl 5(7):63
Zurück zum Zitat Stefanowski J, Wilk S (2008) Selective pre-processing of imbalanced data for improving classification performance. In: International conference on data warehousing and knowledge discovery. Springer, Berlin, pp 283–292 Stefanowski J, Wilk S (2008) Selective pre-processing of imbalanced data for improving classification performance. In: International conference on data warehousing and knowledge discovery. Springer, Berlin, pp 283–292
Zurück zum Zitat Tang S, Chen Sp (2008) The generation mechanism of synthetic minority class examples. In: 2008 International conference on information technology and applications in biomedicine, IEEE, pp 444–447 Tang S, Chen Sp (2008) The generation mechanism of synthetic minority class examples. In: 2008 International conference on information technology and applications in biomedicine, IEEE, pp 444–447
Zurück zum Zitat Tsai DM, Lin CC (2011) Fuzzy c-means based clustering for linearly and nonlinearly separable data. Pattern Recognit 44(8):1750–1760MATHCrossRef Tsai DM, Lin CC (2011) Fuzzy c-means based clustering for linearly and nonlinearly separable data. Pattern Recognit 44(8):1750–1760MATHCrossRef
Zurück zum Zitat Veropoulos K, Campbell C, Cristianini N, et al. (1999) Controlling the sensitivity of support vector machines. In: Proceedings of the international joint conference on AI, vol 55, p 60 Veropoulos K, Campbell C, Cristianini N, et al. (1999) Controlling the sensitivity of support vector machines. In: Proceedings of the international joint conference on AI, vol 55, p 60
Zurück zum Zitat Vijay Kumar T, Lavanya N, Khanna Nehemiah H, Ganapathy S, Kannan A (2019) Identification and classification of pulmonary nodule in lung modality using digital computer. Int J Appl Math Inf Sci 12(2):451–459 Vijay Kumar T, Lavanya N, Khanna Nehemiah H, Ganapathy S, Kannan A (2019) Identification and classification of pulmonary nodule in lung modality using digital computer. Int J Appl Math Inf Sci 12(2):451–459
Zurück zum Zitat Vijayakumar DS, Ganapathy S (2018) Machine learning approach to combat false alarms in wireless intrusion detection system. Comput Inf Sci 11(3):67–81 Vijayakumar DS, Ganapathy S (2018) Machine learning approach to combat false alarms in wireless intrusion detection system. Comput Inf Sci 11(3):67–81
Zurück zum Zitat Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727CrossRef Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Appl 36(3):5718–5727CrossRef
Zurück zum Zitat Yong Y (2012) The research of imbalanced data set of sample sampling method based on k-means cluster and genetic algorithm. Energy Procedia 17:164–170CrossRef Yong Y (2012) The research of imbalanced data set of sample sampling method based on k-means cluster and genetic algorithm. Energy Procedia 17:164–170CrossRef
Zurück zum Zitat Yoon K, Kwek S (2005) An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics. In: Fifth international conference on hybrid intelligent systems (HIS’05), IEEE, p 6 Yoon K, Kwek S (2005) An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics. In: Fifth international conference on hybrid intelligent systems (HIS’05), IEEE, p 6
Zurück zum Zitat Zhao H, Liu H, Xu J, Deng W (2019a) Performance prediction using high-order differential mathematical morphology gradient spectrum entropy and extreme learning machine. IEEE Trans Instrum Meas Zhao H, Liu H, Xu J, Deng W (2019a) Performance prediction using high-order differential mathematical morphology gradient spectrum entropy and extreme learning machine. IEEE Trans Instrum Meas
Zurück zum Zitat Zhao H, Zheng J, Xu J, Deng W (2019b) Fault diagnosis method based on principal component analysis and broad learning system. IEEE Access 7:99,263–99,272CrossRef Zhao H, Zheng J, Xu J, Deng W (2019b) Fault diagnosis method based on principal component analysis and broad learning system. IEEE Access 7:99,263–99,272CrossRef
Zurück zum Zitat Zhao H, Zheng J, Deng W, Song Y (2020) Semi-supervised broad learning system based on manifold regularization and broad network. IEEE Trans Circuits Syst I: Regul Pap Zhao H, Zheng J, Deng W, Song Y (2020) Semi-supervised broad learning system based on manifold regularization and broad network. IEEE Trans Circuits Syst I: Regul Pap
Metadaten
Titel
Robust hybrid data-level sampling approach to handle imbalanced data during classification
verfasst von
Prabhjot Kaur
Anjana Gosain
Publikationsdatum
11.04.2020
Verlag
Springer Berlin Heidelberg
Erschienen in
Soft Computing / Ausgabe 20/2020
Print ISSN: 1432-7643
Elektronische ISSN: 1433-7479
DOI
https://doi.org/10.1007/s00500-020-04901-z

Weitere Artikel der Ausgabe 20/2020

Soft Computing 20/2020 Zur Ausgabe