Skip to main content
Top
Published in: Soft Computing 21/2019

20-11-2018 | Methodologies and Application

TLUSBoost algorithm: a boosting solution for class imbalance problem

Authors: Sujit Kumar, Saroj Kr. Biswas, Debashree Devi

Published in: Soft Computing | Issue 21/2019

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

It is habitually assumed that the training sets used for learning are balanced. However, this hypothesis is not always true in real-world applications, and hence, there is a tendency of relying on the classification models that are biased towards the overrepresented class as traditional data mining algorithms are generally inclined towards building of suboptimal classification models. This class imbalance problem is common to many application domains such as data mining, machine learning, pattern recognition, etc. Several techniques have been proposed to alleviate the problem of class imbalance. RUSBoost is one of the ensemble learning approaches that uses random undersampling (RUS) for data resampling and AdaBoost technique for boosting, as a solution to class imbalance. However, RUS may cause the loss of significant information of dataset. Therefore, this paper proposes Tomek-link undersampling-based boosting (TLUSBoost) algorithm which uses Tomek-linked and redundancy-based undersampling (TLRUS) for data resampling and AdaBoost technique for boosting. TLRUS meticulously finds outliers using Tomek-link concept and then eliminates some of the probable redundant instances from the outliers. Hence, this algorithm reduces the loss of information and conserves the characteristics of the dataset, thereby helping the classifier to be trained appropriately. TLUSBoost method is validated with 16 benchmark datasets and compared with EasyEnsemble, BalanceCascade, SMOTEBoost and RUSBoost algorithms. Ten-fold cross-validation is applied to measure overall accuracy and F-measure metric of the models. Experimental results show that the proposed model is better than EasyEnsemble, BalanceCascade, SMOTEBoost and RUSBoost in both overall accuracy and F-measure performance metric.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
go back to reference Bermejo P, Gamez JA, Puerta JM (2011) Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets. Expert Syst Appl 38(3):2072–2080CrossRef Bermejo P, Gamez JA, Puerta JM (2011) Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets. Expert Syst Appl 38(3):2072–2080CrossRef
go back to reference Biswas SK, Devi D, Chakraborty M (2017) A hybrid case based reasoning model for classification in Internet of Things (IoT) environment. J Organ End User Comput 30(4):104–122CrossRef Biswas SK, Devi D, Chakraborty M (2017) A hybrid case based reasoning model for classification in Internet of Things (IoT) environment. J Organ End User Comput 30(4):104–122CrossRef
go back to reference Błaszczynski J, Deckert M, Stefanowski J, Wilk S (2010) Integrating selective pre-processing of imbalanced data with IVOTES ensemble. In: Rough sets and current trends in computing. Lecture Notes in Computer Science Series, vol 6086, Springer, pp 148–157 Błaszczynski J, Deckert M, Stefanowski J, Wilk S (2010) Integrating selective pre-processing of imbalanced data with IVOTES ensemble. In: Rough sets and current trends in computing. Lecture Notes in Computer Science Series, vol 6086, Springer, pp 148–157
go back to reference Cao D-S, Xu Q-S, Liang Y-Z, Zhang L-X, Li H-D (2010) The boosting: a new idea of building models. Chemom Intell Lab Syst 100(1):1–11CrossRef Cao D-S, Xu Q-S, Liang Y-Z, Zhang L-X, Li H-D (2010) The boosting: a new idea of building models. Chemom Intell Lab Syst 100(1):1–11CrossRef
go back to reference Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRef Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357CrossRef
go back to reference Devi D, Biswas SK, Purkayastha B (2017) Redundancy-driven modified Tomek-link based undersampling: a solution to class imbalance. Pattern Recognit Lett 93:3–12CrossRef Devi D, Biswas SK, Purkayastha B (2017) Redundancy-driven modified Tomek-link based undersampling: a solution to class imbalance. Pattern Recognit Lett 93:3–12CrossRef
go back to reference Elkan C (2001) The foundations of cost–sensitive learning. In: Proceedings of the 17th IEEE international joint conference on artificial intelligence (IJCAI’01), pp 973–978 Elkan C (2001) The foundations of cost–sensitive learning. In: Proceedings of the 17th IEEE international joint conference on artificial intelligence (IJCAI’01), pp 973–978
go back to reference Fradkin D, Muchnik I (2000) Support vector machines for classification. In: DIMACS series in discrete mathematics and theoretical computer science Fradkin D, Muchnik I (2000) Support vector machines for classification. In: DIMACS series in discrete mathematics and theoretical computer science
go back to reference Freund Y, Schapir RE (1996) Experiments with a new boosting algorithm. In: Machine learning: proceedings of the thirteenth international conference, pp 1–9 Freund Y, Schapir RE (1996) Experiments with a new boosting algorithm. In: Machine learning: proceedings of the thirteenth international conference, pp 1–9
go back to reference Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C Appl Rev 42(4):463–484CrossRef Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C Appl Rev 42(4):463–484CrossRef
go back to reference Gunn SR (1998) Support vector machine for classification and regression. Techical report, University of Southampton, Southampton, UK Gunn SR (1998) Support vector machine for classification and regression. Techical report, University of Southampton, Southampton, UK
go back to reference Khreich W, Granger E, Miri A, Sabourin R (2010) Iterative boolean combination of classifiers in the roc space: an application to anomaly detection with HMMS. Pattern Recognit 43(8):2732–2752CrossRef Khreich W, Granger E, Miri A, Sabourin R (2010) Iterative boolean combination of classifiers in the roc space: an application to anomaly detection with HMMS. Pattern Recognit 43(8):2732–2752CrossRef
go back to reference Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mac. Learn 30:195–215CrossRef Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mac. Learn 30:195–215CrossRef
go back to reference Li C, Li J, He M (2016) Concept lattice compression in incomplete contexts based on K-medoids clustering. Int J Mach Learn Cybernet 7:539–552CrossRef Li C, Li J, He M (2016) Concept lattice compression in incomplete contexts based on K-medoids clustering. Int J Mach Learn Cybernet 7:539–552CrossRef
go back to reference Liu Y-H, Chen Y-T (2005) Total margin-based adaptive fuzzy support vector machines for multi-view face recognition. In: Proceedings of IEEE international conference on systems, man and cybernetics, vol 2, pp 1704–1711 Liu Y-H, Chen Y-T (2005) Total margin-based adaptive fuzzy support vector machines for multi-view face recognition. In: Proceedings of IEEE international conference on systems, man and cybernetics, vol 2, pp 1704–1711
go back to reference Liu X-Y, Wu J, Zhou Z-H (2009) Exploratory undersampling for class imbalance learning. IEEE Trans Syst Man Cybern Part B Cybern 39(2):539–550CrossRef Liu X-Y, Wu J, Zhou Z-H (2009) Exploratory undersampling for class imbalance learning. IEEE Trans Syst Man Cybern Part B Cybern 39(2):539–550CrossRef
go back to reference Mazurowski MA, Habas PA, Zurada JM, Lo JY, Baker JA, Tourassi GD (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21(2–3):427–436CrossRef Mazurowski MA, Habas PA, Zurada JM, Lo JY, Baker JA, Tourassi GD (2008) Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw 21(2–3):427–436CrossRef
go back to reference Mustafa G, Niu Z, Yousif A, Tarus J (2015) Solving the class imbalance problems using RUSMultiBoost Ensemble. In: Proceedings of 10th Iberian conference on information systems and technologies Mustafa G, Niu Z, Yousif A, Tarus J (2015) Solving the class imbalance problems using RUSMultiBoost Ensemble. In: Proceedings of 10th Iberian conference on information systems and technologies
go back to reference Pallara A (1992) Binary decision tree approach to classification: a review of CART and other methods with some applications to real data. Statist Appl Ital J Appl Stat 4(3):1–32 Pallara A (1992) Binary decision tree approach to classification: a review of CART and other methods with some applications to real data. Statist Appl Ital J Appl Stat 4(3):1–32
go back to reference Park H-A (2013) Introduction to logistic regression: from basic concepts to interpretation with particular attention to nursing domain. J Korean Acad Nurs 43(2):154–164CrossRef Park H-A (2013) Introduction to logistic regression: from basic concepts to interpretation with particular attention to nursing domain. J Korean Acad Nurs 43(2):154–164CrossRef
go back to reference Phung S, Bouzerdoum A, Nguyen GH (2009) Learning pattern classification tasks with imbalanced data sets. In Yin (Eds.) Press. Pattern recognition, pp 193–208 Phung S, Bouzerdoum A, Nguyen GH (2009) Learning pattern classification tasks with imbalanced data sets. In Yin (Eds.) Press. Pattern recognition, pp 193–208
go back to reference Seiffert C, Khoshgoftaar TM, Hulse JV, Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Hum 40(1):185–197CrossRef Seiffert C, Khoshgoftaar TM, Hulse JV, Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Hum 40(1):185–197CrossRef
go back to reference Sun Y, Wong AC, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recognit 23(4):687–719CrossRef Sun Y, Wong AC, Kamel MS (2009) Classification of imbalanced data: a review. Int J Pattern Recognit 23(4):687–719CrossRef
go back to reference Sutton O (2012) Introduction to k-nearest neighbour classification and condensed nearest neighbour data reduction, pp 1–10 Sutton O (2012) Introduction to k-nearest neighbour classification and condensed nearest neighbour data reduction, pp 1–10
go back to reference Tavallaee M, Stakhanova N, Ghorbani A (2010) Toward credible evaluation of anomaly-based intrusion-detection methods. IEEE Trans Syst Man Cybern Part C Appl Rev 40(5):516–524CrossRef Tavallaee M, Stakhanova N, Ghorbani A (2010) Toward credible evaluation of anomaly-based intrusion-detection methods. IEEE Trans Syst Man Cybern Part C Appl Rev 40(5):516–524CrossRef
go back to reference Wang S, Yao X (2009) Diversity analysis on imbalanced data sets by using ensemble models. IEEE Symp Comput Intell Data Min CDIM’09:324–331 Wang S, Yao X (2009) Diversity analysis on imbalanced data sets by using ensemble models. IEEE Symp Comput Intell Data Min CDIM’09:324–331
go back to reference Yang Z, Tang W, Shintemirov A, Wu Q (2009) Association rule mining based dissolved gas analysis for fault diagnosis of power transformers. IEEE Trans Syst Man Cybern C Appl Rev 39(6):597–610CrossRef Yang Z, Tang W, Shintemirov A, Wu Q (2009) Association rule mining based dissolved gas analysis for fault diagnosis of power transformers. IEEE Trans Syst Man Cybern C Appl Rev 39(6):597–610CrossRef
go back to reference Zhu Z-B, Song Z-H (2010) Fault diagnosis based on imbalance modified kernel fisher discriminant analysis. Chem Eng Res Des 88(8):936–951CrossRef Zhu Z-B, Song Z-H (2010) Fault diagnosis based on imbalance modified kernel fisher discriminant analysis. Chem Eng Res Des 88(8):936–951CrossRef
Metadata
Title
TLUSBoost algorithm: a boosting solution for class imbalance problem
Authors
Sujit Kumar
Saroj Kr. Biswas
Debashree Devi
Publication date
20-11-2018
Publisher
Springer Berlin Heidelberg
Published in
Soft Computing / Issue 21/2019
Print ISSN: 1432-7643
Electronic ISSN: 1433-7479
DOI
https://doi.org/10.1007/s00500-018-3629-4

Other articles of this Issue 21/2019

Soft Computing 21/2019 Go to the issue

Premium Partner