Skip to main content
Top

2020 | OriginalPaper | Chapter

Performance Analysis of Binarization Strategies for Multi-class Imbalanced Data Classification

Authors : Michał Żak, Michał Woźniak

Published in: Computational Science – ICCS 2020

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Multi-class imbalanced classification tasks are characterized by the skewed distribution of examples among the classes and, usually, strong overlapping between class regions in the feature space. Furthermore, frequently the goal of the final system is to obtain very high precision for each of the concepts. All of these factors contribute to the complexity of the task and increase the difficulty of building a quality data model by learning algorithms. One of the ways of addressing these challenges are so-called binarization strategies, which allow for decomposition of the multi-class problem into several binary tasks with lower complexity. Because of the different decomposition schemes used by each of those methods, some of them are considered to be better suited for handling imbalanced data than the others. In this study, we focus on the well-known binary approaches, namely One-Vs-All, One-Vs-One, and Error-Correcting Output Codes, and their effectiveness in multi-class imbalanced data classification, with respect to the base classifiers and various aggregation schemes for each of the strategies. We compare the performance of these approaches and try to boost the performance of seemingly weaker methods by sampling algorithms. The detailed comparative experimental study of the considered methods, supported by the statistical analysis, is presented. The results show the differences among various binarization strategies. We show how one can mitigate those differences using simple oversampling methods.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Alcalá-Fdez, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17, 255–287 (2011) Alcalá-Fdez, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17, 255–287 (2011)
2.
go back to reference Allwein, E., Schapire, R., Singer, Y.: Reducing multiclass to binary: a unifying approach for margin classifiers. J. Mach. Learn. Res. 1, 113–141 (2000)MathSciNetMATH Allwein, E., Schapire, R., Singer, Y.: Reducing multiclass to binary: a unifying approach for margin classifiers. J. Mach. Learn. Res. 1, 113–141 (2000)MathSciNetMATH
3.
go back to reference Anand, A., Suganthan, P.: Multiclass cancer classification by support vector machines with class-wise optimized genes and probability estimates. J. Theor. Biol. 259(3), 533–540 (2009)CrossRef Anand, A., Suganthan, P.: Multiclass cancer classification by support vector machines with class-wise optimized genes and probability estimates. J. Theor. Biol. 259(3), 533–540 (2009)CrossRef
4.
go back to reference Batista, G., Bazzan, B., Monard, M.: Balancing training data for automated annotation of keywords: a case study. In: WOB, pp. 10–18 (2003) Batista, G., Bazzan, B., Monard, M.: Balancing training data for automated annotation of keywords: a case study. In: WOB, pp. 10–18 (2003)
5.
go back to reference Batista, G., Prati, R., Monard, M.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. 6(1), 20–29 (2004)CrossRef Batista, G., Prati, R., Monard, M.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. 6(1), 20–29 (2004)CrossRef
6.
go back to reference Chan, P., Stolfo, S.: Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 164–168 (1998) Chan, P., Stolfo, S.: Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 164–168 (1998)
7.
go back to reference Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. arXiv e-prints arXiv:1106.1813 (2011) Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. arXiv e-prints arXiv:​1106.​1813 (2011)
8.
go back to reference Demšar, J.: Statistical comparisons ofclassifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)MathSciNetMATH Demšar, J.: Statistical comparisons ofclassifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)MathSciNetMATH
9.
go back to reference Dietterich, T., Bakiri, G.: Solving multiclass learning problems via error-correcting output codes. J. Artif. Intell. Res. 2, 263–286 (1995)CrossRef Dietterich, T., Bakiri, G.: Solving multiclass learning problems via error-correcting output codes. J. Artif. Intell. Res. 2, 263–286 (1995)CrossRef
10.
go back to reference Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. Wiley-Interscience, Hoboken (2000)MATH Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. Wiley-Interscience, Hoboken (2000)MATH
11.
go back to reference Fernández, A., López, V., Galar, M., Jesus, M.D., Herrera, F.: Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches. Knowl.-Based Syst. 42, 97–110 (2013)CrossRef Fernández, A., López, V., Galar, M., Jesus, M.D., Herrera, F.: Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches. Knowl.-Based Syst. 42, 97–110 (2013)CrossRef
13.
go back to reference Galar, M., Fernández, A., Barrenechea, E., Bustince, H., Herrera, F.: An overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes. Pattern Recogn. 44(8), 1761–1776 (2011)CrossRef Galar, M., Fernández, A., Barrenechea, E., Bustince, H., Herrera, F.: An overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes. Pattern Recogn. 44(8), 1761–1776 (2011)CrossRef
14.
go back to reference Galar, M., Fernández, A., Barrenechea, E., Herrera, F.: Empowering difficult classes with a similarity-based aggregation in multi-class classification problems. Inf. Sci. 264, 135–157 (2014)MathSciNetCrossRef Galar, M., Fernández, A., Barrenechea, E., Herrera, F.: Empowering difficult classes with a similarity-based aggregation in multi-class classification problems. Inf. Sci. 264, 135–157 (2014)MathSciNetCrossRef
18.
go back to reference Lemaître, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(17), 1–5 (2017) Lemaître, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(17), 1–5 (2017)
19.
go back to reference Liu, K., Xu, C.: A genetic programming-based approach to the classification of multiclass microarray datasets. Bioinformatics 25(3), 331–337 (2009)CrossRef Liu, K., Xu, C.: A genetic programming-based approach to the classification of multiclass microarray datasets. Bioinformatics 25(3), 331–337 (2009)CrossRef
20.
go back to reference Luque, A., Carrasco, A., Martin, A., Heras, A.: The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recogn. 91, 216–231 (2019)CrossRef Luque, A., Carrasco, A., Martin, A., Heras, A.: The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recogn. 91, 216–231 (2019)CrossRef
21.
go back to reference Mosley, L.: A balanced approach to the multi-class imbalance problem. Graduate theses and dissertations, Iowa State University (2013) Mosley, L.: A balanced approach to the multi-class imbalance problem. Graduate theses and dissertations, Iowa State University (2013)
22.
go back to reference Murphy, K.: Machine Learning: A Probabilistic Perspective. The MIT Press, Cambridge (2012)MATH Murphy, K.: Machine Learning: A Probabilistic Perspective. The MIT Press, Cambridge (2012)MATH
23.
go back to reference Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH
24.
go back to reference Phua, C., Lee, V., Smith, K., Gayler, R.: A comprehensive survey of data mining-based fraud detection research. arXiv e-prints arXiv:1009.6119 (2010) Phua, C., Lee, V., Smith, K., Gayler, R.: A comprehensive survey of data mining-based fraud detection research. arXiv e-prints arXiv:​1009.​6119 (2010)
25.
go back to reference Rifkin, R., Klautau, A.: In defense of one-vs-all classification. J. Mach. Learn. Res. 5, 101–141 (2004)MathSciNetMATH Rifkin, R., Klautau, A.: In defense of one-vs-all classification. J. Mach. Learn. Res. 5, 101–141 (2004)MathSciNetMATH
26.
go back to reference Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv e-prints arXiv:1409.1556 (2014) Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv e-prints arXiv:​1409.​1556 (2014)
27.
go back to reference Vapnik, V.: Statistical Learning Theory. Wiley-Interscience, Hoboken (1998)MATH Vapnik, V.: Statistical Learning Theory. Wiley-Interscience, Hoboken (1998)MATH
29.
go back to reference Zhang, Z., Krawczyk, B., Garcìa, S., Rosales-Pérez, A., Herrera, F.: Empowering one-vs-one decomposition with ensemble learning for multi-class imbalanced data. Knowl.-Based Syst. 106, 251–263 (2016) Zhang, Z., Krawczyk, B., Garcìa, S., Rosales-Pérez, A., Herrera, F.: Empowering one-vs-one decomposition with ensemble learning for multi-class imbalanced data. Knowl.-Based Syst. 106, 251–263 (2016)
Metadata
Title
Performance Analysis of Binarization Strategies for Multi-class Imbalanced Data Classification
Authors
Michał Żak
Michał Woźniak
Copyright Year
2020
DOI
https://doi.org/10.1007/978-3-030-50423-6_11

Premium Partner