Skip to main content
Erschienen in: Knowledge and Information Systems 1/2015

01.10.2015 | Regular Paper

Class imbalance revisited: a new experimental setup to assess the performance of treatment methods

verfasst von: Ronaldo C. Prati, Gustavo E. A. P. A. Batista, Diego F. Silva

Erschienen in: Knowledge and Information Systems | Ausgabe 1/2015

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In the last decade, class imbalance has attracted a huge amount of attention from researchers and practitioners. Class imbalance is ubiquitous in Machine Learning, Data Mining and Pattern Recognition applications; therefore, these research communities have responded to such interest with literally dozens of methods and techniques. Surprisingly, there are still many fundamental open-ended questions such as “Are all learning paradigms equally affected by class imbalance?”, “What is the expected performance loss for different imbalance degrees?” and “How much of the performance losses can be recovered by the treatment methods?”. In this paper, we propose a simple experimental design to assess the performance of class imbalance treatment methods. This experimental setup uses real data set with artificially modified class distributions to evaluate classifiers in a wide range of class imbalance. We apply such experimental design in a large-scale experimental evaluation with 22 data set and seven learning algorithms from different paradigms. We also propose a statistical procedure aimed to evaluate the relative degradation and recoveries, based on confidence intervals. This procedure allows a simple yet insightful visualization of the results, as well as provide the basis for drawing statistical conclusions. Our results indicate that the expected performance loss, as a percentage of the performance obtained with the balanced distribution, is quite modest (below 5 %) for the most balanced distributions up to 10 % of minority examples. However, the loss tends to increase quickly for higher degrees of class imbalance, reaching 20 % for 1 % of minority class examples. Support Vector Machine is the classifier paradigm that is less affected by class imbalance, being almost insensitive to all but the most imbalanced distributions. Finally, we show that the treatment methods only partially recover the performance losses. On average, typically, about 30 % or less of the performance that was lost due to class imbalance was recovered by these methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
We use the notation \(X/Y\), with \(X+Y=100\) to denote that for a set of 100 instances, \(X\) belongs to the positive class and \(Y\) belongs to the negative class.
 
2
CRAN (http://​cran.​r-project.​org) is a network of Web servers distributed around the world that store versions of code and documentation for the statistical software R, as well as community contributed packages.
 
Literatur
1.
Zurück zum Zitat Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor 6(1):20–29CrossRef Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor 6(1):20–29CrossRef
2.
Zurück zum Zitat Bennett BM (1965) Confidence limits for a ratio using Wilcoxon’s signed rank test. Biometics 21(1):231–234CrossRef Bennett BM (1965) Confidence limits for a ratio using Wilcoxon’s signed rank test. Biometics 21(1):231–234CrossRef
6.
Zurück zum Zitat Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATH Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357MATH
7.
Zurück zum Zitat Cieslak D, Chawla N (2008) Analyzing pets on imbalanced datasets when training and testing class distributions differ. In: Pacific-Asia conference on advances in knowledge discovery and data mining, pp 519–526 Cieslak D, Chawla N (2008) Analyzing pets on imbalanced datasets when training and testing class distributions differ. In: Pacific-Asia conference on advances in knowledge discovery and data mining, pp 519–526
8.
Zurück zum Zitat Clark P, Boswell R (1991) Rule induction with CN2: some recent improvements. In: European working session on machine learning, pp 151–163 Clark P, Boswell R (1991) Rule induction with CN2: some recent improvements. In: European working session on machine learning, pp 151–163
9.
Zurück zum Zitat Cohen G, Hilario M, Sax H, Hugonnet S, Geissbhler A (2006) Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 37(1):7–18CrossRef Cohen G, Hilario M, Sax H, Hugonnet S, Geissbhler A (2006) Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 37(1):7–18CrossRef
10.
Zurück zum Zitat Cohen WW (1995) Fast effective rule induction. In: International conference on machine learning. Morgan Kaufmann, Los Altos, CA, pp 115–123 Cohen WW (1995) Fast effective rule induction. In: International conference on machine learning. Morgan Kaufmann, Los Altos, CA, pp 115–123
11.
Zurück zum Zitat Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Fayyad UM, Chaudhuri S, Madigan D (eds) ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 155–164 Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Fayyad UM, Chaudhuri S, Madigan D (eds) ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 155–164
16.
Zurück zum Zitat Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C 42(4):463–484CrossRef Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C 42(4):463–484CrossRef
17.
Zurück zum Zitat Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. SIGKDD Explor 6(1):30–39CrossRef Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. SIGKDD Explor 6(1):30–39CrossRef
18.
Zurück zum Zitat Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on advances in intelligent computing. Lecture notes in computer science. Springer, Berlin, pp 878–887. doi:10.1007/11538059_91 Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on advances in intelligent computing. Lecture notes in computer science. Springer, Berlin, pp 878–887. doi:10.​1007/​11538059_​91
19.
Zurück zum Zitat He H, Bai Y, Garcia E, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks, pp 1322–1328 He H, Bai Y, Garcia E, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks, pp 1322–1328
20.
Zurück zum Zitat He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRef He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRef
21.
Zurück zum Zitat Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449MATH Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449MATH
22.
Zurück zum Zitat Khoshgoftaar TM, Seiffert C, Hulse JV, Napolitano A, Folleco A (2007) Learning with limited minority class data. In: International conference on machine learning and applications, pp 348–353 Khoshgoftaar TM, Seiffert C, Hulse JV, Napolitano A, Folleco A (2007) Learning with limited minority class data. In: International conference on machine learning and applications, pp 348–353
23.
Zurück zum Zitat Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2–3):195–215CrossRef Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2–3):195–215CrossRef
24.
Zurück zum Zitat Liu X-Y, Wu J, Zhou Z-H (2006) Exploratory under-sampling for class-imbalance learning. In: IEEE international conference on data mining, pp 965–969 Liu X-Y, Wu J, Zhou Z-H (2006) Exploratory under-sampling for class-imbalance learning. In: IEEE international conference on data mining, pp 965–969
25.
Zurück zum Zitat Liu X-Y, Zhou Z-H (2006) The influence of class imbalance on cost-sensitive learning: an empirical study. In: ‘ICDM’, IEEE Computer Society, pp 970–974 Liu X-Y, Zhou Z-H (2006) The influence of class imbalance on cost-sensitive learning: an empirical study. In: ‘ICDM’, IEEE Computer Society, pp 970–974
26.
Zurück zum Zitat Michie D, Spiegelhalter DJ, Taylor CC (1994) Machine learning, neural and statistical classification. Ellis Horwood, New york Michie D, Spiegelhalter DJ, Taylor CC (1994) Machine learning, neural and statistical classification. Ellis Horwood, New york
27.
Zurück zum Zitat Phua C, Alahakoon D, Lee V (2004) Minority report in fraud detection: classification of skewed data. SIGKDD Explor 6(1):50–59CrossRef Phua C, Alahakoon D, Lee V (2004) Minority report in fraud detection: classification of skewed data. SIGKDD Explor 6(1):50–59CrossRef
28.
Zurück zum Zitat Prati RC, Batista GEAPA, Monard MC (2011) A survey on graphical methods for classification predictive performance evaluation. IEEE Trans Knowl Data Eng 23(11):1601–1618CrossRef Prati RC, Batista GEAPA, Monard MC (2011) A survey on graphical methods for classification predictive performance evaluation. IEEE Trans Knowl Data Eng 23(11):1601–1618CrossRef
30.
Zurück zum Zitat Provost FJ, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. In: Shavlik JW (ed) International conference on machine learning. Morgan Kaufmann, Los Altos, CA, pp 445–453 Provost FJ, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. In: Shavlik JW (ed) International conference on machine learning. Morgan Kaufmann, Los Altos, CA, pp 445–453
31.
Zurück zum Zitat Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc, Los Altos, CA Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc, Los Altos, CA
32.
Zurück zum Zitat Wallace B, Small K, Brodley C, Trikalinos T (2011) Class imbalance, redux. In: IEEE international conference on data mining, pp 754–763 Wallace B, Small K, Brodley C, Trikalinos T (2011) Class imbalance, redux. In: IEEE international conference on data mining, pp 754–763
33.
Zurück zum Zitat Wang X, Matwin S, Japkowicz N, Liu X (2013) Cost-sensitive boosting algorithms for imbalanced multi-instance datasets. In: Zaïane OR, Zilles S (eds) Canadian conference on artificial intelligence, vol 7884 of lecture notes in computer science. Springer, Berlin, pp 174–186 Wang X, Matwin S, Japkowicz N, Liu X (2013) Cost-sensitive boosting algorithms for imbalanced multi-instance datasets. In: Zaïane OR, Zilles S (eds) Canadian conference on artificial intelligence, vol 7884 of lecture notes in computer science. Springer, Berlin, pp 174–186
34.
Zurück zum Zitat Weiss GM (2004) Mining with rarity: a unifying framework. SIGKDD Explor 6(1):7–19CrossRef Weiss GM (2004) Mining with rarity: a unifying framework. SIGKDD Explor 6(1):7–19CrossRef
35.
Zurück zum Zitat Weiss GM, McCarthy K, Zabar B (2007) Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs? In: IEEE international conference on data mining, pp 35–41 Weiss GM, McCarthy K, Zabar B (2007) Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs? In: IEEE international conference on data mining, pp 35–41
36.
Zurück zum Zitat Weiss GM, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction. J Artif Intell Res 19:315–354MATH Weiss GM, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction. J Artif Intell Res 19:315–354MATH
37.
Zurück zum Zitat Wu G, Chang EY (2003) Class-boundary alignment for imbalanced dataset learning. In: Workshop on learning from imbalanced Datasets in international conference on machine learning Wu G, Chang EY (2003) Class-boundary alignment for imbalanced dataset learning. In: Workshop on learning from imbalanced Datasets in international conference on machine learning
Metadaten
Titel
Class imbalance revisited: a new experimental setup to assess the performance of treatment methods
verfasst von
Ronaldo C. Prati
Gustavo E. A. P. A. Batista
Diego F. Silva
Publikationsdatum
01.10.2015
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 1/2015
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-014-0794-3

Weitere Artikel der Ausgabe 1/2015

Knowledge and Information Systems 1/2015 Zur Ausgabe