Skip to main content

2015 | OriginalPaper | Buchkapitel

Learning from Imbalanced Data Using Ensemble Methods and Cluster-Based Undersampling

verfasst von : Parinaz Sobhani, Herna Viktor, Stan Matwin

Erschienen in: New Frontiers in Mining Complex Patterns

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Imbalanced data, where the number of instances of one class is much higher than the others, are frequent in many domains such as fraud detection, telecommunications management, oil spill detection, and text classification. Traditional classifiers do not perform well when considering data that are susceptible to both within-class and between-class imbalances. In this paper, we propose the ClusFirstClass algorithm that employs cluster analysis to aid classifiers when aiming to build accurate models against such imbalanced datasets. In order to work with balanced classes, all minority instances are used together with the same number of majority instances. To further reduce the impact of within-class imbalance, majority instances are clustered into different groups and at least one instance is selected from each cluster. Experimental results demonstrate that our proposed ClusFirstClass algorithm yields promising results compared to the state-of-the art classification approaches, when evaluated against a number of highly imbalanced datasets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. J. Intell. Data Anal. 6(5), 429–450 (2002)MATH Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. J. Intell. Data Anal. 6(5), 429–450 (2002)MATH
2.
Zurück zum Zitat Jo, T., Japkowicz, N.: The class imbalance versus small disjuncts. ACM SIGKDD Explor. Newsl. 6(1), 40–49 (2004)CrossRefMathSciNet Jo, T., Japkowicz, N.: The class imbalance versus small disjuncts. ACM SIGKDD Explor. Newsl. 6(1), 40–49 (2004)CrossRefMathSciNet
3.
Zurück zum Zitat Sun, Y., Kamel, M.S., Wong, A.K.C., Wang, Y.: Cost-sensitive boosting for classification of imbalanced data. J. Pattern Recogn. 40(12), 3358–3378 (2007)CrossRefMATH Sun, Y., Kamel, M.S., Wong, A.K.C., Wang, Y.: Cost-sensitive boosting for classification of imbalanced data. J. Pattern Recogn. 40(12), 3358–3378 (2007)CrossRefMATH
4.
Zurück zum Zitat He, H., Garcia, E.: Learning from imbalanced data. J. IEEE Trans. Data Knowl. Eng. 9(21), 1263–1284 (2009) He, H., Garcia, E.: Learning from imbalanced data. J. IEEE Trans. Data Knowl. Eng. 9(21), 1263–1284 (2009)
5.
Zurück zum Zitat Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory under sampling for class imbalance learning. In: Proceedings of the International Conference on Data Mining, pp. 965–969 (2006) Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory under sampling for class imbalance learning. In: Proceedings of the International Conference on Data Mining, pp. 965–969 (2006)
6.
Zurück zum Zitat Yen, L.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. Int. J. 36(3), 5718–5727 (2009)CrossRef Yen, L.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. Int. J. 36(3), 5718–5727 (2009)CrossRef
7.
Zurück zum Zitat Zhang, J., Mani, I.: KNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of the International Conference on Machine Learning (ICML 2003), Work-shop Learning from Imbalanced Data Sets (2003) Zhang, J., Mani, I.: KNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of the International Conference on Machine Learning (ICML 2003), Work-shop Learning from Imbalanced Data Sets (2003)
8.
Zurück zum Zitat Ding, Z.: Diversified ensemble classifiers for highly imbalanced data learning and its application in bioinformatics. Ph.D. thesis, Georgia State University (2011) Ding, Z.: Diversified ensemble classifiers for highly imbalanced data learning and its application in bioinformatics. Ph.D. thesis, Georgia State University (2011)
10.
Zurück zum Zitat Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. CRC Press, Boca Raton (1984) MATH Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. CRC Press, Boca Raton (1984) MATH
11.
Zurück zum Zitat Batista, G., Prati, R.C., Monard, M.C.: A study of the behaviour of several methods for bal-ancing machine learning training data. SIGKDD Explor. 6(1), 20–29 (2004)CrossRef Batista, G., Prati, R.C., Monard, M.C.: A study of the behaviour of several methods for bal-ancing machine learning training data. SIGKDD Explor. 6(1), 20–29 (2004)CrossRef
12.
Zurück zum Zitat Cesa-Bianchi, N., Re, M., Valentini, G.: Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference. Mach. Learn. 88(1), 209–241 (2012)CrossRefMATHMathSciNet Cesa-Bianchi, N., Re, M., Valentini, G.: Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference. Mach. Learn. 88(1), 209–241 (2012)CrossRefMATHMathSciNet
13.
Zurück zum Zitat Blaszczynski, J., Stefanowski, J., Idkowiak, L.: Extending bagging for imbalanced data. In: Proceedings of the 8th International Conference on Computer Recognition Systems, pp. 269–278 (2013) Blaszczynski, J., Stefanowski, J., Idkowiak, L.: Extending bagging for imbalanced data. In: Proceedings of the 8th International Conference on Computer Recognition Systems, pp. 269–278 (2013)
14.
Zurück zum Zitat Manning, C., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)MATH Manning, C., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)MATH
15.
Zurück zum Zitat Dietterich, T.G., Bakiri, G.: Solving multiclass learning problems via error-correcting output codes. J. AI Res. 2, 263–286 (1995)MATH Dietterich, T.G., Bakiri, G.: Solving multiclass learning problems via error-correcting output codes. J. AI Res. 2, 263–286 (1995)MATH
16.
Zurück zum Zitat Galar, M., et al.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)CrossRef Galar, M., et al.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)CrossRef
17.
Zurück zum Zitat Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Comput. Appl. Math. 20, 53–65 (1987)CrossRefMATH Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Comput. Appl. Math. 20, 53–65 (1987)CrossRefMATH
18.
Zurück zum Zitat Ng, A.: Feature selection, L1 vs. L2 regularization and rotational invariance. In: 21st International Conference on Machine Learning (2004) Ng, A.: Feature selection, L1 vs. L2 regularization and rotational invariance. In: 21st International Conference on Machine Learning (2004)
19.
Zurück zum Zitat Coates, A., Ng, A.Y.: Learning feature representations with K-means. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 561–580. Springer, Heidelberg (2012) CrossRef Coates, A., Ng, A.Y.: Learning feature representations with K-means. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 561–580. Springer, Heidelberg (2012) CrossRef
20.
Zurück zum Zitat Shohei, H., Hisashi, K., Yutaka, T.: Roughly balanced bagging for imbalanced data. Stat. Anal. Data Min. 2(5–6), 412–419 (2009)MathSciNet Shohei, H., Hisashi, K., Yutaka, T.: Roughly balanced bagging for imbalanced data. Stat. Anal. Data Min. 2(5–6), 412–419 (2009)MathSciNet
21.
Zurück zum Zitat Fawcett, T.: ROC graphs: notes and practical considerations for researchers. HP Labs, Palo Alto, CA, Technical report, HPL-2003-4 (2003) Fawcett, T.: ROC graphs: notes and practical considerations for researchers. HP Labs, Palo Alto, CA, Technical report, HPL-2003-4 (2003)
22.
Zurück zum Zitat Demar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)MathSciNet Demar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)MathSciNet
23.
Zurück zum Zitat Weiss, G.M., Hirsh, H.: A quantitative study of small disjuncts: experiments and results. In: 17th National Conference on Artificial Intelligence, Austin, Texas (2002) Weiss, G.M., Hirsh, H.: A quantitative study of small disjuncts: experiments and results. In: 17th National Conference on Artificial Intelligence, Austin, Texas (2002)
Metadaten
Titel
Learning from Imbalanced Data Using Ensemble Methods and Cluster-Based Undersampling
verfasst von
Parinaz Sobhani
Herna Viktor
Stan Matwin
Copyright-Jahr
2015
DOI
https://doi.org/10.1007/978-3-319-17876-9_5