Skip to main content

2016 | OriginalPaper | Buchkapitel

Beyond the Boundaries of SMOTE

A Framework for Manifold-Based Synthetically Oversampling

verfasst von : Colin Bellinger, Christopher Drummond, Nathalie Japkowicz

Erschienen in: Machine Learning and Knowledge Discovery in Databases

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Problems of class imbalance appear in diverse domains, ranging from gene function annotation to spectra and medical classification. On such problems, the classifier becomes biased in favour of the majority class. This leads to inaccuracy on the important minority classes, such as specific diseases and gene functions. Synthetic oversampling mitigates this by balancing the training set, whilst avoiding the pitfalls of random under and oversampling. The existing methods are primarily based on the SMOTE algorithm, which employs a bias of randomly generating points between nearest neighbours. The relationship between the generative bias and the latent distribution has a significant impact on the performance of the induced classifier. Our research into gamma-ray spectra classification has shown that the generative bias applied by SMOTE is inappropriate for domains that conform to the manifold property, such as spectra, text, image and climate change classification. To this end, we propose a framework for manifold-based synthetic oversampling, and demonstrate its superiority in terms of robustness to the manifold with respect to the AUC on three spectra classification tasks and 16 UCI datasets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Alain, G., Bengio, Y.: What regularized auto-encoders learn from the data generating distribution. J. Mach. Learn. Res. 15(1), 3563–3593 (2014)MathSciNetMATH Alain, G., Bengio, Y.: What regularized auto-encoders learn from the data generating distribution. J. Mach. Learn. Res. 15(1), 3563–3593 (2014)MathSciNetMATH
2.
Zurück zum Zitat Batista, G.E.A.P.A., Bazzan, A.L.C., Monard, M.C.: Balancing training data for automated annotation of keywords: a case study. In: Brazilian Workshop on Bioinformatics, pp. 10–18 (2003) Batista, G.E.A.P.A., Bazzan, A.L.C., Monard, M.C.: Balancing training data for automated annotation of keywords: a case study. In: Brazilian Workshop on Bioinformatics, pp. 10–18 (2003)
4.
Zurück zum Zitat Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15(6), 1373–1396 (2003)CrossRefMATH Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15(6), 1373–1396 (2003)CrossRefMATH
5.
Zurück zum Zitat Bellinger, C., Japkowicz, N., Drummond, C.: Synthetic oversampling for advanced radioactive threat detection. In: International Conference on Machine Learning and Applications (2015) Bellinger, C., Japkowicz, N., Drummond, C.: Synthetic oversampling for advanced radioactive threat detection. In: International Conference on Machine Learning and Applications (2015)
7.
Zurück zum Zitat Chapelle, O., Scholkopf, B., Zien, A.: Semi-supervised Learning. MIT Press (2006) Chapelle, O., Scholkopf, B., Zien, A.: Semi-supervised Learning. MIT Press (2006)
8.
Zurück zum Zitat Chawla, N.V., Japkowicz, N., Drive, P.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKD Explor. Newsl. 6(1), 2000–2004 (2004). Special Issue on Learning from Imbalanced Datasets Chawla, N.V., Japkowicz, N., Drive, P.: Editorial: special issue on learning from imbalanced data sets. ACM SIGKD Explor. Newsl. 6(1), 2000–2004 (2004). Special Issue on Learning from Imbalanced Datasets
9.
Zurück zum Zitat Chawla, N.V., Japkowicz, N., Kolcz, A. (eds.) In: ICML 2003 Workshop on Learning from Imbalanced Data Sets (2003) Chawla, N.V., Japkowicz, N., Kolcz, A. (eds.) In: ICML 2003 Workshop on Learning from Imbalanced Data Sets (2003)
10.
Zurück zum Zitat Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)MATH Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)MATH
11.
Zurück zum Zitat Dietterich, T.G.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10(7), 1895–1923 (1998)CrossRef Dietterich, T.G.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10(7), 1895–1923 (1998)CrossRef
12.
Zurück zum Zitat Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55(10), 78 (2012)CrossRef Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55(10), 78 (2012)CrossRef
14.
Zurück zum Zitat Han, H., Wang, W., Mao, B.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D., Zhang, X., Huang, G. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005)CrossRef Han, H., Wang, W., Mao, B.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D., Zhang, X., Huang, G. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005)CrossRef
15.
Zurück zum Zitat He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRef He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRef
16.
Zurück zum Zitat Japkowicz, N.: Supervised versus unsupervised binary-learning by feedforward neural networks. Mach. Learn. 42(1), 97–122 (2001)CrossRefMATH Japkowicz, N.: Supervised versus unsupervised binary-learning by feedforward neural networks. Mach. Learn. 42(1), 97–122 (2001)CrossRefMATH
17.
Zurück zum Zitat Ma, Y., Fu, Y.: Manifold Learning Theory and Applications (2011) Ma, Y., Fu, Y.: Manifold Learning Theory and Applications (2011)
18.
Zurück zum Zitat Nguwi, Y.Y., Cho, S.Y.: Support vector self-organizing learning for imbalanced medical data. In: 2009 International Joint Conference on Neural Networks, pp. 2250–2255. IEEE, June 2009 Nguwi, Y.Y., Cho, S.Y.: Support vector self-organizing learning for imbalanced medical data. In: 2009 International Joint Conference on Neural Networks, pp. 2250–2255. IEEE, June 2009
19.
Zurück zum Zitat Sharma, A., Paliwal, K.K.: Fast principal component analysis using fixed-point algorithm. Pattern Recogn. Lett. 28(10), 1151–1155 (2007)CrossRef Sharma, A., Paliwal, K.K.: Fast principal component analysis using fixed-point algorithm. Pattern Recogn. Lett. 28(10), 1151–1155 (2007)CrossRef
20.
Zurück zum Zitat Stefanowski, J., Wilk, S.: Improving rule-based classifiers induced by MODLEM by selective pre-processing of imbalanced data. In: ECML/PKDD International Workshop on Rough Sets in Knowledge Discovery (RSKD 2007), pp. 54–65 (2007) Stefanowski, J., Wilk, S.: Improving rule-based classifiers induced by MODLEM by selective pre-processing of imbalanced data. In: ECML/PKDD International Workshop on Rough Sets in Knowledge Discovery (RSKD 2007), pp. 54–65 (2007)
23.
Zurück zum Zitat Tuzel, O., Porikli, F., Mee, P.: Human detection via classification on Riemannian manifolds. In: Computer Vision and Pattern Recognition, pp. 1–8 (2007) Tuzel, O., Porikli, F., Mee, P.: Human detection via classification on Riemannian manifolds. In: Computer Vision and Pattern Recognition, pp. 1–8 (2007)
24.
Zurück zum Zitat Vincent, P.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010)MathSciNetMATH Vincent, P.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010)MathSciNetMATH
25.
Zurück zum Zitat Weiss, G.M.: Mining with rarity: a unifying framework. SIGKDD Explor. Newsl. 6(1), 7–19 (2004)CrossRef Weiss, G.M.: Mining with rarity: a unifying framework. SIGKDD Explor. Newsl. 6(1), 7–19 (2004)CrossRef
26.
Zurück zum Zitat Yang, Q., Wu, X., Elkan, C., Gehrke, J., Han, J., Heckerman, D., Keim, D., Liu, J., Madigan, D., Piatetsky-shapiro, G., Raghavan, V.V., Rastogi, R., Stolfo, S.J., Tuzhilin, A., Wah, B.W.: 10 challenging problems in data mining research. Int. J. Inf. Technol. Decis. Making 5(4), 597–604 (2006)CrossRef Yang, Q., Wu, X., Elkan, C., Gehrke, J., Han, J., Heckerman, D., Keim, D., Liu, J., Madigan, D., Piatetsky-shapiro, G., Raghavan, V.V., Rastogi, R., Stolfo, S.J., Tuzhilin, A., Wah, B.W.: 10 challenging problems in data mining research. Int. J. Inf. Technol. Decis. Making 5(4), 597–604 (2006)CrossRef
27.
Zurück zum Zitat Zhang, D., Chen, X.: Text classification with kernels on the multinomial manifold. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 266–273 (2005) Zhang, D., Chen, X.: Text classification with kernels on the multinomial manifold. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 266–273 (2005)
Metadaten
Titel
Beyond the Boundaries of SMOTE
verfasst von
Colin Bellinger
Christopher Drummond
Nathalie Japkowicz
Copyright-Jahr
2016
DOI
https://doi.org/10.1007/978-3-319-46128-1_16

Premium Partner