nach oben

Progress in Artificial Intelligence

Erschienen in:

01.12.2012 | Regular Paper

Surrounding neighborhood-based SMOTE for learning from imbalanced data sets

verfasst von: V. García, J. S. Sánchez, R. Martín-Félez, R. A. Mollineda

Erschienen in: Progress in Artificial Intelligence | Ausgabe 4/2012

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Many traditional approaches to pattern classification assume that the problem classes share similar prior probabilities. However, in many real-life applications, this assumption is grossly violated. Often, the ratios of prior probabilities between classes are extremely skewed. This situation is known as the class imbalance problem. One of the strategies to tackle this problem consists of balancing the classes by resampling the original data set. The SMOTE algorithm is probably the most popular technique to increase the size of the minority class by generating synthetic instances. From the idea of the original SMOTE, we here propose the use of three approaches to surrounding neighborhood with the aim of generating artificial minority instances, but taking into account both the proximity and the spatial distribution of the examples. Experiments over a large collection of databases and using three different classifiers demonstrate that the new surrounding neighborhood-based SMOTE procedures significantly outperform other existing over-sampling algorithms.

Vorheriger Artikel One iteration CHC algorithm for learning Bayesian networks: an effective and efficient algorithm for high dimensional problems

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

Alcalá-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.: Software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft. Comput. 17(2–3), 255–287 (2011)

Barandela, R., Sánchez, J.S., García, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognit. 36(3), 849–851 (2003)CrossRef

Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl. 6(1), 20–29 (2004)CrossRef

Brown, I., Mues, C.: An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst. Appl. 39(3), 3446–3453 (2012)CrossRef

Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for handling the class imbalanced problem. In: Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 475–482, Bangkok, Thailand (2009)

Chaudhuri, B.B.: A new definition of neighborhood of a point in multi-dimensional space. Pattern Recognit. Lett. 17(1), 11–17 (1996)CrossRef

Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)MATH

Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: improving prediction of the minority class in boosting. In: Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 107–119, Dubrovnik, Croatia (2003)

Chen, E., Lin, Y., Xiong, H., Luo, Q., Ma, H.: Exploiting probabilistic topic models to improve text categorization under class imbalance. Inf. Process. Manage. 47(2), 202–214 (2011)CrossRef

10.

Cohen, G., Hilario, M., Sax, H., Hugonnet, S., Geissbuhler, A.: Learning from imbalanced data in surveillance of nosocomial infection. Artif. Intell. Med. 37(1), 7–18 (2006)CrossRef

11.

Daskalaki, S., Kopanas, I., Avouris, N.: Evaluation of classifiers for an uneven class distribution problem. Appl. Artif. Intell. 20(5), 381–417 (2006)CrossRef

12.

Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7(1), 1–30 (2006)MathSciNetMATH

13.

Derrac, J., García, S., Molina, D., Herrera, F.: A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol. Comput. 1(1), 3–18 (2011)CrossRef

14.

Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett. 27(8), 861–874 (2006)MathSciNetCrossRef

15.

Fawcett, T., Provost, F.: Adaptive fraud detection. Data Mining Knowl. Discov. 1(3), 291–316 (1997)CrossRef

16.

Ganganwar, V.: An overview of classification algorithms for imbalanced datasets. Int. J. Emerg. Technol. Adv. Eng. 2(4), 42–47 (2012)

17.

García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf. Sci. 180(10), 2044–2064 (2010)CrossRef

18.

García, V., Sánchez, J.S., Mollineda, R.A.: On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl. Based Syst. 25(1), 13–21 (2012)CrossRef

19.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRef

20.

Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Proceedings of the International Conference on Intelligent Computing, pp. 878–887, Hefei, China (2005)

21.

He, G., Han, H., Wang, W.: An over-sampling expert system for learning from imbalanced data sets. In: Proceedings of the 2nd International Conference on Neural Networks and Brain, pp. 537–541, Beijing, China (2005)

22.

He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the International Joint Conference on Neural Networks, pp. 1322–1328, Hong Kong (2008)

23.

He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data. Eng. 21, 1263–1284 (2009)CrossRef

24.

Hilas, C.S., Mastorocostas, P.A.: An application of supervised and unsupervised learning approaches to telecommunications fraud detection. Knowl. Based Syst. 21(7), 721–726 (2008)CrossRef

25.

Hu, S., Liang, Y., Ma, L., He, Y.: MSMOTE: improving classification performance when training data is imbalanced. In: Proceedings of the 2nd International Workshop on Computer Science and Engineering, pp. 13–17, Qingdao, China (2009)

26.

Huang, J., Ling, C.-X.: Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data. Eng. 17(3), 299–310 (2005)CrossRef

27.

Huang, Y.-M., Hung, C.-M., Jiau, H.C.: Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem. Nonlinear Anal. Real World Appl. 7(4), 720–757 (2006)MathSciNetMATHCrossRef

28.

Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)MATH

29.

Jaromczyk, J.W., Toussaint, G.T.: Relative neighborhood graphs and their relatives. Proc. IEEE 80(9), 1502–1517 (1992)CrossRef

30.

Jiang, Y., Li, M., Zhou, Z.-H.: Software defect detection with ROCUS. J. Comput. Sci. Technol. 26(2), 328–342 (2011)CrossRef

31.

Klement, W., Wilk, S., Michalowski, W., Matwin, S.: Classifying severely imbalanced data. In: Proceedings of the 24th Canadian Conference on Advances in Artificial Intelligence, St. John’s, Canada (2011)

32.

Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30(2–3), 195–215 (1998)CrossRef

33.

Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th International Conference on Machine Learning, pp. 179–186, Nashville, TN (1997)

34.

Lemnaru, C., Potolea, R.: Imbalanced classification problems: systematic study, issues and best practices. In: Enterprise Information Systems, pp. 35–50. Springer, Berlin (2012)

35.

Li, D.-C., Liu, C.-W., Hu, S.C.: A learning method for the class imbalance problem with medical data sets. Comput. Biol. Med. 40(5), 509–518 (2010)CrossRef

36.

Ling, C.X., Li, C.: Data mining for direct marketing: problems and solutions. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, pp. 73–79, New York, NY (1998)

37.

Maciejewski, T., Stefanowski, J.: Local neighbourhood extension of SMOTE for mining imbalanced data. In: Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining, pp. 104–111, Paris, France (2011)

38.

Oh, S.-H.: Error back-propagation algorithm for classification of imbalanced data. Neurocomputing 74(6), 1058–1061 (2011)CrossRef

39.

Orriols-Puig, A., Bernadó-Mansilla, E.: Evolutionary rule-based systems for imbalanced data sets. Soft. Comput. 13(3), 213–225 (2008)CrossRef

40.

Sánchez, J.S., Marqués, A.I.: Enhanced neighbourhood specifications for pattern classification. In: Pattern Recognition and String Matching, pp. 673–702. Kluwer, Doedrecht (2002)

41.

Short, R.D., Fukunaga, K.: A new nearest neighbour distance measure. In: Proceedings of the 5th International Conference on Pattern Recognition, pp. 81–86, Miami, FL (1980)

42.

Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manage. 45(4), 427–437 (2009)CrossRef

43.

Sun, Y., Wong, A.K.C., Kamel, M.S.: Classification of imbalanced data: a review. Int. J. Pattern Recognit. Artif. Intell. 23(4), 687–719 (2009)CrossRef

44.

Tan, S.: Neighbor-weighted K-nearest neighbor for unbalanced text corpus. Expert Syst. Appl. 28(4), 667–671 (2005)CrossRef

45.

Tang, S., Chen, S.-P.: The generation mechanism of synthetic minority class examples. In: Proceedings of the 5th International Conference on Information Technology and Application in Biomedicine, pp. 444–447, Shenzhen, China (2008)

46.

van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning, pp. 935–942, Corvallis, OR (2007)

47.

van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: An exploration of learning when data is noisy and imbalanced. Intell. Data Anal. 15(2), 215–236 (2011)

48.

Wang, J., Xu, M., Wang, H., Zhang, J.: Classification of imbalanced data by using the SMOTE algorithm and locally linear embedding. In: Proceedings of the 8th International Conference on Signal Processing, pp. 16–20, Beijing, China (2006)

49.

Yen, S.-J., Lee, Y.-S., Lin, C.-H., Ying, J.-C.: Investigating the effect of sampling methods for imbalanced data distributions. In: Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, pp. 4163–4168, Taipei, Taiwan (2006)

50.

Zhang, D., Liu, W., Gong, X., Jin, H.: A novel improved SMOTE resampling algorithm based on fractal. J. Comput. Inf. Syst. 7(6), 2204–2211 (2011)

51.

Zhang, J., Yim, Y.-S., Yang, J.: Intelligent selection of instances for prediction functions in lazy learning algorithms. Artif. Intell. Rev. 11(1), 175–191 (1997)CrossRef

Titel: Surrounding neighborhood-based SMOTE for learning from imbalanced data sets
verfasst von: V. García
J. S. Sánchez
R. Martín-Félez
R. A. Mollineda
Publikationsdatum: 01.12.2012
Verlag: Springer-Verlag
Erschienen in: Progress in Artificial Intelligence / Ausgabe 4/2012
Print ISSN: 2192-6352
Elektronische ISSN: 2192-6360
DOI: https://doi.org/10.1007/s13748-012-0027-5

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 4/2012

“I can tell you what it’s not”: active learning from counterexamples

Binary relevance efficacy for multilabel classification

Effective real-time visual object detection

Extending the upper–lower edge detector by means of directional masks and OWA operators

One iteration CHC algorithm for learning Bayesian networks: an effective and efficient algorithm for high dimensional problems

Agreement technologies and their use in cloud computing environments

Premium Partner