Top

Published in:

2017 | OriginalPaper | Chapter

Discovering Minority Sub-clusters and Local Difficulty Factors from Imbalanced Data

Authors : Mateusz Lango, Dariusz Brzezinski, Sebastian Firlik, Jerzy Stefanowski

Published in: Discovery Science

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Learning classifiers from imbalanced data is particularly challenging when class imbalance is accompanied by local data difficulty factors, such as outliers, rare cases, class overlapping, or minority class decomposition. Although these issues have been highlighted in previous research, there have been no proposals of algorithms that simultaneously detect all the aforementioned difficulties in a dataset. In this paper, we put forward two extensions to popular clustering algorithms, ImKmeans and ImScan, and one novel algorithm, ImGrid, that attempt to detect minority sub-clusters, outliers, rare cases, and class overlapping. Experiments with artificial datasets show that ImGrid, which uses a Bayesian test to join similar neighboring regions, is able to re-discover simulated clusters and types of minority examples on par with competing methods, while being the least sensitive to parameter tuning.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Measuring the Inspiration Rate of Topics in Bibliographic Networks

next chapter Fusion Techniques for Named Entity Recognition and Word Sense Induction and Disambiguation

Details on tuning the size of the neighborhood and a comparison between the k-NN and kernel-based approach can be found in [12].

Source code, datasets, and reproducible test scripts available at: https://github.com/langus0/imgrid.

http://www.cs.put.poznan.pl/dbrzezinski/software/MinorityAnalysis.html.

Barnard, G.: A new test for 2\(\,{\times }\,\)2 tables. Nature 156, 177 (1945)MathSciNetCrossRefMATH

Branco, P., Torgo, L., Ribeiro, R.P.: A survey of predictive modeling on imbalanced domains. ACM Comput. Surv. 49(2), 31:1–31:50 (2016)CrossRef

Cheng, W., Wang, W., Batista, S.: Grid-based clustering. In: Aggarwal, C.C., Reddy, C.K. (eds.) Data Clustering: Algorithms and Applications, pp. 127–148. CRC Press, London (2013)

Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.) Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231 (1996)

García, V., Sánchez, J., Mollineda, R.: An empirical study of the behavior of classifiers on imbalanced and overlapped data sets. In: Rueda, L., Mery, D., Kittler, J. (eds.) CIARP 2007. LNCS, vol. 4756, pp. 397–406. Springer, Heidelberg (2007). doi:10.1007/978-3-540-76725-1_42 CrossRef

He, H., Ma, Y. (eds.): Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley-IEEE Press, Hoboken (2013)MATH

Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)MATH

Jeffreys, H.: Some tests of significance, treated by the theory of probability. Proc. Camb. Philos. Soc. 31, 203–222 (1935)CrossRefMATH

Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. SIGKDD Explor. 6(1), 40–49 (2004)CrossRef

10.

Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the International Conference on Machine Learning, pp. 179–186 (1997)

11.

Napierala, K., Stefanowski, J.: Identification of different types of minority class examples in imbalanced data. In: Corchado, E., Snášel, V., Abraham, A., Woźniak, M., Graña, M., Cho, S.-B. (eds.) HAIS 2012. LNCS, vol. 7209, pp. 139–150. Springer, Heidelberg (2012). doi:10.1007/978-3-642-28931-6_14 CrossRef

12.

Napierala, K., Stefanowski, J.: Types of minority class examples and their influence on learning classifiers from imbalanced data. J. Intell. Inf. Syst. 46(3), 563–597 (2016)CrossRef

13.

Napierała, K., Stefanowski, J., Wilk, S.: Learning from imbalanced data in presence of noisy and borderline examples. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 158–167. Springer, Heidelberg (2010). doi:10.1007/978-3-642-13529-3_18 CrossRef

14.

Nickerson, A., Japkowicz, N., Milios, E.E.: Using unsupervised learning to guide resampling in imbalanced data sets. In: Proceedings of the 8th International Conference on Artificial Intelligence and Statistics, pp. 261–265. Society for Artificial Intelligence and Statistics (2001)

15.

Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetMATH

16.

Prati, R.C., Batista, G.E.A.P.A., Monard, M.C.: Class imbalances versus class overlapping: an analysis of a learning system behavior. In: Monroy, R., Arroyo-Figueroa, G., Sucar, L.E., Sossa, H. (eds.) MICAI 2004. LNCS, vol. 2972, pp. 312–321. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24694-7_32 CrossRef

17.

Romano, S., Vinh, N.X., Bailey, J., Verspoor, K.: Adjusting for chance clustering comparison measures. J. Mach. Learn. Res. 17(134), 1–32 (2016)MathSciNetMATH

18.

Sobhani, P., Viktor, H., Matwin, S.: Learning from imbalanced data using ensemble methods and cluster-based undersampling. In: Appice, A., Ceci, M., Loglisci, C., Manco, G., Masciari, E., Ras, Z.W. (eds.) NFMCP 2014. LNCS, vol. 8983, pp. 69–83. Springer, Cham (2015). doi:10.1007/978-3-319-17876-9_5

19.

Stefanowski, J.: Dealing with data difficulty factors while learning from imbalanced data. In: Matwin, S., Mielniczuk, J. (eds.) Challenges in Computational Statistics and Data Mining. SCI, vol. 605, pp. 333–363. Springer, Cham (2016). doi:10.1007/978-3-319-18781-5_17 CrossRef

20.

Wojciechowski, S., Wilk, S.: Difficulty factors and preprocessing in imbalanced data sets: an experimental study on artificial data. Found. Comput. Decis. Sci. 42(2), 149–176 (2017)MATH

Title: Discovering Minority Sub-clusters and Local Difficulty Factors from Imbalanced Data
Authors: Mateusz Lango
Dariusz Brzezinski
Sebastian Firlik
Jerzy Stefanowski
Publisher: Springer International Publishing
Book: Discovery Science
Print ISBN: 978-3-319-67785-9

Electronic ISBN: 978-3-319-67786-6

Copyright Year: 2017
DOI: https://doi.org/10.1007/978-3-319-67786-6_23

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner