nach oben

Progress in Artificial Intelligence

Erschienen in:

01.08.2014 | Regular Paper

Undersampled \(K\)-means approach for handling imbalanced distributed data

verfasst von: N. Santhosh Kumar, K. Nageswara Rao, A. Govardhan, K. Sudheer Reddy, Ali Mirza Mahmood

Erschienen in: Progress in Artificial Intelligence | Ausgabe 1/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

\(K\)-means is a partitional clustering technique that is well known and widely used for its low computational cost. However, the performance of \(K\)-means algorithm tends to be affected by skewed data distributions, i.e., imbalanced data. They often produce clusters of relatively uniform sizes, even if input data have varied cluster size, which is called the “uniform effect”. In this paper, we analyze the causes of this effect and illustrate that it probably occurs more in the \(K\)-means clustering process. As the minority class decreases in size, the “uniform effect” becomes evident. To prevent the effect of the “uniform effect”, we revisit the well-known \(K\)-means algorithm and provide a general method to properly cluster imbalance distributed data. The proposed algorithm consists of a novel undersampling technique implemented by intelligently removing noisy and weak instances from majority class. We conduct experiments using twelve UCI datasets from various application domains using five algorithms for comparison on eight evaluation metrics. Experimental results show the effectiveness of the proposed clustering algorithm in clustering balanced and imbalanced data.

Vorheriger Artikel Constructing fading histograms from data streams

Nächster Artikel A novel hybrid image segmentation method

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Xiong, H., Wu, J.J., Chen, J.: K-means clustering versus validation measures: a data-distribution perspective. IEEE Trans. Syst. Man Cybern. B Cybern. 39(2), 318–331 (2009)CrossRef

Lu, W.-Z., Wang, D.: Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme. Sci. Total. Environ. 395(2–3), 109–116 (2008)CrossRef

Huang, Y.-M., Hung, C.-M., Jiau, H.C.: Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem. Nonlinear Anal. R. World Appl. 7(4), 720–747 (2006)CrossRefMATHMathSciNet

Cieslak, D., Chawla, N., Striegel, A.: Combating imbalance in network intrusion datasets. In: IEEE International Conference on Granular Computing, pp. 732–737 (2006)

Mazurowski, M.A., Habas, P.A., Zurada, J.M., Lo, J.Y., Baker, J.A., Tourassi, G.D.: Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw. 21(2–3), 427–436 (2008)CrossRef

Freitas, A., Costa-Pereira, A., Brazdil, P.: Cost-sensitive decision trees applied to medical data. In: Song, I., Eder, J., Nguyen, T. (eds.) Data Warehousing Knowledge Discovery (Lecture Notes Series in Computer Science)

Kiliç, K., Uncu, Ö., Türksen, I.B.: Comparison of different strategies of utilizing fuzzy clustering in structure identification. Inf. Sci. 177(23), 5153–5162 (2007)CrossRefMATH

Celebi, M.E., Kingravi, H.A., Uddin, B., Iyatomi, H., Aslandogan, Y.A., Stoecker, W.V., Moss, R.H.: A methodological approach to the classification of dermoscopy images. Comput. Med. Imaging Graph 31(6), 362–373 (2007)CrossRef

Mahmood, A.M., Kuppa, M.R.: Early detection of clinical parameters in heart disease using improved decision tree algorithm. In: IEEE 2\(^{nd}\) Vaagdevi International Conference on Information Technology for Real World Problems (VCON’10) Acceptance Rate less than 6 %, pp. 24–29, Dec 9–11 Warangal. Archived in IEEE Computer Society Digital Library, India (2010)

10.

Mahmood, A.M., Kuppa, M.R.: A novel pruning approach using expert knowledge for data specific pruning. In: Engineering with Computers, vol. 28, pp. 21–30. Springer-Verlag, London (2011)

11.

Peng, X., King, I.: Robust BMPM training based on second-order cone programming and its application in medical diagnosis. Neural Netw. 21(2—-3), 450–457 (2008)CrossRef

12.

Chawla, N., Bowyer, K., Kegelmeyer, P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)MATH

13.

Weiss, G.: Mining with rarity: a unifying framework. SIGKDD Explor. Newslett. 6(1), 7–19 (2004)CrossRef

14.

Guha, S., Rastogi, R., Shim, K.: Cure: an efficient clustering algorithm for large databases. In: Proceedings of International Conference on ACM Special Interest Group on Management of Data, pp. 73–84 (1998)

15.

Liu, M.H., Jiang, X.D., Kot, A.C.: A multi-prototype clustering algorithm. Pattern Recogn. 42, 689–698 (2009)CrossRefMATH

16.

Xiang, H., Yang, Y., Zhao, S.: Local clustering ensemble learning method based on improved AdaBoost for rare class analysis. J. Comput. Inform. Syst. 8(4), 1783–1790 (2012)

17.

Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002)CrossRef

18.

de Amorim Cordeiro, R.: Minkowski metric, feature weighting and anomalous cluster initializing in K-means clustering. Pattern Recogn. 45, 1061–1075 (2012)

19.

Kiranyaz, S., Ince, T., Pulkkinen, J., Gabbouj, M.: Personalized long-term ecg classification: a systematic approach. Expert Syst. Appl. 38, 3220–3226 (2011)CrossRef

20.

Muniyandi, A.P., Rajeswari, R., Rajaram, R.: Network anomaly detection by cascading K-means clustering and C4.5 decision tree algorithm. International Conference on Communication Technology and System Design 2011. Procedia Eng. 30, 174–182 (2012)CrossRef

21.

Xuan, l., Zhigang, C., Fan, Y.: Exploring of clustering algorithm on class-imbalanced data (2013)

22.

Bouras, C., Tsogkas, V.: A clustering technique for news articles using WordNet, Knowl. Based Syst. (2012). doi:10.1016/j.knosys.2012.06.015

23.

Mok, P.Y., Huang, H.Q., Kwok, Y.L., Au, J.S.: A robust adaptive clustering analysis method for automatic identification of clusters. Pattern Recogn. 45, 3017–3033 (2012)CrossRef

24.

Leiva, L.A., Vidal, E.: Warped K-means: an algorithm to cluster sequentially-distributed data. Inform. Sci. 237, 196–210 (2013)CrossRefMathSciNet

25.

Jaing, M.F., Tseng, S.S., Su, C.M.: Two phase clustering process for outlier detection. Pattern Recogn. Lett. 22, 691–700 (2001)CrossRef

26.

Cao, J., Wu, Z., Wu, J., Liu, W.: Towards information-theoretic k-means clustering for image indexing. Sign. Proces. 93, 2026–2037 (2013)CrossRef

27.

Mignotte, M.: A de-texturing and spatially constrained k-means approach for image segmentation. Pattern Recogn. Lett. 32, 359–367 (2010). doi:10.1016/j.patrec.2010.09.016 CrossRef

28.

López, V., Fernandez, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inform. Sci. 250, 113–141 (2013)CrossRef

29.

He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRef

30.

Galar, M., Fernandez, A., Barrenechea, E., Herrera, F.: Eusboost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn. 46(12), 3460–3471 (2013)CrossRef

31.

Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for class imbalance problem: bagging, boosting and hybrid based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)CrossRef

32.

Demsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learning Res. 7, 1–30 (2006)MATHMathSciNet

33.

García, S., Fernandez, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inform. Sci. 180, 2044–2064 (2010)

34.

Maimon, O., Rokach, L.: Data Mining And Knowledge Discovery Handbook. Springer, Berlin (2010)CrossRefMATH

35.

Hall, M.A.: Correlation-Based Feature Subset Selection For Machine Learning. Hamilton, New Zealand (1998)

36.

Quinlan, J.R.: C4.5: Programs for Machine Learning, 1st edn. Morgan Kaufmann Publishers, San Mateo (1993)

37.

http://www.keel.es/

38.

Blake, C., Merz, C.J.: UCI repository of machine learning databases. Machine-readable data repository. Department of Information and Computer Science, University of California at Irvine, Irvine. http://www.ics.uci.edu/mlearn/MLRepository.html (2000)

39.

Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

40.

Dasgupta, S.: Performance guarantees for hierarchical clustering. In: 15th Annual Conference on Computational Learning Theory, pp. 351–363 (2002)

Titel: Undersampled -means approach for handling imbalanced distributed data
verfasst von: N. Santhosh Kumar
K. Nageswara Rao
A. Govardhan
K. Sudheer Reddy
Ali Mirza Mahmood
Publikationsdatum: 01.08.2014
Verlag: Springer Berlin Heidelberg
Erschienen in: Progress in Artificial Intelligence / Ausgabe 1/2014
Print ISSN: 2192-6352
Elektronische ISSN: 2192-6360
DOI: https://doi.org/10.1007/s13748-014-0045-6

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 1/2014

Constructing fading histograms from data streams

A novel hybrid image segmentation method

On the statistical analysis of the parameters’ trend in a machine learning algorithm

Unbalanced linguistic approach for venture investment evaluation with risk attitudes