Skip to main content
Erschienen in: Progress in Artificial Intelligence 1/2014

01.08.2014 | Regular Paper

Undersampled \(K\)-means approach for handling imbalanced distributed data

verfasst von: N. Santhosh Kumar, K. Nageswara Rao, A. Govardhan, K. Sudheer Reddy, Ali Mirza Mahmood

Erschienen in: Progress in Artificial Intelligence | Ausgabe 1/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

\(K\)-means is a partitional clustering technique that is well known and widely used for its low computational cost. However, the performance of \(K\)-means algorithm tends to be affected by skewed data distributions, i.e., imbalanced data. They often produce clusters of relatively uniform sizes, even if input data have varied cluster size, which is called the “uniform effect”. In this paper, we analyze the causes of this effect and illustrate that it probably occurs more in the \(K\)-means clustering process. As the minority class decreases in size, the “uniform effect” becomes evident. To prevent the effect of the “uniform effect”, we revisit the well-known \(K\)-means algorithm and provide a general method to properly cluster imbalance distributed data. The proposed algorithm consists of a novel undersampling technique implemented by intelligently removing noisy and weak instances from majority class. We conduct experiments using twelve UCI datasets from various application domains using five algorithms for comparison on eight evaluation metrics. Experimental results show the effectiveness of the proposed clustering algorithm in clustering balanced and imbalanced data.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Xiong, H., Wu, J.J., Chen, J.: K-means clustering versus validation measures: a data-distribution perspective. IEEE Trans. Syst. Man Cybern. B Cybern. 39(2), 318–331 (2009)CrossRef Xiong, H., Wu, J.J., Chen, J.: K-means clustering versus validation measures: a data-distribution perspective. IEEE Trans. Syst. Man Cybern. B Cybern. 39(2), 318–331 (2009)CrossRef
2.
Zurück zum Zitat Lu, W.-Z., Wang, D.: Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme. Sci. Total. Environ. 395(2–3), 109–116 (2008)CrossRef Lu, W.-Z., Wang, D.: Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme. Sci. Total. Environ. 395(2–3), 109–116 (2008)CrossRef
3.
Zurück zum Zitat Huang, Y.-M., Hung, C.-M., Jiau, H.C.: Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem. Nonlinear Anal. R. World Appl. 7(4), 720–747 (2006)CrossRefMATHMathSciNet Huang, Y.-M., Hung, C.-M., Jiau, H.C.: Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem. Nonlinear Anal. R. World Appl. 7(4), 720–747 (2006)CrossRefMATHMathSciNet
4.
Zurück zum Zitat Cieslak, D., Chawla, N., Striegel, A.: Combating imbalance in network intrusion datasets. In: IEEE International Conference on Granular Computing, pp. 732–737 (2006) Cieslak, D., Chawla, N., Striegel, A.: Combating imbalance in network intrusion datasets. In: IEEE International Conference on Granular Computing, pp. 732–737 (2006)
5.
Zurück zum Zitat Mazurowski, M.A., Habas, P.A., Zurada, J.M., Lo, J.Y., Baker, J.A., Tourassi, G.D.: Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw. 21(2–3), 427–436 (2008)CrossRef Mazurowski, M.A., Habas, P.A., Zurada, J.M., Lo, J.Y., Baker, J.A., Tourassi, G.D.: Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance. Neural Netw. 21(2–3), 427–436 (2008)CrossRef
6.
Zurück zum Zitat Freitas, A., Costa-Pereira, A., Brazdil, P.: Cost-sensitive decision trees applied to medical data. In: Song, I., Eder, J., Nguyen, T. (eds.) Data Warehousing Knowledge Discovery (Lecture Notes Series in Computer Science) Freitas, A., Costa-Pereira, A., Brazdil, P.: Cost-sensitive decision trees applied to medical data. In: Song, I., Eder, J., Nguyen, T. (eds.) Data Warehousing Knowledge Discovery (Lecture Notes Series in Computer Science)
7.
Zurück zum Zitat Kiliç, K., Uncu, Ö., Türksen, I.B.: Comparison of different strategies of utilizing fuzzy clustering in structure identification. Inf. Sci. 177(23), 5153–5162 (2007)CrossRefMATH Kiliç, K., Uncu, Ö., Türksen, I.B.: Comparison of different strategies of utilizing fuzzy clustering in structure identification. Inf. Sci. 177(23), 5153–5162 (2007)CrossRefMATH
8.
Zurück zum Zitat Celebi, M.E., Kingravi, H.A., Uddin, B., Iyatomi, H., Aslandogan, Y.A., Stoecker, W.V., Moss, R.H.: A methodological approach to the classification of dermoscopy images. Comput. Med. Imaging Graph 31(6), 362–373 (2007)CrossRef Celebi, M.E., Kingravi, H.A., Uddin, B., Iyatomi, H., Aslandogan, Y.A., Stoecker, W.V., Moss, R.H.: A methodological approach to the classification of dermoscopy images. Comput. Med. Imaging Graph 31(6), 362–373 (2007)CrossRef
9.
Zurück zum Zitat Mahmood, A.M., Kuppa, M.R.: Early detection of clinical parameters in heart disease using improved decision tree algorithm. In: IEEE 2\(^{nd}\) Vaagdevi International Conference on Information Technology for Real World Problems (VCON’10) Acceptance Rate less than 6 %, pp. 24–29, Dec 9–11 Warangal. Archived in IEEE Computer Society Digital Library, India (2010) Mahmood, A.M., Kuppa, M.R.: Early detection of clinical parameters in heart disease using improved decision tree algorithm. In: IEEE 2\(^{nd}\) Vaagdevi International Conference on Information Technology for Real World Problems (VCON’10) Acceptance Rate less than 6 %, pp. 24–29, Dec 9–11 Warangal. Archived in IEEE Computer Society Digital Library, India (2010)
10.
Zurück zum Zitat Mahmood, A.M., Kuppa, M.R.: A novel pruning approach using expert knowledge for data specific pruning. In: Engineering with Computers, vol. 28, pp. 21–30. Springer-Verlag, London (2011) Mahmood, A.M., Kuppa, M.R.: A novel pruning approach using expert knowledge for data specific pruning. In: Engineering with Computers, vol. 28, pp. 21–30. Springer-Verlag, London (2011)
11.
Zurück zum Zitat Peng, X., King, I.: Robust BMPM training based on second-order cone programming and its application in medical diagnosis. Neural Netw. 21(2—-3), 450–457 (2008)CrossRef Peng, X., King, I.: Robust BMPM training based on second-order cone programming and its application in medical diagnosis. Neural Netw. 21(2—-3), 450–457 (2008)CrossRef
12.
Zurück zum Zitat Chawla, N., Bowyer, K., Kegelmeyer, P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)MATH Chawla, N., Bowyer, K., Kegelmeyer, P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)MATH
13.
Zurück zum Zitat Weiss, G.: Mining with rarity: a unifying framework. SIGKDD Explor. Newslett. 6(1), 7–19 (2004)CrossRef Weiss, G.: Mining with rarity: a unifying framework. SIGKDD Explor. Newslett. 6(1), 7–19 (2004)CrossRef
14.
Zurück zum Zitat Guha, S., Rastogi, R., Shim, K.: Cure: an efficient clustering algorithm for large databases. In: Proceedings of International Conference on ACM Special Interest Group on Management of Data, pp. 73–84 (1998) Guha, S., Rastogi, R., Shim, K.: Cure: an efficient clustering algorithm for large databases. In: Proceedings of International Conference on ACM Special Interest Group on Management of Data, pp. 73–84 (1998)
15.
Zurück zum Zitat Liu, M.H., Jiang, X.D., Kot, A.C.: A multi-prototype clustering algorithm. Pattern Recogn. 42, 689–698 (2009)CrossRefMATH Liu, M.H., Jiang, X.D., Kot, A.C.: A multi-prototype clustering algorithm. Pattern Recogn. 42, 689–698 (2009)CrossRefMATH
16.
Zurück zum Zitat Xiang, H., Yang, Y., Zhao, S.: Local clustering ensemble learning method based on improved AdaBoost for rare class analysis. J. Comput. Inform. Syst. 8(4), 1783–1790 (2012) Xiang, H., Yang, Y., Zhao, S.: Local clustering ensemble learning method based on improved AdaBoost for rare class analysis. J. Comput. Inform. Syst. 8(4), 1783–1790 (2012)
17.
Zurück zum Zitat Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002)CrossRef Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002)CrossRef
18.
Zurück zum Zitat de Amorim Cordeiro, R.: Minkowski metric, feature weighting and anomalous cluster initializing in K-means clustering. Pattern Recogn. 45, 1061–1075 (2012) de Amorim Cordeiro, R.: Minkowski metric, feature weighting and anomalous cluster initializing in K-means clustering. Pattern Recogn. 45, 1061–1075 (2012)
19.
Zurück zum Zitat Kiranyaz, S., Ince, T., Pulkkinen, J., Gabbouj, M.: Personalized long-term ecg classification: a systematic approach. Expert Syst. Appl. 38, 3220–3226 (2011)CrossRef Kiranyaz, S., Ince, T., Pulkkinen, J., Gabbouj, M.: Personalized long-term ecg classification: a systematic approach. Expert Syst. Appl. 38, 3220–3226 (2011)CrossRef
20.
Zurück zum Zitat Muniyandi, A.P., Rajeswari, R., Rajaram, R.: Network anomaly detection by cascading K-means clustering and C4.5 decision tree algorithm. International Conference on Communication Technology and System Design 2011. Procedia Eng. 30, 174–182 (2012)CrossRef Muniyandi, A.P., Rajeswari, R., Rajaram, R.: Network anomaly detection by cascading K-means clustering and C4.5 decision tree algorithm. International Conference on Communication Technology and System Design 2011. Procedia Eng. 30, 174–182 (2012)CrossRef
21.
Zurück zum Zitat Xuan, l., Zhigang, C., Fan, Y.: Exploring of clustering algorithm on class-imbalanced data (2013) Xuan, l., Zhigang, C., Fan, Y.: Exploring of clustering algorithm on class-imbalanced data (2013)
23.
Zurück zum Zitat Mok, P.Y., Huang, H.Q., Kwok, Y.L., Au, J.S.: A robust adaptive clustering analysis method for automatic identification of clusters. Pattern Recogn. 45, 3017–3033 (2012)CrossRef Mok, P.Y., Huang, H.Q., Kwok, Y.L., Au, J.S.: A robust adaptive clustering analysis method for automatic identification of clusters. Pattern Recogn. 45, 3017–3033 (2012)CrossRef
24.
Zurück zum Zitat Leiva, L.A., Vidal, E.: Warped K-means: an algorithm to cluster sequentially-distributed data. Inform. Sci. 237, 196–210 (2013)CrossRefMathSciNet Leiva, L.A., Vidal, E.: Warped K-means: an algorithm to cluster sequentially-distributed data. Inform. Sci. 237, 196–210 (2013)CrossRefMathSciNet
25.
Zurück zum Zitat Jaing, M.F., Tseng, S.S., Su, C.M.: Two phase clustering process for outlier detection. Pattern Recogn. Lett. 22, 691–700 (2001)CrossRef Jaing, M.F., Tseng, S.S., Su, C.M.: Two phase clustering process for outlier detection. Pattern Recogn. Lett. 22, 691–700 (2001)CrossRef
26.
Zurück zum Zitat Cao, J., Wu, Z., Wu, J., Liu, W.: Towards information-theoretic k-means clustering for image indexing. Sign. Proces. 93, 2026–2037 (2013)CrossRef Cao, J., Wu, Z., Wu, J., Liu, W.: Towards information-theoretic k-means clustering for image indexing. Sign. Proces. 93, 2026–2037 (2013)CrossRef
28.
Zurück zum Zitat López, V., Fernandez, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inform. Sci. 250, 113–141 (2013)CrossRef López, V., Fernandez, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inform. Sci. 250, 113–141 (2013)CrossRef
29.
Zurück zum Zitat He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRef He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)CrossRef
30.
Zurück zum Zitat Galar, M., Fernandez, A., Barrenechea, E., Herrera, F.: Eusboost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn. 46(12), 3460–3471 (2013)CrossRef Galar, M., Fernandez, A., Barrenechea, E., Herrera, F.: Eusboost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn. 46(12), 3460–3471 (2013)CrossRef
31.
Zurück zum Zitat Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for class imbalance problem: bagging, boosting and hybrid based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)CrossRef Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for class imbalance problem: bagging, boosting and hybrid based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(4), 463–484 (2012)CrossRef
32.
Zurück zum Zitat Demsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learning Res. 7, 1–30 (2006)MATHMathSciNet Demsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learning Res. 7, 1–30 (2006)MATHMathSciNet
33.
Zurück zum Zitat García, S., Fernandez, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inform. Sci. 180, 2044–2064 (2010) García, S., Fernandez, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inform. Sci. 180, 2044–2064 (2010)
34.
Zurück zum Zitat Maimon, O., Rokach, L.: Data Mining And Knowledge Discovery Handbook. Springer, Berlin (2010)CrossRefMATH Maimon, O., Rokach, L.: Data Mining And Knowledge Discovery Handbook. Springer, Berlin (2010)CrossRefMATH
35.
Zurück zum Zitat Hall, M.A.: Correlation-Based Feature Subset Selection For Machine Learning. Hamilton, New Zealand (1998) Hall, M.A.: Correlation-Based Feature Subset Selection For Machine Learning. Hamilton, New Zealand (1998)
36.
Zurück zum Zitat Quinlan, J.R.: C4.5: Programs for Machine Learning, 1st edn. Morgan Kaufmann Publishers, San Mateo (1993) Quinlan, J.R.: C4.5: Programs for Machine Learning, 1st edn. Morgan Kaufmann Publishers, San Mateo (1993)
39.
Zurück zum Zitat Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005) Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
40.
Zurück zum Zitat Dasgupta, S.: Performance guarantees for hierarchical clustering. In: 15th Annual Conference on Computational Learning Theory, pp. 351–363 (2002) Dasgupta, S.: Performance guarantees for hierarchical clustering. In: 15th Annual Conference on Computational Learning Theory, pp. 351–363 (2002)
Metadaten
Titel
Undersampled -means approach for handling imbalanced distributed data
verfasst von
N. Santhosh Kumar
K. Nageswara Rao
A. Govardhan
K. Sudheer Reddy
Ali Mirza Mahmood
Publikationsdatum
01.08.2014
Verlag
Springer Berlin Heidelberg
Erschienen in
Progress in Artificial Intelligence / Ausgabe 1/2014
Print ISSN: 2192-6352
Elektronische ISSN: 2192-6360
DOI
https://doi.org/10.1007/s13748-014-0045-6

Weitere Artikel der Ausgabe 1/2014

Progress in Artificial Intelligence 1/2014 Zur Ausgabe