Skip to main content
Erschienen in: Neural Computing and Applications 12/2021

10.11.2020 | Original Article

An entropy-based initialization method of K-means clustering on the optimal number of clusters

verfasst von: Kuntal Chowdhury, Debasis Chaudhuri, Arup Kumar Pal

Erschienen in: Neural Computing and Applications | Ausgabe 12/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Clustering is an unsupervised learning approach used to group similar features using specific mathematical criteria. This mathematical criterion is known as the objective function. Any clustering is done depending on some objective function. K-means is one of the widely used partitional clustering algorithms whose performance depends on the initial point and the value of K. In this paper, we have combined both these parameters. We have defined an entropy-based objective function for the initialization process, which is better than other existing initialization methods of K-means clustering. Here, we have also designed an algorithm to calculate the correct number of clusters of datasets using some cluster validity indexes. In this paper, the entropy-based initialization algorithm has been proposed and applied to different 2D and 3D data sets. The comparison with other existing initialization methods has been represented in this paper.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Fußnoten
1
COUNT() function gives the count of no. of cluster validity indexes support for a particular K value (no. of clusters). Here, 2, 3, ..., c denote the number of clusters starting from K = 2.
 
Literatur
1.
Zurück zum Zitat Askari G, Li Y, MoezziNasab R (2014) An adaptive polygonal centroidal voronoi tessellation algorithm for segmentation of noisy sar images. Int Arch Photogram Remote Sens Spatial Inf Sci 40(2):65CrossRef Askari G, Li Y, MoezziNasab R (2014) An adaptive polygonal centroidal voronoi tessellation algorithm for segmentation of noisy sar images. Int Arch Photogram Remote Sens Spatial Inf Sci 40(2):65CrossRef
2.
Zurück zum Zitat Astrahan M (1970) Speech analysis by clustering, or the hyperphoneme method. STANFORD UNIV CA DEPT OF COMPUTER SCIENCE, Tech. rep Astrahan M (1970) Speech analysis by clustering, or the hyperphoneme method. STANFORD UNIV CA DEPT OF COMPUTER SCIENCE, Tech. rep
3.
Zurück zum Zitat Bai L, Liang J, Dang C, Cao F (2012) A cluster centers initialization method for clustering categorical data. Expert Syst Appl 39(9):8022–8029CrossRef Bai L, Liang J, Dang C, Cao F (2012) A cluster centers initialization method for clustering categorical data. Expert Syst Appl 39(9):8022–8029CrossRef
4.
Zurück zum Zitat Ball GH, Hall DJ (1965) Isodata, a novel method of data analysis and pattern classification. Tech. rep, Stanford research inst Menlo Park CA Ball GH, Hall DJ (1965) Isodata, a novel method of data analysis and pattern classification. Tech. rep, Stanford research inst Menlo Park CA
5.
Zurück zum Zitat Bordogna G, Pasi G (2011) Soft clustering for information retrieval applications. Wiley Interdiscip Rev Data Min Knowl Discov 1(2):138–146CrossRef Bordogna G, Pasi G (2011) Soft clustering for information retrieval applications. Wiley Interdiscip Rev Data Min Knowl Discov 1(2):138–146CrossRef
6.
Zurück zum Zitat Bozdogan H (1994) Mixture-model cluster analysis using model selection criteria and a new informational measure of complexity. In: Proceedings of the first US/Japan conference on the frontiers of statistical modeling: an informational approach, Springer, pp 69–113 Bozdogan H (1994) Mixture-model cluster analysis using model selection criteria and a new informational measure of complexity. In: Proceedings of the first US/Japan conference on the frontiers of statistical modeling: an informational approach, Springer, pp 69–113
7.
Zurück zum Zitat Cao F, Liang J, Jiang G (2009) An initialization method for the k-means algorithm using neighborhood model. Comput Math Appl 58(3):474–483MathSciNetCrossRef Cao F, Liang J, Jiang G (2009) An initialization method for the k-means algorithm using neighborhood model. Comput Math Appl 58(3):474–483MathSciNetCrossRef
8.
Zurück zum Zitat Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210CrossRef Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210CrossRef
9.
Zurück zum Zitat Chaudhuri D, Murthy C, Chaudhuri B (1994) Finding a subset of representative points in a data set. IEEE Trans Syst Man Cybern 24(9):1416–1424CrossRef Chaudhuri D, Murthy C, Chaudhuri B (1994) Finding a subset of representative points in a data set. IEEE Trans Syst Man Cybern 24(9):1416–1424CrossRef
10.
Zurück zum Zitat Chen K, Liu L (2005) The “best k” for entropy-based categorical data clustering Chen K, Liu L (2005) The “best k” for entropy-based categorical data clustering
11.
Zurück zum Zitat Chowdhury K, Chaudhuri D, Pal AK (2018) Seed point selection algorithm in clustering of image data. In: Progress in Intelligent Computing Techniques: Theory, Practice, and Applications, Springer, pp 119–126 Chowdhury K, Chaudhuri D, Pal AK (2018) Seed point selection algorithm in clustering of image data. In: Progress in Intelligent Computing Techniques: Theory, Practice, and Applications, Springer, pp 119–126
12.
Zurück zum Zitat Dalhatu K, Sim ATH (2016) Density base k-mean’s cluster centroid initialization algorithm Dalhatu K, Sim ATH (2016) Density base k-mean’s cluster centroid initialization algorithm
13.
Zurück zum Zitat Dey L, Chakraborty S (2014) Canonical pso based-means clustering approach for real datasets. International scholarly research notices 2014 Dey L, Chakraborty S (2014) Canonical pso based-means clustering approach for real datasets. International scholarly research notices 2014
14.
Zurück zum Zitat Evanno G, Regnaut S, Goudet J (2005) Detecting the number of clusters of individuals using the software structure: a simulation study. Mol Ecol 14(8):2611–2620CrossRef Evanno G, Regnaut S, Goudet J (2005) Detecting the number of clusters of individuals using the software structure: a simulation study. Mol Ecol 14(8):2611–2620CrossRef
15.
Zurück zum Zitat Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21:768–769 Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21:768–769
16.
17.
Zurück zum Zitat Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666CrossRef Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666CrossRef
18.
Zurück zum Zitat Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Inc Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Inc
19.
Zurück zum Zitat Jin Z, Kim DY, Cho J, Lee B (2015) An analysis on optimal cluster ratio in cluster-based wireless sensor networks. IEEE Sens J 15(11):6413–6423CrossRef Jin Z, Kim DY, Cho J, Lee B (2015) An analysis on optimal cluster ratio in cluster-based wireless sensor networks. IEEE Sens J 15(11):6413–6423CrossRef
20.
Zurück zum Zitat Lu JF, Tang J, Tang ZM, Yang JY (2008) Hierarchical initialization approach for k-means clustering. Pattern Recognit Lett 29(6):787–795CrossRef Lu JF, Tang J, Tang ZM, Yang JY (2008) Hierarchical initialization approach for k-means clustering. Pattern Recognit Lett 29(6):787–795CrossRef
21.
Zurück zum Zitat MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. Proc Fifth Berkeley Symp Math Stat Prob Oakland CA USA 1:281–297MathSciNetMATH MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. Proc Fifth Berkeley Symp Math Stat Prob Oakland CA USA 1:281–297MathSciNetMATH
22.
Zurück zum Zitat Mahmud MS, Rahman MM, Akhtar MN (2012) Improvement of k-means clustering algorithm with better initial centroids based on weighted average. In: Electrical & Computer Engineering (ICECE), 2012 7th International Conference on, IEEE, pp 647–650 Mahmud MS, Rahman MM, Akhtar MN (2012) Improvement of k-means clustering algorithm with better initial centroids based on weighted average. In: Electrical & Computer Engineering (ICECE), 2012 7th International Conference on, IEEE, pp 647–650
23.
Zurück zum Zitat Milligan GW (1981) A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika 46(2):187–199CrossRef Milligan GW (1981) A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika 46(2):187–199CrossRef
24.
Zurück zum Zitat Nazeer KA, Sebastian M (2009) Improving the accuracy and efficiency of the k-means clustering algorithm. Proc World Congress Eng 1:1–3 Nazeer KA, Sebastian M (2009) Improving the accuracy and efficiency of the k-means clustering algorithm. Proc World Congress Eng 1:1–3
25.
Zurück zum Zitat Pakhira MK et al (2009) A modified k-means algorithm to avoid empty clusters. Int J Recent Trends Eng 1(1) Pakhira MK et al (2009) A modified k-means algorithm to avoid empty clusters. Int J Recent Trends Eng 1(1)
26.
Zurück zum Zitat Pal SK, Pramanik P (1986) Fuzzy measures in determining seed points in clustering. Pattern Recognit Lett 4(3):159–164CrossRef Pal SK, Pramanik P (1986) Fuzzy measures in determining seed points in clustering. Pattern Recognit Lett 4(3):159–164CrossRef
27.
Zurück zum Zitat Reddy D, Jana PK, Member IS (2012) Initialization for k-means clustering using voronoi diagram. Proc Technol 4:395–400CrossRef Reddy D, Jana PK, Member IS (2012) Initialization for k-means clustering using voronoi diagram. Proc Technol 4:395–400CrossRef
28.
Zurück zum Zitat Smyth P (1996) Clustering using monte carlo cross-validation. Kdd 1:26–133 Smyth P (1996) Clustering using monte carlo cross-validation. Kdd 1:26–133
29.
Zurück zum Zitat Sugar CA, James GM (2003) Finding the number of clusters in a dataset: an information-theoretic approach. J Am Stat Assoc 98(463):750–763MathSciNetCrossRef Sugar CA, James GM (2003) Finding the number of clusters in a dataset: an information-theoretic approach. J Am Stat Assoc 98(463):750–763MathSciNetCrossRef
30.
Zurück zum Zitat Suryawanshi R, Puthran S (2016) Review of various enhancement for clustering algorithms in big data mining. Int J Adv Res Comput Sci Softw Eng Suryawanshi R, Puthran S (2016) Review of various enhancement for clustering algorithms in big data mining. Int J Adv Res Comput Sci Softw Eng
31.
Zurück zum Zitat Thakare Y, Bagal S (2015) Performance evaluation of k-means clustering algorithm with various distance metrics. Int J Comput Appl 110(11) Thakare Y, Bagal S (2015) Performance evaluation of k-means clustering algorithm with various distance metrics. Int J Comput Appl 110(11)
32.
Zurück zum Zitat Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B Stat Methodol 63(2):411–423MathSciNetCrossRef Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B Stat Methodol 63(2):411–423MathSciNetCrossRef
33.
Zurück zum Zitat Tzortzis G, Likas A (2014) The minmax k-means clustering algorithm. Pattern Recognit 47(7):2505–2516CrossRef Tzortzis G, Likas A (2014) The minmax k-means clustering algorithm. Pattern Recognit 47(7):2505–2516CrossRef
34.
Zurück zum Zitat Virmani D, Taneja S, Malhotra G (2015) Normalization based k means clustering algorithm. arXiv preprint arXiv:150300900 Virmani D, Taneja S, Malhotra G (2015) Normalization based k means clustering algorithm. arXiv preprint arXiv:​150300900
35.
36.
Zurück zum Zitat Wang X, Bai Y (2016) A modified minmax-means algorithm based on pso. Comput Intell Neurosci Wang X, Bai Y (2016) A modified minmax-means algorithm based on pso. Comput Intell Neurosci
37.
Zurück zum Zitat Wang Y, Li Y, Zhao Q (2016) Coupling regular tessellation with rjmcmc algorithm to segment sar image with unknown number of classes. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, p 7 Wang Y, Li Y, Zhao Q (2016) Coupling regular tessellation with rjmcmc algorithm to segment sar image with unknown number of classes. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, p 7
38.
Zurück zum Zitat Xu L (2002) Byy harmony learning, structural rpcl, and topological self-organizing on mixture models. Neural Netw 15(8):1125–1151CrossRef Xu L (2002) Byy harmony learning, structural rpcl, and topological self-organizing on mixture models. Neural Netw 15(8):1125–1151CrossRef
39.
Zurück zum Zitat Xu S, Qiao X, Zhu L, Zheng H (2010) Deep analysis on mining frequent & maximal reference sequences with generalized suffix tree. J Comput Inf Syst 6(7):2187–2197 Xu S, Qiao X, Zhu L, Zheng H (2010) Deep analysis on mining frequent & maximal reference sequences with generalized suffix tree. J Comput Inf Syst 6(7):2187–2197
40.
Zurück zum Zitat Yadav J, Sharma M (2013) Automatic k-detection algorithm. In: 2013 International Conference on Machine Intelligence and Research Advancement (ICMIRA), IEEE, pp 269–273 Yadav J, Sharma M (2013) Automatic k-detection algorithm. In: 2013 International Conference on Machine Intelligence and Research Advancement (ICMIRA), IEEE, pp 269–273
Metadaten
Titel
An entropy-based initialization method of K-means clustering on the optimal number of clusters
verfasst von
Kuntal Chowdhury
Debasis Chaudhuri
Arup Kumar Pal
Publikationsdatum
10.11.2020
Verlag
Springer London
Erschienen in
Neural Computing and Applications / Ausgabe 12/2021
Print ISSN: 0941-0643
Elektronische ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-020-05471-9

Weitere Artikel der Ausgabe 12/2021

Neural Computing and Applications 12/2021 Zur Ausgabe

Premium Partner