Top

Neural Computing and Applications

Published in:

10-11-2020 | Original Article

An entropy-based initialization method of K-means clustering on the optimal number of clusters

Authors: Kuntal Chowdhury, Debasis Chaudhuri, Arup Kumar Pal

Published in: Neural Computing and Applications | Issue 12/2021

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Clustering is an unsupervised learning approach used to group similar features using specific mathematical criteria. This mathematical criterion is known as the objective function. Any clustering is done depending on some objective function. K-means is one of the widely used partitional clustering algorithms whose performance depends on the initial point and the value of K. In this paper, we have combined both these parameters. We have defined an entropy-based objective function for the initialization process, which is better than other existing initialization methods of K-means clustering. Here, we have also designed an algorithm to calculate the correct number of clusters of datasets using some cluster validity indexes. In this paper, the entropy-based initialization algorithm has been proposed and applied to different 2D and 3D data sets. The comparison with other existing initialization methods has been represented in this paper.

previous article Recurrent neural network-based prediction of compressive and flexural strength of steel slag mixed concrete

next article Homoclinic and heteroclinic motions of delayed inertial neural networks

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

COUNT() function gives the count of no. of cluster validity indexes support for a particular K value (no. of clusters). Here, 2, 3, ..., c denote the number of clusters starting from K = 2.

Askari G, Li Y, MoezziNasab R (2014) An adaptive polygonal centroidal voronoi tessellation algorithm for segmentation of noisy sar images. Int Arch Photogram Remote Sens Spatial Inf Sci 40(2):65CrossRef

Astrahan M (1970) Speech analysis by clustering, or the hyperphoneme method. STANFORD UNIV CA DEPT OF COMPUTER SCIENCE, Tech. rep

Bai L, Liang J, Dang C, Cao F (2012) A cluster centers initialization method for clustering categorical data. Expert Syst Appl 39(9):8022–8029CrossRef

Ball GH, Hall DJ (1965) Isodata, a novel method of data analysis and pattern classification. Tech. rep, Stanford research inst Menlo Park CA

Bordogna G, Pasi G (2011) Soft clustering for information retrieval applications. Wiley Interdiscip Rev Data Min Knowl Discov 1(2):138–146CrossRef

Bozdogan H (1994) Mixture-model cluster analysis using model selection criteria and a new informational measure of complexity. In: Proceedings of the first US/Japan conference on the frontiers of statistical modeling: an informational approach, Springer, pp 69–113

Cao F, Liang J, Jiang G (2009) An initialization method for the k-means algorithm using neighborhood model. Comput Math Appl 58(3):474–483MathSciNetCrossRef

Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210CrossRef

Chaudhuri D, Murthy C, Chaudhuri B (1994) Finding a subset of representative points in a data set. IEEE Trans Syst Man Cybern 24(9):1416–1424CrossRef

10.

Chen K, Liu L (2005) The “best k” for entropy-based categorical data clustering

11.

Chowdhury K, Chaudhuri D, Pal AK (2018) Seed point selection algorithm in clustering of image data. In: Progress in Intelligent Computing Techniques: Theory, Practice, and Applications, Springer, pp 119–126

12.

Dalhatu K, Sim ATH (2016) Density base k-mean’s cluster centroid initialization algorithm

13.

Dey L, Chakraborty S (2014) Canonical pso based-means clustering approach for real datasets. International scholarly research notices 2014

14.

Evanno G, Regnaut S, Goudet J (2005) Detecting the number of clusters of individuals using the software structure: a simulation study. Mol Ecol 14(8):2611–2620CrossRef

15.

Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21:768–769

16.

Gonzalez TF (1985) Clustering to minimize the maximum intercluster distance. Theor Comput Sci 38:293–306MathSciNetCrossRef

17.

Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666CrossRef

18.

Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Inc

19.

Jin Z, Kim DY, Cho J, Lee B (2015) An analysis on optimal cluster ratio in cluster-based wireless sensor networks. IEEE Sens J 15(11):6413–6423CrossRef

20.

Lu JF, Tang J, Tang ZM, Yang JY (2008) Hierarchical initialization approach for k-means clustering. Pattern Recognit Lett 29(6):787–795CrossRef

21.

MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. Proc Fifth Berkeley Symp Math Stat Prob Oakland CA USA 1:281–297MathSciNetMATH

22.

Mahmud MS, Rahman MM, Akhtar MN (2012) Improvement of k-means clustering algorithm with better initial centroids based on weighted average. In: Electrical & Computer Engineering (ICECE), 2012 7th International Conference on, IEEE, pp 647–650

23.

Milligan GW (1981) A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika 46(2):187–199CrossRef

24.

Nazeer KA, Sebastian M (2009) Improving the accuracy and efficiency of the k-means clustering algorithm. Proc World Congress Eng 1:1–3

25.

Pakhira MK et al (2009) A modified k-means algorithm to avoid empty clusters. Int J Recent Trends Eng 1(1)

26.

Pal SK, Pramanik P (1986) Fuzzy measures in determining seed points in clustering. Pattern Recognit Lett 4(3):159–164CrossRef

27.

Reddy D, Jana PK, Member IS (2012) Initialization for k-means clustering using voronoi diagram. Proc Technol 4:395–400CrossRef

28.

Smyth P (1996) Clustering using monte carlo cross-validation. Kdd 1:26–133

29.

Sugar CA, James GM (2003) Finding the number of clusters in a dataset: an information-theoretic approach. J Am Stat Assoc 98(463):750–763MathSciNetCrossRef

30.

Suryawanshi R, Puthran S (2016) Review of various enhancement for clustering algorithms in big data mining. Int J Adv Res Comput Sci Softw Eng

31.

Thakare Y, Bagal S (2015) Performance evaluation of k-means clustering algorithm with various distance metrics. Int J Comput Appl 110(11)

32.

Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B Stat Methodol 63(2):411–423MathSciNetCrossRef

33.

Tzortzis G, Likas A (2014) The minmax k-means clustering algorithm. Pattern Recognit 47(7):2505–2516CrossRef

34.

Virmani D, Taneja S, Malhotra G (2015) Normalization based k means clustering algorithm. arXiv preprint arXiv:150300900

35.

Wang J (2010) Consistent selection of the number of clusters via crossvalidation. Biometrika 97(4):893–904MathSciNetCrossRef

36.

Wang X, Bai Y (2016) A modified minmax-means algorithm based on pso. Comput Intell Neurosci

37.

Wang Y, Li Y, Zhao Q (2016) Coupling regular tessellation with rjmcmc algorithm to segment sar image with unknown number of classes. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, p 7

38.

Xu L (2002) Byy harmony learning, structural rpcl, and topological self-organizing on mixture models. Neural Netw 15(8):1125–1151CrossRef

39.

Xu S, Qiao X, Zhu L, Zheng H (2010) Deep analysis on mining frequent & maximal reference sequences with generalized suffix tree. J Comput Inf Syst 6(7):2187–2197

40.

Yadav J, Sharma M (2013) Automatic k-detection algorithm. In: 2013 International Conference on Machine Intelligence and Research Advancement (ICMIRA), IEEE, pp 269–273

Title: An entropy-based initialization method of K-means clustering on the optimal number of clusters
Authors: Kuntal Chowdhury
Debasis Chaudhuri
Arup Kumar Pal
Publication date: 10-11-2020
Publisher: Springer London
Published in: Neural Computing and Applications / Issue 12/2021
Print ISSN: 0941-0643
Electronic ISSN: 1433-3058
DOI: https://doi.org/10.1007/s00521-020-05471-9

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Other articles of this Issue 12/2021

Molten steel temperature prediction using a hybrid model based on information interaction-enhanced cuckoo search

Periodically intermittent control for finite-time synchronization of delayed quaternion-valued neural networks

Improved coral reefs optimization with adaptive -hill climbing for feature selection

Signature verification using geometrical features and artificial neural network classifier

Multi-granularity semantic representation model for relation extraction

The Mode-Fisher pooling for time complexity optimization in deep convolutional neural networks

Premium Partner