Skip to main content
Top
Published in:

25-06-2022

An Improved K-means Clustering Algorithm Towards an Efficient Data-Driven Modeling

Authors: Md. Zubair, MD. Asif Iqbal, Avijeet Shil, M. J. M. Chowdhury, Mohammad Ali Moni, Iqbal H. Sarker

Published in: Annals of Data Science | Issue 5/2024

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

K-means algorithm is one of the well-known unsupervised machine learning algorithms. The algorithm typically finds out distinct non-overlapping clusters in which each point is assigned to a group. The minimum squared distance technique distributes each point to the nearest clusters or subgroups. One of the K-means algorithm’s main concerns is to find out the initial optimal centroids of clusters. It is the most challenging task to determine the optimum position of the initial clusters’ centroids at the very first iteration. This paper proposes an approach to find the optimal initial centroids efficiently to reduce the number of iterations and execution time. To analyze the effectiveness of our proposed method, we have utilized different real-world datasets to conduct experiments. We have first analyzed COVID-19 and patient datasets to show our proposed method’s efficiency. A synthetic dataset of 10M instances with 8 dimensions is also used to estimate the performance of the proposed algorithm. Experimental results show that our proposed method outperforms traditional kmeans++ and random centroids initialization methods regarding the computation time and the number of iterations.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
It is one of the matrices used by Oxford COVID-19 Government Response Tracker [43]. It delivers a picture of the country’s enforced strongest measures.
 
Literature
1.
go back to reference Sarker IH (2022) Ai-based modeling: Techniques, applications and research issues towards automation, intelligent and smart systems. SN Computer Science 3(2):1–20CrossRef Sarker IH (2022) Ai-based modeling: Techniques, applications and research issues towards automation, intelligent and smart systems. SN Computer Science 3(2):1–20CrossRef
2.
go back to reference Bonaccorso G (2017) Machine learning algorithms Bonaccorso G (2017) Machine learning algorithms
3.
go back to reference Sarker IH (2021) Data science and analytics: an overview from data-driven smart computing, decision-making and applications perspective. SN Computer Science 2(5):1–22CrossRef Sarker IH (2021) Data science and analytics: an overview from data-driven smart computing, decision-making and applications perspective. SN Computer Science 2(5):1–22CrossRef
4.
go back to reference Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques
5.
go back to reference Olson DL, Shi Y, Shi Y (2007) Introduction to business data mining, vol 10. McGraw-Hill/Irwin, New York Olson DL, Shi Y, Shi Y (2007) Introduction to business data mining, vol 10. McGraw-Hill/Irwin, New York
6.
go back to reference Sarker IH, Colman A, Han J, Watters PA (2021) Context-aware machine learning and mobile data analytics: automated rule-based services with intelligent decision-making. Springer Nature, Switzerland Sarker IH, Colman A, Han J, Watters PA (2021) Context-aware machine learning and mobile data analytics: automated rule-based services with intelligent decision-making. Springer Nature, Switzerland
8.
go back to reference Pham DT, Dimov SS, Nguyen CD (2004) An incremental k-means algorithm. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science 218(7):783–795 Pham DT, Dimov SS, Nguyen CD (2004) An incremental k-means algorithm. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science 218(7):783–795
9.
go back to reference Shi Y, Tian Y, Kou G, Peng Y, Li J (2011) Optimization based data mining: theory and applications. Springer, LondonCrossRef Shi Y, Tian Y, Kou G, Peng Y, Li J (2011) Optimization based data mining: theory and applications. Springer, LondonCrossRef
10.
go back to reference Zubair Md, Iqbal A, Shil A, Haque E, Moshiul Hoque M, Sarker IH (2020) An efficient k-means clustering algorithm for analysing covid-19. In International Conference on Hybrid Intelligent Systems, pages 422–432. Springer Zubair Md, Iqbal A, Shil A, Haque E, Moshiul Hoque M, Sarker IH (2020) An efficient k-means clustering algorithm for analysing covid-19. In International Conference on Hybrid Intelligent Systems, pages 422–432. Springer
11.
go back to reference Rahim MdS, Ahmed T (2017) An initial centroid selection method based on radial and angular coordinates for k-means algorithm. In 2017 20th International Conference of Computer and Information Technology (ICCIT), 1–6. IEEE Rahim MdS, Ahmed T (2017) An initial centroid selection method based on radial and angular coordinates for k-means algorithm. In 2017 20th International Conference of Computer and Information Technology (ICCIT), 1–6. IEEE
12.
go back to reference Kumar A, Gupta SC (2015) A new initial centroid finding method based on dissimilarity tree for k-means algorithm. arXiv preprint arXiv:1509.03200 Kumar A, Gupta SC (2015) A new initial centroid finding method based on dissimilarity tree for k-means algorithm. arXiv preprint arXiv:​1509.​03200
13.
go back to reference Mahmud MdS, Rahman MdM, Akhtar MdN (2012) Improvement of k-means clustering algorithm with better initial centroids based on weighted average. In 2012 7th International Conference on Electrical and Computer Engineering, 647–650. IEEE Mahmud MdS, Rahman MdM, Akhtar MdN (2012) Improvement of k-means clustering algorithm with better initial centroids based on weighted average. In 2012 7th International Conference on Electrical and Computer Engineering, 647–650. IEEE
14.
go back to reference Goyal M, Kumar S (2014) Improving the initial centroids of k-means clustering algorithm to generalize its applicability. Journal of The Institution of Engineers (India): Series B 95(4):345–350 Goyal M, Kumar S (2014) Improving the initial centroids of k-means clustering algorithm to generalize its applicability. Journal of The Institution of Engineers (India): Series B 95(4):345–350
15.
go back to reference Lakshmi MA, Daniel GV, Rao DS (2019) Initial centroids for k-means using nearest neighbors and feature means. In Wang J, Reddy GRM, Prasad VK, Reddy VS (eds), Soft Computing and Signal Processing, 27–34, Singapore. Springer Singapore Lakshmi MA, Daniel GV, Rao DS (2019) Initial centroids for k-means using nearest neighbors and feature means. In Wang J, Reddy GRM, Prasad VK, Reddy VS (eds), Soft Computing and Signal Processing, 27–34, Singapore. Springer Singapore
16.
go back to reference Sawant KB (2015) Efficient determination of clusters in k-mean algorithm using neighborhood distance. The International Journal of Emerging Engineering Research and Technology 3(1):22–27 Sawant KB (2015) Efficient determination of clusters in k-mean algorithm using neighborhood distance. The International Journal of Emerging Engineering Research and Technology 3(1):22–27
17.
go back to reference Fahim AM, Salem AM, Torkey FAf, Ramadan MA (2006) An efficient enhanced k-means clustering algorithm. Journal of Zhejiang University-Science A 7(10):1626–1633CrossRef Fahim AM, Salem AM, Torkey FAf, Ramadan MA (2006) An efficient enhanced k-means clustering algorithm. Journal of Zhejiang University-Science A 7(10):1626–1633CrossRef
18.
go back to reference Motwani M, Arora N, Gupta A (2019) A study on initial centroids selection for partitional clustering algorithms. In Hoda MN, Chauhan N, Quadri SMK, Srivastava PR (eds), Software Engineering, pages 211–220, Singapore. Springer Singapore Motwani M, Arora N, Gupta A (2019) A study on initial centroids selection for partitional clustering algorithms. In Hoda MN, Chauhan N, Quadri SMK, Srivastava PR (eds), Software Engineering, pages 211–220, Singapore. Springer Singapore
19.
go back to reference Yedla M, Pathakota SR, Srinivasa TM (2010) Enhancing k-means clustering algorithm with improved initial center. International Journal of computer science and information technologies 1(2):121–125 Yedla M, Pathakota SR, Srinivasa TM (2010) Enhancing k-means clustering algorithm with improved initial center. International Journal of computer science and information technologies 1(2):121–125
20.
go back to reference Vadyala SR, Betgeri SN, Sherer EA, Amritphale A (2020) Prediction of the number of covid-19 confirmed cases based on k-means-lstm. arXiv preprint arXiv:2006.14752 Vadyala SR, Betgeri SN, Sherer EA, Amritphale A (2020) Prediction of the number of covid-19 confirmed cases based on k-means-lstm. arXiv preprint arXiv:​2006.​14752
21.
go back to reference Poompaavai A, Manimannan G (2019) Clustering study of indian states and union territories affected by coronavirus (covid-19) using k-means algorithm. International Journal of Data Mining And Emerging Technologies 9(2):43–51CrossRef Poompaavai A, Manimannan G (2019) Clustering study of indian states and union territories affected by coronavirus (covid-19) using k-means algorithm. International Journal of Data Mining And Emerging Technologies 9(2):43–51CrossRef
22.
go back to reference Sonbhadra SK, Agarwal S, Nagabhushan P (2020) Target specific mining of covid-19 scholarly articles using one-class approach. arXiv preprint arXiv:2004.11706 Sonbhadra SK, Agarwal S, Nagabhushan P (2020) Target specific mining of covid-19 scholarly articles using one-class approach. arXiv preprint arXiv:​2004.​11706
23.
go back to reference Chinchorkar S (2020) Defining covid 19 containment zones using k-means dynamically Chinchorkar S (2020) Defining covid 19 containment zones using k-means dynamically
24.
go back to reference Aydin N, Yurdakul G (2020) Assessing countries’ performances against covid-19 via wsidea and machine learning algorithms. Applied Soft Computing 97:106792CrossRef Aydin N, Yurdakul G (2020) Assessing countries’ performances against covid-19 via wsidea and machine learning algorithms. Applied Soft Computing 97:106792CrossRef
25.
go back to reference KUCUKEFE B (2020) Clustering macroeconomic impact of covid-19 in oecd countries and china. Ekonomi Politika ve Finans Araştırmaları Dergisi, 5(Özel Sayı):280–291 KUCUKEFE B (2020) Clustering macroeconomic impact of covid-19 in oecd countries and china. Ekonomi Politika ve Finans Araştırmaları Dergisi, 5(Özel Sayı):280–291
26.
go back to reference Zhang T, Lin G (2020) Generalized k-means in glms with applications to the outbreak of covid-19 in the united states. arXiv preprint arXiv:2008.03838 Zhang T, Lin G (2020) Generalized k-means in glms with applications to the outbreak of covid-19 in the united states. arXiv preprint arXiv:​2008.​03838
27.
go back to reference de la Fuente-Tomas L, Arranz B, Safont G, Sierra P, Sanchez-Autet M, Garcia-Blanco A, Garcia-Portilla MP (2019) Classification of patients with bipolar disorder using k-means clustering. PloS one 14(1):e0210314CrossRef de la Fuente-Tomas L, Arranz B, Safont G, Sierra P, Sanchez-Autet M, Garcia-Blanco A, Garcia-Portilla MP (2019) Classification of patients with bipolar disorder using k-means clustering. PloS one 14(1):e0210314CrossRef
28.
go back to reference Silitonga P (2017) Clustering of patient disease data by using k-means clustering. International Journal of Computer Science and Information Security (IJCSIS) 15(7):219–221 Silitonga P (2017) Clustering of patient disease data by using k-means clustering. International Journal of Computer Science and Information Security (IJCSIS) 15(7):219–221
29.
go back to reference Das N, Iqbal MDA (2020) Nearest blood & plasma donor finding: A machine learning approach. In 2020 23rd International Conference on Computer and Information Technology (ICCIT), 1–6. IEEE Das N, Iqbal MDA (2020) Nearest blood & plasma donor finding: A machine learning approach. In 2020 23rd International Conference on Computer and Information Technology (ICCIT), 1–6. IEEE
30.
go back to reference Alam MdS, Rahman MdM, Hossain MA, Islam MdK, Ahmed KM, Ahmed KT, Singh BC, Miah MdS (2019) Automatic human brain tumor detection in mri image using template-based k means and improved fuzzy c means clustering algorithm. Big Data and Cognitive Computing 3(2):27CrossRef Alam MdS, Rahman MdM, Hossain MA, Islam MdK, Ahmed KM, Ahmed KT, Singh BC, Miah MdS (2019) Automatic human brain tumor detection in mri image using template-based k means and improved fuzzy c means clustering algorithm. Big Data and Cognitive Computing 3(2):27CrossRef
31.
go back to reference Shi Y (2022) Advances in big data analytics: theory, algorithms and practices. Springer, SingaporeCrossRef Shi Y (2022) Advances in big data analytics: theory, algorithms and practices. Springer, SingaporeCrossRef
32.
go back to reference Arthur D, Vassilvitskii S (2006) k-means++: The advantages of careful seeding. Technical report, Stanford Arthur D, Vassilvitskii S (2006) k-means++: The advantages of careful seeding. Technical report, Stanford
33.
go back to reference Aloise D, Deshpande A, Hansen P, Popat P (2009) Np-hardness of euclidean sum-of-squares clustering. Machine learning 75(2):245–248CrossRef Aloise D, Deshpande A, Hansen P, Popat P (2009) Np-hardness of euclidean sum-of-squares clustering. Machine learning 75(2):245–248CrossRef
34.
go back to reference Berkhin P (2006) A Survey of Clustering Data Mining Techniques, 25–71. Springer Berlin Heidelberg, Berlin, Heidelberg Berkhin P (2006) A Survey of Clustering Data Mining Techniques, 25–71. Springer Berlin Heidelberg, Berlin, Heidelberg
35.
go back to reference Abdi H, Williams LJ (2010) Principal component analysis. Wiley interdisciplinary reviews: computational statistics 2(4):433–459CrossRef Abdi H, Williams LJ (2010) Principal component analysis. Wiley interdisciplinary reviews: computational statistics 2(4):433–459CrossRef
36.
go back to reference Sehgal S, Singh H, Agarwal M, Bhasker V et al (2014) Data analysis using principal component analysis. In International Conference on Medical Imaging, m-Health and Emerging Communication Systems (MedCom), 45–48. IEEE Sehgal S, Singh H, Agarwal M, Bhasker V et al (2014) Data analysis using principal component analysis. In International Conference on Medical Imaging, m-Health and Emerging Communication Systems (MedCom), 45–48. IEEE
37.
go back to reference Altman DG, Bland JM (1994) Statistics notes: quartiles, quintiles, centiles, and other quantiles. Bmj 309(6960):996CrossRef Altman DG, Bland JM (1994) Statistics notes: quartiles, quintiles, centiles, and other quantiles. Bmj 309(6960):996CrossRef
39.
go back to reference Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830 Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830
44.
go back to reference Kodinariya TM, Makwana PR (2013) Review on determining number of cluster in k-means clustering. International Journal 1(6):90–95 Kodinariya TM, Makwana PR (2013) Review on determining number of cluster in k-means clustering. International Journal 1(6):90–95
45.
go back to reference Sarker IH (2022) Smart city data science: Towards data-driven smart cities with open research issues. Internet of Things, 100528 Sarker IH (2022) Smart city data science: Towards data-driven smart cities with open research issues. Internet of Things, 100528
46.
go back to reference Tien JM (2017) Internet of things, real-time decision making, and artificial intelligence. Annals of Data Science 4(2):149–178CrossRef Tien JM (2017) Internet of things, real-time decision making, and artificial intelligence. Annals of Data Science 4(2):149–178CrossRef
Metadata
Title
An Improved K-means Clustering Algorithm Towards an Efficient Data-Driven Modeling
Authors
Md. Zubair
MD. Asif Iqbal
Avijeet Shil
M. J. M. Chowdhury
Mohammad Ali Moni
Iqbal H. Sarker
Publication date
25-06-2022
Publisher
Springer Berlin Heidelberg
Published in
Annals of Data Science / Issue 5/2024
Print ISSN: 2198-5804
Electronic ISSN: 2198-5812
DOI
https://doi.org/10.1007/s40745-022-00428-2

Premium Partner