Skip to main content

2024 | OriginalPaper | Buchkapitel

A K-Means Variation Based on Careful Seeding and Constrained Silhouette Coefficients

verfasst von : Libero Nigro, Franco Cicirelli, Francesco Pupo

Erschienen in: Advances in Data-Driven Computing and Intelligent Systems

Verlag: Springer Nature Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

K-Means is well-known clustering algorithm very often used for its simplicity and efficiency. Its properties have been thoroughly investigated. It is emerged that K-Means heavily depends on the seeding method used to initialize the cluster centroids and that, besides the seeding procedure, it mainly acts as a local refiner of the centroids and can easily become stuck around a local sub-optimal solution of the objective function cost. As a consequence, K-Means is often repeated many times, always starting with a different centroids’ configuration, to increase the likelihood of finding a clustering solution near the optimal one. In this paper, the Hartigan and Wong variation of K-Means (HWKM) is chosen because of its increased probability to ending up near the optimal solution. HWKM is then enhanced with the use of careful seeding methods and by an incremental technique which constrains the movement of points among clusters according to their Silhouette coefficients. The result is HWKM+ which, through a small number of restarts, is capable of generating a careful clustering solution with compact and well-separated clusters. The current implementation of HWKM+ rests on Java parallel streams. The paper describes the design and development of HWKM+ and demonstrates its abilities through a series of benchmark and real-world datasets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Garey MR, Johnson DS, Witsenhausen HS (1982) The complexity of the generalized Lloyd-Max problem. IEEE Trans Inf Theory 28:255–256MathSciNetCrossRef Garey MR, Johnson DS, Witsenhausen HS (1982) The complexity of the generalized Lloyd-Max problem. IEEE Trans Inf Theory 28:255–256MathSciNetCrossRef
2.
Zurück zum Zitat Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31(8):651–666CrossRef Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31(8):651–666CrossRef
4.
Zurück zum Zitat Nigro L (2022) Performance of parallel K-means algorithms in Java. Algorithms 15(4):117CrossRef Nigro L (2022) Performance of parallel K-means algorithms in Java. Algorithms 15(4):117CrossRef
5.
Zurück zum Zitat Fränti P, Sieranoja S (2018) K-means properties on six clustering benchmark datasets. Appl Intell 48(12):4743–4759CrossRef Fränti P, Sieranoja S (2018) K-means properties on six clustering benchmark datasets. Appl Intell 48(12):4743–4759CrossRef
6.
Zurück zum Zitat Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats? Pattern Recogn 93:95–112CrossRef Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats? Pattern Recogn 93:95–112CrossRef
7.
Zurück zum Zitat Vouros A, Langdell S, Croucher M, Vasilaki E (2021) An empirical comparison between stochastic and deterministic centroid initialization for K-means variations. Mach Learn 110:1975–2003MathSciNetCrossRef Vouros A, Langdell S, Croucher M, Vasilaki E (2021) An empirical comparison between stochastic and deterministic centroid initialization for K-means variations. Mach Learn 110:1975–2003MathSciNetCrossRef
8.
Zurück zum Zitat Baldassi C (2020) Recombinator K-means: a population-based algorithm that exploits k-means++ for recombination. arXiv:1905.00531v3, Artificial Intelligence Lab, Institute for Data Science and Analytics, Bocconi University, via Sarfatti 25, 20135 Milan, Italy Baldassi C (2020) Recombinator K-means: a population-based algorithm that exploits k-means++ for recombination. arXiv:​1905.​00531v3, Artificial Intelligence Lab, Institute for Data Science and Analytics, Bocconi University, via Sarfatti 25, 20135 Milan, Italy
11.
Zurück zum Zitat Nigro L, Cicirelli F (2023) Performance of a K-means algorithm driven by careful seeding. In: Proceedings of the 13th international conference on simulation and modeling methodologies, technologies and applications, pp 27–36. ISBN 978-989-758-668-2, ISSN 2184-2841 Nigro L, Cicirelli F (2023) Performance of a K-means algorithm driven by careful seeding. In: Proceedings of the 13th international conference on simulation and modeling methodologies, technologies and applications, pp 27–36. ISBN 978-989-758-668-2, ISSN 2184-2841
12.
Zurück zum Zitat Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65CrossRef Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65CrossRef
14.
Zurück zum Zitat Fränti P, Rezaei M, Zhao Q (2014) Centroid index: cluster level similarity measure. Pattern Recogn 47(9):3034–3045CrossRef Fränti P, Rezaei M, Zhao Q (2014) Centroid index: cluster level similarity measure. Pattern Recogn 47(9):3034–3045CrossRef
15.
Zurück zum Zitat Fränti P, Rezaei M (2016) Generalizing centroid index to different clustering models. In: Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR), pp 285–296, Springer, Berlin Fränti P, Rezaei M (2016) Generalizing centroid index to different clustering models. In: Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR), pp 285–296, Springer, Berlin
16.
Zurück zum Zitat Nigro L, Cicirelli F, Fränti P (2023) Parallel random swap: an efficient and reliable clustering algorithm in Java. Simul Model Pract Theory 124:102712CrossRef Nigro L, Cicirelli F, Fränti P (2023) Parallel random swap: an efficient and reliable clustering algorithm in Java. Simul Model Pract Theory 124:102712CrossRef
17.
Zurück zum Zitat Hartigan JA, Wong MA (1979) Algorithm as 136: A k-means clustering algorithm. J Roy Stat Soc: Ser C (Appl Stat) 28(1):100–108 Hartigan JA, Wong MA (1979) Algorithm as 136: A k-means clustering algorithm. J Roy Stat Soc: Ser C (Appl Stat) 28(1):100–108
18.
Zurück zum Zitat Slonim N, Aharoni E, Crammer K (2013) Hartigan’s k-means versus Lloyd’s k-means-is it time for a change? IJCAI 1677–1684 Slonim N, Aharoni E, Crammer K (2013) Hartigan’s k-means versus Lloyd’s k-means-is it time for a change? IJCAI 1677–1684
19.
Zurück zum Zitat Urma RG, Fusco M, Mycroft A (2019) Modern Java in action. Manning, Shelter Island Urma RG, Fusco M, Mycroft A (2019) Modern Java in action. Manning, Shelter Island
21.
Zurück zum Zitat Rodriguez R, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):14.92–14.96 Rodriguez R, Laio A (2014) Clustering by fast search and find of density peaks. Science 344(6191):14.92–14.96
22.
Zurück zum Zitat Nigro L (2021) Parallel Theatre: a Java actor-framework for high-performance computing. Simul Model Pract Theory 106:102189CrossRef Nigro L (2021) Parallel Theatre: a Java actor-framework for high-performance computing. Simul Model Pract Theory 106:102189CrossRef
Metadaten
Titel
A K-Means Variation Based on Careful Seeding and Constrained Silhouette Coefficients
verfasst von
Libero Nigro
Franco Cicirelli
Francesco Pupo
Copyright-Jahr
2024
Verlag
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-99-9521-9_17