Skip to main content
Erschienen in: Advances in Data Analysis and Classification 1/2022

09.01.2022 | Regular Article

An empirical comparison and characterisation of nine popular clustering methods

verfasst von: Christian Hennig

Erschienen in: Advances in Data Analysis and Classification | Ausgabe 1/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Nine popular clustering methods are applied to 42 real data sets. The aim is to give a detailed characterisation of the methods by means of several cluster validation indexes that measure various individual aspects of the resulting clusters such as small within-cluster distances, separation of clusters, closeness to a Gaussian distribution etc. as introduced in Hennig (in: Data analysis and applications 1: clustering and regression, modeling—estimating, forecasting and data mining, ISTE Ltd., London, 2019). 30 of the data sets come with a “true” clustering. On these data sets the similarity of the clusterings from the nine methods to the “true” clusterings is explored. Furthermore, a mixed effects regression relates the observable individual aspects of the clusters to the similarity with the “true” clusterings, which in real clustering problems is unobservable. The study gives new insight not only into the ability of the methods to discover “true” clusterings, but also into properties of clusterings that can be expected from the methods, which is crucial for the choice of a method in a real situation without a given “true” clustering.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Ackerman M, Ben-David S (2008) Measures of clustering quality: a working set of axioms for clustering. Adv Neural Inf Process Syst NIPS 22:121–128 Ackerman M, Ben-David S (2008) Measures of clustering quality: a working set of axioms for clustering. Adv Neural Inf Process Syst NIPS 22:121–128
Zurück zum Zitat Ackerman M, Ben-David S, Branzei S, Loker D (2012) Weighted clustering. In: Proceedings of the 26th AAAI conference on artificial intelligence, pp 858–863 Ackerman M, Ben-David S, Branzei S, Loker D (2012) Weighted clustering. In: Proceedings of the 26th AAAI conference on artificial intelligence, pp 858–863
Zurück zum Zitat Ackerman M, Ben-David S, Loker D (2010) Towards property-based classification of clustering paradigms. In: Advances in neural information processing systems (NIPS), pp 10–18 Ackerman M, Ben-David S, Loker D (2010) Towards property-based classification of clustering paradigms. In: Advances in neural information processing systems (NIPS), pp 10–18
Zurück zum Zitat Adolfsson A, Ackerman M, Brownstein NC (2019) To cluster, or not to cluster: an analysis of clusterability methods. Pattern Recognit 88:13–26CrossRef Adolfsson A, Ackerman M, Brownstein NC (2019) To cluster, or not to cluster: an analysis of clusterability methods. Pattern Recognit 88:13–26CrossRef
Zurück zum Zitat Akhanli SE, Hennig C (2020) Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes. Stat Comput 30(5):1523–1544MathSciNetMATHCrossRef Akhanli SE, Hennig C (2020) Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes. Stat Comput 30(5):1523–1544MathSciNetMATHCrossRef
Zurück zum Zitat Amigo E, Gonzalo J, Artiles J, Verdejo F (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retr 12:461–486CrossRef Amigo E, Gonzalo J, Artiles J, Verdejo F (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf Retr 12:461–486CrossRef
Zurück zum Zitat Anderlucci L, Hennig C (2014) Clustering of categorical data: a comparison of a model-based and a distance-based approach. Commun Stat Theory Methods 43:704–721MathSciNetMATHCrossRef Anderlucci L, Hennig C (2014) Clustering of categorical data: a comparison of a model-based and a distance-based approach. Commun Stat Theory Methods 43:704–721MathSciNetMATHCrossRef
Zurück zum Zitat Andrews JL, McNicholas PD (2012) Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions. Stat Comput 22(5):1021–1029MathSciNetMATHCrossRef Andrews JL, McNicholas PD (2012) Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions. Stat Comput 22(5):1021–1029MathSciNetMATHCrossRef
Zurück zum Zitat Andrews JL, Wickins JR, Boers NM, McNicholas PD (2018) teigen: an R package for model-based clustering and classification via the multivariate \(t\) distribution. J Stat Softw 83(7):1–32CrossRef Andrews JL, Wickins JR, Boers NM, McNicholas PD (2018) teigen: an R package for model-based clustering and classification via the multivariate \(t\) distribution. J Stat Softw 83(7):1–32CrossRef
Zurück zum Zitat Arbelaitz O, Gurrutxaga I, Muguerza J, Pérez JM, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recognit 46(1):243–256CrossRef Arbelaitz O, Gurrutxaga I, Muguerza J, Pérez JM, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recognit 46(1):243–256CrossRef
Zurück zum Zitat Bagga A, Baldwin B (1998) Entity-based cross-document coreferencing using the vector space model. In: Proceedings of the 36th annual meeting of the association for computational linguistics and the 17th international conference on computational linguistics (COLING-ACL 98). ACL, Stroudsburg PE, pp 79–85 Bagga A, Baldwin B (1998) Entity-based cross-document coreferencing using the vector space model. In: Proceedings of the 36th annual meeting of the association for computational linguistics and the 17th international conference on computational linguistics (COLING-ACL 98). ACL, Stroudsburg PE, pp 79–85
Zurück zum Zitat Boulesteix AL, Hatz M (2017) Benchmarking for clustering methods based on real data: a statistical view. In: Data science: innovative developments in data analysis and clustering. Springer, Berlin, pp 73–82 Boulesteix AL, Hatz M (2017) Benchmarking for clustering methods based on real data: a statistical view. In: Data science: innovative developments in data analysis and clustering. Springer, Berlin, pp 73–82
Zurück zum Zitat Boulesteix AL (2015) Ten simple rules for reducing overoptimistic reporting in methodological computational research. PLoS Comput Biol 11:e1004191CrossRef Boulesteix AL (2015) Ten simple rules for reducing overoptimistic reporting in methodological computational research. PLoS Comput Biol 11:e1004191CrossRef
Zurück zum Zitat Boulesteix AL, Lauer S, Eugster MJA (2013) A plea for neutral comparison studies in computational sciences. PLoS ONE 8:e61562CrossRef Boulesteix AL, Lauer S, Eugster MJA (2013) A plea for neutral comparison studies in computational sciences. PLoS ONE 8:e61562CrossRef
Zurück zum Zitat Brusco MJ, Steinley D (2007) A comparison of heuristic procedures for minimum within-cluster sums of squares partitioning. Psychometrika 72:583–600MathSciNetMATHCrossRef Brusco MJ, Steinley D (2007) A comparison of heuristic procedures for minimum within-cluster sums of squares partitioning. Psychometrika 72:583–600MathSciNetMATHCrossRef
Zurück zum Zitat Coretto P, Hennig C (2016) Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. J Am Stat Assoc 111:1648–1659MathSciNetCrossRef Coretto P, Hennig C (2016) Robust improper maximum likelihood: tuning, computation, and a comparison with other methods for robust Gaussian clustering. J Am Stat Assoc 111:1648–1659MathSciNetCrossRef
Zurück zum Zitat Correa-Morris J (2013) An indication of unification for different clustering approaches. Pattern Recognit 46:2548–2561MATHCrossRef Correa-Morris J (2013) An indication of unification for different clustering approaches. Pattern Recognit 46:2548–2561MATHCrossRef
Zurück zum Zitat de Souto MC, Costa IG, de Araujo DS, Ludermir TB, Schliep A (2008) Clustering cancer gene expression data: a comparative study. BMC Bioinform 9:497CrossRef de Souto MC, Costa IG, de Araujo DS, Ludermir TB, Schliep A (2008) Clustering cancer gene expression data: a comparative study. BMC Bioinform 9:497CrossRef
Zurück zum Zitat Dimitriadou E, Barth M, Windischberger C, Hornik K, Moser E (2004) A quantitative comparison of functional MRI cluster analysis. Artif Intell Med 31:57–71CrossRef Dimitriadou E, Barth M, Windischberger C, Hornik K, Moser E (2004) A quantitative comparison of functional MRI cluster analysis. Artif Intell Med 31:57–71CrossRef
Zurück zum Zitat Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad UM (eds) KDD 96: proceedings of the second international conference on knowledge discovery and data mining. AAAI Press, Menlo Park, pp 226–231 Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad UM (eds) KDD 96: proceedings of the second international conference on knowledge discovery and data mining. AAAI Press, Menlo Park, pp 226–231
Zurück zum Zitat Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster analysis, 5th edn. Wiley, New YorkMATHCrossRef Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster analysis, 5th edn. Wiley, New YorkMATHCrossRef
Zurück zum Zitat Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis and density estimation. J Am Stat Assoc 97:611–631MathSciNetMATHCrossRef Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis and density estimation. J Am Stat Assoc 97:611–631MathSciNetMATHCrossRef
Zurück zum Zitat Halkidi M, Vazirgiannis M, Hennig C (2015) Method-independent indices for cluster validation and estimating the number of clusters. In: Hennig C, Meila M, Murtagh F, Rocci R (eds) Handbook of cluster analysis. CRC Press, Boca Raton, pp 595–618MATH Halkidi M, Vazirgiannis M, Hennig C (2015) Method-independent indices for cluster validation and estimating the number of clusters. In: Hennig C, Meila M, Murtagh F, Rocci R (eds) Handbook of cluster analysis. CRC Press, Boca Raton, pp 595–618MATH
Zurück zum Zitat Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. Appl Stat 28:100–108MATHCrossRef Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. Appl Stat 28:100–108MATHCrossRef
Zurück zum Zitat Hennig C (2020) FPC: flexible procedures for clustering. R package version 2.2-8 Hennig C (2020) FPC: flexible procedures for clustering. R package version 2.2-8
Zurück zum Zitat Hennig C (2015) Clustering strategy and method selection. In: Hennig C, Meila M, Murtagh F, Rocci R (eds) Handbook of cluster analysis. CRC Press, Boca Raton, pp 703–730CrossRef Hennig C (2015) Clustering strategy and method selection. In: Hennig C, Meila M, Murtagh F, Rocci R (eds) Handbook of cluster analysis. CRC Press, Boca Raton, pp 703–730CrossRef
Zurück zum Zitat Hennig C (2018) Some thoughts on simulation studies to compare clustering methods. Arch Data Sci Ser A 5(1):1–21 Hennig C (2018) Some thoughts on simulation studies to compare clustering methods. Arch Data Sci Ser A 5(1):1–21
Zurück zum Zitat Hennig C (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In: Skiadas CH, Bozeman JR (eds) Data analysis and applications 1: clustering and regression, modeling—estimating, forecasting and data mining. ISTE Ltd., London, pp 1–24 Hennig C (2019) Cluster validation by measurement of clustering characteristics relevant to the user. In: Skiadas CH, Bozeman JR (eds) Data analysis and applications 1: clustering and regression, modeling—estimating, forecasting and data mining. ISTE Ltd., London, pp 1–24
Zurück zum Zitat Hennig C, Meila M (2015) Cluster analysis: an overview. In: Hennig C, Meila M, Murtagh F, Rocci R (eds) Handbook of cluster analysis. CRC Press, Boca Raton, pp 1–19MATHCrossRef Hennig C, Meila M (2015) Cluster analysis: an overview. In: Hennig C, Meila M, Murtagh F, Rocci R (eds) Handbook of cluster analysis. CRC Press, Boca Raton, pp 1–19MATHCrossRef
Zurück zum Zitat Jain AK, Topchy A, Law MHC, Buhmann JM (2004) Landscape of clustering algorithms. In: Proceedings of the 17th international conference on pattern recognition (ICPR04). IEEE Computer Society Washington, vol 1, pp 260–263 Jain AK, Topchy A, Law MHC, Buhmann JM (2004) Landscape of clustering algorithms. In: Proceedings of the 17th international conference on pattern recognition (ICPR04). IEEE Computer Society Washington, vol 1, pp 260–263
Zurück zum Zitat Jardine N, Sibson R (1971) Mathematical taxonomy. Wiley, LondonMATH Jardine N, Sibson R (1971) Mathematical taxonomy. Wiley, LondonMATH
Zurück zum Zitat Javed A, Lee BS, Rizzo DM (2020) A benchmark study on time series clustering. Mach Learn Appl 1:100001 Javed A, Lee BS, Rizzo DM (2020) A benchmark study on time series clustering. Mach Learn Appl 1:100001
Zurück zum Zitat Karatzoglou A, Smola A, Hornik K, Zeileis A (2004) kernlab—an S4 package for kernel methods in R. J Stat Softw 11(9):1–20CrossRef Karatzoglou A, Smola A, Hornik K, Zeileis A (2004) kernlab—an S4 package for kernel methods in R. J Stat Softw 11(9):1–20CrossRef
Zurück zum Zitat Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis, vol 344. Wiley, New YorkMATHCrossRef Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis, vol 344. Wiley, New YorkMATHCrossRef
Zurück zum Zitat Kleinberg J (2002) An impossibility theorem for clustering. Adv Neural Inf Process Syst NIPS 15:463–470 Kleinberg J (2002) An impossibility theorem for clustering. Adv Neural Inf Process Syst NIPS 15:463–470
Zurück zum Zitat Kou G, Peng Y, Wang G (2014) Evaluation of clustering algorithms for financial risk analysis using MCDM methods. Inf Sci 275:1–12CrossRef Kou G, Peng Y, Wang G (2014) Evaluation of clustering algorithms for financial risk analysis using MCDM methods. Inf Sci 275:1–12CrossRef
Zurück zum Zitat Liu X, Song W, Wong BY, Zhang T, Yu S, Lin GN, Di X (2019) A comparison framework and guideline of clustering methods for mass cytometry data. Genome Biol 20:297CrossRef Liu X, Song W, Wong BY, Zhang T, Yu S, Lin GN, Di X (2019) A comparison framework and guideline of clustering methods for mass cytometry data. Genome Biol 20:297CrossRef
Zurück zum Zitat Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2019) cluster: cluster analysis basics and extensions. R package version 2.1.0 Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2019) cluster: cluster analysis basics and extensions. R package version 2.1.0
Zurück zum Zitat Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654CrossRef Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654CrossRef
Zurück zum Zitat Meila M (2015) Criteria for comparing clusterings. In: Hennig C, Meila M, Murtagh F, Rocci R (eds) Handbook of cluster analysis. CRC Press, Boca Raton, pp 619–635 Meila M (2015) Criteria for comparing clusterings. In: Hennig C, Meila M, Murtagh F, Rocci R (eds) Handbook of cluster analysis. CRC Press, Boca Raton, pp 619–635
Zurück zum Zitat Meila M, Heckerman D (2001) An experimental comparison of model-based clustering methods. Mach Learn 42:9–29MATHCrossRef Meila M, Heckerman D (2001) An experimental comparison of model-based clustering methods. Mach Learn 42:9–29MATHCrossRef
Zurück zum Zitat Milligan GW (1980) An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 45:325–342CrossRef Milligan GW (1980) An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika 45:325–342CrossRef
Zurück zum Zitat Milligan GW (1981) A Monte Carlo study of thirty internal criterion measures for cluster analysis. Psychometrika 46:187–199MATHCrossRef Milligan GW (1981) A Monte Carlo study of thirty internal criterion measures for cluster analysis. Psychometrika 46:187–199MATHCrossRef
Zurück zum Zitat Milligan GW (1996) Clustering validation: results and implications for applied analyses. In: Arabie P, Hubert LJ, Soete GD (eds) Clustering and classification. World Scientific, Singapore, pp 341–375MATHCrossRef Milligan GW (1996) Clustering validation: results and implications for applied analyses. In: Arabie P, Hubert LJ, Soete GD (eds) Clustering and classification. World Scientific, Singapore, pp 341–375MATHCrossRef
Zurück zum Zitat Ng AY, Jordan MI, Weiss Y (2001) On spectral clustering: analysis and an algorithm. In: Dietterich T, Becker S, Ghahramani Z (eds) Advances in neural information processing systems 14 (NIPS 2001). NIPS, pp 1–8 Ng AY, Jordan MI, Weiss Y (2001) On spectral clustering: analysis and an algorithm. In: Dietterich T, Becker S, Ghahramani Z (eds) Advances in neural information processing systems 14 (NIPS 2001). NIPS, pp 1–8
Zurück zum Zitat Pinheiro JC, Bates DM (2000) Mixed-effects models in S and S-PLUS. Springer, New YorkMATHCrossRef Pinheiro JC, Bates DM (2000) Mixed-effects models in S and S-PLUS. Springer, New YorkMATHCrossRef
Zurück zum Zitat Rodriguez MZ, Comin CH, Casanova D, Bruno OM, Amancio DR, Costa L, Rodrigues FA (2019) Clustering algorithms: a comparative approach. PLoS ONE 14:e0210236CrossRef Rodriguez MZ, Comin CH, Casanova D, Bruno OM, Amancio DR, Costa L, Rodrigues FA (2019) Clustering algorithms: a comparative approach. PLoS ONE 14:e0210236CrossRef
Zurück zum Zitat Saracli S, Dogan N, Dogan I (2013) Comparison of hierarchical cluster analysis methods by cophenetic correlation. J Inequal Appl 203:89MATH Saracli S, Dogan N, Dogan I (2013) Comparison of hierarchical cluster analysis methods by cophenetic correlation. J Inequal Appl 203:89MATH
Zurück zum Zitat Scrucca L, Fop M, Murphy TB, Raftery AE (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J 8(1):289–317CrossRef Scrucca L, Fop M, Murphy TB, Raftery AE (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J 8(1):289–317CrossRef
Zurück zum Zitat Steinley D, Brusco MJ (2011) Evaluating the performance of model-based clustering: recommendations and cautions. Psychol Methods 16:63–79CrossRef Steinley D, Brusco MJ (2011) Evaluating the performance of model-based clustering: recommendations and cautions. Psychol Methods 16:63–79CrossRef
Zurück zum Zitat Van Mechelen I, Boulesteix AL, Dangl R, Dean N, Guyon I, Hennig C, Leisch F, Steinley D (2018) Benchmarking in cluster analysis: a white paper. arXiv:1809.10496 [stat] Van Mechelen I, Boulesteix AL, Dangl R, Dean N, Guyon I, Hennig C, Leisch F, Steinley D (2018) Benchmarking in cluster analysis: a white paper. arXiv:​1809.​10496 [stat]
Zurück zum Zitat von Luxburg U, Williamson R, Guyon I (2012) Clustering: science or art? JMLR Workshop Conf Proc 27:65–79 von Luxburg U, Williamson R, Guyon I (2012) Clustering: science or art? JMLR Workshop Conf Proc 27:65–79
Zurück zum Zitat Wang K, Ng A, McLachlan G (2018) EMMIXskew: the EM algorithm and skew mixture distribution. R package version 1.0.3 Wang K, Ng A, McLachlan G (2018) EMMIXskew: the EM algorithm and skew mixture distribution. R package version 1.0.3
Metadaten
Titel
An empirical comparison and characterisation of nine popular clustering methods
verfasst von
Christian Hennig
Publikationsdatum
09.01.2022
Verlag
Springer Berlin Heidelberg
Erschienen in
Advances in Data Analysis and Classification / Ausgabe 1/2022
Print ISSN: 1862-5347
Elektronische ISSN: 1862-5355
DOI
https://doi.org/10.1007/s11634-021-00478-z

Weitere Artikel der Ausgabe 1/2022

Advances in Data Analysis and Classification 1/2022 Zur Ausgabe

Premium Partner