Top

Journal of Classification

Published in:

16-07-2019

Effects of Resampling in Determining the Number of Clusters in a Data Set

Authors: Rainer Dangl, Friedrich Leisch

Published in: Journal of Classification | Issue 3/2020

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Using cluster validation indices is a widely applied method in order to detect the number of groups in a data set and as such a crucial step in the model validation process in clustering. The study presented in this paper demonstrates how the accuracy of certain indices can be significantly improved when calculated numerous times on data sets resampled from the original data. There are obviously many ways to resample data—in this study, three very common options are used: bootstrapping, data splitting (without subset overlap of two subsamples), and random subsetting (with subset overlap of two subsamples). Index values calculated on the basis of resampled data sets are compared to the values obtained from the original data partition. The primary hypothesis of the study states that resampling does generally improve index accuracy. The hypothesis is based on the notion of cluster stability: if there are stable clusters in a data set, a clustering algorithm should produce consistent results for data sampled or resampled from the same source. The primary hypothesis was partly confirmed; for external validation measures, it does indeed apply. The secondary hypothesis states that the resampling strategy itself does not play a significant role. This was also shown to be accurate, yet slight deviations between the resampling schemes suggest that splitting appears to yield slightly better results.

previous article A Short Note on Improvement of Agreement Rate

next article Versatile Linkage: a Family of Space-Conserving Strategies for Agglomerative Hierarchical Clustering

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

Ball, G.H., & Hall, D.J. (1965). ISODATA. A novel method of data analysis and pattern classification. Technical report, Menlo Park, Stanford Research Institute.

Banfield, J.D., & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49(3), 803–821.MathSciNetCrossRef

Ben-Hur, A., Elisseeff, A., Guyon, I. (2001). A stability based method for discovering structure in clustered data. In Biocomputing 2002 (pp. 6–17). World Scientific.

Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics - Theory and Methods, 3(1), 1–27.MathSciNetCrossRef

Davies, D.L., & Bouldin, D.W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2), 224–227.CrossRef

Desgraupes, B. (2013). Clustering indices. University Paris Ouest Lab Modal’X.

Dimitriadou, E., Dolničar, S., Weingessel, A. (2002). An examination of indexes for determining the number of clusters in binary data sets. Psychometrika, 67 (1), 137–159.MathSciNetCrossRef

Dolnicar, S., Grün, B., Leisch, F., Schmidt, K. (2014). Required sample sizes for data-driven market segmentation analyses in tourism. Journal of Travel Research, 53(3), 296–306.CrossRef

Dudoit, S., & Fridlyand, J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset genome. Biology, 3(7).

Dunn, J.C. (1974). Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4(1), 95–104.MathSciNetCrossRef

Fowlkes, E.B., & Mallows, C.L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383), 553–569.CrossRef

Halkidi, M., Batistakis, Y., Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2–3), 107–145. https://doi.org/10.1023/A:1012801612483, http://dblp.uni-trier.de/db/journals/jiis/jiis17.html#HalkidiBV01. http://www.bibsonomy.org/bibtex/2d5ad72294e83dff72417a6f5c68f75fc/dblp.CrossRef

Handl, J., Knowles, J.D., Kell, D.B. (2005). Computational cluster validation in post-genomic data analysis. Bioinformatics, 21 (15), 3201–3212. https://doi.org/10.1093/bioinformatics/bti517, http://dblp.uni-trier.de/db/journals/bioinformatics/bioinformatics21.html#HandlKK05http://www.bibsonomy.org/bibtex/236e89bd65762f2b6274b4dc60ba299b1/dblp.CrossRef

Hartigan, J.A. (1975). Clustering algorithms, 99th edn. New York: Wiley. ISBN 047135645X.MATH

Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218. https://doi.org/10.1007/BF01908075. ISSN 0176-4268.CrossRefMATH

Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37–50.CrossRef

Kulczyński, S. (1928). Die Pflanzenassoziationen der Pieninen. Imprimerie de l’Université.

Lai, W.J., & Krzanowski, Y.T. (1988). A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics, 44, 23–34.MathSciNetCrossRef

Lange, T., Roth, V., Braun, M.L., Buhmann, J.M. (2004). Stability-based validation of clustering solutions. Neural Computation, 16 (6), 1299–1323. https://doi.org/10.1162/089976604773717621, http://dblp.uni-trier.de/db/journals/neco/neco16.html#LangeRBB04, http://www.bibsonomy.org/bibtex/23bdb518b89f88cdaac004cfa86fd70a1/dblp.CrossRef

Leisch, F. (2006). A toolbox for k-centroids cluster analysis. Computational Statistics and Data Analysis, 51(2), 526–544.MathSciNetCrossRef

Levine, E., & Domany, E. (2001). Resampling method for unsupervised estimation of cluster validity. Neural Computation, 13, 2573–2593.CrossRef

McLachlan, G.J., & Khan, N. (2004). On a resampling approach for tests on the number of clusters with mixture model-based clustering of tissue samples. Journal of Multivariate Analysis, 90, 90–105.MathSciNetCrossRef

Monti, S., Tamayo, P., Mesirov, J., Golub, T. (2003). Consensus clustering – a resampling-based method for class discovery and visualization of gene expression microarray data. In Machine learning, functional genomics special issue (pp. 91–118).

Mufti, B.G., Bertrand, P., Moubarki, E.L. (2005). Determining the number of groups from measures of cluster stability. In Proceedings of international symposium on applied stochastic models and data analysis (pp. 17–20).

Pakhira, M.K., Bandyopadhyay, S., Maulik, U. (2004). Validity index for crisp and fuzzy clusters. Pattern Recognition, 37(3), 487–501.CrossRef

Qiu, W., & Joe, H. (2006). Generation of random clusters with specified degree of separation. Journal of Classification, 23(2), 315–334. https://doi.org/10.1007/s00357-006-0018-y, http://dblp.uni-trier.de/db/journals/classification/classification23.html#QiuJ06, http://www.bibsonomy.org/bibtex/2b242e6b052f477826c8307641ff32a80/dblp.MathSciNetCrossRef

Qiu, W., & Joe, H. (2013). clusterGeneration: random cluster generation (with specified degree of separation). http://CRAN.R-project.org/package=clusterGeneration. R package version 1.3.1.

R Core Team. (2014). R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/.

Rand, W.M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336), 846–850.CrossRef

Rogers, D.J., & Tanimoto, T.T. (1960). A computer program for classifying plants. Science, 132(3434), 1115–1118. http://www.sciencemag.org/content/132/3434/1115.short.CrossRef

Roth, V., Lange, T., Braun, M., Buhmann, J. (2002). A resampling approach to cluster validation. In Intl. conf. on computational statistics (pp. 123–128).

Rousseeuw, P.J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20 (0), 53–65.CrossRef

Sokal, R.R., Sneath, P.H.A., et al. (1963). Principles of numerical taxonomy. Principles of Numerical Taxonomy.

Tibshirani, R., & Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3), 511–528.MathSciNetCrossRef

Tibshirani, R., Walther, G., Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411– 423.MathSciNetCrossRef

Tseng, G.C., & Wong, W.H. (2006). Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics, 61, 10–16.MathSciNetCrossRef

Volkovich, Z., Barzily, Z., Morozensky, L. (2008). A statistical model of cluster stability. Pattern Recognition, 41(7), 2174–2188.CrossRef

Xu, L. (1997). Bayesian Ying–Yang machine, clustering and number of clusters. Pattern Recognition Letters, 18(11), 1167–1178.CrossRef

Title: Effects of Resampling in Determining the Number of Clusters in a Data Set
Authors: Rainer Dangl
Friedrich Leisch
Publication date: 16-07-2019
Publisher: Springer US
Published in: Journal of Classification / Issue 3/2020
Print ISSN: 0176-4268
Electronic ISSN: 1432-1343
DOI: https://doi.org/10.1007/s00357-019-09328-2

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 3/2020

Editorial: Journal of Classification Vol. 37-3

Proximity Curves for Potential-Based Clustering

An Impartial Trimming Approach for Joint Dimension and Sample Reduction

C443: a Methodology to See a Forest for the Trees

Unequal Priors in Linear Discriminant Analysis

A Short Note on Improvement of Agreement Rate

Premium Partner