Skip to main content
Top
Published in: Journal of Classification 3/2020

16-07-2019

Effects of Resampling in Determining the Number of Clusters in a Data Set

Authors: Rainer Dangl, Friedrich Leisch

Published in: Journal of Classification | Issue 3/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Using cluster validation indices is a widely applied method in order to detect the number of groups in a data set and as such a crucial step in the model validation process in clustering. The study presented in this paper demonstrates how the accuracy of certain indices can be significantly improved when calculated numerous times on data sets resampled from the original data. There are obviously many ways to resample data—in this study, three very common options are used: bootstrapping, data splitting (without subset overlap of two subsamples), and random subsetting (with subset overlap of two subsamples). Index values calculated on the basis of resampled data sets are compared to the values obtained from the original data partition. The primary hypothesis of the study states that resampling does generally improve index accuracy. The hypothesis is based on the notion of cluster stability: if there are stable clusters in a data set, a clustering algorithm should produce consistent results for data sampled or resampled from the same source. The primary hypothesis was partly confirmed; for external validation measures, it does indeed apply. The secondary hypothesis states that the resampling strategy itself does not play a significant role. This was also shown to be accurate, yet slight deviations between the resampling schemes suggest that splitting appears to yield slightly better results.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
go back to reference Ball, G.H., & Hall, D.J. (1965). ISODATA. A novel method of data analysis and pattern classification. Technical report, Menlo Park, Stanford Research Institute. Ball, G.H., & Hall, D.J. (1965). ISODATA. A novel method of data analysis and pattern classification. Technical report, Menlo Park, Stanford Research Institute.
go back to reference Banfield, J.D., & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49(3), 803–821.MathSciNetCrossRef Banfield, J.D., & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49(3), 803–821.MathSciNetCrossRef
go back to reference Ben-Hur, A., Elisseeff, A., Guyon, I. (2001). A stability based method for discovering structure in clustered data. In Biocomputing 2002 (pp. 6–17). World Scientific. Ben-Hur, A., Elisseeff, A., Guyon, I. (2001). A stability based method for discovering structure in clustered data. In Biocomputing 2002 (pp. 6–17). World Scientific.
go back to reference Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics - Theory and Methods, 3(1), 1–27.MathSciNetCrossRef Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics - Theory and Methods, 3(1), 1–27.MathSciNetCrossRef
go back to reference Davies, D.L., & Bouldin, D.W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2), 224–227.CrossRef Davies, D.L., & Bouldin, D.W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2), 224–227.CrossRef
go back to reference Desgraupes, B. (2013). Clustering indices. University Paris Ouest Lab Modal’X. Desgraupes, B. (2013). Clustering indices. University Paris Ouest Lab Modal’X.
go back to reference Dimitriadou, E., Dolničar, S., Weingessel, A. (2002). An examination of indexes for determining the number of clusters in binary data sets. Psychometrika, 67 (1), 137–159.MathSciNetCrossRef Dimitriadou, E., Dolničar, S., Weingessel, A. (2002). An examination of indexes for determining the number of clusters in binary data sets. Psychometrika, 67 (1), 137–159.MathSciNetCrossRef
go back to reference Dolnicar, S., Grün, B., Leisch, F., Schmidt, K. (2014). Required sample sizes for data-driven market segmentation analyses in tourism. Journal of Travel Research, 53(3), 296–306.CrossRef Dolnicar, S., Grün, B., Leisch, F., Schmidt, K. (2014). Required sample sizes for data-driven market segmentation analyses in tourism. Journal of Travel Research, 53(3), 296–306.CrossRef
go back to reference Dudoit, S., & Fridlyand, J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset genome. Biology, 3(7). Dudoit, S., & Fridlyand, J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset genome. Biology, 3(7).
go back to reference Fowlkes, E.B., & Mallows, C.L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383), 553–569.CrossRef Fowlkes, E.B., & Mallows, C.L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383), 553–569.CrossRef
go back to reference Hartigan, J.A. (1975). Clustering algorithms, 99th edn. New York: Wiley. ISBN 047135645X.MATH Hartigan, J.A. (1975). Clustering algorithms, 99th edn. New York: Wiley. ISBN 047135645X.MATH
go back to reference Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37–50.CrossRef Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37–50.CrossRef
go back to reference Kulczyński, S. (1928). Die Pflanzenassoziationen der Pieninen. Imprimerie de l’Université. Kulczyński, S. (1928). Die Pflanzenassoziationen der Pieninen. Imprimerie de l’Université.
go back to reference Lai, W.J., & Krzanowski, Y.T. (1988). A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics, 44, 23–34.MathSciNetCrossRef Lai, W.J., & Krzanowski, Y.T. (1988). A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics, 44, 23–34.MathSciNetCrossRef
go back to reference Leisch, F. (2006). A toolbox for k-centroids cluster analysis. Computational Statistics and Data Analysis, 51(2), 526–544.MathSciNetCrossRef Leisch, F. (2006). A toolbox for k-centroids cluster analysis. Computational Statistics and Data Analysis, 51(2), 526–544.MathSciNetCrossRef
go back to reference Levine, E., & Domany, E. (2001). Resampling method for unsupervised estimation of cluster validity. Neural Computation, 13, 2573–2593.CrossRef Levine, E., & Domany, E. (2001). Resampling method for unsupervised estimation of cluster validity. Neural Computation, 13, 2573–2593.CrossRef
go back to reference McLachlan, G.J., & Khan, N. (2004). On a resampling approach for tests on the number of clusters with mixture model-based clustering of tissue samples. Journal of Multivariate Analysis, 90, 90–105.MathSciNetCrossRef McLachlan, G.J., & Khan, N. (2004). On a resampling approach for tests on the number of clusters with mixture model-based clustering of tissue samples. Journal of Multivariate Analysis, 90, 90–105.MathSciNetCrossRef
go back to reference Monti, S., Tamayo, P., Mesirov, J., Golub, T. (2003). Consensus clustering – a resampling-based method for class discovery and visualization of gene expression microarray data. In Machine learning, functional genomics special issue (pp. 91–118). Monti, S., Tamayo, P., Mesirov, J., Golub, T. (2003). Consensus clustering – a resampling-based method for class discovery and visualization of gene expression microarray data. In Machine learning, functional genomics special issue (pp. 91–118).
go back to reference Mufti, B.G., Bertrand, P., Moubarki, E.L. (2005). Determining the number of groups from measures of cluster stability. In Proceedings of international symposium on applied stochastic models and data analysis (pp. 17–20). Mufti, B.G., Bertrand, P., Moubarki, E.L. (2005). Determining the number of groups from measures of cluster stability. In Proceedings of international symposium on applied stochastic models and data analysis (pp. 17–20).
go back to reference Pakhira, M.K., Bandyopadhyay, S., Maulik, U. (2004). Validity index for crisp and fuzzy clusters. Pattern Recognition, 37(3), 487–501.CrossRef Pakhira, M.K., Bandyopadhyay, S., Maulik, U. (2004). Validity index for crisp and fuzzy clusters. Pattern Recognition, 37(3), 487–501.CrossRef
go back to reference Rand, W.M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336), 846–850.CrossRef Rand, W.M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336), 846–850.CrossRef
go back to reference Roth, V., Lange, T., Braun, M., Buhmann, J. (2002). A resampling approach to cluster validation. In Intl. conf. on computational statistics (pp. 123–128). Roth, V., Lange, T., Braun, M., Buhmann, J. (2002). A resampling approach to cluster validation. In Intl. conf. on computational statistics (pp. 123–128).
go back to reference Rousseeuw, P.J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20 (0), 53–65.CrossRef Rousseeuw, P.J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20 (0), 53–65.CrossRef
go back to reference Sokal, R.R., Sneath, P.H.A., et al. (1963). Principles of numerical taxonomy. Principles of Numerical Taxonomy. Sokal, R.R., Sneath, P.H.A., et al. (1963). Principles of numerical taxonomy. Principles of Numerical Taxonomy.
go back to reference Tibshirani, R., & Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3), 511–528.MathSciNetCrossRef Tibshirani, R., & Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3), 511–528.MathSciNetCrossRef
go back to reference Tibshirani, R., Walther, G., Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411– 423.MathSciNetCrossRef Tibshirani, R., Walther, G., Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411– 423.MathSciNetCrossRef
go back to reference Tseng, G.C., & Wong, W.H. (2006). Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics, 61, 10–16.MathSciNetCrossRef Tseng, G.C., & Wong, W.H. (2006). Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics, 61, 10–16.MathSciNetCrossRef
go back to reference Volkovich, Z., Barzily, Z., Morozensky, L. (2008). A statistical model of cluster stability. Pattern Recognition, 41(7), 2174–2188.CrossRef Volkovich, Z., Barzily, Z., Morozensky, L. (2008). A statistical model of cluster stability. Pattern Recognition, 41(7), 2174–2188.CrossRef
go back to reference Xu, L. (1997). Bayesian Ying–Yang machine, clustering and number of clusters. Pattern Recognition Letters, 18(11), 1167–1178.CrossRef Xu, L. (1997). Bayesian Ying–Yang machine, clustering and number of clusters. Pattern Recognition Letters, 18(11), 1167–1178.CrossRef
Metadata
Title
Effects of Resampling in Determining the Number of Clusters in a Data Set
Authors
Rainer Dangl
Friedrich Leisch
Publication date
16-07-2019
Publisher
Springer US
Published in
Journal of Classification / Issue 3/2020
Print ISSN: 0176-4268
Electronic ISSN: 1432-1343
DOI
https://doi.org/10.1007/s00357-019-09328-2

Other articles of this Issue 3/2020

Journal of Classification 3/2020 Go to the issue

Premium Partner