Skip to main content
Erschienen in: Journal of Classification 3/2020

16.07.2019

Effects of Resampling in Determining the Number of Clusters in a Data Set

verfasst von: Rainer Dangl, Friedrich Leisch

Erschienen in: Journal of Classification | Ausgabe 3/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Using cluster validation indices is a widely applied method in order to detect the number of groups in a data set and as such a crucial step in the model validation process in clustering. The study presented in this paper demonstrates how the accuracy of certain indices can be significantly improved when calculated numerous times on data sets resampled from the original data. There are obviously many ways to resample data—in this study, three very common options are used: bootstrapping, data splitting (without subset overlap of two subsamples), and random subsetting (with subset overlap of two subsamples). Index values calculated on the basis of resampled data sets are compared to the values obtained from the original data partition. The primary hypothesis of the study states that resampling does generally improve index accuracy. The hypothesis is based on the notion of cluster stability: if there are stable clusters in a data set, a clustering algorithm should produce consistent results for data sampled or resampled from the same source. The primary hypothesis was partly confirmed; for external validation measures, it does indeed apply. The secondary hypothesis states that the resampling strategy itself does not play a significant role. This was also shown to be accurate, yet slight deviations between the resampling schemes suggest that splitting appears to yield slightly better results.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Ball, G.H., & Hall, D.J. (1965). ISODATA. A novel method of data analysis and pattern classification. Technical report, Menlo Park, Stanford Research Institute. Ball, G.H., & Hall, D.J. (1965). ISODATA. A novel method of data analysis and pattern classification. Technical report, Menlo Park, Stanford Research Institute.
Zurück zum Zitat Banfield, J.D., & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49(3), 803–821.MathSciNetCrossRef Banfield, J.D., & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49(3), 803–821.MathSciNetCrossRef
Zurück zum Zitat Ben-Hur, A., Elisseeff, A., Guyon, I. (2001). A stability based method for discovering structure in clustered data. In Biocomputing 2002 (pp. 6–17). World Scientific. Ben-Hur, A., Elisseeff, A., Guyon, I. (2001). A stability based method for discovering structure in clustered data. In Biocomputing 2002 (pp. 6–17). World Scientific.
Zurück zum Zitat Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics - Theory and Methods, 3(1), 1–27.MathSciNetCrossRef Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics - Theory and Methods, 3(1), 1–27.MathSciNetCrossRef
Zurück zum Zitat Davies, D.L., & Bouldin, D.W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2), 224–227.CrossRef Davies, D.L., & Bouldin, D.W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2), 224–227.CrossRef
Zurück zum Zitat Desgraupes, B. (2013). Clustering indices. University Paris Ouest Lab Modal’X. Desgraupes, B. (2013). Clustering indices. University Paris Ouest Lab Modal’X.
Zurück zum Zitat Dimitriadou, E., Dolničar, S., Weingessel, A. (2002). An examination of indexes for determining the number of clusters in binary data sets. Psychometrika, 67 (1), 137–159.MathSciNetCrossRef Dimitriadou, E., Dolničar, S., Weingessel, A. (2002). An examination of indexes for determining the number of clusters in binary data sets. Psychometrika, 67 (1), 137–159.MathSciNetCrossRef
Zurück zum Zitat Dolnicar, S., Grün, B., Leisch, F., Schmidt, K. (2014). Required sample sizes for data-driven market segmentation analyses in tourism. Journal of Travel Research, 53(3), 296–306.CrossRef Dolnicar, S., Grün, B., Leisch, F., Schmidt, K. (2014). Required sample sizes for data-driven market segmentation analyses in tourism. Journal of Travel Research, 53(3), 296–306.CrossRef
Zurück zum Zitat Dudoit, S., & Fridlyand, J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset genome. Biology, 3(7). Dudoit, S., & Fridlyand, J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset genome. Biology, 3(7).
Zurück zum Zitat Dunn, J.C. (1974). Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4(1), 95–104.MathSciNetCrossRef Dunn, J.C. (1974). Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4(1), 95–104.MathSciNetCrossRef
Zurück zum Zitat Fowlkes, E.B., & Mallows, C.L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383), 553–569.CrossRef Fowlkes, E.B., & Mallows, C.L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383), 553–569.CrossRef
Zurück zum Zitat Hartigan, J.A. (1975). Clustering algorithms, 99th edn. New York: Wiley. ISBN 047135645X.MATH Hartigan, J.A. (1975). Clustering algorithms, 99th edn. New York: Wiley. ISBN 047135645X.MATH
Zurück zum Zitat Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37–50.CrossRef Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37–50.CrossRef
Zurück zum Zitat Kulczyński, S. (1928). Die Pflanzenassoziationen der Pieninen. Imprimerie de l’Université. Kulczyński, S. (1928). Die Pflanzenassoziationen der Pieninen. Imprimerie de l’Université.
Zurück zum Zitat Lai, W.J., & Krzanowski, Y.T. (1988). A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics, 44, 23–34.MathSciNetCrossRef Lai, W.J., & Krzanowski, Y.T. (1988). A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics, 44, 23–34.MathSciNetCrossRef
Zurück zum Zitat Leisch, F. (2006). A toolbox for k-centroids cluster analysis. Computational Statistics and Data Analysis, 51(2), 526–544.MathSciNetCrossRef Leisch, F. (2006). A toolbox for k-centroids cluster analysis. Computational Statistics and Data Analysis, 51(2), 526–544.MathSciNetCrossRef
Zurück zum Zitat Levine, E., & Domany, E. (2001). Resampling method for unsupervised estimation of cluster validity. Neural Computation, 13, 2573–2593.CrossRef Levine, E., & Domany, E. (2001). Resampling method for unsupervised estimation of cluster validity. Neural Computation, 13, 2573–2593.CrossRef
Zurück zum Zitat McLachlan, G.J., & Khan, N. (2004). On a resampling approach for tests on the number of clusters with mixture model-based clustering of tissue samples. Journal of Multivariate Analysis, 90, 90–105.MathSciNetCrossRef McLachlan, G.J., & Khan, N. (2004). On a resampling approach for tests on the number of clusters with mixture model-based clustering of tissue samples. Journal of Multivariate Analysis, 90, 90–105.MathSciNetCrossRef
Zurück zum Zitat Monti, S., Tamayo, P., Mesirov, J., Golub, T. (2003). Consensus clustering – a resampling-based method for class discovery and visualization of gene expression microarray data. In Machine learning, functional genomics special issue (pp. 91–118). Monti, S., Tamayo, P., Mesirov, J., Golub, T. (2003). Consensus clustering – a resampling-based method for class discovery and visualization of gene expression microarray data. In Machine learning, functional genomics special issue (pp. 91–118).
Zurück zum Zitat Mufti, B.G., Bertrand, P., Moubarki, E.L. (2005). Determining the number of groups from measures of cluster stability. In Proceedings of international symposium on applied stochastic models and data analysis (pp. 17–20). Mufti, B.G., Bertrand, P., Moubarki, E.L. (2005). Determining the number of groups from measures of cluster stability. In Proceedings of international symposium on applied stochastic models and data analysis (pp. 17–20).
Zurück zum Zitat Pakhira, M.K., Bandyopadhyay, S., Maulik, U. (2004). Validity index for crisp and fuzzy clusters. Pattern Recognition, 37(3), 487–501.CrossRef Pakhira, M.K., Bandyopadhyay, S., Maulik, U. (2004). Validity index for crisp and fuzzy clusters. Pattern Recognition, 37(3), 487–501.CrossRef
Zurück zum Zitat Rand, W.M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336), 846–850.CrossRef Rand, W.M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336), 846–850.CrossRef
Zurück zum Zitat Roth, V., Lange, T., Braun, M., Buhmann, J. (2002). A resampling approach to cluster validation. In Intl. conf. on computational statistics (pp. 123–128). Roth, V., Lange, T., Braun, M., Buhmann, J. (2002). A resampling approach to cluster validation. In Intl. conf. on computational statistics (pp. 123–128).
Zurück zum Zitat Rousseeuw, P.J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20 (0), 53–65.CrossRef Rousseeuw, P.J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20 (0), 53–65.CrossRef
Zurück zum Zitat Sokal, R.R., Sneath, P.H.A., et al. (1963). Principles of numerical taxonomy. Principles of Numerical Taxonomy. Sokal, R.R., Sneath, P.H.A., et al. (1963). Principles of numerical taxonomy. Principles of Numerical Taxonomy.
Zurück zum Zitat Tibshirani, R., & Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3), 511–528.MathSciNetCrossRef Tibshirani, R., & Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3), 511–528.MathSciNetCrossRef
Zurück zum Zitat Tibshirani, R., Walther, G., Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411– 423.MathSciNetCrossRef Tibshirani, R., Walther, G., Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411– 423.MathSciNetCrossRef
Zurück zum Zitat Tseng, G.C., & Wong, W.H. (2006). Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics, 61, 10–16.MathSciNetCrossRef Tseng, G.C., & Wong, W.H. (2006). Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics, 61, 10–16.MathSciNetCrossRef
Zurück zum Zitat Volkovich, Z., Barzily, Z., Morozensky, L. (2008). A statistical model of cluster stability. Pattern Recognition, 41(7), 2174–2188.CrossRef Volkovich, Z., Barzily, Z., Morozensky, L. (2008). A statistical model of cluster stability. Pattern Recognition, 41(7), 2174–2188.CrossRef
Zurück zum Zitat Xu, L. (1997). Bayesian Ying–Yang machine, clustering and number of clusters. Pattern Recognition Letters, 18(11), 1167–1178.CrossRef Xu, L. (1997). Bayesian Ying–Yang machine, clustering and number of clusters. Pattern Recognition Letters, 18(11), 1167–1178.CrossRef
Metadaten
Titel
Effects of Resampling in Determining the Number of Clusters in a Data Set
verfasst von
Rainer Dangl
Friedrich Leisch
Publikationsdatum
16.07.2019
Verlag
Springer US
Erschienen in
Journal of Classification / Ausgabe 3/2020
Print ISSN: 0176-4268
Elektronische ISSN: 1432-1343
DOI
https://doi.org/10.1007/s00357-019-09328-2

Weitere Artikel der Ausgabe 3/2020

Journal of Classification 3/2020 Zur Ausgabe

Premium Partner