Skip to main content
Log in

On some significance tests in cluster analysis

  • Authors Of Articles
  • Published:
Journal of Classification Aims and scope Submit manuscript

Abstract

We investigate the properties of several significance tests for distinguishing between the hypothesisH of a “homogeneous” population and an alternativeA involving “clustering” or “heterogeneity,” with emphasis on the case of multidimensional observationsx 1, ...,x n ε p. Four types of test statistics are considered: the (s-th) largest gap between observations, their mean distance (or similarity), the minimum within-cluster sum of squares resulting from a k-means algorithm, and the resulting maximum F statistic. The asymptotic distributions underH are given forn→∞ and the asymptotic power of the tests is derived for neighboring alternatives.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • BARNETT, V., KAY, R., and SNEATH, P.H.A. (1979), “A Familiar Statistic in an Unfamiliar Guise — A Problem in Clustering,”The Statistican, 28, 185–191.

    Google Scholar 

  • BAUBKUS, W. (1985), “Minimizing the Variance Criterion in Cluster Analysis: Optimal Configurations in the Multidimensional Normal Case,” Diplomarbeit, Institute of Statistics, Technical University Aachen, 117 p.

  • BICKEL, P.J., and BREIMAN, L. (1983), “Sums of Functions of Nearest Neighbor Distances, Moment Bounds, Limit Theorems and a Goodness of Fit Test,”Annals of Probability, 11, 185–214.

    Google Scholar 

  • BINDER, D.A. (1978), “Bayesian Cluster Analysis,”Biometrika, 65, 31–38.

    Google Scholar 

  • BOCK, H.H. (1972), “Statistische Modelle und Bayes'sche Verfahren zur Bestimmung einer unbekannten Klassifikation normalverteilter zufälliger Vektoren,”Metrika, 18, 120–132.

    Google Scholar 

  • BOCK, H.H. (1974),Automatische Klassifikation. Theoretische und praktische Methoden zur Gruppierung und Strukturierung von Daten (Clusteranalyse), Göttingen: Vandenhoeck & Ruprecht, 480 p.

    Google Scholar 

  • BOCK, H.H. (1977), “On Tests Concerning the Existence of a Classification,” inProceedings First International Symposium on Data Analysis and Informatics, Le Chesnay, France, Institut de Recherche en Informatique et en Automatique (IRIA), 449–464.

    Google Scholar 

  • BOCK, H.H. (1981), “Statistical Testing and Evaluation Methods in Cluster Analysis,” inProceedings on the Golden Jubilee Conference in Statistics: Applications and New Directions, December 1981, Calcutta, Indian Statistical Institute, 1984, 116–146.

    Google Scholar 

  • BOCK, H.H. (1983), “Statistische Testverfahren im Rahmen der Clusteranalyse,”Proceedings of the 7th Annual Meeting of the Gesellschaft für Klassifikation e.V., inStudien zur Klassifikation, Vol. 13, ed. M. Schader, Frankfurt: Indeks-Verlag, 161–176.

    Google Scholar 

  • BRYANT, P., and WILLIAMSON, J.A. (1978), “Asymptotic Behavior of Classification Maximum Likelihood Estimates,”Biometrika, 65, 273–281.

    Google Scholar 

  • COX, D.R. (1957), “Note on Grouping,”Journal of the American Statistical Association, 52, 543–547.

    Google Scholar 

  • DAVID, H.A. (1981),Order Statistics, New York: Wiley, chap. 9.3, 9.4.

    Google Scholar 

  • DEGENS, P.O. (1978), “Clusteranalyse auf topologisch-masstheoretischer Grundlage,” Dissertation, Fachbereich Mathematik, Universitaet Muenchen.

  • DEL PINO, G.E. (1979), “On the Asymptotic Distribution of k-spacings with Applications to Goodness-of-Fit Tests,”Annals of Statistics, 7, 1058–1065.

    Google Scholar 

  • DUBES, R., and JAIN, A.K. (1979), “Validity Studies in Clustering Methodologies,”Pattern Recognition, 11, 235–254.

    Google Scholar 

  • EBERL, W., and HAFNER, R. (1971), “Die asymptotische Verteilung von Koinzidenzen,”Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 18, 322–332.

    Google Scholar 

  • ENGELMAN, L., and HARTIGAN, J.A. (1969), “Percentage Points of a Test for Clusters,”Journal of the American Statistical Association, 64, 1647–1648.

    Google Scholar 

  • FLEISCHER, P.E. (1964), “Sufficient Conditions for Achieving Minimum Distortion in a Quantizer,”IEEE Int. Conv. Rec., part 1, 104–111.

    Google Scholar 

  • GHOSH, J.K., and SEN, P.K. (1984), “On the Asymptotic Distribution of the Log Likelihood Ratio Statistic for the Mixture Model and Related Results,” Preprint, Calcutta: Indian Statistical Institute.

    Google Scholar 

  • GIACOMELLI, F., WIENER, J., KRUSKAL, J.B., v. POMERANZ, J., and LOUD, A.V. (1971), “Subpopulations of Blood Lymphocytes Demonstrated by Quantitative Cytochemistry,”Journal of Histochemistry and Cytochemistry, 19, 426–433.

    Google Scholar 

  • GRAY, R.M., and KARNIN, E.D. (1982), “Multiple Local Optima in Vector Quantizers,”IEEE Trans. Information Theory, IT-28, 256–261.

    Google Scholar 

  • HARTIGAN, J.A. (1975),Clustering Algorithms, New York: Wiley.

    Google Scholar 

  • HARTIGAN, J.A. (1977), “Distribution Problems in Clustering,” inClassification and Clustering, ed. J. van Ryzin, New York: Academic Press, 45–72.

    Google Scholar 

  • HARTIGAN, J.A. (1978), “Asymptotic Distributions for Clustering Criteria,”Annals of Statistics, 6, 117–131.

    Google Scholar 

  • HENZE, N. (1981), “An Asymptotic Result on the Maximum Nearest Neighbor Distance Between Independent Random Vectors with an Application for Testing Goodness-of-Fit in ℝp on Spheres,” Dissertation, University of Hannover, published inMetrika, 30, 245–260.

    Google Scholar 

  • HENZE, N. (1982), “The Limit Distribution for Maxima of Weightedr-th Nearest Neighbor Distances,”Journal of Applied Probability, 19, 334–354.

    Google Scholar 

  • KIEFFER, J.C. (1983), “Uniqueness of Locally Optimal Quantizer for Log-concave Density and Convex Error Weighting Function,”IEEE Trans. Infromation, IT-29, 42–27.

    Google Scholar 

  • KUO, M., and RAO, J.S. (1981), “Limit Theory and Efficiences for Tests Based on Higher Order Spacings,” inProceedings on the Golden Jubilee Conference in Statistics: Applications and New Directions, December 1981, Calcutta: Indian Statistical Institute, 1984.

    Google Scholar 

  • LEE, K.L. (1979), “Multivariate Tests for Clusters,”Journal of the American Statistical Association, 74, 708–714.

    Google Scholar 

  • LEHMANN, E.L. (1955), “Ordered Families of Distributions,”Annals of Mathematical Statistics, 26, 399–419.

    Google Scholar 

  • LOEVE, M. (1963),Probability Theory, Princeton, NJ: van Nostrand.

    Google Scholar 

  • NEWELL, G.F. (1963), “Distribution for the Smallest Distance Between any Pair of the κ-th Nearest Neighbor Random Points on a Line,” inProc. Symp. Time Series Analysis, ed. M. Rosenblatt, New York: Wiley, 89–103.

    Google Scholar 

  • OGAWA, J. (1951), “Contributions to the Theory of Systematic Statistics I,”Osaka Mathematical Journal, 3, 175–213.

    Google Scholar 

  • OGAWA, J. (1962), “Determination of Optimum Spacings in the Case of Normal Distribution,” inContributions to Order Statistics, eds. A.E. Sarhan and B.G. Greenberg, New York: Wiley, p. 277 ff.

    Google Scholar 

  • PERRUCHET, C. (1982), “Les Epreuves de Classifiabilité en Analyse des Données,” Note technique NT/PAA/ATR/MTI/810, Issy-les-Moulineaux, France: Centre National d'Etudes de Télécommunications, September 1982.

    Google Scholar 

  • PERRUCHET, C. (1983), “Significance Tests for Clusters: Overview and Comments,” inNumerical Taxonomy, ed. J. Felsenstein, Berlin: Springer, 199–208.

    Google Scholar 

  • POLLARD, D. (1981), “Strong Consistency of k-means Clustering,”Annals of Statistics, 9, 135–140.

    Google Scholar 

  • POLLARD, D. (1982a), “A Central Limit Theorem for k-means Clustering,”Annals of Probability, 10, 919–926.

    Google Scholar 

  • POLLARD, D. (1982b), “Quantization and the Method of k-means,”IEEE Trans. Information Theory, IT-28, 119–205.

    Google Scholar 

  • RANDLES, R.H., and WOLFE, D.A. (1979),Introduction to the Theory of Non-parametric Statistics, New York: Wiley.

    Google Scholar 

  • SCHILLING, M.F. (1983a), “Goodness of Fit Testing in ℝm Based on the Weighted Empirical Distribution of Certain Nearest Neighbor Statistics,”Annals of Statistics, 11, 1–12.

    Google Scholar 

  • SCHILLING, M.F. (1983b), “An Infinite-dimensional Approximation for Nearest Neighbor Goodness of Fit,”Annals of Statistics, 11, 13–24.

    Google Scholar 

  • SILVERMAN, B.W. (1976), “Limit Theorems for Dissociated Random Variables,”Advances in Applied Probability, 8, 806–819.

    Google Scholar 

  • SNEATH, P.H.A. (1977a), “A Method for Testing the Distinctness of Clusters: A Test of the Disjunction of Two Clusters in Euclidean Space as Measured by their Overlap,”Jour. Int. Assoc. Math. Geol., 9, 123–143.

    Google Scholar 

  • SNEATH, P.H.A. (1977b), “Cluster Significance Tests and Their Relation to Measures of Overlap,” inProceedings First International Symposium on Data Analysis and Informatics, Versailles, September 1977, Institut de Recherche d'Informatique et d'Automatique (IRIA), Le Chesnay, France, 1, 15–36.

    Google Scholar 

  • SNEATH, P.H.A. (1979a), “The Sampling Distribution of the W Statistic of Disjunction for the Arbitrary Division of a Random Rectangular Distribution,”Journal. Int. Assoc. Math. Geol., 11, 423–429.

    Google Scholar 

  • SNEATH, P.H.A. (1979b), “Basic Program for a Significance Test for 2 Clusters in Euclidean Space as Measured by Their Overlap,”Computers and Geosciences, 5, 143–155.

    Google Scholar 

  • SPAETH, H. (1982),Cluster Analysis Algorithms, Chichester: Horwood.

    Google Scholar 

  • SPAETH, H. (1983),Cluster-Formation und -Analyse, München-Wien: Oldenbourg.

    Google Scholar 

  • TRUSHKIN, A.V. (1982), “Sufficient Conditions for Uniqueness of a Locally Optimal Quantizer for a Class of Convex Error Weighting Functions,”IEEE Trans. Information Theory, IT-28, 187–198.

    Google Scholar 

  • WALLENSTEIN, S.R., and NAUS, J.I. (1973), “Probabilities for ak-th Nearest Neighbor Problem on the Line,”Ann. Probab., 1, 188–190.

    Google Scholar 

  • WALLENSTEIN, S.R., and NAUS, J.I. (1974), “Probabilities of the Size of Largest Clusters and Smallest Intervals,”Journal of the American Statistical Association, 69, 690–697.

    Google Scholar 

  • WEISS, L. (1960), “A Test of Fit Based on the Largest Sample Spacing,”SIAM Journal of the Society for Industrial and Applied Mathematics, 8, 295–299.

    Google Scholar 

  • WITTING, H., and NOELLE, G. (1979),Angewandte Mathematische Statistik, Stuttgart: B.G. Teubner, theorem 2.10.

    Google Scholar 

  • WOLFE, J.H. (1970), “Pattern Clustering by Multivariate Mixture Analysis,”Multivariate Behavioral Research, 5, 329–350.

    Google Scholar 

  • WOLFE, J.H. (1981), “A Monte Carlo Study of the Sampling Distribution of the Likelihood Ratio for Mixture of Multinormal Distribution,” Technical Bulletin STB 72-2, San Diego: U.S. Naval Personal and Training Research Laboratory.

    Google Scholar 

  • WOLFE, S.J. (1975), “On the Unimodality of Spherically Symmetric Stable Distribution Functions,”Journal of Multivariate Analysis, 5, 236–242.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bock, H.H. On some significance tests in cluster analysis. Journal of Classification 2, 77–108 (1985). https://doi.org/10.1007/BF01908065

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF01908065

Keywords

Navigation