Skip to main content
Log in

Model selection for probabilistic clustering using cross-validated likelihood

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Cross-validated likelihood is investigated as a tool for automatically determining the appropriate number of components (given the data) in finite mixture modeling, particularly in the context of model-based probabilistic clustering. The conceptual framework for the cross-validation approach to model selection is straightforward in the sense that models are judged directly on their estimated out-of-sample predictive performance. The cross-validation approach, as well as penalized likelihood and McLachlan's bootstrap method, are applied to two data sets and the results from all three methods are in close agreement. The second data set involves a well-known clustering problem from the atmospheric science literature using historical records of upper atmosphere geopotential height in the Northern hemisphere. Cross-validated likelihood provides an interpretable and objective solution to the atmospheric clustering problem. The clusters found are in agreement with prior analyses of the same data based on non-probabilistic clustering techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Aitkin M., Anderson D., and Hinde J. 1981. Statistical modelling of data on teaching styles (with discussion). J. R. Statist. Soc. A 144: 419–461.

    Google Scholar 

  • Burman P. 1989. A comparative study of ordinary cross-validation, vfold cross-validation, and the repeated learning-testing methods. Biometrika 76(3): 503–514.

    Google Scholar 

  • Celeux G. and Govaert G. 1995. Gaussian parsimonious clustering models. Pattern Recognition 28: 781–793.

    Google Scholar 

  • Cheng X. and Wallace J.M. 1993. Cluster analysis of the Northern hemisphere winter-time 500-hPa height field: spatial patterns. J. Atmos. Sci. 50(16): 2674–2696.

    Google Scholar 

  • Chickering D.M. and Heckerman D. 1997. Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables. Machine Learning 29(2/3): 181–244.

    Google Scholar 

  • Cover T.A. and Thomas J.M. 1991. Elements of Information Theory, New York, John Wiley.

    Google Scholar 

  • Dawid A.P. 1984. Present position and potential developments: some personal views. Statistical theory: the prequential approach. J. R. Statist. Soc. A 147: 278–292 (with discussion).

    Google Scholar 

  • Diebolt J. and Robert C.P. 1994. Bayesian estimation of finite mixture distributions. J. R. Statist. Soc. B 56: 363–375.

    Google Scholar 

  • Everitt B.S. and Hand D.J. 1981. Finite Mixture Distributions, London, Chapman and Hall.

    Google Scholar 

  • Feng Z.D. and McCulloch C.E. 1996. Using bootstrap likelihood ratios in finite mixture models. J. R. Statist. Soc. B 58(3): 609–617.

    Google Scholar 

  • Fraley C. and Raftery A.E. 1998. How many clusters? Which clustering method? Answers via model-based cluster analysis. Computer Journal 41: 578–588.

    Google Scholar 

  • Good I.J. 1952. Rational decisions. J. R. Statist. Soc. B 14, 107–114.

    Google Scholar 

  • Hjorth J.S.U. 1994. Computer Intensive Statistical Methods: Validation, Model Selection and Bootstrap, Chapman and Hall, UK.

    Google Scholar 

  • Kass R.E. and Raftery A.E. 1995. Bayes factors. J. Am. Stat. Assoc. 90. 773–795.

    Google Scholar 

  • Kearns M. 1996. A bound on the error of cross validation using the approximation and estimation rates, with consequences for the training-test split. In: Touretzky D. S., Mozer M. C., and Hasselmo M.E. (Eds.), Advances in Neural Information Processing 8. Cambridge, MA, The MIT Press, pp. 183–189.

    Google Scholar 

  • Kimoto M. and Ghil M. 1993. Multiple flow regimes in the Northern hemisphere winter: Part I: methodology and hemispheric regimes. J. Atmos. Sci. 50(16): 2625–2643.

    Google Scholar 

  • Lavine M. and West M. 1992. A Bayesian method for classification and discrimination. Can. J. Statist. 20: 451–461.

    Google Scholar 

  • McLachlan G.J. 1987. On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Appl. Statist. 36: 318–324.

    Google Scholar 

  • McLachlan G.J. and Basford K.E. 1988. Mixture Models: Inference and Applications to Clustering, New York, Marcel Dekker.

    Google Scholar 

  • McLachlan G.J. and Krishnan T. 1997. The EM Algorithm and Extensions, New York, John Wiley and Sons.

    Google Scholar 

  • McLachlan G.J. and Peel D. 1997. On a resampling approach to choosing the number of components in normal mixture models. In: L. Billard and N.I. Fisher (Eds.). Computing Science and Statistics (Vol. 28), Fairfax Station, Virginia, Interface Foundation of North America, pp. 260–266.

    Google Scholar 

  • McLachlan G.J. and Peel D. 1998. MIXFIT: An algorithm for the automatic fitting and testing of normal mixture models. In: Proceedings of the 14th International Conference on Pattern Recognition, Vol. I, Los Alamitos, CA, IEEE Computer Society, pp. 553–557.

  • Michelangeli P.-A., Vautard R., and Legras B. 1995. Weather regimes: recurrence and quasi-stationarity. J. Atmos. Sci. 52(8): 1237–1256.

    Google Scholar 

  • Mo K. and Ghil M. 1988. Cluster analysis of multiple planetary flow regimes. J. Geophys. Res. 93, D9: 10927–10952.

    Google Scholar 

  • Preisendorfer R.W. 1988. In: C.D. Mobley (Ed.), Principal Component Analysis in Meteorology and Oceanography. Elsevier, Amsterdam.

  • Raftery A.E., Madigan D., and Volinsky C. 1996. ‘Accounting for model uncertainty in survival analysis improves predictive performance,’ In: Bernardo J.M., Berger J.O., Dawid A.P., and Smith A.F.M. (Eds.), Bayesian Statistics 5. Oxford University Press, pp. 323–349.

  • Reaven G.M. and Miller R.G. 1979. An attempt to define the nature of chemical diabetes using a multi-dimensional analysis. Diabetologia 16: 17–24.

    Google Scholar 

  • Schwarz G. 1978. Estimating the dimensions of a model. Annals of Statistics 6: 461–462.

    Google Scholar 

  • Shao J. 1993. Linear model selection by cross-validation. J. Am. Stat. Assoc. 88(422): 486–494.

    Google Scholar 

  • Silverman B.W. 1986. Density Estimation for Statistics and Data Analysis, Chapman and Hall.

  • Smyth P. 1996. Clustering using Monte-Carlo cross validation. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA, AAAI Press, pp. 126–133.

    Google Scholar 

  • Smyth P. 1997. Clustering sequences using hidden Markov models. In: Mozer M.C., Jordan M.I., and Petsche T. (Eds.), Advances in Neural Information Processing 9. Cambridge, MA: MIT Press, 648–654.

    Google Scholar 

  • Smyth P., Ide K., and Ghil M. 1999. Multiple regimes in Northern hemisphere height fields via mixture model clustering. Journal of Atmospheric Sciences 56(21): 3704–3723.

    Google Scholar 

  • Smyth P. and Wolpert D. 1999. Linearly combining density estimators via stacking. Machine Learning 36(1): 59–83.

    Google Scholar 

  • Symons M. 1981. Clustering criteria and multivariate normal mixtures. Biometrics 37: 35–43.

    Google Scholar 

  • Thiesson B., Meek C., Chickering D.M., and Heckerman D. 1997. Learning mixtures of Bayesian networks. Technical Report MSRTR-97-30, Microsoft Research, Redmond, WA.

    Google Scholar 

  • Titterington D.M., Smith A.F.M., and Makov U.E. 1985. Statistical Analysis of Finite Mixture Distributions. Chichester, UK, John Wiley and Sons.

    Google Scholar 

  • Wallace J.M. 1996. Observed Climatic Variability: Spatial Structure. In: Anderson D.L.T. and Willebrand J. (Eds.), Decadal Climate Variability, NATO ASI Series, Springer Verlag.

  • Zhang P. 1993. Model selection via multifold cross validation. Ann. Statist. 21(1): 299–313.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Smyth, P. Model selection for probabilistic clustering using cross-validated likelihood . Statistics and Computing 10, 63–72 (2000). https://doi.org/10.1023/A:1008940618127

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1008940618127

Keywords

Navigation