Skip to main content
Log in

Dimension reduction for model-based clustering

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

We introduce a dimension reduction method for visualizing the clustering structure obtained from a finite mixture of Gaussian densities. Information on the dimension reduction subspace is obtained from the variation on group means and, depending on the estimated mixture model, on the variation on group covariances. The proposed method aims at reducing the dimensionality by identifying a set of linear combinations, ordered by importance as quantified by the associated eigenvalues, of the original features which capture most of the cluster structure contained in the data. Observations may then be projected onto such a reduced subspace, thus providing summary plots which help to visualize the clustering structure. These plots can be particularly appealing in the case of high-dimensional data and noisy structure. The new constructed variables capture most of the clustering information available in the data, and they can be further reduced to improve clustering performance. We illustrate the approach on both simulated and real data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Banfield, J., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  • Bernard-Michel, C., Girard, S.: Gaussian regularized sliced inverse regression. Stat. Comput. 19(1), 85–98 (2009)

    Article  Google Scholar 

  • Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. 22(7), 719–725 (2000)

    Article  Google Scholar 

  • Bishop, C.M., Tipping, M.E.: A hierarchical latent variable model for data visualization. IEEE Trans. Pattern Anal. 20(3), 281–293 (1998)

    Article  Google Scholar 

  • Bouveyron, C., Girard, S., Schmid, C.: High-dimensional data clustering. Comput. Stat. Data Anal. 52, 502–519 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  • Campbell, N., Mahon, R.: A multivariate study of variation in two species of rock crab of genus leptograpsus. Aust. J. Zool. 22, 417–425 (1974)

    Article  Google Scholar 

  • Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recognit. 28, 781–793 (1995)

    Article  Google Scholar 

  • Celeux, G., Soromenho, G.: An entropy criterion for assessing the number of clusters in a mixture model. J. Classif. 2, 195–212 (1996)

    Article  MathSciNet  Google Scholar 

  • Chang, W.: On using principal components before separating a mixture of two multivariate normal distributions. J. R. Stat. Soc. C Appl. Stat. 32(3) (1983)

  • Cook, R.D.: Regression Graphics: Ideas for Studying Regressions Through Graphics. Wiley, New York (1998)

    MATH  Google Scholar 

  • Cook, R.D., Yin, X.: Dimension reduction and visualization in discriminant analysis (with discussion). Aust. NZ J. Stat. 43, 147–199 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  • Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm (with discussion). J. R. Stat. Soc. B Methodol. 39, 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  • Forina, M., Armanino, C., Castino, M., Ubigli, M.: Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25, 189–201 (1986). ftp://ftp.ics.uci.edu/pub/machine-learning-databases/wine. Wine Recognition Database

    Google Scholar 

  • Fraley, C., Raftery, A.E.: How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput. J. 41, 578–588 (1998)

    Article  MATH  Google Scholar 

  • Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97(458), 611–631 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  • Fraley, C., Raftery, A.E.: MCLUST version 3 for R: Normal mixture modeling and model-based clustering. Tech. Rep. 504, Department of Statistics, University of Washington (2006)

  • Ghosh, D., Chinnaiyan, A.M.: Mixture modelling of gene expression data from microarray experiments. Bioinformatics 18(2), 275–286 (2002)

    Article  Google Scholar 

  • Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)

    Article  Google Scholar 

  • Jolliffe, I.T.: Principal Component Analysis. Springer, New York (2002)

    MATH  Google Scholar 

  • Kass, R.E., Raftery, A.E.: Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995)

    Article  MATH  Google Scholar 

  • Li, K.C.: Sliced inverse regression for dimension reduction (with discussion). J. Am. Stat. Assoc. 86, 316–342 (1991)

    Article  MATH  Google Scholar 

  • Li, K.C.: High dimensional data analysis via the SIR/PHD approach. Unpublished manuscript (2000)

  • Maugis, C., Celeux, G., Martin-Magniette, M.-L.: Variable selection for clustering with Gaussian mixture models. Biometrics (2009, to appear)

  • McLachlan, G., Peel, D.: Finite Mixture Models. Wiley, New York (2000)

    Book  MATH  Google Scholar 

  • McLachlan, G., Peel, D., Bean, R.: Modelling high-dimensional data by mixtures of factor analyzers. Comput. Stat. Data Anal. 41, 379–388 (2003)

    Article  MathSciNet  Google Scholar 

  • McNicholas, P., Murphy, T.: Parsimonious Gaussian mixture models. Stat. Comput. 18, 285–296 (2008)

    Article  MathSciNet  Google Scholar 

  • Raftery, A.E., Dean, N.: Variable selection for model-based clustering. J. Am. Stat. Assoc. 101(473), 168–178 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  • Schwartz, G.: Estimating the dimension of a model. Ann. Stat. 6, 31–38 (1978)

    Google Scholar 

  • Tipping, M.E., Bishop, C.M.: Mixtures of probabilistic principal component analyzers. Neural Comput. 11(2), 443–482 (1999a)

    Article  Google Scholar 

  • Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. J. R. Stat. Soc. B Methodol. 61, 611–622 (1999b)

    Article  MATH  MathSciNet  Google Scholar 

  • Yeung, K.Y., Fraley, C., Murua, A., Raftery, A.E., Ruzzo, W.L.: Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10), 977–987 (2001)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luca Scrucca.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Scrucca, L. Dimension reduction for model-based clustering. Stat Comput 20, 471–484 (2010). https://doi.org/10.1007/s11222-009-9138-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-009-9138-7

Keywords

Navigation