Abstract
Variable selection in cluster analysis is important yet challenging. It can be achieved by regularization methods, which realize a trade-off between the clustering accuracy and the number of selected variables by using a lasso-type penalty. However, the calibration of the penalty term can suffer from criticisms. Model selection methods are an efficient alternative, yet they require a difficult optimization of an information criterion which involves combinatorial problems. First, most of these optimization algorithms are based on a suboptimal procedure (e.g. stepwise method). Second, the algorithms are often computationally expensive because they need multiple calls of EM algorithms. Here we propose to use a new information criterion based on the integrated complete-data likelihood. It does not require the maximum likelihood estimate and its maximization appears to be simple and computationally efficient. The original contribution of our approach is to perform the model selection without requiring any parameter estimation. Then, parameter inference is needed only for the unique selected model. This approach is used for the variable selection of a Gaussian mixture model with conditional independence assumed. The numerical experiments on simulated and benchmark datasets show that the proposed method often outperforms two classical approaches for variable selection. The proposed approach is implemented in the R package VarSelLCM available on CRAN.
Similar content being viewed by others
References
Bertoletti, M., Friel, N., Rastelli, R.: Choosing the number of clusters in a finite mixture model using an exact integrated completed likelihood criterion. METRON 73(2), 177–199 (2015)
Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. Pattern Anal. Mach. Intell. IEEE Trans. 22(7), 719–725 (2000)
Biernacki, C., Celeux, G., Govaert, G.: Exact and Monte Carlo calculations of integrated likelihoods for the latent class model. J. Stat. Plan. Inference 140(11), 2991–3002 (2010)
Celeux, G., Govaert, G.: Clustering criteria for discrete data and latent class models. J. Classif. 8(2), 157–176 (1991)
Celeux, G., Martin-Magniette, M., Maugis-Rabusseau, C., Raftery, A.: Comparing model selection and regularization approaches to variable selection in model-based clustering. Journal de la Societe francaise de statistique 155(2), 57 (2014)
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 39(1), 1–38 (1977)
Flury, B., Riedwyl, H.: Multivariate Statistics: A Practical Approach. Chapman and Hall, London (1988)
Forina, M., et al.: PARVUS an extendible package for data exploration, classification and correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Genoa, Italy (1991)
Friedman, J., Meulman, J.: Clustering objects on subsets of attributes (with discussion). J. R. Stat. Soc. Ser. B (Stat. Methodol.) 66(4), 815–849 (2004)
Friel, N., Wyse, J.: Estimating the evidence-a review. Stat. Neerl. 66(3), 288–308 (2012)
Golub, T., et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439), 531–537 (1999)
Govaert, G.: Data Analysis. ISTE Wiley, New York (2009)
Green, P.: On use of the EM for penalized likelihood estimation. J. R. Stat. Soc. Ser. B (Methodol.) 52(3), 443–452 (1990)
Hand, D., Keming, Y.: Idiot’s Bayes, not so stupid after all? Int. Stat. Rev. 69(3), 385–398 (2001)
Hartigan, J.A., Wong, M.A.: Algorithm as 136: a k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979)
Haughton, D.: On the choice of a model to fit data from an exponential family. Ann. Stat. 16(1), 342–355 (1988)
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Keribin, C.: Consistent estimation of the order of mixture models. Sankhyā, 49–66 (2000)
Maugis, C., Celeux, G., Martin-Magniette, M.: Variable selection for clustering with Gaussian mixture models. Biometrics 65(3), 701–709 (2009a)
Maugis, C., Celeux, G., Martin-Magniette, M.: Variable selection in model-based clustering: a general variable role modeling. Comput. Stat. Data Anal. 53(11), 3872–3882 (2009b)
Moustaki, I., Papageorgiou, I.: Latent class models for mixed variables with applications in Archaeometry. Comput. Stat. Data Anal. 48(3), 659–675 (2005)
Pan, W., Shen, X.: Penalized model-based clustering with application to variable selection. J. Mach. Learn. Res. 8, 1145–1164 (2007)
Raftery, A., Dean, N.: Variable selection for model-based clustering. J. Am. Stat. assoc. 101(473), 168–178 (2006)
Robert, C.: The Bayesian choice: from decision-theoretic foundations to computational implementation. Springer, New York (2007)
Rusakov, D., Geiger, D.: Asymptotic model selection for Naive Bayesian networks. J. Mach. Learn. Res. 6, 1–35 (2005)
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
Scrucca L., Raftery, A. E.: clustvarsel: A package implementing variable selection for model-based clustering in R. Pre-print available at http://arxiv.org/abs/1411.0606 (2015)
Street, W., Wolberg, W., Mangasarian, O.: Nuclear feature extraction for breast tumor diagnosis. IST/SPIE 1993 international symposium on electronic imaging. Sci. Technol. 1905, 861–870 (1993)
Streuli, H.: Der heutige stand der kaffeechemie. In: Association Scientifique International du Cafe, 6th International Colloquium on Coffee Chemisrty, pp. 61–72 (1973)
Tadesse, M., Sha, N., Vannucci, M.: Bayesian variable selection in clustering high-dimensional data. J. Am. Stat. Assoc. 100(470), 602–617 (2005)
White, A., Wyse, J., Murphy, T.B.: Bayesian variable selection for latent class analysis using a collapsed Gibbs sampler. Stat. Comput., 1–17 (2014)
Witten, D., Tibshirani, R.: A framework for feature selection in clustering. J. Am. Stat. Assoc. 105(490), 713–726 (2010)
Witten, D., Tibshirani, R.: sparcl: Perform sparse hierarchical clustering and sparse k-means clustering. R package version 1, 3 (2013)
Acknowledgments
The authors are grateful to Gilles Celeux, Paul Diver and Jean-Michel Marin for their leading comments.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendices
Appendix 1: Consistency of the MICL criterion
This section is devoted to the proof of consistency of our \(\text {MICL}\) criterion with a fixed number of components. The first part deals with non-nested models and requires a bias-entropy compensation assumption. The second part covers the nested models, i.e, when the competing model contains the true model. In what follows, we consider the true model \(\varvec{m}^{(0)} = \big (g^{(0)}, \varvec{\omega }^{(0)}\big )\), its set of relevant variables is \({\varOmega }^{(0)} = \left\{ j : \omega ^{(0)}_j = 1 \right\} \) and the parameter is \(\varvec{\theta }^{(0)}\).
Case of non-nested modelWe need to introduce the entropy notation given by
where \(\tau _{ik}\big (\varvec{\theta }\mid \varvec{m}\big ) = \dfrac{\tau _k \phi \big (\varvec{x}_i \mid \theta _k, \varvec{m}\big )}{\sum _h^g\tau _h \phi \big (\varvec{x}_i \mid \theta _h, \varvec{m}\big )}.\)
Proposition 1
Assume that \(\varvec{m}^{(1)}\) is a model such that \(\varvec{m}^{(0)}\) is a non-nested model within \(\varvec{m}^{(1)}\). Assume that
where \(\mathbf {KL}\Big [\varvec{m}^{(0)}||\varvec{m}^{(1)}\Big ]\) is the Kullback-Leibler divergence of \(p\big (\cdot \mid \varvec{\theta }^{(0)},\varvec{m}^{(0)}\big )\) from \(p\big (\cdot \mid \varvec{\theta }^{(1)},\varvec{m}^{(1)}\big )\) and
When \(n \rightarrow \infty \), we have
Proof
For any model \(\varvec{m}\), we have the following inequalities,
It follows,
Now set \({\varDelta }\nu = \nu ^{(1)} - \nu ^{(0)}\) where \(\nu ^{(1)}\) and \(\nu ^{(0)}\) are the numbers of free parameters in the models \(\varvec{m}^{(1)}\) and \(\varvec{m}^{(0)}\) respectively. Using Laplace’s approximation, we have
where \(\widehat{\varvec{\theta }}^{(0)}\) and \(\widehat{\mathbf {z}}^{(0)}\) are respectively the MLE and the partition given by the corresponding MAP rule. In the same way, we have
where \(\widehat{\varvec{\theta }}^{(1)}\) is the MLE of \(\varvec{\theta }^{(1)}\). Note that
where
and
When \(n \rightarrow \infty \), we have \(A_n \rightarrow \chi ^2_{{\varDelta }\nu }\) in distribution and \(B_n\) tends to
in probability. Thus, under the assumption (22), \(\text {MICL}\) is consistent since when \(n \rightarrow \infty \), we have
Case of nested model Recall that \(\text {MICL}\big (\varvec{m}^{(0)}\big ) = \ln p\big (\mathbf {x}, \mathbf {z}^{(0)}\mid \varvec{m}^{(0)}\big )\), where \(\mathbf {z}^{(0)} = \underset{\mathbf {z}}{\text {argmax}} \ln p\big (\mathbf {x}, \mathbf {z}\mid \varvec{m}^{(0)}\big )\). We have
where \({\varOmega }_0 = \left\{ j : \omega ^{(0)}_j = 1\right\} \). Let \(\varvec{m}^{(1)} = \left( g^{(0)}, {\varOmega }_1\right) \) where \({\varOmega }_1 = {\varOmega }_0 \cup {\varOmega }_{01}\) and \( {\varOmega }_{01} = \left\{ j : \omega _j^{(1)} = 1, \omega _j^{(0)} = 0 \right\} \). Then, in the same way, we have \(\text {MICL}\big (\varvec{m}^{(1)}\big ) = \ln p\big (\mathbf {x}, \mathbf {z}^{(1)}\!\mid \varvec{m}^{(1)}\big )\), where
Let \(j \in {\varOmega }_{01}\), Laplace’s approximation gives us,
where
Proposition 2
Assume that \(\varvec{m}^{(1)}\) is a model such that \(g^{(1)} = g^{(0)}\) and \({\varOmega }_1 = {\varOmega }_0 \cup {\varOmega }_{01}\) where \({\varOmega }_{01} \ne \emptyset \), i.e, the model \(\varvec{m}^{(0)}\) is nested within the model \(\varvec{m}^{(1)}\) with the same number of components. When \(n \rightarrow \infty \),
Proof
We have
And for each \(j \in {\varOmega }_{01}\), when \(n \rightarrow \infty \)
We have
Appendix 2: Details on the partition step
At iteration [r], the partition \(\mathbf {z}^{[r]}\) is defined as a partition which increase the value of the integrated complete-data likelihood for the current model \(\varvec{m}^{[r]}\). This partition is obtained by an iterative method initialized with the partition \(\mathbf {z}^{[r-1]}\). Each iteration consists in sampling uniformly an individual which is affiliated to the class maximizing the integrated complete-data likelihood while the other class memberships are unchanged.
At iteration [r] of the global algorithm, the algorithm used to obtained \(\mathbf {z}^{[r]}\) is inialized at partition \(\mathbf {z}^{(0)}=\mathbf {z}^{[r-1]}\). It performs S iterations where iteration (s) is defined as follows:
Individual sampling \(i^{(s)} \sim \mathcal {U}\{1,\ldots ,n\}\)
Partition optimization defined the set of partition \(\mathcal {Z}^{(s)}=\{\mathbf {z}: \varvec{z}_i=\varvec{z}_i^{(s-1)},\; \forall i\ne i^{(s)}\}\) and compute the optimized partition \(\mathbf {z}^{(s)}\) defined by
Rights and permissions
About this article
Cite this article
Marbac, M., Sedki, M. Variable selection for model-based clustering using the integrated complete-data likelihood. Stat Comput 27, 1049–1063 (2017). https://doi.org/10.1007/s11222-016-9670-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-016-9670-1