Variational Bayesian inference for infinite generalized inverted Dirichlet mixtures with feature selection and its application to clustering

Bdiri, Taoufik; Bouguila, Nizar; Ziou, Djemel

doi:10.1007/s10489-015-0714-6

Variational Bayesian inference for infinite generalized inverted Dirichlet mixtures with feature selection and its application to clustering

Published: 06 October 2015

Volume 44, pages 507–525, (2016)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Abstract

We developed a variational Bayesian learning framework for the infinite generalized Dirichlet mixture model (i.e. a weighted mixture of Dirichlet process priors based on the generalized inverted Dirichlet distribution) that has proven its capability to model complex multidimensional data. We also integrate a “feature selection” approach to highlight the features that are most informative in order to construct an appropriate model in terms of clustering accuracy. Experiments on synthetic data as well as real data generated from visual scenes and handwritten digits datasets illustrate and validate the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Density-Based Clustering Based on Hierarchical Density Estimates

A Systematic Review of Hidden Markov Models and Their Applications

Article 12 May 2020

Mixture Models: Latent Profile and Latent Class Analysis

References

Jain AK, Murty M, Flynn P (1999) Data clustering: a Review. ACM Comput Surv 31(3):264–323
Article Google Scholar
Rui X, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Article Google Scholar
Bargary N, Hinde J, Garcia AF (2014) Finite mixture model clustering of snp data. In: MacKenziet G, Peng D (eds) Statistical Modelling in Biostatistics and Bioinformatics, Contributions to Statistics. Springer International Publishing, pp 139–157
Koestler DC, Marsit CJ, Christensen BC, Kelsey KT, Houseman EA (2014) A recursively partitioned mixture model for clustering time-course gene expression data. Translational Cancer Research 3(3)
Prabhakaran S, Rey M, Zagordi O, Beerenwinkel N, Roth V (2014) Hiv haplotype inference using a propagating dirichlet process mixture model. IEEE/ACM Trans Comput Biol Bioinform 11(1):182–191
Article Google Scholar
Tran KA, Vo NQ, Lee G (2014) A novel clustering algorithm based gaussian mixture model for image segmentation. In: Proc. of the 8th International Conference on Ubiquitous Information Management and Communication, ICUIMC ’14, pp 97:1–97:5 ACM
Topkaya IS, Erdogan H, Porikli F (2014) Counting people by clustering person detector outputs. In: Proc. of the 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp 313–318
Zhou B, Tang X, Wang X (2015) Learning collective crowd behaviors with dynamic pedestrian-agents. Int J Comput Vis 111(1):50–68
Article Google Scholar
Boutemedjet S, Ziou D (2012) Predictive approach for user long-term needs in content-based image suggestion. IEEE Transactions on Neural Networks and Learning Systems 23(8):1242–1253
Article Google Scholar
Beutel A, Murray K, Faloutsos C, Smola AJ (2014) Cobafi: Collaborative bayesian filtering. In: Proc. of the 23rd International Conference on World Wide Web, WWW ’14, pages 97–108. ACM
Yin H, Cui B, Chen L, Hu Z, Huang Z (2014) A temporal context-aware model for user behavior modeling in social media systems. In: Proc. of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pp 1543–1554. ACM
Handcock MS, Raftery AE, Tantrum JM (2007) Model-based clustering for social networks. J R Stat Soc: Series A (Statistics in Society) 170(2):301–354
Article MathSciNet Google Scholar
Couronne T, Stoica A, Beuscart JS (2010) Online social network popularity evolution: An additive mixture model. In: Proc. of International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp 346–350
Xu D, Yang S (2014) Location prediction in social media based on contents and graphs. In: Proc. of Fourth International Conference on Communication Systems and Network Technologies (CSNT), pp 1177–1181
Bdiri T, Bouguila N (2011) Learning inverted dirichlet mixtures for positive data clustering . In: Proc. of the 13th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing (RSFDGrC), pp 265–272
Bdiri T, Bouguila N (2012) Positive vectors clustering using inverted dirichlet finite mixture models. Expert Systems With Applications 39(2):1869–1882
Article Google Scholar
Bdiri T, Bouguila N, Ziou D (2014) Object clustering and recognition using multi-finite mixtures for semantic classes and hierarchy modeling. Expert Systems with Applications 41(4, Part 1):1218–1235
Article Google Scholar
Bdiri T, Bouguila N, Ziou D (2015) A statistical framework for online learning using adjustable model selection criteria. Technical report, Concordia Institute for Information Systems Engineering. Concordia University, Montreal
Google Scholar
Bdiri T, Bouguila N, Ziou D (2013) Visual scenes categorization using a flexible hierarchical mixture model supporting users ontology. In: IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI), pp 262–267
Wallace CS (2005) Statistical and inductive inference by minimum message length. Springer-Verlag
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723
Article MathSciNet MATH Google Scholar
Rissanen J (1978) Modeling by shortest data description. Automatica 14(5):465–471
Article MATH Google Scholar
Figueiredo MAT, Leit ao JMN, Jain A (1999) On fitting mixture models. In: Proc. of the Second International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition. Springer-Verlag, pp 54–69
McLachlan GJ, Peel D (2000) Finite Mixture Models. Wiley, New York
Book MATH Google Scholar
McLachlan GJ, Krishnan T (1997) The EM Algorithm and Extensions. John Wiley and Sons. Inc.
Winn J, Bishop CM (2005) Variational Message Passing. J Mach Learn Res 6:661–694
MathSciNet MATH Google Scholar
Dimitris K, Evdokia X (2003) Choosing initial values for the {EM} algorithm for finite mixtures. Comput Stat Data Anal 41(34):577–590
MathSciNet MATH Google Scholar
Robert CP (2007) The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation, 2nd edn. Springer
Bouguila N, Elguebaly T (2012) A fully bayesian model based on reversible jump {MCMC} and finite beta mixtures for clustering. Expert Systems with Applications 39(5):5946–5959
Article Google Scholar
Pereyra M, Dobigeon N, Batatia H, Tourneret J (2013) Estimating the granularity coefficient of a potts-markov random field within a markov chain monte carlo algorithm. IEEE Trans Image Process 22(6):2385–2397
Article MathSciNet Google Scholar
Bouguila N, Ziou D (2008) A dirichlet process mixture of dirichlet distributions for classification and prediction. In: IEEE Workshop on Machine Learning for Signal Processing (MLSP), pp 297–302
Cowles MK, Carlin BP (1996) Markov chain monte carlo convergence diagnostics: A comparative review. J Am Stat Assoc 91(434):883–904
Article MathSciNet MATH Google Scholar
Bhatnagar N, Bogdanov A, Mossel E (2011) The computational complexity of estimating mcmc convergence time. In: Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, volume 6845 of Lecture Notes in Computer Science. Springer, Berlin Heidelberg, pp 424–435
Google Scholar
Corduneanu A, Bishop CM (2001) Variational bayesian model selection for mixture distributions. In: Proc. of the Eighth International Conference on Artificial Intelligence and Statistics, p 2734. Morgan Kaufmann
Tan SL, Nott DJ (2014) Variational approximation for mixtures of linear mixed models. J Comput Graph Stat 23(2):564–585
Article MathSciNet Google Scholar
Thanh MN, Wu QMJ (2014) Asymmetric mixture model with variational bayesian learning. In: Proc. of International Joint Conference on Neural Networks (IJCNN), pp 285–290
Zhanyu M, Leijon A (2011) Bayesian estimation of beta mixture models with variational inference. IEEE Trans Pattern Anal Mach Intell 33(11):2160–2173
Article Google Scholar
Boutemedjet S, Bouguila N, Ziou D (2009) A hybrid feature extraction selection approach for high-dimensional non-gaussian data clustering. IEEE Trans Pattern Anal Mach Intell 31(8):1429–1443
Article Google Scholar
Wang H, Zha H, Qin H (2007) Dirichlet aggregation: unsupervised learning towards an optimal metric for proportional data. In: Proceedings of the 24th international conference on Machine learning, pp 959–966. ACM
Johnson NL, Kotz S, Balakrishnan N (1995) Continuous Univariate Distributions: Vol.: 2. Wiley series in probability and mathematical statistics. Applied probability and statistics
Sethuraman J. (1994) A constructive definition of Dirichlet priors. Stat Sin 4:639–650
MathSciNet MATH Google Scholar
Hastie T, Tibshirani R, Friedman J (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer
Fei-Fei L, Fergus R, Perona P (2004) Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In: Proc. of conference on Computer Vision and Pattern Recognition Workshop (CVPRW), pp 178–178
Constantinopoulos C, Titsias MK, Likas A (2006) Bayesian feature and model selection for gaussian mixture models. IEEE Trans Pattern Anal Mach Intell 28(6):1013–1018
Article Google Scholar
Fan W, Bouguila N (2013) Variational learning of a dirichlet process of generalized dirichlet distributions for clustering, simultaneous feature selection. Pattern Recogn 46(10):2754–2769
Article MATH Google Scholar
Blei DM, Jordan MI (2006) Variational inference for dirichlet process mixtures. Bayesian Analysis 1 (1):121–143
Article MathSciNet MATH Google Scholar
Jordan M, Ghahramani Z, Jaakkola T, Saul L (1999) An introduction to variational methods for graphical models. Mach Learn 37(2):183–233
Article MATH Google Scholar
Opper M, Saad D (2001) Advanced mean field methods: theory and practice. Neural Information Processing. Massachusetts Institute of Technology Press (MIT Press)
Bishop CM (2006) Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, New York Inc.
MATH Google Scholar
Boyd S, Vandenberghe L (2004) Convex Optimization. Cambridge University Press
Ishwaran H, James LF (2001) Gibbs sampling methods for stick-breaking priors. J Am Stat Assoc 96(453)
Figueiredo MAT, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396
Article Google Scholar
Law MHC, Figueiredo MAT, Jain AK (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26(9):1154–1166
Article Google Scholar
Salter MT, Murphy TB (2012) Variational bayesian inference for the latent position cluster model for network data. Comput Stat Data Anal 57(1):661–671
Article MathSciNet Google Scholar
Nasios N, Bors AG (2006) Variational learning for gaussian mixture models . IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 36(4):849–862
Article Google Scholar
Oliva A, Torralba A (2001) Modeling the shape of the scene: A holistic representation of the spatial envelope. Int J Comput Vis 42:145–175
Article MATH Google Scholar
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 886–893 IEEE
Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc. of the IEEE 86(11):2278–2324
Article Google Scholar

Download references

Acknowledgments

The completion of this research was made possible thanks to the Natural Sciences and Engineering Research Council of Canada (NSERC); and Concordia University via a Research Chair in Management, Analysis, and Modeling of Big Multimodal Data and Applications. The authors would like to thank the anonymous referees and the associate editor for their helpful comments.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Concordia University, Montreal, QC, H3G 1T7, Canada
Taoufik Bdiri
The Concordia Institute for Information Systems Engineering (CIISE), Concordia University, Montreal, QC, H3G 1T7, Canada
Nizar Bouguila
DI, Faculté des Sciences, Université de Sherbrooke, Sherbrooke, QC, J1K 2R1, Canada
Djemel Ziou

Authors

Taoufik Bdiri
View author publications
You can also search for this author in PubMed Google Scholar
Nizar Bouguila
View author publications
You can also search for this author in PubMed Google Scholar
Djemel Ziou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nizar Bouguila.

Appendices

Appendix A: Conditional independence in the transformed space

We know that the posterior probability is p(j|Y _i)∝π _j p(Y _i|α _j, β _j), so every vector Y _i is assigned to its cluster j such as j = arg maxj p(j|Y _i)= arg maxj π _j p(Y _i|α _j, β _j). We have:

$$ p(\mathbf{Y}_{i}|\boldsymbol{\alpha}_{j},\boldsymbol{\beta}_{j}) = \prod\limits_{d=1}^{D}\frac{\varGamma{(\alpha_{jd}+\beta_{jd})}}{{ \varGamma(\alpha_{jd})\varGamma(\beta_{jd})}} \frac{Y_{id}^{\alpha_{jd}-1}}{ (1+{\sum}_{l=1}^{D} Y_{il})^{\gamma_{jd}}} $$

(58)

For GID, it is possible to compute the posterior probability by examining the form of the product in (58) and considering every feature separately, so if we want to consider the feature D, (58) becomes for a specific vector Y _i=(Y _i1, Y _i2,..., Y _{i
D}):

$$\begin{array}{@{}rcl@{}} \frac{1}{B(\alpha_{jD},\beta_{jD})} Y_{iD}^{\alpha_{jD}-1} \left(1+\sum\limits_{l=1}^{D} Y_{il}\right)^{- \beta_{jD} - \alpha_{jD} + \beta_{j(D+1)} } \\ \times \prod\limits_{l=1}^{D-1}\frac{1}{B(\alpha_{jl},\beta_{jl})} Y_{iD}^{\alpha_{jl}-1} \left(1+\sum\limits_{k=1}^{l} Y_{ik}\right)^{- \beta_{jl} - \alpha_{jl} + \beta_{j(l+1)} } \end{array} $$

(59)

where B(α _{j
l}, β _{j
l}) is the beta function such that $B(\alpha _{jl},\beta _{jl}) = \frac {\varGamma (\alpha _{jl})\varGamma (\alpha _{jl})}{\varGamma (\alpha _{jl} + \beta _{jl})}$. As β _j(D+1)=0, (59) becomes:

$$\begin{array}{@{}rcl@{}} &&\frac{1}{B(\alpha_{jD},\beta_{jD})} Y_{iD}^{\alpha_{jD}-1} \left(1+\sum\limits_{l=1}^{D} Y_{il}\right)^{- \beta_{jD} - \alpha_{jD} } \prod\limits_{l=1}^{D-1}\\ &&\times\frac{1}{B(\alpha_{jl},\beta_{jl})} Y_{iD}^{\alpha_{jl}-1} \left(1+\sum\limits_{k=1}^{l} Y_{ik}\right)^{- \beta_{jl} - \alpha_{jl} + \beta_{j(l+1)} } \end{array} $$

(60)

by multiplying (60) by the constant $\left (1+{\sum }_{l=1}^{D-1} Y_{il}\right )^{\beta _{jD} + \alpha _{jD} - \alpha _{jD} +1} = \left (1+{\sum }_{l=1}^{D-1} Y_{il}\right )^{\beta _{jD} + 1}$, (60) becomes proportional to:

$$\begin{array}{@{}rcl@{}} \frac{1}{B(\alpha_{jD},\beta_{jD})} \!\!\left(\frac{Y_{iD}}{1+{\sum}_{l=1}^{D-1} Y_{il}}\right)^{\alpha_{jD}-1}\!\! \left(\!1\,+\,\frac{Y_{iD}}{1\,+\,{\sum}_{l=1}^{D-1} Y_{il}}\!\right)^{\!- \beta_{jD} - \alpha_{jD} } \\ \times \! \prod\limits_{l=1}^{D-1}\!\!\frac{1}{B(\alpha_{jl},\beta_{jl})} Y_{iD}^{\alpha_{jl}-1} \!\left(\!1\,+\,\sum\limits_{k=1}^{l} Y_{ik}\!\right)^{\!- \beta_{jl} - \alpha_{jl} + \beta_{j(l+1)} }\\ \end{array} $$

(61)

We know that:

$$\begin{array}{@{}rcl@{}} \frac{1}{B(\alpha_{jD},\beta_{jD})}\! \left(\!\frac{Y_{iD}}{1+{\sum}_{l=1}^{D-1} Y_{il}}\!\right)^{\alpha_{jD}-1} \!\!\left(\!1\,+\,\frac{Y_{iD}}{1+{\sum}_{l=1}^{D-1} Y_{il}}\!\right)^{\!- \beta_{jD} - \alpha_{jD}} \,=\, \\ p_{iBeta}\!\left(\!\frac{Y_{iD}}{1+{\sum}_{l=1}^{D-1} Y_{il}} | \alpha_{jD},\beta_{jD}\!\right)\\ \end{array} $$

(62)

so (60) becomes:

$$\begin{array}{@{}rcl@{}} p_{iBeta}\!\left(\!\frac{Y_{iD}}{1+{\sum}_{l=1}^{D-1} Y_{il}} | \alpha_{jD},\beta_{jD}\!\right) \!\prod\limits_{l=1}^{D-1}\!\frac{1}{B(\alpha_{jl},\beta_{jl})} Y_{iD}^{\alpha_{jl}-1}\\ \times\!\left(\!1\,+\,\sum\limits_{k=1}^{l} Y_{ik}\!\right)^{\!- \beta_{jl} - \alpha_{jl} + \beta_{j(l+1)} }\\ \end{array} $$

(63)

For every remaining feature l in the product from 1 to D−1 we mutiply (63) by the constant $\left (1+{\sum }_{k=1}^{l-1} Y_{ik}\right )^{\beta _{jl} + \alpha _{jl} - \alpha _{jl} +1} \quad \left (1+{\sum }_{k=1}^{l} Y_{ik}\right )^{- \beta _{j(l+1)}}=\left (1+{\sum }_{k=1}^{l-1} Y_{ik}\right )^{\beta _{jl} +1}\left (1+{\sum }_{k=1}^{l} Y_{ik}\right )^{- \beta _{j(l+1)} } $ so (63) will be proportional to:

$$ \prod\limits_{l=1}^{D} p_{iBeta}\left(\frac{Y_{il}}{1+{\sum}_{k=1}^{l-1} Y_{ik}} | \alpha_{jl},\beta_{jl}\right) $$

(64)

the first term of the product in (64) is : p _{i
B
e
t
a}(Y _i1|α _{j
l}, β _{j
l}) so we finally have:

$$\begin{array}{@{}rcl@{}} p(j|\mathbf{Y}_{i})\propto \pi_{j}p(\mathbf{Y}_{i}|\boldsymbol{\alpha}_{j},\boldsymbol{\beta}_{j})\propto \pi_{j} p_{iBeta}(Y_{i1} | \alpha_{jl},\beta_{jl})\\ \times\prod\limits_{l=2}^{D} p_{iBeta}\left(\frac{Y_{il}}{1+{\sum}_{k=1}^{l-1} Y_{ik}} | \alpha_{jl},\beta_{jl}\right) \end{array} $$

(65)

Appendix B: Proof of equations

Equation (21) shows that the terms that are independent of Q _s(Θ _s) are absorbed into an additive constant. In order to make use of (21) we need to calculate the logarithm of (18) with the truncation of number of components of the GID mixture M, and the number of components of the irrelevant features to K. We also know that $\left \{ {\sum }_{j=1}^{M} \langle Z_{ij} \rangle \right \} = 1$, so this term will be discarded when it is factorized in the variational factors.

1.1 Variational solution to Q(ϕ)

We compute the logarithm of the variational factor Q(ϕ _{i
l}) as

$$\begin{array}{@{}rcl@{}} \ln{Q(\phi_{il})} &=& \phi_{il} \left\{ \langle \ln{\epsilon_{l1}}\rangle+\sum\limits_{j=1}^{M} \langle Z_{ij} \rangle \left[ \mathcal{R}_{jl} +(\bar{\alpha}_{jl} -1 ) \ln{X_{il}}\right.\right.\\ &&-\!\left.\left. (\bar{\alpha}_{jl} + \bar{\beta}_{jl})\ln{(1 + X_{il})} \right]\vphantom{{\sum}_{j=1}^{M}}\right\} \\ &&+\,(1\,-\, \phi_{il}) \left\{ \langle \ln{\epsilon_{l2}}\rangle + \left\{\sum\limits_{k=1}^{K} \langle W_{ikl}\rangle\left[\mathcal{F}_{kl} \,+\,(\bar{\sigma}_{kl} \,-\,1 ) \ln{X_{il}} \right.\right.\right. \\ &&-\!\left. \left. \left. (\bar{\sigma}_{kl} + \bar{\tau}_{kl})\ln{(1 + X_{il})} \right]\vphantom{{\sum}_{j=1}^{M}} \right\} \right\} \,+\, const \end{array} $$

(66)

where

$$ \bar{\alpha} = \langle \alpha \rangle, \;\;\; \bar{\beta} = \langle \beta \rangle, \;\;\; \bar{\tau} = \langle \tau \rangle $$

(67)

and

$$ \mathcal{R}_{jl} = \langle\ln{\frac{ \varGamma(\alpha_{jl} + \beta_{jl} ) }{\varGamma(\alpha_{jl})\varGamma(\beta_{jl})}}\rangle, \;\;\; \mathcal{F}_{kl} = \langle\ln{\frac{ \varGamma(\sigma_{kl} + \tau_{kl} ) }{\varGamma(\sigma_{kl})\varGamma(\tau_{kl})}}\rangle $$

(68)

The expectations in (68) are analytically intractable, thus, we apply the second-order Taylor series expansion in order to obtain a closed-form expression such as in [37]. The approximation of $\mathcal {R}_{jl}$ and $\mathcal {F}_{kl}$ are given by $\tilde {\mathcal {R}}_{jl}$ (31) and $\tilde {\mathcal {F}}_{kl}$ (32), respectively. We substitute the lower bound of (31) and (32) in (66) we obtain

$$\begin{array}{@{}rcl@{}} \ln{Q(\phi_{il})} &=& \phi_{il} \left\{ \langle \ln{\epsilon_{l1}}\rangle+\sum\limits_{j=1}^{M} \langle Z_{ij} \rangle \left[ \tilde{\mathcal{R}}_{jl} +(\bar{\alpha}_{jl} -1 ) \ln{X_{il}}\right.\right. \\ &&- \left.\left.(\bar{\alpha}_{jl} + \bar{\beta}_{jl})\ln{(1 + X_{il})} \right]\right\} \\ &&+(1\,-\, \phi_{il}) \left\{ \langle \ln{\epsilon_{l2}}\rangle \,+\, \left\{\sum\limits_{k=1}^{K} \langle W_{ikl}\rangle\left[\tilde{\mathcal{F}}_{kl} \,+\,(\bar{\sigma}_{kl} \,-\,1 ) \ln{X_{il}}\right. \right.\right. \\ &&- \left.\left.\left.(\bar{\sigma}_{kl} + \bar{\tau}_{kl})\ln{(1 + X_{il})} \right] \right\} \right\} + const \end{array} $$

(69)

From (69) we can deduce the variational solution of Q(ϕ) as a Bernoulli distribution such that

$$ Q(\boldsymbol{\phi}) = \prod\limits_{i=1}^{N}\prod\limits_{l=1}^{D} f_{il}^{\phi_{il}} (1-f_{il})^{(1-\phi_{il})} $$

(70)

where f _{i
l} is defined in (26), and from the Bernoulli distribution it is straightforward to have

$$ \langle \phi_{ij} \rangle = f_{ij}, \;\;\;\; \langle 1-\phi_{ij} \rangle = 1-f_{ij} $$

(71)

1.2 Variational solution to Q(Z)

The logarithm of the variational factor of Q(Z _{i
j}) is calculated as

$$\begin{array}{@{}rcl@{}} \ln{Q(Z_{ij})} &=& Z_{ij} \left\{ \sum\limits_{l=1}^{D} \left(\langle\phi_{il}\rangle \left[ \tilde{\mathcal{R}}_{jl} +(\bar{\alpha}_{jl} -1 ) \ln{X_{il}}- (\bar{\alpha}_{jl} + \bar{\beta}_{jl})\ln{(1 + X_{il})} \right]\right. \right. \\ &+&\left. \langle 1 - \phi_{il}\rangle\sum\limits_{k=1}^{K} \langle w_{ikl}\rangle \left[ \tilde{\mathcal{F}}_{kl} +(\bar{\sigma}_{kl} \,-\,1 ) \ln{X_{il}} \,-\, (\bar{\sigma}_{kl} \,+\, \bar{\tau}_{kl})\ln{(1 \,+\, X_{il})} \right] \right) \\ &+&\left.\langle \ln\lambda_{j}\rangle + \sum\limits_{s=1}^{j-1}\langle\ln{(1-\lambda_{s})}\rangle \right\} + const \end{array} $$

(72)

By analyzing the form of (72) we can write lnQ(Z) as

$$ \ln{Q(Z)} = \sum\limits_{i=1}^{N} \sum\limits_{j=1}^{M} Z_{ij} \ln{\tilde{r}_{ij}} + const $$

(73)

where $\tilde {r}_{ij}$ is defined in (27). Thus we have

$$ Q(Z) \propto \prod\limits_{i=1}^{N}\prod\limits_{j=1}^{M} \tilde{r}_{ij}^{Z_{ij}} $$

(74)

We know that Z _{i
j} are binary and we have ${\sum }_{j=1}^{M} Z_{ij} = 1$, so we can normalize (74) such that

$$ Q(Z) = \prod\limits_{i=1}^{N}\prod\limits_{j=1}^{M} r_{ij}^{Z_{ij}} $$

(75)

where r _{i
j} is defined in (26). We can obtain 〈Z _{i
j}〉 from the multinomial distribution of Q(Z) such that

$$ \langle Z_{ij}\rangle = r_{ij} $$

(76)

1.3 Variational solution to Q(λ)

The logarithm for of the variational factor Q(λ) is given by

$$\begin{array}{@{}rcl@{}} \ln{Q(\lambda_{j})} &=& \ln{\lambda_{j}} \sum\limits_{i=1}^{N}\langle Z_{ij}\rangle + \ln{(1-\lambda_{j})}\\ &&\times\left({\sum\limits_{1}^{N}} \sum\limits_{s=j+1}^{M} \langle Z_{is} \rangle + \langle {\Psi}_{j}\rangle - 1 \right) + const \end{array} $$

(77)

Equation (77) has the logarithm form of the Beta distribution, by taking the exponential we obtain

$$ Q(\boldsymbol{\lambda}) = \prod\limits_{j=1}^{M} Beta(\lambda_{j}|\theta_{j},\vartheta_{j}) $$

(78)

where 𝜃 _j and 𝜗 _j are defined in (33). As γ has the Beta prior distribution, Q(γ) can be derived in a similar way as for Q(λ). Following the same steps we define ρ _k and ϖ _k in (35)

1.4 Variational solution to Q(ψ)

The logarithm form of Q(ψ) is given by

$$ \ln{Q(\psi_{j})} = \ln{\psi_{j}} a_{j}+ \psi_{j} (\langle \ln{(1-\lambda_{j})} \rangle - b_{j}) + const $$

(79)

by taking the exponential in “Variational solution to Q(ϕ)” we obtain

$$ Q(\boldsymbol{\psi}) = \prod\limits_{j=1}^{M} \mathcal{G}(\psi_{j}|a_{j}^{*},b_{j}^{*}) $$

(80)

where $a_{j}^{*}$ and $b_{j}^{*}$ are defined in (34). As φ has the Gamma prior distribution, Q(φ) can be derived in a similar way as for Q(ψ). Following the same steps we define $c^{*}_{k} $ and $d^{*}_{k} $ in (36).

1.5 Variational solution to Q(W)

The logarithm of the variational factor Q(W _{i
k
l}) is given by

$$\begin{array}{@{}rcl@{}} \ln{Q(W_{ikl})} &=& W_{ikl} \left\{ \langle1 - \phi_{il} \rangle \left(\tilde{\mathcal{F}}_{kl} +(\bar{\sigma}_{kl} -1 )\ln{X_{il}}\right.\right.\\ &-&\left. (\bar{\sigma}_{kl} + \bar{\tau}_{kl})\ln{(1 + X_{il})} {\vphantom{\tilde{\mathcal{F}}_{kl}}}\right) \\ &+& \left.\langle \ln\gamma_{k}\rangle +\sum\limits_{s=1}^{k-1} \langle \ln{(1 - \gamma_{s})}\rangle \right\} + const \end{array} $$

(81)

By taking the exponential in (81) we obtain

$$ Q(W) = \prod\limits_{i=1}^{N}\prod\limits_{k=1}^{K}\prod\limits_{l=1}^{D} m_{ikl}^{W_{ikl}} $$

(82)

where m _{i
k
l} is given by (26).

1.6 Variational solution to Q(𝜖)

The logarithm of the variational factor Q(𝜖 _l) is given by

$$\begin{array}{@{}rcl@{}} \ln{Q(\boldsymbol{\epsilon}_{l})} &=& \ln{\epsilon_{l_{1}}} \left(\sum\limits_{i=1}^{N} \langle \phi_{il} \rangle + \xi_{1} - 1 \right)\\ &&+\ln{\epsilon_{l_{2}}} \left(\sum\limits_{i=1}^{N} \langle 1 - \phi_{il} \rangle + \xi_{2} - 1 \right) + const \end{array} $$

(83)

Equation (83) has a logarithmic form similar to the logarithm form of a Dirichlet distribution. The variational solution to Q(𝜖 _l) can be obtained by

$$ Q(\boldsymbol{\epsilon}_{l}) = \prod\limits_{l=1}^{D} Dir(\boldsymbol{\epsilon}_{l} | \boldsymbol{\xi}^{*}) $$

(84)

where $\boldsymbol {\xi }^{*} = (\xi ^{*}_{1}, \xi ^{*}_{2})$ in (37).

1.7 Variational solution to Q(α), Q(β), Q(σ), and Q(τ)

The logarithm of the variational factor Q(α _{j
l}) can be calculated as

$$\begin{array}{@{}rcl@{}} \ln{Q(\alpha_{jl})} &=& {\langle \ln{p(\mathcal{X},\varTheta)}\rangle}_{\varTheta \neq \alpha_{jl}} \\ &=& \sum\limits_{i=1}^{N}\langle Z_{ij}\rangle \langle \phi_{il} \rangle \left[ \mathcal{D}(\alpha_{jl}) + \alpha_{jl} \ln{\frac{X_{il}}{1+X_{il}}}\right]\\ &&+ (u_{jl} - 1) \ln{\alpha_{jl}} - v_{jl}\alpha_{jl} + const \end{array} $$

(85)

where

$$ \mathcal{D} = \langle \ln \frac{\varGamma(\alpha_{jl} + \beta_{jl})}{\varGamma(\alpha_{jl})\varGamma(\beta_{jl})} \rangle_{\beta_{jl}} $$

(86)

using a non-linear approximation as proposed in [37] we have

$$ \mathcal{D}(\alpha) \geq \ln \alpha \left\{ \psi(\bar{\alpha} + \bar{\beta}) - \psi(\bar{\alpha}) +\left. \bar{\beta}\psi^{\prime}(\bar{\alpha} + \bar{\beta}) \right) (\langle \ln \beta \rangle - \ln \bar{\beta})\right\}\bar{\alpha} $$

(87)

We substitute the lower bound in (87) into the (85) we have

$$\begin{array}{@{}rcl@{}} \ln{Q(\alpha_{jl})} &=& \ln{\alpha_{jl}} \left\{ \sum\limits_{i=1}^{N}\langle Z_{ij}\rangle \langle \phi_{il} \rangle \left[ \psi(\bar{\alpha}_{jl} + \bar{\beta}_{jl})\right.\right. \\ &-& \left.\left.\psi(\bar{\alpha}_{jl}) + \bar{\beta}_{jl}\psi^{\prime}(\bar{\alpha}_{jl} + \bar{\beta}_{jl}) \right) (\langle \ln \beta_{jl} \rangle - \ln \bar{\beta}_{jl})\right]\bar{\alpha}_{jl}\\ &&+\left. u_{jl} - 1\vphantom{\sum\limits_{i=1}^{N}}\right\} \\ &+& \alpha_{jl} \left\{ \sum\limits_{i=1}^{N}\langle Z_{ij}\rangle \langle \phi_{il} \rangle \ln{\frac{X_{il}}{1+X_{il}}} - v_{jl} \right\} + const \end{array} $$

(88)

Equation (88) has the form of a Gamma distribution. By taking it to the exponential we obtain

$$ Q(\boldsymbol{\alpha}) = \prod\limits_{j=1}^{M}\prod\limits_{l=1}^{D} \mathcal{G}(\alpha_{jl}|u_{jl}^{*},v_{jl}^{*}) $$

(89)

The hyperparameters $u_{jl}^{*}$ and $v_{jl}^{*}$ can be estimated by (38) and (39), respectively. Since β, σ and τ have the Gamma prior, we obtain the variational solutions to Q(β), Q(σ), and Q(τ) in the same way as for Q(α).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bdiri, T., Bouguila, N. & Ziou, D. Variational Bayesian inference for infinite generalized inverted Dirichlet mixtures with feature selection and its application to clustering. Appl Intell 44, 507–525 (2016). https://doi.org/10.1007/s10489-015-0714-6

Download citation

Published: 06 October 2015
Issue Date: April 2016
DOI: https://doi.org/10.1007/s10489-015-0714-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Variational Bayesian inference for infinite generalized inverted Dirichlet mixtures with feature selection and its application to clustering

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

A Systematic Review of Hidden Markov Models and Their Applications

Mixture Models: Latent Profile and Latent Class Analysis

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Conditional independence in the transformed space

Appendix B: Proof of equations

1.1 Variational solution to Q(ϕ)

1.2 Variational solution to Q(Z)

1.3 Variational solution to Q(λ)

1.4 Variational solution to Q(ψ)

1.5 Variational solution to Q(W)

1.6 Variational solution to Q(𝜖)

1.7 Variational solution to Q(α), Q(β), Q(σ), and Q(τ)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Variational Bayesian inference for infinite generalized inverted Dirichlet mixtures with feature selection and its application to clustering

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

A Systematic Review of Hidden Markov Models and Their Applications

Mixture Models: Latent Profile and Latent Class Analysis

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Conditional independence in the transformed space

Appendix B: Proof of equations

1.1 Variational solution to Q(ϕ)

1.2 Variational solution to Q(Z)

1.3 Variational solution to Q(λ)

1.4 Variational solution to Q(ψ)

1.5 Variational solution to Q(W)

1.6 Variational solution to Q(𝜖)

1.7 Variational solution to Q(α), Q(β), Q(σ), and Q(τ)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation