The aspect Bernoulli model: multiple causes of presences and absences

Bingham, Ella; Kabán, Ata; Fortelius, Mikael

doi:10.1007/s10044-007-0096-4

The aspect Bernoulli model: multiple causes of presences and absences

Theoretical Advances
Published: 28 November 2007

Volume 12, pages 55–78, (2009)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Ella Bingham¹,
Ata Kabán² &
Mikael Fortelius³

246 Accesses
20 Citations
Explore all metrics

Abstract

We present a probabilistic multiple cause model for the analysis of binary (0–1) data. A distinctive feature of the aspect Bernoulli (AB) model is its ability to automatically detect and distinguish between “true absences” and “false absences” (both of which are coded as 0 in the data), and similarly, between “true presences” and “false presences” (both of which are coded as 1). This is accomplished by specific additive noise components which explicitly account for such non-content bearing causes. The AB model is thus suitable for noise removal and data explanatory purposes, including omission/addition detection. An important application of AB that we demonstrate is data-driven reasoning about palaeontological recordings. Additionally, results on recovering corrupted handwritten digit images and expanding short text documents are also given, and comparisons to other methods are demonstrated and discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Article Open access 05 May 2021

Thematic Analysis: Making Values Emerge from Texts

PRISMA-S: an extension to the PRISMA Statement for Reporting Literature Searches in Systematic Reviews

Article Open access 26 January 2021

Notes

Note that at each attribute t of observation n, the latent aspect k is sampled anew from the distribution p(1|n),…,p(K|n).
NOW: Neogene of the Old World, http://www.helsinki.fi/science/now/.
http://www.ics.uci.edu/~mlearn/MLSummary.html.
http://www.cs.cmu.edu/~textlearning/.
http://www.cs.cmu.edu/~mccallum/bow/.
Another way would be to average over L _n, assuming that L _n ranges uniformly between the number of ones in the observation and some manually chosen upper limit; in Table 4 this would give inferior results for many choices of the upper limit.
This is indeed not the proportion of 0s turned to 1s, but instead includes new 1s superimposed at existing 1s, which have no effect.
At large K, several aspects may correspond to noise, but for simplicity we only select the one having the smallest value of ∑_t a _tk.
This also fallows from the first moment identity for exponential family of distributions, of which the Bernoulli distribution is a member [50].

References

Harman HH (1967) Modern factor analysis, 2nd edn. University of Chicago Press, Chicago
MATH Google Scholar
Saund E (1995) A multiple cause mixture model for unsupervised learning. Neural Comput 7:51–71
Article Google Scholar
Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42:177–196
Article MATH Google Scholar
Hyvärinen A, Karhunen J, Oja E (2001) Independent component analysis. Wiley Interscience, New York
Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Article MATH Google Scholar
Kabán A, Bingham E, Hirsimäki T (2004) Learning to read between the lines: the aspect Bernoulli model. In: Proceedings of the 4th SIAM international conference on data mining, pp 462–466
Buntine W (2002) Variational extensions to EM and multinomial PCA. In: Machine learning: ECML 2002. Lecture notes in artificial intelligence (LNAI), vol 2430. Springer, New York, pp 23–34
McCallum A, Nigam K (1998) A comparison of event models for naive Bayes text classification. In: Sahami M (ed) Learning for text categorization. Papers from the AAAI Workshop, Technical report WS-98-05, AAAI, pp 41–48
Fortelius M, Werdelin L, Andrews P, Bernor RL, Gentry A, Humphrey L, Mittmann W, Viranta S (1996) Provinciality, diversity, turnover and paleoecology in land mammal faunas of the later Miocene of western Eurasia. In: Bernor R, Fahlbusch V, Mittmann W (eds) The evolution of Western Eurasian Neogene Mammal Faunas. Columbia University Press, Columbia, pp 414–448
Google Scholar
Srebro N (2004) Learning with matrix factorizations. Ph.D. thesis, Massachusetts Institute of Technology
Hofmann T, Puzicha J (1998) Unsupervised learning from dyadic data. Technical Report TR-98-042, International Computer Science Insitute, Berkeley
Hofmann T (2004) Latent semantic models for collaborative filtering. ACM Trans Inf Syst 22:89–115
Article Google Scholar
Marlin B, Zemel R (2004) The multiple multiplicative factor model for collaborative filtering. ICML-2004. In: Proceedings of the 21st international conference on machine learning, pp 576–583
Buntine W, Jakulin A (2006) Discrete components analysis. In: Saunders C, Grobelnik M, Gunn S, Shawe-Taylor J (eds) Subspace, latent structure and feature selection techniques. Springer, Berlin
Google Scholar
Marlin B (2004) Modeling user rating profiles for collaborative filtering. In: Thrun S, Saul L, Schölkopf B (eds) Advances in neural information processing systems, vol 16. MIT, Cambridge
Google Scholar
Kabán A, Bingham E (2006) ICA-based binary feature construction. In: Rosca JP, Erdogmus D, Príncipe JC, Haykin S (eds) Independent component analysis and blind signal separation. 6th international conference, ICA 2006, Proceedings. Lecture notes in computer science, vol 3889. Springer, Berlin, pp 140–148
Kabán A, Bingham E. Factorisation and denoising of 0–1 data: a variational approach. Neurocomputing, special issue on advances in blind signal processing (in press)
Everitt BS, Hand DJ (1981) Finite mixture distributions. Chapman & Hall, London
MATH Google Scholar
Gyllenberg M, Koski T, Reilink E, Verlaan M (1994) Non-uniqueness in probabilistic numerical identification of bacteria. J Appl Probab 31:542–548
Article MATH MathSciNet Google Scholar
Meilă M (1999) An accelerated Chow and Liu algorithm: fitting tree distributions to high-dimensional sparse data. In: ICML ’99: Proceedings of the sixteenth international conference on machine learning, Morgan Kaufmann, San Francisco, pp 249–257
Schein A, Saul L, Ungar L (2003) A generalized linear model for principal component analysis of binary data. In: Bishop CM, Frey BJ (eds) Proceedings of the 9th international workshop on artificial intelligence and statistics
Tipping ME (1999) Probabilistic visualisation of high-dimensional binary data. In: Kearns MS, Solla SA, Cohn DA (eds) Advances in neural information processing systems, vol 11, pp 592–598
Collins M, Dasgupta S, Schapire RE (2001) A generalization of principal component analysis to the exponential family. Advances in neural information processing systems, vol 14, pp 617–624
Hofmann T, Puzicha J (1999) Latent class models for collaborative filtering. In: Dean T (ed) Proceedings of the 16th international joint conference on artificial intelligence, IJCAI 99, Morgan Kaufmann, San Francisco, pp 688–693
Dayan P, Zemel RS (1995) Competition and multiple cause models. Neural Comput 7:565–579
Article Google Scholar
Seppänen JK, Bingham E, Mannila H (2003) A simple algorithm for topic identification in 0–1 data. In: Lavrač N, Gamberger D, Todorovski L, Blockeel H (eds) Knowledge discovery in databases: PKDD 2003. Lecture notes in artificial intelligence, vol 2832, Springer, New York, pp 423–434
Jaakkola TS (1997) Variational methods for inference and estimation in graphical models. Ph.D. thesis, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA
Barlow H, Kaushal T, Mitchison G (1989) Finding minimum entropy codes. Neural Comput 1:412–423
Article Google Scholar
Földiák P (1990) Forming sparse representations by local anti-Hebbian learning. Biol Cybern 64:165–170
Article Google Scholar
Schmidhuber J (1992) Learning factorial codes by predictability minimization. Neural Comput 4:863–879
Article Google Scholar
Zemel RS (1993) A minimum description length framework for unsupervised learning. Ph.D. thesis, University of Toronto
Kemp C, Griffiths TL, Tenenbaum JB (2004) Discovering latent classes in relational data. Technical Report AI Memo 2004-019, Massachusetts Institute of Technology, Computer Science and Artificial Intelligence Laboratory
Agrawal R, Mannila H, Srikant R, Toivonen H, Verkamo AI (1996) Fast discovery of association rules. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining, Chap. 12. AAAI, New York, pp 307–328
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. Technical Report TR 2001-05, Department of Computer Sciences, University of Texas, Austin
Smolensky P (1986) Information processing in dynamical systems: foundations of harmony theory. In: Rumelhart DE, McClelland JL (eds) Parallel distributed processing: explorations in the microstructure of cognition, Foundations, vol 1. MIT, Cambridge, pp 194–281
Google Scholar
Jolliffe IT (1986) Principal component analysis. Springer, Berlin
Google Scholar
Paatero P, Tapper U (1994) Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5:111–126
Article Google Scholar
Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401:788–791
Article Google Scholar
Lee DD, Seung HS (2000) Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems, vol 13, 556–562
Tipping ME, Bishop CM (1999) Probabilistic principal component analysis. J R Stat Soc B Stat Methodol 61:611–622
Article MATH MathSciNet Google Scholar
Dahyot R, Charbonnier P, Heitz F (2004) A bayesian approach to object detection using probabilistic appearance-based models. Pattern Anal Appl 7:317–332
MathSciNet Google Scholar
Srebro N, Jaakkola T (2003) Weighted low-rank approximations. In: Fawcett T, Mishra N (eds) Machine learning. Proceedings of the twentieth international conference (ICML 2003). AAAI, New York, pp 720–727
Gordon GJ (2003) Generalized² linear² models. In: Becker S, Thrun S, Obermayer K (eds) Advances in neural information processing systems, vol 15. MIT, Cambridge, pp 577–584
Haft M, Hofman R, Tresp V (2003) Generative binary codes. Pattern Anal Appl 6:269–284
MathSciNet Google Scholar
Barry JC, Morgan M, Flynn L, Pilbeam D, Behrensmeyer A, Raza S, Khan I, Badgley C, Hicks J, Kelley J (2002) Faunal and environmental change in the late Miocene Siwaliks of Northern Pakistan. Palaeobiology (Supplement). 28:1–71
Google Scholar
Puolamäki K, Fortelius M, Mannila H (2006) Seriation in paleontological data using Markov chain Monte Carlo methods. PLoS Comput Biol 2:e6
Article Google Scholar
Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, Cambridge
MATH Google Scholar
Smyth P (2000) Model selection for probabilistic clustering using cross-validated likelihood. Stat Comput 10:63–72
Article Google Scholar
Heckerman D, Chickering DM (1996) A comparison of scientific and engineering criteria for Bayesian model selection. Technical Report MSR-TR-96-12, Microsoft Research
Bernardo JM, Smith AFM (1994) Bayesian theory. Wiley, New York
MATH Google Scholar
Akaike H (1973) Information theory and an extension of the maximum likelihod principle. In: Petrox B, Csaki F (eds) Second international symposium on information theory, pp 267–281
Kabán A (2007) Predictive modelling of heterogeneous sequence collections by topographic ordering of histories. Mach Learn 68:63–95
Article Google Scholar
Peterson C, Söderberg B (1989) A new method for mapping optimization problems onto neural networks. Int J Neural Syst 1:3–22
Article Google Scholar
Wu J-M, Chiu S-J (2001) Independent component analysis using Potts models. IEEE Trans Neural Netw 12:202–211
Article Google Scholar
Jung T-P, Makeig S, Humphries C, Lee T-W, McKeown MJ, Iragui V, Sejnowski TJ (2000) Removing electroencephalographic artifacts by blind source separation. Psychophysiology 37:163–178
Article Google Scholar
Alter O, Brown PO, Botstein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. PNAS 97:10101–10106
Article Google Scholar
Law MHC, Figueiredo MAT, Jain AK (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26:1154–1166
Article Google Scholar
Hofmann T (1999) The cluster-abstraction model: unsupervised learning of topic hierarchies from text data. In: Dean T (ed) Proceedings of the 16th international joint conference on artificial intelligence, IJCAI 99. Morgan Kaufmann, San Francisco, pp 682–687
Barnard K, Duygulu P, Forsyth D (2001) Clustering art. IEEE Conf Comput Vis Pattern Recognit 2(II):434–441
Google Scholar
Barnard K, Duygulu P, Forsyth D, de Freitas N, Blei DM, Jordan MI (2003) Matching words and pictures. J Mach Learn Res 3:1107–1135
Article MATH Google Scholar
Blei DM, Griffiths TL, Jordan MI, Tenenbaum JB (2004) Hierarchical topic models and the nested Chinese restaurant process. In: Thrun S, Saul L, Schölkopf B (eds) Advances in neural information processing systems, vol 16. MIT, Cambridge
Google Scholar
Wang X, Kabán A (2005) Finding uninformative features in binary data. In: Gallagher M, Hogan JM, Maire F (eds) Intelligent data engineering and automated learning—IDEAL 2005. Lecture notes in computer science, vol 3578. Springer, New York, pp 40–47
Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York
MATH Google Scholar

Download references

Acknowledgments

The authors wish to thank Professor Heikki Mannila for insightful discussions regarding the model and the palaeontological data, and for suggesting Fig. 7. The authors would also like to thank the anonymous reviewers for their useful comments and suggestions.

Author information

Authors and Affiliations

Helsinki Institute for Information Technology, University of Helsinki and Helsinki University of Technology, P.O. Box 68, 00014, Helsinki, Finland
Ella Bingham
School of Computer Science, University of Birmingham, Birmingham, UK
Ata Kabán
Division of Palaeontology, University of Helsinki, Helsinki, Finland
Mikael Fortelius

Authors

Ella Bingham
View author publications
You can also search for this author in PubMed Google Scholar
Ata Kabán
View author publications
You can also search for this author in PubMed Google Scholar
Mikael Fortelius
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ella Bingham.

Additional information

A part of the work of Ella Bingham was performed while visiting the School of Computer Science, University of Birmingham, UK.

Appendices

Appendix A

The following holds for any distributions q _n(·):

$$ \begin{aligned} \,&\sum_n \log p({{\user2{x}}}_n|{{\user2{s}}}_n,{{\user2{a}}}) \\ &\quad = \sum_n \sum_{{{\user2{z}}}_n}q_n({{\user2{z}}}_n)\log p({{\user2{x}}}_n|{{\user2{s}}}_n,{{\user2{a}}}) \end{aligned} $$

(36)

$$ =\sum_n \sum_{{{\user2{z}}}_n} q_n({{\user2{z}}}_n)\log \frac{p({{\user2{x}}}_n,{{\user2{z}}}_n|{{\user2{s}}}_n,{{\user2{a}}})} {p({{\user2{z}}}_n| {{\user2{x}}}_n,{{\user2{s}}}_n,{{\user2{a}}})}\frac{q_n ({{\user2{z}}}_n)}{q_n({{\user2{z}}}_n)} $$

(37)

$$ \begin{aligned} &= \underbrace{\sum_n \left[ \sum_{{{\user2{z}}}_n}q_n({{\user2{z}}}_n) \log p({{\user2{x}}}_n,{{\user2{z}}}_n| {{\user2{s}}}_n, {{\user2{a}}}) -\sum_{{{\user2{z}}}_n}q_n({{\user2{z}}}_n)\log q_n({{\user2{z}}}_n)\right]}_{F_{{{\user2{s}}}_1,\ldots,{{\user2{s}}}_N, {{\user2{a}}}} ({{\user2{x}}}_1,\ldots,{{\user2{x}}}_N; \; q_1(\cdot),\ldots,q_N(\cdot))} \\ &\quad + \sum_n \underbrace{\sum_{{{\user2{z}}}_n} q_n({{\user2{z}}}_n)\log \frac{q_n({{\user2{z}}}_n)} {p_n({{\user2{z}}}_n|{{\user2{x}}}_n,{{\user2{s}}}_n, {{\user2{a}}})}}_{KL(q_n(\cdot)|| p(\cdot|{{\user2{x}}}_n,{{\user2{s}}}_n, {{\user2{a}}}))} \end{aligned} $$

(38)

We can recognise that the first term of F in (38) is the sum of conditional expectations of the joint likelihood of the data and latent variables z _n, conditioned on s _n. The second term is the sum of entropies of q _n. Finally, the last term is the sum of Kullback–Leibler distances between the (so far arbitrary) distributions q _n over z _n and the true conditional posterior distributions of z _n, conditioned on s _n.

Since the KL divergence is always positive [63] (which can easily be shown by using Jensen’s inequality), F is always a lower bound to the log likelihood log p(x _n|s _n,a), irrespective of the distributions q _n(·). When q _n(·)≡ p(·|x _n,s _n,a), ∀ n = 1,…,N, then the KL divergence becomes zero and, therefore, the lower bound approaches the log likelihood exactly.

The iterative EM procedure is then: In the E-step, having some fixed estimates of s _n, n = 1,…,N and a, we maximise $F_{{{\user2{s}}}_1,\ldots,{{\user2{s}}}_N,{{\user2{a}}}}({{\user2{x}}}_1, \ldots,{{\user2{x}}}_N; \, q_1(\cdot),\ldots,q_N(\cdot))$ with respect to all q _n(·). This is achieved by setting these to the true conditional posteriors p(·|x _n,s _n, a), for all n = 1,…,N, which—from Bayes’ rule—are the following:

$$ p({{\user2{z}}}_n|{{\user2{x}}}_n,{{\user2{s}}}_n,{{\user2{a}}})= \frac{p({{\user2{x}}}_n|{{\user2{z}}}_n,{{\user2{a}}})p({{\user2{z}}}_n| {{\user2{s}}}_n)}{\sum_{{{\user2{z}}}_n} p({{\user2{x}}}_n|{{\user2{z}}}_n,{{\user2{a}}})p({{\user2{z}}}_n| {{\user2{s}}}_n)} $$

(39)

$$ = \frac{\prod_t\prod_k [a_{tk}^{x_{tn}}(1-a_{tk})^{1-x_{tn}}]^{z_{tnk}} \prod_t\prod_k [s_{kn}]^{z_{tnk}}}{\prod_t \sum_k s_{kn} a_{tk}^{x_{tn}}(1-a_{tk})^{1-x_{tn}}} $$

(40)

$$ =\prod_t \frac{\prod_k [s_{kn}a_{tk}^{x_{tn}}(1-a_{tk})^{1-x_{tn}}]^{z_{tnk}}}{\sum_k s_{kn}a_{tk}^{x_{tn}}(1-a_{tk})^{1-x_{tn}}} $$

(41)

$$ = \prod_t p(z_{tn}|x_{tn},{{\user2{s}}}_n, {{\user2{a}}}_t) $$

(42)

In (39), we used that p(x _n|z _n,a) = p(x _n|s _n,z _n,a), which follows from the dependency structure of AB, namely that x _n depends on s _n only through z _n, therefore, knowing z _n makes x _n independent of s _n.

It may be interesting to note that the above conditional posterior distribution p(z _n|x _n,s _n,a) factorises naturally, without having imposed a factor form. Of course, as we know, this posterior is conditioned on the value of s _n, so even though it is an exact conditional posterior quantity, there is no tractable exact posterior over the joint distribution of all hidden variables of the model, which would be p(s _n,z _n|x _n,a) = p(s _n|x _n, a)p(z _n|x _n,s _n, a). The latter term is what we just computed, while the former term is intractable as discussed earlier in the main text, and so a point estimate for s _n will be obtained as part of the M-step.

In the M-step, we keep the posterior distributions q _n(·) fixed at the values computed in the previous E-step, and compute the most probable value of s _n for each x _n as well as the parameters a. This is achieved by maximising $F_{{{\user2{s}}}_1,\ldots,{{\user2{s}}}_N,{{\user2{a}}}}({{\user2{x}}}_1, \ldots,{{\user2{x}}}_N; \; q_1(\cdot),\ldots,q_N(\cdot))$ with respect to all s _n and a.

To ensure the constraint ∑_k s _kn = 1 is met, we add a Lagrangian term. Denoting

$$ q_{k,t,n,x_{tn}}\equiv p(z_{tn}=k|x_{tn},{{\user2{s}}}_n, {{\user2{a}}}_t)= \frac{s_{kn}a_{tk}^{x_{tn}}(1-a_{tk})^{1-x_{tn}}} {\sum_{\ell} s_{\ell n}a_{t\ell}^{x_{tn}}(1-a_{t\ell})^{1-x_{tn}}} $$

(43)

where the last equality follows from (41), and replacing the result obtained from the previous E-step into F, the expression to maximise, up to a constant term, is

$$ Q = \sum_n \left[ \sum_{{{\user2{z}}}_n}q_n({{\user2{z}}}_n) \log p({{\user2{x}}}_n,{{\user2{z}}}_n| {{\user2{s}}}_n, {{\user2{a}}}) - \lambda_n \left(\sum_k s_{kn}-1\right)\right] $$

(44)

$$ = \sum_n \left[\sum_t\sum_k\left[\sum_{{{\user2{z}}}_n}q_n({{\user2{z}}}_n) z_{tnk}\right]\log [ s_{kn}a_{tk}^{x_{tn}}(1-a_{tk})^{1-x_{tn}}] -\lambda_n\left(\sum_k s_{kn}-1\right)\right] $$

(45)

$$ =\sum_n \left[\sum_t \sum_k q_{k,t,n,x_{tn}} \log [ s_{kn}a_{tk}^{x_{tn}}(1-a_{tk})^{1-x_{tn}}] -\lambda_n\left(\sum_k s_{kn}-1\right)\right] $$

(46)

where λ_n are Lagrange multipliers, and the second term of F (the entropy term) was omitted for being a constant w.r.t. the variables of interest. Eq. (45) was obtained by expanding $p({{\user2{x}}}_n,{{\user2{z}}}_n| {{\user2{s}}}_n, {{\user2{a}}})=\prod_t \prod_k [s_{kn}a_{tk}^{x_{tn}}(1-a_{tk})^{1-x_{tn}}]^{z_{tnk}}$ and grouping together the terms with z _tnk. To obtain (46), we used the result (42) and the notation (43), so that $\sum_{{{\user2{z}}}_n}q_n({{\user2{z}}}_n) z_{tnk}=\sum_{{{\user2{z}}}_n}\prod_t p(z_{tn}|x_{tn},{{\user2{s}}}_n,{{\user2{a}}}_t) z_{tnk}= p(z_{tn}=k|x_{tn},{{\user2{s}}}_n,{{\user2{a}}}_t)=q_{k,t,n,x_{tn}}.$

The terms that depend on elements of s _n are

$$ Q_{{{\user2{s}}}_n}= \sum_k\sum_t q_{k,t,n,x_{tn}}\log s_{kn}-\lambda_n\left(\sum_k s_{kn}-1\right)+{\rm const.} $$

(47)

Now, we solve the system of stationary equations w.r.t. s _kn, which are the following.

$$ \frac{\partial Q_{{{\user2{s}}}_n}}{\partial s_{kn}}= \sum_t q_{k,t,n,x_{tn}}/s_{kn} =\lambda_n $$

(48)

Multiplying both sides by s _kn, we obtain

$$ \sum_t q_{k,t,n,x_{tn}} =\lambda_ns_{kn} $$

(49)

from which we have that

$$ s_{kn}= \sum_t q_{k,t,n,x_{tn}}/\lambda_n $$

(50)

The value of λ_n is obtained by summing both sides, and using ∑_k s _kn = 1. This gives $\lambda_n= \sum_k \sum_t q_{k,t,n,x_{tn}} = T,$ since by its definition, $\sum_k q_{k,t,n,x_{tn}} = 1.$

To complete the M-step, we now maximise Q w.r.t. a. The terms that depend on elements of a up to constants, are the following.

$$ Q_{{{\user2{a}}}}=\sum_n\sum_k\sum_t q_{k,t,n,x_{tn}} [ x_{tn}\log a_{tk}+(1-x_{tn})\log (1-a_{tk})] $$

(51)

The stationary equations are then the following.

$$ \frac{\partial Q_{{{\user2{a}}}}}{\partial a_{tk}}= \sum_n q_{k,t,n,x_{tn}} \left(\frac{x_{tn}}{a_{tk}}-\frac{1-x_{tn}} {1-a_{tk}}\right) =\sum_n q_{k,t,n,x_{tn}}\frac{x_{tn}-a_{tk}} {a_{tk}(1-a_{tk})}=0 $$

(52)

The denominator is the variance of the Bernoulli and always non-negative, it can be simplified and by isolating a _tk we have the solution:

$$ a_{tk}=\frac{\sum_n x_{tn}q_{k,t,n,x_{tn}}}{\sum_n q_{k,t,n,x_{tn}}} $$

(53)

Note that the constraint a _tk ∈ [0,1] needed not be explicitly imposed in this model setting, as it is automatically satisfied for binary data.^{Footnote 9}

Appendix B

The derivation of the fixed point equations (13)–(15) as an alternating optimisation of ∑_n p(x _n| s _n,a) (or equivalently ∑_n p(x _n, s _n|a)) is as follows.

Denote ${\overline{a_{tk}}}=1-a_{tk}.$ The log likelihood (11) is maximised, subject to the constraints ∑_k s _kn = 1 and $a_{tk}+{\overline{a_{tk}}}=1.$ The corresponding Lagrangian is thus the following:

$$ \begin{aligned} {{\mathcal{L}}} &= \sum_n \sum_t \left[ x_{tn} \log \sum_k a_{tk}s_{kn} + (1-x_{tn})\log \sum_k {\overline{a_{tk}}}s_{kn}\right. \\ &\left. \quad - c_{tk}(a_{tk}+{\overline{a_{tk}}}-1) - \lambda_n \left(\sum_k s_{kn}-1\right)\right] \end{aligned} $$

(54)

where c _tk and λ_n are Lagrangian multipliers, and we have rewritten (1 − ∑_k a _tk s _kn) as $\sum_k {\overline{a_{tk}}}s_{kn}.$ The stationary equations of ${{\mathcal{L}}}$ with respect to both a _tk and ${\overline{a_{tk}}}$ are

$$ \frac{\partial {{\mathcal{L}}}}{\partial a_{tk}} = \sum_n \frac{x_{tn}}{\sum_\ell a_{t\ell}s_{\ell n}} s_{kn} - c_{tk} = 0 $$

(55)

$$ \frac{\partial {{\mathcal{L}}}}{\partial {\overline{a_{tk}}}} = \sum_n \frac{1-x_{tn}}{\sum_\ell \overline{a_{t\ell}}s_{\ell n}} s_{kn} - c_{tk} =0 $$

(56)

Multiplying the first of the above equations by a _tk and the second by ${\overline{a_{tk}}}$ we obtain

$$ a_{tk} \sum_n \frac{x_{tn}}{\sum_\ell a_{t\ell}s_{\ell n}}s_{kn} - c_{tk}a_{tk} = 0 $$

(57)

$$ {\overline{a_{tk}}} \sum_n \frac{1-x_{tn}}{\sum_\ell {\overline{a_{t\ell}}}s_{\ell n}}s_{kn} - c_{tk}{\overline{a_{tk}}} = 0 $$

(58)

Summing both sides and using $a_{tk}+{\overline{a_{tk}}} = 1$ provides the Lagrangian multiplier c _tk:

$$ c_{tk} = a_{tk} \sum_n \frac{x_{tn}}{\sum_\ell a_{t\ell}s_{\ell n}} s_{kn} + {\overline{a_{tk}}} \sum_n \frac{1-x_{tn}}{\sum_\ell \overline{a_{t\ell}}s_{\ell n}} s_{kn} $$

(59)

as in (15). From (57) we have the solution for a _tk in the form of a fixed point equation:

$$ a_{tk}= a_{tk}\sum_n\frac{x_{tn}}{\sum_\ell a_{t \ell}s_{\ell n}}s_{kn}/c_{tk} $$

(60)

as in (14). Solving for s _kn proceeds similarly: the stationary equation is

$$ \frac{\partial {{\mathcal{L}}}}{\partial s_{kn}} = \sum_t \left( \frac{x_{tn}}{\sum_\ell a_{t \ell}s_{\ell n}} a_{tk} + \frac{1-x_{tn}}{\sum_\ell \overline{a_{t \ell}}s_{\ell n}} {\overline{a_{tk}}} \right) - \lambda_n =0 $$

(61)

Multiplying both sides by s _kn we obtain

$$ \sum_t \left(\frac{x_{tn}}{\sum_\ell a_{t \ell}s_{\ell n}} a_{tk}s_{kn} + \frac{1-x_{tn}}{\sum_\ell \overline{a_{t \ell}}s_{\ell n}} {\overline{a_{tk}}}s_{kn} \right) = \lambda_n s_{kn} $$

(62)

Summing over k and using ∑_k s _kn = 1 we have the Lagrange multiplier λ_n:

$$ \lambda_n = \sum_t \frac{x_{tn}}{\sum_\ell a_{t \ell}s_{\ell n}} \sum_k a_{tk}s_{kn} + \sum_t \frac{1-x_{tn}}{\sum_\ell \overline{a_{t \ell}}s_{\ell n}} \sum_k {\overline{a_{tk}}}s_{kn} $$

(63)

$$ = \sum_t x_{tn} + \sum_t (1-x_{tn})=T $$

(64)

Having computed λ_n, from (62) we obtain the fixed point equation for s _kn, identical to (13):

$$ s_{kn}= s_{kn} \left\{\sum_t \frac{x_{tn}}{\sum_\ell a_{t \ell}s_{\ell n}}a_{tk} + \frac{1-x_{tn}}{\sum_\ell \overline{a_{t \ell}}s_{\ell n}} {\overline{a_{tk}}}\right\}/T $$

(65)

As discussed in the text, this derivation is simpler and yields the same multiplicative updates (13)–(15), obtained also via rewriting the EM algorithm (7)–(9). However, the fact that we need not iterate each multiplicative fixed point update to convergence separately before alternating these inner loops, but we can actually just alternate them while still obtaining a convergent algorithm, is less apparent from the derivation given in this section. Instead, this is a consequence of the EM interpretation presented in Appendix A. Indeed, recall that every single multiplicative update is a combination of a full E-step and an M-step update for one (group of) variables, hence it is guaranteed not to decrease the likelihood.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bingham, E., Kabán, A. & Fortelius, M. The aspect Bernoulli model: multiple causes of presences and absences. Pattern Anal Applic 12, 55–78 (2009). https://doi.org/10.1007/s10044-007-0096-4

Download citation

Received: 18 April 2007
Accepted: 14 October 2007
Published: 28 November 2007
Issue Date: February 2009
DOI: https://doi.org/10.1007/s10044-007-0096-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The aspect Bernoulli model: multiple causes of presences and absences

Abstract

Access this article

Similar content being viewed by others

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Thematic Analysis: Making Values Emerge from Texts

PRISMA-S: an extension to the PRISMA Statement for Reporting Literature Searches in Systematic Reviews

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix A

Appendix B

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The aspect Bernoulli model: multiple causes of presences and absences

Abstract

Access this article

Similar content being viewed by others

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Thematic Analysis: Making Values Emerge from Texts

PRISMA-S: an extension to the PRISMA Statement for Reporting Literature Searches in Systematic Reviews

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix A

Appendix B

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation