Skip to main content
Log in

The aspect Bernoulli model: multiple causes of presences and absences

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

We present a probabilistic multiple cause model for the analysis of binary (0–1) data. A distinctive feature of the aspect Bernoulli (AB) model is its ability to automatically detect and distinguish between “true absences” and “false absences” (both of which are coded as 0 in the data), and similarly, between “true presences” and “false presences” (both of which are coded as 1). This is accomplished by specific additive noise components which explicitly account for such non-content bearing causes. The AB model is thus suitable for noise removal and data explanatory purposes, including omission/addition detection. An important application of AB that we demonstrate is data-driven reasoning about palaeontological recordings. Additionally, results on recovering corrupted handwritten digit images and expanding short text documents are also given, and comparisons to other methods are demonstrated and discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. Note that at each attribute t of observation n, the latent aspect k is sampled anew from the distribution p(1|n),…,p(K|n).

  2. NOW: Neogene of the Old World, http://www.helsinki.fi/science/now/.

  3. http://www.ics.uci.edu/~mlearn/MLSummary.html.

  4. http://www.cs.cmu.edu/~textlearning/.

  5. http://www.cs.cmu.edu/~mccallum/bow/.

  6. Another way would be to average over L n , assuming that L n ranges uniformly between the number of ones in the observation and some manually chosen upper limit; in Table 4 this would give inferior results for many choices of the upper limit.

  7. This is indeed not the proportion of 0s turned to 1s, but instead includes new 1s superimposed at existing 1s, which have no effect.

  8. At large K, several aspects may correspond to noise, but for simplicity we only select the one having the smallest value of ∑ t a tk .

  9. This also fallows from the first moment identity for exponential family of distributions, of which the Bernoulli distribution is a member [50].

References

  1. Harman HH (1967) Modern factor analysis, 2nd edn. University of Chicago Press, Chicago

    MATH  Google Scholar 

  2. Saund E (1995) A multiple cause mixture model for unsupervised learning. Neural Comput 7:51–71

    Article  Google Scholar 

  3. Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42:177–196

    Article  MATH  Google Scholar 

  4. Hyvärinen A, Karhunen J, Oja E (2001) Independent component analysis. Wiley Interscience, New York

    Google Scholar 

  5. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    Article  MATH  Google Scholar 

  6. Kabán A, Bingham E, Hirsimäki T (2004) Learning to read between the lines: the aspect Bernoulli model. In: Proceedings of the 4th SIAM international conference on data mining, pp 462–466

  7. Buntine W (2002) Variational extensions to EM and multinomial PCA. In: Machine learning: ECML 2002. Lecture notes in artificial intelligence (LNAI), vol 2430. Springer, New York, pp 23–34

  8. McCallum A, Nigam K (1998) A comparison of event models for naive Bayes text classification. In: Sahami M (ed) Learning for text categorization. Papers from the AAAI Workshop, Technical report WS-98-05, AAAI, pp 41–48

  9. Fortelius M, Werdelin L, Andrews P, Bernor RL, Gentry A, Humphrey L, Mittmann W, Viranta S (1996) Provinciality, diversity, turnover and paleoecology in land mammal faunas of the later Miocene of western Eurasia. In: Bernor R, Fahlbusch V, Mittmann W (eds) The evolution of Western Eurasian Neogene Mammal Faunas. Columbia University Press, Columbia, pp 414–448

    Google Scholar 

  10. Srebro N (2004) Learning with matrix factorizations. Ph.D. thesis, Massachusetts Institute of Technology

  11. Hofmann T, Puzicha J (1998) Unsupervised learning from dyadic data. Technical Report TR-98-042, International Computer Science Insitute, Berkeley

  12. Hofmann T (2004) Latent semantic models for collaborative filtering. ACM Trans Inf Syst 22:89–115

    Article  Google Scholar 

  13. Marlin B, Zemel R (2004) The multiple multiplicative factor model for collaborative filtering. ICML-2004. In: Proceedings of the 21st international conference on machine learning, pp 576–583

  14. Buntine W, Jakulin A (2006) Discrete components analysis. In: Saunders C, Grobelnik M, Gunn S, Shawe-Taylor J (eds) Subspace, latent structure and feature selection techniques. Springer, Berlin

    Google Scholar 

  15. Marlin B (2004) Modeling user rating profiles for collaborative filtering. In: Thrun S, Saul L, Schölkopf B (eds) Advances in neural information processing systems, vol 16. MIT, Cambridge

    Google Scholar 

  16. Kabán A, Bingham E (2006) ICA-based binary feature construction. In: Rosca JP, Erdogmus D, Príncipe JC, Haykin S (eds) Independent component analysis and blind signal separation. 6th international conference, ICA 2006, Proceedings. Lecture notes in computer science, vol 3889. Springer, Berlin, pp 140–148

  17. Kabán A, Bingham E. Factorisation and denoising of 0–1 data: a variational approach. Neurocomputing, special issue on advances in blind signal processing (in press)

  18. Everitt BS, Hand DJ (1981) Finite mixture distributions. Chapman & Hall, London

    MATH  Google Scholar 

  19. Gyllenberg M, Koski T, Reilink E, Verlaan M (1994) Non-uniqueness in probabilistic numerical identification of bacteria. J Appl Probab 31:542–548

    Article  MATH  MathSciNet  Google Scholar 

  20. Meilă M (1999) An accelerated Chow and Liu algorithm: fitting tree distributions to high-dimensional sparse data. In: ICML ’99: Proceedings of the sixteenth international conference on machine learning, Morgan Kaufmann, San Francisco, pp 249–257

  21. Schein A, Saul L, Ungar L (2003) A generalized linear model for principal component analysis of binary data. In: Bishop CM, Frey BJ (eds) Proceedings of the 9th international workshop on artificial intelligence and statistics

  22. Tipping ME (1999) Probabilistic visualisation of high-dimensional binary data. In: Kearns MS, Solla SA, Cohn DA (eds) Advances in neural information processing systems, vol 11, pp 592–598

  23. Collins M, Dasgupta S, Schapire RE (2001) A generalization of principal component analysis to the exponential family. Advances in neural information processing systems, vol 14, pp 617–624

  24. Hofmann T, Puzicha J (1999) Latent class models for collaborative filtering. In: Dean T (ed) Proceedings of the 16th international joint conference on artificial intelligence, IJCAI 99, Morgan Kaufmann, San Francisco, pp 688–693

  25. Dayan P, Zemel RS (1995) Competition and multiple cause models. Neural Comput 7:565–579

    Article  Google Scholar 

  26. Seppänen JK, Bingham E, Mannila H (2003) A simple algorithm for topic identification in 0–1 data. In: Lavrač N, Gamberger D, Todorovski L, Blockeel H (eds) Knowledge discovery in databases: PKDD 2003. Lecture notes in artificial intelligence, vol 2832, Springer, New York, pp 423–434

  27. Jaakkola TS (1997) Variational methods for inference and estimation in graphical models. Ph.D. thesis, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA

  28. Barlow H, Kaushal T, Mitchison G (1989) Finding minimum entropy codes. Neural Comput 1:412–423

    Article  Google Scholar 

  29. Földiák P (1990) Forming sparse representations by local anti-Hebbian learning. Biol Cybern 64:165–170

    Article  Google Scholar 

  30. Schmidhuber J (1992) Learning factorial codes by predictability minimization. Neural Comput 4:863–879

    Article  Google Scholar 

  31. Zemel RS (1993) A minimum description length framework for unsupervised learning. Ph.D. thesis, University of Toronto

  32. Kemp C, Griffiths TL, Tenenbaum JB (2004) Discovering latent classes in relational data. Technical Report AI Memo 2004-019, Massachusetts Institute of Technology, Computer Science and Artificial Intelligence Laboratory

  33. Agrawal R, Mannila H, Srikant R, Toivonen H, Verkamo AI (1996) Fast discovery of association rules. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining, Chap. 12. AAAI, New York, pp 307–328

  34. Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. Technical Report TR 2001-05, Department of Computer Sciences, University of Texas, Austin

  35. Smolensky P (1986) Information processing in dynamical systems: foundations of harmony theory. In: Rumelhart DE, McClelland JL (eds) Parallel distributed processing: explorations in the microstructure of cognition, Foundations, vol 1. MIT, Cambridge, pp 194–281

    Google Scholar 

  36. Jolliffe IT (1986) Principal component analysis. Springer, Berlin

    Google Scholar 

  37. Paatero P, Tapper U (1994) Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5:111–126

    Article  Google Scholar 

  38. Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401:788–791

    Article  Google Scholar 

  39. Lee DD, Seung HS (2000) Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems, vol 13, 556–562

  40. Tipping ME, Bishop CM (1999) Probabilistic principal component analysis. J R Stat Soc B Stat Methodol 61:611–622

    Article  MATH  MathSciNet  Google Scholar 

  41. Dahyot R, Charbonnier P, Heitz F (2004) A bayesian approach to object detection using probabilistic appearance-based models. Pattern Anal Appl 7:317–332

    MathSciNet  Google Scholar 

  42. Srebro N, Jaakkola T (2003) Weighted low-rank approximations. In: Fawcett T, Mishra N (eds) Machine learning. Proceedings of the twentieth international conference (ICML 2003). AAAI, New York, pp 720–727

  43. Gordon GJ (2003) Generalized2 linear2 models. In: Becker S, Thrun S, Obermayer K (eds) Advances in neural information processing systems, vol 15. MIT, Cambridge, pp 577–584

  44. Haft M, Hofman R, Tresp V (2003) Generative binary codes. Pattern Anal Appl 6:269–284

    MathSciNet  Google Scholar 

  45. Barry JC, Morgan M, Flynn L, Pilbeam D, Behrensmeyer A, Raza S, Khan I, Badgley C, Hicks J, Kelley J (2002) Faunal and environmental change in the late Miocene Siwaliks of Northern Pakistan. Palaeobiology (Supplement). 28:1–71

    Google Scholar 

  46. Puolamäki K, Fortelius M, Mannila H (2006) Seriation in paleontological data using Markov chain Monte Carlo methods. PLoS Comput Biol 2:e6

    Article  Google Scholar 

  47. Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  48. Smyth P (2000) Model selection for probabilistic clustering using cross-validated likelihood. Stat Comput 10:63–72

    Article  Google Scholar 

  49. Heckerman D, Chickering DM (1996) A comparison of scientific and engineering criteria for Bayesian model selection. Technical Report MSR-TR-96-12, Microsoft Research

  50. Bernardo JM, Smith AFM (1994) Bayesian theory. Wiley, New York

    MATH  Google Scholar 

  51. Akaike H (1973) Information theory and an extension of the maximum likelihod principle. In: Petrox B, Csaki F (eds) Second international symposium on information theory, pp 267–281

  52. Kabán A (2007) Predictive modelling of heterogeneous sequence collections by topographic ordering of histories. Mach Learn 68:63–95

    Article  Google Scholar 

  53. Peterson C, Söderberg B (1989) A new method for mapping optimization problems onto neural networks. Int J Neural Syst 1:3–22

    Article  Google Scholar 

  54. Wu J-M, Chiu S-J (2001) Independent component analysis using Potts models. IEEE Trans Neural Netw 12:202–211

    Article  Google Scholar 

  55. Jung T-P, Makeig S, Humphries C, Lee T-W, McKeown MJ, Iragui V, Sejnowski TJ (2000) Removing electroencephalographic artifacts by blind source separation. Psychophysiology 37:163–178

    Article  Google Scholar 

  56. Alter O, Brown PO, Botstein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. PNAS 97:10101–10106

    Article  Google Scholar 

  57. Law MHC, Figueiredo MAT, Jain AK (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26:1154–1166

    Article  Google Scholar 

  58. Hofmann T (1999) The cluster-abstraction model: unsupervised learning of topic hierarchies from text data. In: Dean T (ed) Proceedings of the 16th international joint conference on artificial intelligence, IJCAI 99. Morgan Kaufmann, San Francisco, pp 682–687

  59. Barnard K, Duygulu P, Forsyth D (2001) Clustering art. IEEE Conf Comput Vis Pattern Recognit 2(II):434–441

    Google Scholar 

  60. Barnard K, Duygulu P, Forsyth D, de Freitas N, Blei DM, Jordan MI (2003) Matching words and pictures. J Mach Learn Res 3:1107–1135

    Article  MATH  Google Scholar 

  61. Blei DM, Griffiths TL, Jordan MI, Tenenbaum JB (2004) Hierarchical topic models and the nested Chinese restaurant process. In: Thrun S, Saul L, Schölkopf B (eds) Advances in neural information processing systems, vol 16. MIT, Cambridge

    Google Scholar 

  62. Wang X, Kabán A (2005) Finding uninformative features in binary data. In: Gallagher M, Hogan JM, Maire F (eds) Intelligent data engineering and automated learning—IDEAL 2005. Lecture notes in computer science, vol 3578. Springer, New York, pp 40–47

  63. Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York

    MATH  Google Scholar 

Download references

Acknowledgments

The authors wish to thank Professor Heikki Mannila for insightful discussions regarding the model and the palaeontological data, and for suggesting Fig. 7. The authors would also like to thank the anonymous reviewers for their useful comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ella Bingham.

Additional information

A part of the work of Ella Bingham was performed while visiting the School of Computer Science, University of Birmingham, UK.

Appendices

Appendix A

The following holds for any distributions q n (·):

$$ \begin{aligned} \,&\sum_n \log p({{\user2{x}}}_n|{{\user2{s}}}_n,{{\user2{a}}}) \\ &\quad = \sum_n \sum_{{{\user2{z}}}_n}q_n({{\user2{z}}}_n)\log p({{\user2{x}}}_n|{{\user2{s}}}_n,{{\user2{a}}}) \end{aligned} $$
(36)
$$ =\sum_n \sum_{{{\user2{z}}}_n} q_n({{\user2{z}}}_n)\log \frac{p({{\user2{x}}}_n,{{\user2{z}}}_n|{{\user2{s}}}_n,{{\user2{a}}})} {p({{\user2{z}}}_n| {{\user2{x}}}_n,{{\user2{s}}}_n,{{\user2{a}}})}\frac{q_n ({{\user2{z}}}_n)}{q_n({{\user2{z}}}_n)} $$
(37)
$$ \begin{aligned} &= \underbrace{\sum_n \left[ \sum_{{{\user2{z}}}_n}q_n({{\user2{z}}}_n) \log p({{\user2{x}}}_n,{{\user2{z}}}_n| {{\user2{s}}}_n, {{\user2{a}}}) -\sum_{{{\user2{z}}}_n}q_n({{\user2{z}}}_n)\log q_n({{\user2{z}}}_n)\right]}_{F_{{{\user2{s}}}_1,\ldots,{{\user2{s}}}_N, {{\user2{a}}}} ({{\user2{x}}}_1,\ldots,{{\user2{x}}}_N; \; q_1(\cdot),\ldots,q_N(\cdot))} \\ &\quad + \sum_n \underbrace{\sum_{{{\user2{z}}}_n} q_n({{\user2{z}}}_n)\log \frac{q_n({{\user2{z}}}_n)} {p_n({{\user2{z}}}_n|{{\user2{x}}}_n,{{\user2{s}}}_n, {{\user2{a}}})}}_{KL(q_n(\cdot)|| p(\cdot|{{\user2{x}}}_n,{{\user2{s}}}_n, {{\user2{a}}}))} \end{aligned} $$
(38)

We can recognise that the first term of F in (38) is the sum of conditional expectations of the joint likelihood of the data and latent variables z n , conditioned on s n . The second term is the sum of entropies of q n . Finally, the last term is the sum of Kullback–Leibler distances between the (so far arbitrary) distributions q n over z n and the true conditional posterior distributions of z n , conditioned on s n .

Since the KL divergence is always positive [63] (which can easily be shown by using Jensen’s inequality), F is always a lower bound to the log likelihood log p(x n |s n ,a), irrespective of the distributions q n (·). When q n (·)≡ p(·|x n ,s n ,a), ∀ n = 1,…,N, then the KL divergence becomes zero and, therefore, the lower bound approaches the log likelihood exactly.

The iterative EM procedure is then: In the E-step, having some fixed estimates of s n , n = 1,…,N and a, we maximise \(F_{{{\user2{s}}}_1,\ldots,{{\user2{s}}}_N,{{\user2{a}}}}({{\user2{x}}}_1, \ldots,{{\user2{x}}}_N; \, q_1(\cdot),\ldots,q_N(\cdot))\) with respect to all q n (·). This is achieved by setting these to the true conditional posteriors p(·|x n ,s n , a), for all n = 1,…,N, which—from Bayes’ rule—are the following:

$$ p({{\user2{z}}}_n|{{\user2{x}}}_n,{{\user2{s}}}_n,{{\user2{a}}})= \frac{p({{\user2{x}}}_n|{{\user2{z}}}_n,{{\user2{a}}})p({{\user2{z}}}_n| {{\user2{s}}}_n)}{\sum_{{{\user2{z}}}_n} p({{\user2{x}}}_n|{{\user2{z}}}_n,{{\user2{a}}})p({{\user2{z}}}_n| {{\user2{s}}}_n)} $$
(39)
$$ = \frac{\prod_t\prod_k [a_{tk}^{x_{tn}}(1-a_{tk})^{1-x_{tn}}]^{z_{tnk}} \prod_t\prod_k [s_{kn}]^{z_{tnk}}}{\prod_t \sum_k s_{kn} a_{tk}^{x_{tn}}(1-a_{tk})^{1-x_{tn}}} $$
(40)
$$ =\prod_t \frac{\prod_k [s_{kn}a_{tk}^{x_{tn}}(1-a_{tk})^{1-x_{tn}}]^{z_{tnk}}}{\sum_k s_{kn}a_{tk}^{x_{tn}}(1-a_{tk})^{1-x_{tn}}} $$
(41)
$$ = \prod_t p(z_{tn}|x_{tn},{{\user2{s}}}_n, {{\user2{a}}}_t) $$
(42)

In (39), we used that p(x n |z n ,a) = p(x n |s n ,z n ,a), which follows from the dependency structure of AB, namely that x n depends on s n only through z n , therefore, knowing z n makes x n independent of s n .

It may be interesting to note that the above conditional posterior distribution p(z n |x n ,s n ,a) factorises naturally, without having imposed a factor form. Of course, as we know, this posterior is conditioned on the value of s n , so even though it is an exact conditional posterior quantity, there is no tractable exact posterior over the joint distribution of all hidden variables of the model, which would be p(s n ,z n |x n ,a) = p(s n |x n , a)p(z n |x n ,s n , a). The latter term is what we just computed, while the former term is intractable as discussed earlier in the main text, and so a point estimate for s n will be obtained as part of the M-step.

In the M-step, we keep the posterior distributions q n (·) fixed at the values computed in the previous E-step, and compute the most probable value of s n for each x n as well as the parameters a. This is achieved by maximising \(F_{{{\user2{s}}}_1,\ldots,{{\user2{s}}}_N,{{\user2{a}}}}({{\user2{x}}}_1, \ldots,{{\user2{x}}}_N; \; q_1(\cdot),\ldots,q_N(\cdot))\) with respect to all s n and a.

To ensure the constraint ∑ k s kn  = 1 is met, we add a Lagrangian term. Denoting

$$ q_{k,t,n,x_{tn}}\equiv p(z_{tn}=k|x_{tn},{{\user2{s}}}_n, {{\user2{a}}}_t)= \frac{s_{kn}a_{tk}^{x_{tn}}(1-a_{tk})^{1-x_{tn}}} {\sum_{\ell} s_{\ell n}a_{t\ell}^{x_{tn}}(1-a_{t\ell})^{1-x_{tn}}} $$
(43)

where the last equality follows from (41), and replacing the result obtained from the previous E-step into F, the expression to maximise, up to a constant term, is

$$ Q = \sum_n \left[ \sum_{{{\user2{z}}}_n}q_n({{\user2{z}}}_n) \log p({{\user2{x}}}_n,{{\user2{z}}}_n| {{\user2{s}}}_n, {{\user2{a}}}) - \lambda_n \left(\sum_k s_{kn}-1\right)\right] $$
(44)
$$ = \sum_n \left[\sum_t\sum_k\left[\sum_{{{\user2{z}}}_n}q_n({{\user2{z}}}_n) z_{tnk}\right]\log [ s_{kn}a_{tk}^{x_{tn}}(1-a_{tk})^{1-x_{tn}}] -\lambda_n\left(\sum_k s_{kn}-1\right)\right] $$
(45)
$$ =\sum_n \left[\sum_t \sum_k q_{k,t,n,x_{tn}} \log [ s_{kn}a_{tk}^{x_{tn}}(1-a_{tk})^{1-x_{tn}}] -\lambda_n\left(\sum_k s_{kn}-1\right)\right] $$
(46)

where λ n are Lagrange multipliers, and the second term of F (the entropy term) was omitted for being a constant w.r.t. the variables of interest. Eq. (45) was obtained by expanding \(p({{\user2{x}}}_n,{{\user2{z}}}_n| {{\user2{s}}}_n, {{\user2{a}}})=\prod_t \prod_k [s_{kn}a_{tk}^{x_{tn}}(1-a_{tk})^{1-x_{tn}}]^{z_{tnk}}\) and grouping together the terms with z tnk . To obtain (46), we used the result (42) and the notation (43), so that \(\sum_{{{\user2{z}}}_n}q_n({{\user2{z}}}_n) z_{tnk}=\sum_{{{\user2{z}}}_n}\prod_t p(z_{tn}|x_{tn},{{\user2{s}}}_n,{{\user2{a}}}_t) z_{tnk}= p(z_{tn}=k|x_{tn},{{\user2{s}}}_n,{{\user2{a}}}_t)=q_{k,t,n,x_{tn}}.\)

The terms that depend on elements of s n are

$$ Q_{{{\user2{s}}}_n}= \sum_k\sum_t q_{k,t,n,x_{tn}}\log s_{kn}-\lambda_n\left(\sum_k s_{kn}-1\right)+{\rm const.} $$
(47)

Now, we solve the system of stationary equations w.r.t. s kn , which are the following.

$$ \frac{\partial Q_{{{\user2{s}}}_n}}{\partial s_{kn}}= \sum_t q_{k,t,n,x_{tn}}/s_{kn} =\lambda_n $$
(48)

Multiplying both sides by s kn , we obtain

$$ \sum_t q_{k,t,n,x_{tn}} =\lambda_ns_{kn} $$
(49)

from which we have that

$$ s_{kn}= \sum_t q_{k,t,n,x_{tn}}/\lambda_n $$
(50)

The value of λ n is obtained by summing both sides, and using ∑ k s kn  = 1. This gives \(\lambda_n= \sum_k \sum_t q_{k,t,n,x_{tn}} = T,\) since by its definition, \(\sum_k q_{k,t,n,x_{tn}} = 1.\)

To complete the M-step, we now maximise Q w.r.t. a. The terms that depend on elements of a up to constants, are the following.

$$ Q_{{{\user2{a}}}}=\sum_n\sum_k\sum_t q_{k,t,n,x_{tn}} [ x_{tn}\log a_{tk}+(1-x_{tn})\log (1-a_{tk})] $$
(51)

The stationary equations are then the following.

$$ \frac{\partial Q_{{{\user2{a}}}}}{\partial a_{tk}}= \sum_n q_{k,t,n,x_{tn}} \left(\frac{x_{tn}}{a_{tk}}-\frac{1-x_{tn}} {1-a_{tk}}\right) =\sum_n q_{k,t,n,x_{tn}}\frac{x_{tn}-a_{tk}} {a_{tk}(1-a_{tk})}=0 $$
(52)

The denominator is the variance of the Bernoulli and always non-negative, it can be simplified and by isolating a tk we have the solution:

$$ a_{tk}=\frac{\sum_n x_{tn}q_{k,t,n,x_{tn}}}{\sum_n q_{k,t,n,x_{tn}}} $$
(53)

Note that the constraint a tk ∈ [0,1] needed not be explicitly imposed in this model setting, as it is automatically satisfied for binary data.Footnote 9

Appendix B

The derivation of the fixed point equations (13)–(15) as an alternating optimisation of ∑ n p(x n | s n ,a) (or equivalently ∑ n p(x n , s n |a)) is as follows.

Denote \({\overline{a_{tk}}}=1-a_{tk}.\) The log likelihood (11) is maximised, subject to the constraints ∑ k s kn  = 1 and \(a_{tk}+{\overline{a_{tk}}}=1.\) The corresponding Lagrangian is thus the following:

$$ \begin{aligned} {{\mathcal{L}}} &= \sum_n \sum_t \left[ x_{tn} \log \sum_k a_{tk}s_{kn} + (1-x_{tn})\log \sum_k {\overline{a_{tk}}}s_{kn}\right. \\ &\left. \quad - c_{tk}(a_{tk}+{\overline{a_{tk}}}-1) - \lambda_n \left(\sum_k s_{kn}-1\right)\right] \end{aligned} $$
(54)

where c tk and λ n are Lagrangian multipliers, and we have rewritten (1 − ∑ k a tk s kn ) as \(\sum_k {\overline{a_{tk}}}s_{kn}.\) The stationary equations of \({{\mathcal{L}}}\) with respect to both a tk and \({\overline{a_{tk}}}\) are

$$ \frac{\partial {{\mathcal{L}}}}{\partial a_{tk}} = \sum_n \frac{x_{tn}}{\sum_\ell a_{t\ell}s_{\ell n}} s_{kn} - c_{tk} = 0 $$
(55)
$$ \frac{\partial {{\mathcal{L}}}}{\partial {\overline{a_{tk}}}} = \sum_n \frac{1-x_{tn}}{\sum_\ell \overline{a_{t\ell}}s_{\ell n}} s_{kn} - c_{tk} =0 $$
(56)

Multiplying the first of the above equations by a tk and the second by \({\overline{a_{tk}}}\) we obtain

$$ a_{tk} \sum_n \frac{x_{tn}}{\sum_\ell a_{t\ell}s_{\ell n}}s_{kn} - c_{tk}a_{tk} = 0 $$
(57)
$$ {\overline{a_{tk}}} \sum_n \frac{1-x_{tn}}{\sum_\ell {\overline{a_{t\ell}}}s_{\ell n}}s_{kn} - c_{tk}{\overline{a_{tk}}} = 0 $$
(58)

Summing both sides and using \(a_{tk}+{\overline{a_{tk}}} = 1\) provides the Lagrangian multiplier c tk :

$$ c_{tk} = a_{tk} \sum_n \frac{x_{tn}}{\sum_\ell a_{t\ell}s_{\ell n}} s_{kn} + {\overline{a_{tk}}} \sum_n \frac{1-x_{tn}}{\sum_\ell \overline{a_{t\ell}}s_{\ell n}} s_{kn} $$
(59)

as in (15). From (57) we have the solution for a tk in the form of a fixed point equation:

$$ a_{tk}= a_{tk}\sum_n\frac{x_{tn}}{\sum_\ell a_{t \ell}s_{\ell n}}s_{kn}/c_{tk} $$
(60)

as in (14). Solving for s kn proceeds similarly: the stationary equation is

$$ \frac{\partial {{\mathcal{L}}}}{\partial s_{kn}} = \sum_t \left( \frac{x_{tn}}{\sum_\ell a_{t \ell}s_{\ell n}} a_{tk} + \frac{1-x_{tn}}{\sum_\ell \overline{a_{t \ell}}s_{\ell n}} {\overline{a_{tk}}} \right) - \lambda_n =0 $$
(61)

Multiplying both sides by s kn we obtain

$$ \sum_t \left(\frac{x_{tn}}{\sum_\ell a_{t \ell}s_{\ell n}} a_{tk}s_{kn} + \frac{1-x_{tn}}{\sum_\ell \overline{a_{t \ell}}s_{\ell n}} {\overline{a_{tk}}}s_{kn} \right) = \lambda_n s_{kn} $$
(62)

Summing over k and using ∑ k s kn  = 1 we have the Lagrange multiplier λ n :

$$ \lambda_n = \sum_t \frac{x_{tn}}{\sum_\ell a_{t \ell}s_{\ell n}} \sum_k a_{tk}s_{kn} + \sum_t \frac{1-x_{tn}}{\sum_\ell \overline{a_{t \ell}}s_{\ell n}} \sum_k {\overline{a_{tk}}}s_{kn} $$
(63)
$$ = \sum_t x_{tn} + \sum_t (1-x_{tn})=T $$
(64)

Having computed λ n , from (62) we obtain the fixed point equation for s kn , identical to (13):

$$ s_{kn}= s_{kn} \left\{\sum_t \frac{x_{tn}}{\sum_\ell a_{t \ell}s_{\ell n}}a_{tk} + \frac{1-x_{tn}}{\sum_\ell \overline{a_{t \ell}}s_{\ell n}} {\overline{a_{tk}}}\right\}/T $$
(65)

As discussed in the text, this derivation is simpler and yields the same multiplicative updates (13)–(15), obtained also via rewriting the EM algorithm (7)–(9). However, the fact that we need not iterate each multiplicative fixed point update to convergence separately before alternating these inner loops, but we can actually just alternate them while still obtaining a convergent algorithm, is less apparent from the derivation given in this section. Instead, this is a consequence of the EM interpretation presented in Appendix A. Indeed, recall that every single multiplicative update is a combination of a full E-step and an M-step update for one (group of) variables, hence it is guaranteed not to decrease the likelihood.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bingham, E., Kabán, A. & Fortelius, M. The aspect Bernoulli model: multiple causes of presences and absences. Pattern Anal Applic 12, 55–78 (2009). https://doi.org/10.1007/s10044-007-0096-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-007-0096-4

Keywords

Navigation