Abstract
We present a probabilistic multiple cause model for the analysis of binary (0–1) data. A distinctive feature of the aspect Bernoulli (AB) model is its ability to automatically detect and distinguish between “true absences” and “false absences” (both of which are coded as 0 in the data), and similarly, between “true presences” and “false presences” (both of which are coded as 1). This is accomplished by specific additive noise components which explicitly account for such non-content bearing causes. The AB model is thus suitable for noise removal and data explanatory purposes, including omission/addition detection. An important application of AB that we demonstrate is data-driven reasoning about palaeontological recordings. Additionally, results on recovering corrupted handwritten digit images and expanding short text documents are also given, and comparisons to other methods are demonstrated and discussed.
Similar content being viewed by others
Notes
Note that at each attribute t of observation n, the latent aspect k is sampled anew from the distribution p(1|n),…,p(K|n).
NOW: Neogene of the Old World, http://www.helsinki.fi/science/now/.
Another way would be to average over L n , assuming that L n ranges uniformly between the number of ones in the observation and some manually chosen upper limit; in Table 4 this would give inferior results for many choices of the upper limit.
This is indeed not the proportion of 0s turned to 1s, but instead includes new 1s superimposed at existing 1s, which have no effect.
At large K, several aspects may correspond to noise, but for simplicity we only select the one having the smallest value of ∑ t a tk .
This also fallows from the first moment identity for exponential family of distributions, of which the Bernoulli distribution is a member [50].
References
Harman HH (1967) Modern factor analysis, 2nd edn. University of Chicago Press, Chicago
Saund E (1995) A multiple cause mixture model for unsupervised learning. Neural Comput 7:51–71
Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42:177–196
Hyvärinen A, Karhunen J, Oja E (2001) Independent component analysis. Wiley Interscience, New York
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Kabán A, Bingham E, Hirsimäki T (2004) Learning to read between the lines: the aspect Bernoulli model. In: Proceedings of the 4th SIAM international conference on data mining, pp 462–466
Buntine W (2002) Variational extensions to EM and multinomial PCA. In: Machine learning: ECML 2002. Lecture notes in artificial intelligence (LNAI), vol 2430. Springer, New York, pp 23–34
McCallum A, Nigam K (1998) A comparison of event models for naive Bayes text classification. In: Sahami M (ed) Learning for text categorization. Papers from the AAAI Workshop, Technical report WS-98-05, AAAI, pp 41–48
Fortelius M, Werdelin L, Andrews P, Bernor RL, Gentry A, Humphrey L, Mittmann W, Viranta S (1996) Provinciality, diversity, turnover and paleoecology in land mammal faunas of the later Miocene of western Eurasia. In: Bernor R, Fahlbusch V, Mittmann W (eds) The evolution of Western Eurasian Neogene Mammal Faunas. Columbia University Press, Columbia, pp 414–448
Srebro N (2004) Learning with matrix factorizations. Ph.D. thesis, Massachusetts Institute of Technology
Hofmann T, Puzicha J (1998) Unsupervised learning from dyadic data. Technical Report TR-98-042, International Computer Science Insitute, Berkeley
Hofmann T (2004) Latent semantic models for collaborative filtering. ACM Trans Inf Syst 22:89–115
Marlin B, Zemel R (2004) The multiple multiplicative factor model for collaborative filtering. ICML-2004. In: Proceedings of the 21st international conference on machine learning, pp 576–583
Buntine W, Jakulin A (2006) Discrete components analysis. In: Saunders C, Grobelnik M, Gunn S, Shawe-Taylor J (eds) Subspace, latent structure and feature selection techniques. Springer, Berlin
Marlin B (2004) Modeling user rating profiles for collaborative filtering. In: Thrun S, Saul L, Schölkopf B (eds) Advances in neural information processing systems, vol 16. MIT, Cambridge
Kabán A, Bingham E (2006) ICA-based binary feature construction. In: Rosca JP, Erdogmus D, Príncipe JC, Haykin S (eds) Independent component analysis and blind signal separation. 6th international conference, ICA 2006, Proceedings. Lecture notes in computer science, vol 3889. Springer, Berlin, pp 140–148
Kabán A, Bingham E. Factorisation and denoising of 0–1 data: a variational approach. Neurocomputing, special issue on advances in blind signal processing (in press)
Everitt BS, Hand DJ (1981) Finite mixture distributions. Chapman & Hall, London
Gyllenberg M, Koski T, Reilink E, Verlaan M (1994) Non-uniqueness in probabilistic numerical identification of bacteria. J Appl Probab 31:542–548
Meilă M (1999) An accelerated Chow and Liu algorithm: fitting tree distributions to high-dimensional sparse data. In: ICML ’99: Proceedings of the sixteenth international conference on machine learning, Morgan Kaufmann, San Francisco, pp 249–257
Schein A, Saul L, Ungar L (2003) A generalized linear model for principal component analysis of binary data. In: Bishop CM, Frey BJ (eds) Proceedings of the 9th international workshop on artificial intelligence and statistics
Tipping ME (1999) Probabilistic visualisation of high-dimensional binary data. In: Kearns MS, Solla SA, Cohn DA (eds) Advances in neural information processing systems, vol 11, pp 592–598
Collins M, Dasgupta S, Schapire RE (2001) A generalization of principal component analysis to the exponential family. Advances in neural information processing systems, vol 14, pp 617–624
Hofmann T, Puzicha J (1999) Latent class models for collaborative filtering. In: Dean T (ed) Proceedings of the 16th international joint conference on artificial intelligence, IJCAI 99, Morgan Kaufmann, San Francisco, pp 688–693
Dayan P, Zemel RS (1995) Competition and multiple cause models. Neural Comput 7:565–579
Seppänen JK, Bingham E, Mannila H (2003) A simple algorithm for topic identification in 0–1 data. In: Lavrač N, Gamberger D, Todorovski L, Blockeel H (eds) Knowledge discovery in databases: PKDD 2003. Lecture notes in artificial intelligence, vol 2832, Springer, New York, pp 423–434
Jaakkola TS (1997) Variational methods for inference and estimation in graphical models. Ph.D. thesis, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA
Barlow H, Kaushal T, Mitchison G (1989) Finding minimum entropy codes. Neural Comput 1:412–423
Földiák P (1990) Forming sparse representations by local anti-Hebbian learning. Biol Cybern 64:165–170
Schmidhuber J (1992) Learning factorial codes by predictability minimization. Neural Comput 4:863–879
Zemel RS (1993) A minimum description length framework for unsupervised learning. Ph.D. thesis, University of Toronto
Kemp C, Griffiths TL, Tenenbaum JB (2004) Discovering latent classes in relational data. Technical Report AI Memo 2004-019, Massachusetts Institute of Technology, Computer Science and Artificial Intelligence Laboratory
Agrawal R, Mannila H, Srikant R, Toivonen H, Verkamo AI (1996) Fast discovery of association rules. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining, Chap. 12. AAAI, New York, pp 307–328
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. Technical Report TR 2001-05, Department of Computer Sciences, University of Texas, Austin
Smolensky P (1986) Information processing in dynamical systems: foundations of harmony theory. In: Rumelhart DE, McClelland JL (eds) Parallel distributed processing: explorations in the microstructure of cognition, Foundations, vol 1. MIT, Cambridge, pp 194–281
Jolliffe IT (1986) Principal component analysis. Springer, Berlin
Paatero P, Tapper U (1994) Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5:111–126
Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401:788–791
Lee DD, Seung HS (2000) Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems, vol 13, 556–562
Tipping ME, Bishop CM (1999) Probabilistic principal component analysis. J R Stat Soc B Stat Methodol 61:611–622
Dahyot R, Charbonnier P, Heitz F (2004) A bayesian approach to object detection using probabilistic appearance-based models. Pattern Anal Appl 7:317–332
Srebro N, Jaakkola T (2003) Weighted low-rank approximations. In: Fawcett T, Mishra N (eds) Machine learning. Proceedings of the twentieth international conference (ICML 2003). AAAI, New York, pp 720–727
Gordon GJ (2003) Generalized2 linear2 models. In: Becker S, Thrun S, Obermayer K (eds) Advances in neural information processing systems, vol 15. MIT, Cambridge, pp 577–584
Haft M, Hofman R, Tresp V (2003) Generative binary codes. Pattern Anal Appl 6:269–284
Barry JC, Morgan M, Flynn L, Pilbeam D, Behrensmeyer A, Raza S, Khan I, Badgley C, Hicks J, Kelley J (2002) Faunal and environmental change in the late Miocene Siwaliks of Northern Pakistan. Palaeobiology (Supplement). 28:1–71
Puolamäki K, Fortelius M, Mannila H (2006) Seriation in paleontological data using Markov chain Monte Carlo methods. PLoS Comput Biol 2:e6
Ripley BD (1996) Pattern recognition and neural networks. Cambridge University Press, Cambridge
Smyth P (2000) Model selection for probabilistic clustering using cross-validated likelihood. Stat Comput 10:63–72
Heckerman D, Chickering DM (1996) A comparison of scientific and engineering criteria for Bayesian model selection. Technical Report MSR-TR-96-12, Microsoft Research
Bernardo JM, Smith AFM (1994) Bayesian theory. Wiley, New York
Akaike H (1973) Information theory and an extension of the maximum likelihod principle. In: Petrox B, Csaki F (eds) Second international symposium on information theory, pp 267–281
Kabán A (2007) Predictive modelling of heterogeneous sequence collections by topographic ordering of histories. Mach Learn 68:63–95
Peterson C, Söderberg B (1989) A new method for mapping optimization problems onto neural networks. Int J Neural Syst 1:3–22
Wu J-M, Chiu S-J (2001) Independent component analysis using Potts models. IEEE Trans Neural Netw 12:202–211
Jung T-P, Makeig S, Humphries C, Lee T-W, McKeown MJ, Iragui V, Sejnowski TJ (2000) Removing electroencephalographic artifacts by blind source separation. Psychophysiology 37:163–178
Alter O, Brown PO, Botstein D (2000) Singular value decomposition for genome-wide expression data processing and modeling. PNAS 97:10101–10106
Law MHC, Figueiredo MAT, Jain AK (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26:1154–1166
Hofmann T (1999) The cluster-abstraction model: unsupervised learning of topic hierarchies from text data. In: Dean T (ed) Proceedings of the 16th international joint conference on artificial intelligence, IJCAI 99. Morgan Kaufmann, San Francisco, pp 682–687
Barnard K, Duygulu P, Forsyth D (2001) Clustering art. IEEE Conf Comput Vis Pattern Recognit 2(II):434–441
Barnard K, Duygulu P, Forsyth D, de Freitas N, Blei DM, Jordan MI (2003) Matching words and pictures. J Mach Learn Res 3:1107–1135
Blei DM, Griffiths TL, Jordan MI, Tenenbaum JB (2004) Hierarchical topic models and the nested Chinese restaurant process. In: Thrun S, Saul L, Schölkopf B (eds) Advances in neural information processing systems, vol 16. MIT, Cambridge
Wang X, Kabán A (2005) Finding uninformative features in binary data. In: Gallagher M, Hogan JM, Maire F (eds) Intelligent data engineering and automated learning—IDEAL 2005. Lecture notes in computer science, vol 3578. Springer, New York, pp 40–47
Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York
Acknowledgments
The authors wish to thank Professor Heikki Mannila for insightful discussions regarding the model and the palaeontological data, and for suggesting Fig. 7. The authors would also like to thank the anonymous reviewers for their useful comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Additional information
A part of the work of Ella Bingham was performed while visiting the School of Computer Science, University of Birmingham, UK.
Appendices
Appendix A
The following holds for any distributions q n (·):
We can recognise that the first term of F in (38) is the sum of conditional expectations of the joint likelihood of the data and latent variables z n , conditioned on s n . The second term is the sum of entropies of q n . Finally, the last term is the sum of Kullback–Leibler distances between the (so far arbitrary) distributions q n over z n and the true conditional posterior distributions of z n , conditioned on s n .
Since the KL divergence is always positive [63] (which can easily be shown by using Jensen’s inequality), F is always a lower bound to the log likelihood log p(x n |s n ,a), irrespective of the distributions q n (·). When q n (·)≡ p(·|x n ,s n ,a), ∀ n = 1,…,N, then the KL divergence becomes zero and, therefore, the lower bound approaches the log likelihood exactly.
The iterative EM procedure is then: In the E-step, having some fixed estimates of s n , n = 1,…,N and a, we maximise \(F_{{{\user2{s}}}_1,\ldots,{{\user2{s}}}_N,{{\user2{a}}}}({{\user2{x}}}_1, \ldots,{{\user2{x}}}_N; \, q_1(\cdot),\ldots,q_N(\cdot))\) with respect to all q n (·). This is achieved by setting these to the true conditional posteriors p(·|x n ,s n , a), for all n = 1,…,N, which—from Bayes’ rule—are the following:
In (39), we used that p(x n |z n ,a) = p(x n |s n ,z n ,a), which follows from the dependency structure of AB, namely that x n depends on s n only through z n , therefore, knowing z n makes x n independent of s n .
It may be interesting to note that the above conditional posterior distribution p(z n |x n ,s n ,a) factorises naturally, without having imposed a factor form. Of course, as we know, this posterior is conditioned on the value of s n , so even though it is an exact conditional posterior quantity, there is no tractable exact posterior over the joint distribution of all hidden variables of the model, which would be p(s n ,z n |x n ,a) = p(s n |x n , a)p(z n |x n ,s n , a). The latter term is what we just computed, while the former term is intractable as discussed earlier in the main text, and so a point estimate for s n will be obtained as part of the M-step.
In the M-step, we keep the posterior distributions q n (·) fixed at the values computed in the previous E-step, and compute the most probable value of s n for each x n as well as the parameters a. This is achieved by maximising \(F_{{{\user2{s}}}_1,\ldots,{{\user2{s}}}_N,{{\user2{a}}}}({{\user2{x}}}_1, \ldots,{{\user2{x}}}_N; \; q_1(\cdot),\ldots,q_N(\cdot))\) with respect to all s n and a.
To ensure the constraint ∑ k s kn = 1 is met, we add a Lagrangian term. Denoting
where the last equality follows from (41), and replacing the result obtained from the previous E-step into F, the expression to maximise, up to a constant term, is
where λ n are Lagrange multipliers, and the second term of F (the entropy term) was omitted for being a constant w.r.t. the variables of interest. Eq. (45) was obtained by expanding \(p({{\user2{x}}}_n,{{\user2{z}}}_n| {{\user2{s}}}_n, {{\user2{a}}})=\prod_t \prod_k [s_{kn}a_{tk}^{x_{tn}}(1-a_{tk})^{1-x_{tn}}]^{z_{tnk}}\) and grouping together the terms with z tnk . To obtain (46), we used the result (42) and the notation (43), so that \(\sum_{{{\user2{z}}}_n}q_n({{\user2{z}}}_n) z_{tnk}=\sum_{{{\user2{z}}}_n}\prod_t p(z_{tn}|x_{tn},{{\user2{s}}}_n,{{\user2{a}}}_t) z_{tnk}= p(z_{tn}=k|x_{tn},{{\user2{s}}}_n,{{\user2{a}}}_t)=q_{k,t,n,x_{tn}}.\)
The terms that depend on elements of s n are
Now, we solve the system of stationary equations w.r.t. s kn , which are the following.
Multiplying both sides by s kn , we obtain
from which we have that
The value of λ n is obtained by summing both sides, and using ∑ k s kn = 1. This gives \(\lambda_n= \sum_k \sum_t q_{k,t,n,x_{tn}} = T,\) since by its definition, \(\sum_k q_{k,t,n,x_{tn}} = 1.\)
To complete the M-step, we now maximise Q w.r.t. a. The terms that depend on elements of a up to constants, are the following.
The stationary equations are then the following.
The denominator is the variance of the Bernoulli and always non-negative, it can be simplified and by isolating a tk we have the solution:
Note that the constraint a tk ∈ [0,1] needed not be explicitly imposed in this model setting, as it is automatically satisfied for binary data.Footnote 9
Appendix B
The derivation of the fixed point equations (13)–(15) as an alternating optimisation of ∑ n p(x n | s n ,a) (or equivalently ∑ n p(x n , s n |a)) is as follows.
Denote \({\overline{a_{tk}}}=1-a_{tk}.\) The log likelihood (11) is maximised, subject to the constraints ∑ k s kn = 1 and \(a_{tk}+{\overline{a_{tk}}}=1.\) The corresponding Lagrangian is thus the following:
where c tk and λ n are Lagrangian multipliers, and we have rewritten (1 − ∑ k a tk s kn ) as \(\sum_k {\overline{a_{tk}}}s_{kn}.\) The stationary equations of \({{\mathcal{L}}}\) with respect to both a tk and \({\overline{a_{tk}}}\) are
Multiplying the first of the above equations by a tk and the second by \({\overline{a_{tk}}}\) we obtain
Summing both sides and using \(a_{tk}+{\overline{a_{tk}}} = 1\) provides the Lagrangian multiplier c tk :
as in (15). From (57) we have the solution for a tk in the form of a fixed point equation:
as in (14). Solving for s kn proceeds similarly: the stationary equation is
Multiplying both sides by s kn we obtain
Summing over k and using ∑ k s kn = 1 we have the Lagrange multiplier λ n :
Having computed λ n , from (62) we obtain the fixed point equation for s kn , identical to (13):
As discussed in the text, this derivation is simpler and yields the same multiplicative updates (13)–(15), obtained also via rewriting the EM algorithm (7)–(9). However, the fact that we need not iterate each multiplicative fixed point update to convergence separately before alternating these inner loops, but we can actually just alternate them while still obtaining a convergent algorithm, is less apparent from the derivation given in this section. Instead, this is a consequence of the EM interpretation presented in Appendix A. Indeed, recall that every single multiplicative update is a combination of a full E-step and an M-step update for one (group of) variables, hence it is guaranteed not to decrease the likelihood.
Rights and permissions
About this article
Cite this article
Bingham, E., Kabán, A. & Fortelius, M. The aspect Bernoulli model: multiple causes of presences and absences. Pattern Anal Applic 12, 55–78 (2009). https://doi.org/10.1007/s10044-007-0096-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-007-0096-4