ABSTRACT
Network data is ubiquitous, encoding collections of relationships between entities such as people, places, genes, or corporations. While many resources for networks of interesting entities are emerging, most of these can only annotate connections in a limited fashion. Although relationships between entities are rich, it is impractical to manually devise complete characterizations of these relationships for every pair of entities on large, real-world corpora.
In this paper we present a novel probabilistic topic model to analyze text corpora and infer descriptions of its entities and of relationships between those entities. We develop variational methods for performing approximate inference on our model and demonstrate that our model can be practically deployed on large corpora such as Wikipedia. We show qualitatively and quantitatively that our model can construct and annotate graphs of relationships and make useful predictions.
Supplemental Material
- E. Agichtein and L. Gravano. Querying text databases for efficient information extraction. Data Engineering, International Conference on, 0:113, 2003.Google Scholar
- A. Anagnostopoulos, R. Kumar, and M. Mahdian. Influence and correlation in social networks. KDD 2008, 2008. Google ScholarDigital Library
- M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI 2007, 2007.Google ScholarDigital Library
- I. Bhattacharya, S. Godbole, and S. Joshi. Structured entity identification and document categorization: Two tasks with one joint model. KDD 2008, 2008. Google ScholarDigital Library
- D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
- D. Cai, Z. Shao, X. He, X. Yan, and J. Han. Mining hidden community in heterogeneous social networks. LinkKDD 2005, Aug 2005. Google ScholarDigital Library
- A. Culotta, R. Bekkerman, and A. McCallum. Extracting social networks and contact information from email and the web. AAAI 2005, 2005.Google Scholar
- D. Davidov, A. Rappoport, and M. Koppel. Fully unsupervised discovery of concept-specific relationships by web mining. In ACL, 2007.Google Scholar
- C. Diehl, G. M. Namata, and L. Getoor. Relationship identification for social network discovery. In AAAI 2007, July 2007. Google ScholarDigital Library
- B. Efron. Estimating the error rate of a prediction rule: Improvement on cross-validation. Journal of the American Statistical Association, 78(382), 1983.Google ScholarCross Ref
- D. Gibson, J. Kleinberg, and P. Raghavan. Inferring web communities from link topology. HYPERTEXT 1998, May 1998. Google ScholarDigital Library
- T. Hofmann. Probabilistic latent semantic indexing. SIGIR 1999, 1999. Google ScholarDigital Library
- M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Oct 1999.Google Scholar
- S. Katrenko and P. Adriaans. Learning relations from biomedical corpora using dependency trees. Lecture Notes in Computer Science, 2007.Google Scholar
- J. Leskovec, L. Backstrom, R. Kumar, and A. Tomkins. Microscopic evolution of social networks. KDD 2008, 2008. Google ScholarDigital Library
- J. Leskovec, K. Lang, A. Dasgupta, and M. Mahoney. Statistical properties of community structure in large social and information networks. WWW 2008, 2008. Google ScholarDigital Library
- A. McCallum, A. Corrada-Emmanuel, and X. Wang. Topic and role discovery in social networks. IJCAI 2005, 2005.Google Scholar
- A. McGovern, L. Friedland, M. Hay, B. Gallagher, A. Fast, J. Neville, and D. Jensen. Exploiting relational structure to understand publication patterns in high-energy physics. ACM SIGKDD Explorations Newsletter, 5(2), Dec 2003. Google ScholarDigital Library
- E. Meeds, Z. Ghahramani, R. Neal, and S. Roweis. Modeling dyadic data with binary latent factors. NIPS 2007, 2007.Google Scholar
- Q. Mei, D. Cai, D. Zhang, and C. Zhai. Topic modeling with network regularization. WWW 2008, Apr 2008. Google ScholarDigital Library
- Q. Mei, D. Xin, H. Cheng, J. Han, and C. Zhai. Semantic annotation of frequent patterns. KDD 2007, 1(3), 2007. Google ScholarDigital Library
- R. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen. Joint latent topic models for text and citations. KDD 2008, 2008. Google ScholarDigital Library
- O. J. Nave. Nave's Topical Bible. Thomas Nelson, 2003.Google Scholar
- D. Newman, C. Chemudugunta, and P. Smyth. Statistical entity-topic models. In KDD 2006, pages 680--686, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- M. E. J. Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103(23), 2006.Google ScholarCross Ref
- T. Ohta, Y. Tateisi, and J.-D. Kim. Genia corpus: an annotated research abstract corpus in molecular biology domain. In HLT 2008, San Diego, USA, 2002. Google ScholarDigital Library
- M. Rabbat, M. Figueiredo, and R. Nowak. Inferring network structure from co-occurrences. NIPS 2006, 2006.Google Scholar
- M. Rosen-Zvi, T. Griffiths, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In AUAI 2004, pages 487--494, Arlington, Virginia, United States, 2004. AUAI Press. Google ScholarDigital Library
- S. Sahay, S. Mukherjea, E. Agichtein, E. Garcia, S. Navathe, and A. Ram. Discovering semantic biomedical relations utilizing the web. KDD 2008, 2(1), Mar 2008. Google ScholarDigital Library
- M. Steyvers and T. Griffiths. Probabilistic topic models. Handbook of Latent Semantic Analysis, 2007.Google Scholar
- L. Tanabe, N. Xie, L. H. Thom, W. Matten, and W. J. Wilbur. Genetag: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics, 6 Suppl 1, 2005.Google Scholar
- B. Taskar, M.-F. Wong, P. Abbeel, and D. Koller. Link prediction in relational data. NIPS 2003, 2003.Google Scholar
- X. Wang, N. Mohanty, and A. McCallum. Group and topic discovery from relations and text. Proceedings of the 3rd international workshop on Link discovery, 2005. Google ScholarDigital Library
- S. Wasserman and P. Pattison. Logit models and logistic regressions for social networks: I. an introduction to markov graphs and p*. Psychometrika, 1996.Google ScholarCross Ref
- D. Zhou, S. Zhu, K. Yu, X. Song, B. Tseng, H. Zha, and C. Giles. Learning multiple graphs for document recommendations. WWW 2008, Apr 2008. Google ScholarDigital Library
Index Terms
- Connections between the lines: augmenting social networks with text
Recommendations
An Introduction to Variational Methods for Graphical Models
This paper presents a tutorial introduction to the use of variational methods for inference and learning in graphical models (Bayesian networks and Markov random fields). We present a number of examples of graphical models, including the QMR-DT database, ...
Robust Bayesian clustering
A new variational Bayesian learning algorithm for Student-t mixture models is introduced. This algorithm leads to (i) robust density estimation, (ii) robust clustering and (iii) robust automatic model selection. Gaussian mixture models are learning ...
Bayesian parameter estimation via variational methods
We consider a logistic regression model with a Gaussian prior distribution over the parameters. We show that an accurate variational transformation can be used to obtain a closed form approximation to the posterior distribution of the parameters thereby ...
Comments