ABSTRACT
In this work, we address the problem of joint modeling of text and citations in the topic modeling framework. We present two different models called the Pairwise-Link-LDA and the Link-PLSA-LDA models.
The Pairwise-Link-LDA model combines the ideas of LDA [4] and Mixed Membership Block Stochastic Models [1] and allows modeling arbitrary link structure. However, the model is computationally expensive, since it involves modeling the presence or absence of a citation (link) between every pair of documents. The second model solves this problem by assuming that the link structure is a bipartite graph. As the name indicates, Link-PLSA-LDA model combines the LDA and PLSA models into a single graphical model.
Our experiments on a subset of Citeseer data show that both these models are able to predict unseen data better than the baseline model of Erosheva and Lafferty [8], by capturing the notion of topical similarity between the contents of the cited and citing documents. Our experiments on two different data sets on the link prediction task show that the Link-PLSA-LDA model performs the best on the citation prediction task, while also remaining highly scalable. In addition, we also present some interesting visualizations generated by each of the models.
- M. Airodi, D. Blei, E. Xing, and S. Fienberg. Mixed membership stochastic block models for relational data, with applications to protein-protein interactions. In International Biometric Society-ENAR Annual Meetings, 2006.Google Scholar
- C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, NY, USA, first edition, 2006. Google ScholarDigital Library
- D. Blei and J. Lafferty. Correlated topic models. In Advances in Neural Information Processing Systems, 2006.Google ScholarDigital Library
- D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
- D. M. Blei and J. D. Lafferty. Dynamic topic models. In International conference on Machine learning, pages 113--120, 2006. Google ScholarDigital Library
- D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems 13, 2001.Google Scholar
- L. Dietz, S. Bickel, and T. Scheffer. Unsupervised prediction of citation influences. In International Conference on Machine learning, pages 233--240, 2007. Google ScholarDigital Library
- E. Erosheva, S. Fienberg, and J. Lafferty. Mixed-membership models of scientific publications. Proceedings of the National Academy of Sciences, 101:5220--5227, 2004.Google ScholarCross Ref
- T. Hoffman. Probabilistic Latent Semantic Analysis. In Uncertainty in Artificial Intelligence, 1999.Google Scholar
- J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999. Google ScholarDigital Library
- W. Li and A. McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. In International conference on Machine learning, pages 577--584, 2006. Google ScholarDigital Library
- D. Liben-Nowell and J. Kleinberg. The link prediction problem in social networks. In Conference on Information and Knowledge Management, 2003. Google ScholarDigital Library
- A. McCallum and K. Nigam. A comparison of event models for Naïve Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.Google Scholar
- R. Nallapati and W. Cohen. Link-LDA-PLSA: a new unsupervised technique for topics and influence in blogs. In International Conference for Weblogs and Social Media, 2008.Google Scholar
- R. Nallapati, J. Lafferty, W. Cohen, K. Ung, and S. Ditmore. Multiscale topic tomography. In Conference on Knowledge Discovery and Data mining, 2007. Google ScholarDigital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. In Technical report, Department of Computer Science, Stanford University, 1998.Google Scholar
- B. Shaparenko and T. Joachims. Information genealogy: Uncovering the flow of ideas in non-hyperlinked document databases. In Knowledge Discovery and Data Mining (KDD) Conference, 2007. Google ScholarDigital Library
- T. Strohman, W. B. Croft, and D. Jensen. Recommending citations for academic papers. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, 2007. Google ScholarDigital Library
- B. Taskar, Ming-FaiWong, P. Abbeel, and D. Koller. Link prediction in relational data. In Neural Information Processing Systems, 2003.Google Scholar
- M. Wainwright and M. Jordan. Graphical models, exponential families, and variational inference. In UC Berkeley, Dept. of Statistics, Technical Report, 2003.Google Scholar
Index Terms
- Joint latent topic models for text and citations
Recommendations
Unsupervised mining of long time series based on latent topic model
This paper presents a novel unsupervised method for mining time series based on two generative topic models, i.e., probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA). The proposed method treats each time series as a text ...
Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA
AbstractProbabilistic topic models have recently attracted much attention because of their successful applications in many text mining tasks such as retrieval, summarization, categorization, and clustering. Although many existing studies have reported ...
Probabilistic Topic Models for Text Data Retrieval and Analysis
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information RetrievalText data include all kinds of natural language text such as web pages, news articles, scientific literature, emails, enterprise documents, and social media posts. As text data continues to grow quickly, it is increasingly important to develop ...
Comments