research-article

Joint latent topic models for text and citations

Authors:
Ramesh M. Nallapati

Stanford University, Stanford, CA, USA

Stanford University, Stanford, CA, USA
View Profile

,
Amr Ahmed

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Eric P. Xing

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
William W. Cohen

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2008Pages 542–550https://doi.org/10.1145/1401890.1401957

Published:24 August 2008Publication History

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 542–550

ABSTRACT

In this work, we address the problem of joint modeling of text and citations in the topic modeling framework. We present two different models called the Pairwise-Link-LDA and the Link-PLSA-LDA models.

The Pairwise-Link-LDA model combines the ideas of LDA [4] and Mixed Membership Block Stochastic Models [1] and allows modeling arbitrary link structure. However, the model is computationally expensive, since it involves modeling the presence or absence of a citation (link) between every pair of documents. The second model solves this problem by assuming that the link structure is a bipartite graph. As the name indicates, Link-PLSA-LDA model combines the LDA and PLSA models into a single graphical model.

Our experiments on a subset of Citeseer data show that both these models are able to predict unseen data better than the baseline model of Erosheva and Lafferty [8], by capturing the notion of topical similarity between the contents of the cited and citing documents. Our experiments on two different data sets on the link prediction task show that the Link-PLSA-LDA model performs the best on the citation prediction task, while also remaining highly scalable. In addition, we also present some interesting visualizations generated by each of the models.

References

M. Airodi, D. Blei, E. Xing, and S. Fienberg. Mixed membership stochastic block models for relational data, with applications to protein-protein interactions. In International Biometric Society-ENAR Annual Meetings, 2006.Google Scholar
C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, NY, USA, first edition, 2006. Google ScholarDigital Library
D. Blei and J. Lafferty. Correlated topic models. In Advances in Neural Information Processing Systems, 2006.Google ScholarDigital Library
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
D. M. Blei and J. D. Lafferty. Dynamic topic models. In International conference on Machine learning, pages 113--120, 2006. Google ScholarDigital Library
D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems 13, 2001.Google Scholar
L. Dietz, S. Bickel, and T. Scheffer. Unsupervised prediction of citation influences. In International Conference on Machine learning, pages 233--240, 2007. Google ScholarDigital Library
E. Erosheva, S. Fienberg, and J. Lafferty. Mixed-membership models of scientific publications. Proceedings of the National Academy of Sciences, 101:5220--5227, 2004.Google ScholarCross Ref
T. Hoffman. Probabilistic Latent Semantic Analysis. In Uncertainty in Artificial Intelligence, 1999.Google Scholar
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999. Google ScholarDigital Library
W. Li and A. McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. In International conference on Machine learning, pages 577--584, 2006. Google ScholarDigital Library
D. Liben-Nowell and J. Kleinberg. The link prediction problem in social networks. In Conference on Information and Knowledge Management, 2003. Google ScholarDigital Library
A. McCallum and K. Nigam. A comparison of event models for Naïve Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.Google Scholar
R. Nallapati and W. Cohen. Link-LDA-PLSA: a new unsupervised technique for topics and influence in blogs. In International Conference for Weblogs and Social Media, 2008.Google Scholar
R. Nallapati, J. Lafferty, W. Cohen, K. Ung, and S. Ditmore. Multiscale topic tomography. In Conference on Knowledge Discovery and Data mining, 2007. Google ScholarDigital Library
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. In Technical report, Department of Computer Science, Stanford University, 1998.Google Scholar
B. Shaparenko and T. Joachims. Information genealogy: Uncovering the flow of ideas in non-hyperlinked document databases. In Knowledge Discovery and Data Mining (KDD) Conference, 2007. Google ScholarDigital Library
T. Strohman, W. B. Croft, and D. Jensen. Recommending citations for academic papers. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, 2007. Google ScholarDigital Library
B. Taskar, Ming-FaiWong, P. Abbeel, and D. Koller. Link prediction in relational data. In Neural Information Processing Systems, 2003.Google Scholar
M. Wainwright and M. Jordan. Graphical models, exponential families, and variational inference. In UC Berkeley, Dept. of Statistics, Technical Report, 2003.Google Scholar

Index Terms

Joint latent topic models for text and citations
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Unsupervised mining of long time series based on latent topic model

This paper presents a novel unsupervised method for mining time series based on two generative topic models, i.e., probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA). The proposed method treats each time series as a text ...
Read More
Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA
Abstract
Probabilistic topic models have recently attracted much attention because of their successful applications in many text mining tasks such as retrieval, summarization, categorization, and clustering. Although many existing studies have reported ...
Read More
Probabilistic Topic Models for Text Data Retrieval and Analysis
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Text data include all kinds of natural language text such as web pages, news articles, scientific literature, emails, enterprise documents, and social media posts. As text data continues to grow quickly, it is increasingly important to develop ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2008
1116 pages
ISBN:9781605581934
DOI:10.1145/1401890
General Chair:
Ying Li
Microsoft adCenter Labs
,
Program Chairs:
Bing Liu
University of Illinois at Chicago
,
Sunita Sarawagi
Indian Institute of Technology, Bombay
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
LDA
PLSA
citations
hyperlinks
influence
topic models
variational inference
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 279
  Total Citations
  View Citations
- 1,963
  Total Downloads
- Downloads (Last 12 months)21
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Joint latent topic models for text and citations

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Unsupervised mining of long time series based on latent topic model

Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA

Probabilistic Topic Models for Text Data Retrieval and Analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Joint latent topic models for text and citations

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Unsupervised mining of long time series based on latent topic model

Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA

Probabilistic Topic Models for Text Data Retrieval and Analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media