skip to main content
10.1145/1401890.1401957acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Joint latent topic models for text and citations

Published:24 August 2008Publication History

ABSTRACT

In this work, we address the problem of joint modeling of text and citations in the topic modeling framework. We present two different models called the Pairwise-Link-LDA and the Link-PLSA-LDA models.

The Pairwise-Link-LDA model combines the ideas of LDA [4] and Mixed Membership Block Stochastic Models [1] and allows modeling arbitrary link structure. However, the model is computationally expensive, since it involves modeling the presence or absence of a citation (link) between every pair of documents. The second model solves this problem by assuming that the link structure is a bipartite graph. As the name indicates, Link-PLSA-LDA model combines the LDA and PLSA models into a single graphical model.

Our experiments on a subset of Citeseer data show that both these models are able to predict unseen data better than the baseline model of Erosheva and Lafferty [8], by capturing the notion of topical similarity between the contents of the cited and citing documents. Our experiments on two different data sets on the link prediction task show that the Link-PLSA-LDA model performs the best on the citation prediction task, while also remaining highly scalable. In addition, we also present some interesting visualizations generated by each of the models.

References

  1. M. Airodi, D. Blei, E. Xing, and S. Fienberg. Mixed membership stochastic block models for relational data, with applications to protein-protein interactions. In International Biometric Society-ENAR Annual Meetings, 2006.Google ScholarGoogle Scholar
  2. C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, NY, USA, first edition, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Blei and J. Lafferty. Correlated topic models. In Advances in Neural Information Processing Systems, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. M. Blei and J. D. Lafferty. Dynamic topic models. In International conference on Machine learning, pages 113--120, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In Advances in Neural Information Processing Systems 13, 2001.Google ScholarGoogle Scholar
  7. L. Dietz, S. Bickel, and T. Scheffer. Unsupervised prediction of citation influences. In International Conference on Machine learning, pages 233--240, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. Erosheva, S. Fienberg, and J. Lafferty. Mixed-membership models of scientific publications. Proceedings of the National Academy of Sciences, 101:5220--5227, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  9. T. Hoffman. Probabilistic Latent Semantic Analysis. In Uncertainty in Artificial Intelligence, 1999.Google ScholarGoogle Scholar
  10. J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Li and A. McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. In International conference on Machine learning, pages 577--584, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Liben-Nowell and J. Kleinberg. The link prediction problem in social networks. In Conference on Information and Knowledge Management, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. McCallum and K. Nigam. A comparison of event models for Naïve Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.Google ScholarGoogle Scholar
  14. R. Nallapati and W. Cohen. Link-LDA-PLSA: a new unsupervised technique for topics and influence in blogs. In International Conference for Weblogs and Social Media, 2008.Google ScholarGoogle Scholar
  15. R. Nallapati, J. Lafferty, W. Cohen, K. Ung, and S. Ditmore. Multiscale topic tomography. In Conference on Knowledge Discovery and Data mining, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. In Technical report, Department of Computer Science, Stanford University, 1998.Google ScholarGoogle Scholar
  17. B. Shaparenko and T. Joachims. Information genealogy: Uncovering the flow of ideas in non-hyperlinked document databases. In Knowledge Discovery and Data Mining (KDD) Conference, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. Strohman, W. B. Croft, and D. Jensen. Recommending citations for academic papers. In Proceedings of the ACM SIGIR conference on Research and development in information retrieval, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. Taskar, Ming-FaiWong, P. Abbeel, and D. Koller. Link prediction in relational data. In Neural Information Processing Systems, 2003.Google ScholarGoogle Scholar
  20. M. Wainwright and M. Jordan. Graphical models, exponential families, and variational inference. In UC Berkeley, Dept. of Statistics, Technical Report, 2003.Google ScholarGoogle Scholar

Index Terms

  1. Joint latent topic models for text and citations

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
        August 2008
        1116 pages
        ISBN:9781605581934
        DOI:10.1145/1401890
        • General Chair:
        • Ying Li,
        • Program Chairs:
        • Bing Liu,
        • Sunita Sarawagi

        Copyright © 2008 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 August 2008

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader