skip to main content
10.1145/1281192.1281246acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Automatic labeling of multinomial topic models

Published:12 August 2007Publication History

ABSTRACT

Multinomial distributions over words are frequently used to model topics in text collections. A common, major challenge in applying all such topic models to any text mining problem is to label a multinomial topic model accurately so that a user can interpret the discovered topic. So far, such labels have been generated manually in a subjective way. In this paper, we propose probabilistic approaches to automatically labeling multinomial topic models in an objective way. We cast this labeling problem as an optimization problem involving minimizing Kullback-Leibler divergence between word distributions and maximizing mutual information between a label and a topic model. Experiments with user study have been done on two text data sets with different genres.The results show that the proposed labeling methods are quite effective to generate labels that are meaningful and useful for interpreting the discovered topic models. Our methods are general and can be applied to labeling topics learned through all kinds of topic models such as PLSA, LDA, and their variations.

References

  1. S. Banerjee and T. Pedersen. The design, implementation, and use of the ngram statistics package. pages 370--381, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Blei and J. Lafferty. Correlated topic models. In NIPS '05: Advances in Neural Information Processing Systems 18, 2005.Google ScholarGoogle Scholar
  3. D. M. Blei and J. D. Lafferty. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, pages 113--120, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993--1022, 2003. Google ScholarGoogle ScholarCross RefCross Ref
  5. J. Carbonell and J. Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of SIGIR '98, pages 335--336, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Chen, J. Yan, B. Zhang, Q. Yang, and Z. Chen. Diverse topic phrase extraction through latent semantic analysis. In Proceedings of ICDM '06, pages 834--838, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. Comput. Linguist., 16(1): 22--29, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W. B. Croft and J. Lafferty, editors. Language Modeling and Information Retrieval. Kluwer Academic Publishers, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences 101(suppl.1): 5228--5235, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  10. J. Hammerton, M. Osborne, S. Armstrong, and W. Daelemans. Introduction to special issue on machine learning approaches to shallow parsing. J. Mach. Learn. Res., 2:551--558, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of ACM SIGIR'99, pages 50--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Jin and A. G. Hauptmann. A new probabilistic model for title generation. In Proceedings of the 19th international conference on Computational linguistics, pages 1--7, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. J. Kaufman, Leonard; Rousseeuw. Finding groups in data. an introduction to cluster analysis. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics. Wiley. New York., 1990.Google ScholarGoogle Scholar
  14. W. Li and A. McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. In ICML '06: Proceedings of the 23rd international conference on Machine learning, pages 577--584, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. D. Manning and H. Schtze. Foundations of statistical natural language processing. MIT Press, Cambridge, MA, USA, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Q. Mei, C. Liu, H. Su, and C. Zhai. A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In Proceedings of WWW '06, pages 533--542, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Q. Mei and C. Zhai. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In Proceeding of KDD'05, pages 198--207, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Q. Mei and C. Zhai. A mixture model for contextual text mining. In Proceedings of KDD '06, pages 649--655, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Newman, C. Chemudugunta, and P. Smyth. Statistical entity-topic models. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 680--686, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. Pantel and D. Lin. Discovering word senses from text. In Proceedings of KDD '02, pages 613--619, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. R. Radev, E. Hovy, and K. McKeown. Introduction to the special issue on summarization. Comput. Linguist., 28(4): 399--408, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griffiths. Probabilistic author-topic models for information discovery. In Proceedings of KDD'04, pages 306--315, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. X. Wang and A. McCallum. Topics over time: a non-markov continuous-time model of topical trends. In Proceedings of KDD '06, pages 424--433, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. X. Wei and W. B. Croft. Lda--based document models for ad-hoc retrieval. In Proceedings of SIGIR '06, pages 178-185, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Zhai. Fast statistical parsing of noun phrases for document indexing. In Proceedings of the fifth conference on Applied natural language processing, pages 312--319, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C. Zhai and J. Lafferty. Model--based feedback in the language modeling approach to information retrieval. In Proceedings of CIKM '01, pages 403--410, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In Proceedings of KDD'04, pages 743--748, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automatic labeling of multinomial topic models

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2007
      1080 pages
      ISBN:9781595936097
      DOI:10.1145/1281192

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 August 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      KDD '07 Paper Acceptance Rate111of573submissions,19%Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader