Article

Automatic labeling of multinomial topic models

Authors:
Qiaozhu Mei

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

,
Xuehua Shen

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

,
ChengXiang Zhai

University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign
View Profile

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2007Pages 490–499https://doi.org/10.1145/1281192.1281246

Published:12 August 2007Publication History

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 490–499

ABSTRACT

Multinomial distributions over words are frequently used to model topics in text collections. A common, major challenge in applying all such topic models to any text mining problem is to label a multinomial topic model accurately so that a user can interpret the discovered topic. So far, such labels have been generated manually in a subjective way. In this paper, we propose probabilistic approaches to automatically labeling multinomial topic models in an objective way. We cast this labeling problem as an optimization problem involving minimizing Kullback-Leibler divergence between word distributions and maximizing mutual information between a label and a topic model. Experiments with user study have been done on two text data sets with different genres.The results show that the proposed labeling methods are quite effective to generate labels that are meaningful and useful for interpreting the discovered topic models. Our methods are general and can be applied to labeling topics learned through all kinds of topic models such as PLSA, LDA, and their variations.

References

S. Banerjee and T. Pedersen. The design, implementation, and use of the ngram statistics package. pages 370--381, 2003. Google ScholarDigital Library
D. Blei and J. Lafferty. Correlated topic models. In NIPS '05: Advances in Neural Information Processing Systems 18, 2005.Google Scholar
D. M. Blei and J. D. Lafferty. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, pages 113--120, 2006. Google ScholarDigital Library
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993--1022, 2003. Google ScholarCross Ref
J. Carbonell and J. Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of SIGIR '98, pages 335--336, 1998. Google ScholarDigital Library
J. Chen, J. Yan, B. Zhang, Q. Yang, and Z. Chen. Diverse topic phrase extraction through latent semantic analysis. In Proceedings of ICDM '06, pages 834--838, 2006. Google ScholarDigital Library
K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. Comput. Linguist., 16(1): 22--29, 1990. Google ScholarDigital Library
W. B. Croft and J. Lafferty, editors. Language Modeling and Information Retrieval. Kluwer Academic Publishers, 2003. Google ScholarDigital Library
T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences 101(suppl.1): 5228--5235, 2004.Google ScholarCross Ref
J. Hammerton, M. Osborne, S. Armstrong, and W. Daelemans. Introduction to special issue on machine learning approaches to shallow parsing. J. Mach. Learn. Res., 2:551--558, 2002. Google ScholarDigital Library
T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of ACM SIGIR'99, pages 50--57. Google ScholarDigital Library
R. Jin and A. G. Hauptmann. A new probabilistic model for title generation. In Proceedings of the 19th international conference on Computational linguistics, pages 1--7, 2002. Google ScholarDigital Library
P. J. Kaufman, Leonard; Rousseeuw. Finding groups in data. an introduction to cluster analysis. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics. Wiley. New York., 1990.Google Scholar
W. Li and A. McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. In ICML '06: Proceedings of the 23rd international conference on Machine learning, pages 577--584, 2006. Google ScholarDigital Library
C. D. Manning and H. Schtze. Foundations of statistical natural language processing. MIT Press, Cambridge, MA, USA, 1999. Google ScholarDigital Library
Q. Mei, C. Liu, H. Su, and C. Zhai. A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In Proceedings of WWW '06, pages 533--542, 2006. Google ScholarDigital Library
Q. Mei and C. Zhai. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In Proceeding of KDD'05, pages 198--207, 2005. Google ScholarDigital Library
Q. Mei and C. Zhai. A mixture model for contextual text mining. In Proceedings of KDD '06, pages 649--655, 2006. Google ScholarDigital Library
D. Newman, C. Chemudugunta, and P. Smyth. Statistical entity-topic models. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 680--686, 2006. Google ScholarDigital Library
P. Pantel and D. Lin. Discovering word senses from text. In Proceedings of KDD '02, pages 613--619, 2002. Google ScholarDigital Library
D. R. Radev, E. Hovy, and K. McKeown. Introduction to the special issue on summarization. Comput. Linguist., 28(4): 399--408, 2002. Google ScholarDigital Library
M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griffiths. Probabilistic author-topic models for information discovery. In Proceedings of KDD'04, pages 306--315, 2004. Google ScholarDigital Library
X. Wang and A. McCallum. Topics over time: a non-markov continuous-time model of topical trends. In Proceedings of KDD '06, pages 424--433, 2006. Google ScholarDigital Library
X. Wei and W. B. Croft. Lda--based document models for ad-hoc retrieval. In Proceedings of SIGIR '06, pages 178-185, 2006. Google ScholarDigital Library
C. Zhai. Fast statistical parsing of noun phrases for document indexing. In Proceedings of the fifth conference on Applied natural language processing, pages 312--319, 1997. Google ScholarDigital Library
C. Zhai and J. Lafferty. Model--based feedback in the language modeling approach to information retrieval. In Proceedings of CIKM '01, pages 403--410, 2001. Google ScholarDigital Library
C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In Proceedings of KDD'04, pages 743--748, 2004. Google ScholarDigital Library

Index Terms

Automatic labeling of multinomial topic models
1. Information systems
  1. Information retrieval

Recommendations

Automatic labeling hierarchical topics
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Recently, statistical topic modeling has been widely applied in text mining and knowledge management due to its powerful ability. A topic, as a probability distribution over words, is usually difficult to be understood. A common, major challenge in ...
Read More
Personalized resource categorisation in folksonomies
MDS '12: Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics

Folksonomies constitute an important type of Web 2.0 services, where users collectively annotate (or "tag") resources to create custom categories. Semantic relation of these categories hint at the possibility of another categorization at a higher level. ...
Read More
Topic modeling with network regularization
WWW '08: Proceedings of the 17th international conference on World Wide Web

In this paper, we formally define the problem of topic modeling with network structure (TMN). We propose a novel solution to this problem, which regularizes a statistical topic model with a harmonic regularizer based on a graph structure in the data. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2007
1080 pages
ISBN:9781595936097
DOI:10.1145/1281192
General Chair:
Pavel Berkhin
Yahoo!, USA
,
Program Chairs:
Rich Caruana
Cornell University, USA
,
Xindong Wu
University of Vermont, USA
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 August 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
multinomial distribution
statistical topic models
topic model labeling
Qualifiers
- Article
Conference

Acceptance Rates
KDD '07 Paper Acceptance Rate111of573submissions,19%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 226
  Total Citations
  View Citations
- 2,890
  Total Downloads
- Downloads (Last 12 months)115
- Downloads (Last 6 weeks)20
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatic labeling of multinomial topic models

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatic labeling hierarchical topics

Personalized resource categorisation in folksonomies

Topic modeling with network regularization