ABSTRACT
Multinomial distributions over words are frequently used to model topics in text collections. A common, major challenge in applying all such topic models to any text mining problem is to label a multinomial topic model accurately so that a user can interpret the discovered topic. So far, such labels have been generated manually in a subjective way. In this paper, we propose probabilistic approaches to automatically labeling multinomial topic models in an objective way. We cast this labeling problem as an optimization problem involving minimizing Kullback-Leibler divergence between word distributions and maximizing mutual information between a label and a topic model. Experiments with user study have been done on two text data sets with different genres.The results show that the proposed labeling methods are quite effective to generate labels that are meaningful and useful for interpreting the discovered topic models. Our methods are general and can be applied to labeling topics learned through all kinds of topic models such as PLSA, LDA, and their variations.
- S. Banerjee and T. Pedersen. The design, implementation, and use of the ngram statistics package. pages 370--381, 2003. Google ScholarDigital Library
- D. Blei and J. Lafferty. Correlated topic models. In NIPS '05: Advances in Neural Information Processing Systems 18, 2005.Google Scholar
- D. M. Blei and J. D. Lafferty. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, pages 113--120, 2006. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993--1022, 2003. Google ScholarCross Ref
- J. Carbonell and J. Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of SIGIR '98, pages 335--336, 1998. Google ScholarDigital Library
- J. Chen, J. Yan, B. Zhang, Q. Yang, and Z. Chen. Diverse topic phrase extraction through latent semantic analysis. In Proceedings of ICDM '06, pages 834--838, 2006. Google ScholarDigital Library
- K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. Comput. Linguist., 16(1): 22--29, 1990. Google ScholarDigital Library
- W. B. Croft and J. Lafferty, editors. Language Modeling and Information Retrieval. Kluwer Academic Publishers, 2003. Google ScholarDigital Library
- T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences 101(suppl.1): 5228--5235, 2004.Google ScholarCross Ref
- J. Hammerton, M. Osborne, S. Armstrong, and W. Daelemans. Introduction to special issue on machine learning approaches to shallow parsing. J. Mach. Learn. Res., 2:551--558, 2002. Google ScholarDigital Library
- T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of ACM SIGIR'99, pages 50--57. Google ScholarDigital Library
- R. Jin and A. G. Hauptmann. A new probabilistic model for title generation. In Proceedings of the 19th international conference on Computational linguistics, pages 1--7, 2002. Google ScholarDigital Library
- P. J. Kaufman, Leonard; Rousseeuw. Finding groups in data. an introduction to cluster analysis. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics. Wiley. New York., 1990.Google Scholar
- W. Li and A. McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. In ICML '06: Proceedings of the 23rd international conference on Machine learning, pages 577--584, 2006. Google ScholarDigital Library
- C. D. Manning and H. Schtze. Foundations of statistical natural language processing. MIT Press, Cambridge, MA, USA, 1999. Google ScholarDigital Library
- Q. Mei, C. Liu, H. Su, and C. Zhai. A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In Proceedings of WWW '06, pages 533--542, 2006. Google ScholarDigital Library
- Q. Mei and C. Zhai. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In Proceeding of KDD'05, pages 198--207, 2005. Google ScholarDigital Library
- Q. Mei and C. Zhai. A mixture model for contextual text mining. In Proceedings of KDD '06, pages 649--655, 2006. Google ScholarDigital Library
- D. Newman, C. Chemudugunta, and P. Smyth. Statistical entity-topic models. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 680--686, 2006. Google ScholarDigital Library
- P. Pantel and D. Lin. Discovering word senses from text. In Proceedings of KDD '02, pages 613--619, 2002. Google ScholarDigital Library
- D. R. Radev, E. Hovy, and K. McKeown. Introduction to the special issue on summarization. Comput. Linguist., 28(4): 399--408, 2002. Google ScholarDigital Library
- M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griffiths. Probabilistic author-topic models for information discovery. In Proceedings of KDD'04, pages 306--315, 2004. Google ScholarDigital Library
- X. Wang and A. McCallum. Topics over time: a non-markov continuous-time model of topical trends. In Proceedings of KDD '06, pages 424--433, 2006. Google ScholarDigital Library
- X. Wei and W. B. Croft. Lda--based document models for ad-hoc retrieval. In Proceedings of SIGIR '06, pages 178-185, 2006. Google ScholarDigital Library
- C. Zhai. Fast statistical parsing of noun phrases for document indexing. In Proceedings of the fifth conference on Applied natural language processing, pages 312--319, 1997. Google ScholarDigital Library
- C. Zhai and J. Lafferty. Model--based feedback in the language modeling approach to information retrieval. In Proceedings of CIKM '01, pages 403--410, 2001. Google ScholarDigital Library
- C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In Proceedings of KDD'04, pages 743--748, 2004. Google ScholarDigital Library
Index Terms
- Automatic labeling of multinomial topic models
Recommendations
Automatic labeling hierarchical topics
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementRecently, statistical topic modeling has been widely applied in text mining and knowledge management due to its powerful ability. A topic, as a probability distribution over words, is usually difficult to be understood. A common, major challenge in ...
Personalized resource categorisation in folksonomies
MDS '12: Proceedings of the ACM SIGKDD Workshop on Mining Data SemanticsFolksonomies constitute an important type of Web 2.0 services, where users collectively annotate (or "tag") resources to create custom categories. Semantic relation of these categories hint at the possibility of another categorization at a higher level. ...
Topic modeling with network regularization
WWW '08: Proceedings of the 17th international conference on World Wide WebIn this paper, we formally define the problem of topic modeling with network structure (TMN). We propose a novel solution to this problem, which regularizes a statistical topic model with a harmonic regularizer based on a graph structure in the data. ...
Comments