nach oben

Discover Computing

Erschienen in:

01.04.2011

Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA

verfasst von: Yue Lu, Qiaozhu Mei, ChengXiang Zhai

Erschienen in: Discover Computing | Ausgabe 2/2011

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Probabilistic topic models have recently attracted much attention because of their successful applications in many text mining tasks such as retrieval, summarization, categorization, and clustering. Although many existing studies have reported promising performance of these topic models, none of the work has systematically investigated the task performance of topic models; as a result, some critical questions that may affect the performance of all applications of topic models are mostly unanswered, particularly how to choose between competing models, how multiple local maxima affect task performance, and how to set parameters in topic models. In this paper, we address these questions by conducting a systematic investigation of two representative probabilistic topic models, probabilistic latent semantic analysis (PLSA) and Latent Dirichlet Allocation (LDA), using three representative text mining tasks, including document clustering, text categorization, and ad-hoc retrieval. The analysis of our experimental results provides deeper understanding of topic models and many useful insights about how to optimize the performance of topic models for these typical tasks. The task-based evaluation framework is generalizable to other topic models in the family of either PLSA or LDA.

Vorheriger Artikel Discriminative probabilistic models for expert search in heterogeneous information sources

Nächster Artikel Diane Kelly: Methods for evaluating interactive information retrieval systems with users

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

http://projects.ldc.upenn.edu/TDT2/.

http://www.daviddlewis.com/resources/testcollections/reuters21578/.

http://svmlight.joachims.org/svm_multiclass.html.

For example, estimating PLSA on the whole collection of LA would take about 40 h for a single run of 50 iterations on a Linux server four 2.2 GHz AMD Opteron 848 processors and 32GB memory.

Blei, D. M., Griffiths, T. L., Jordan, M. I., & Tenenbaum, J. B. (2004). Hierarchical topic models and the nested chinese restaurant process. In Advances in neural information processing systems (p. 2003). MIT Press.

Blei, D. M., & Lafferty, J. D. (2005). Correlated topic models. In NIPS. MIT Press

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.CrossRefMATH

Cai, D., Mei, Q., Han, J., & Zhai, C. (2008). Modeling hidden topics on document manifold. In J. G. Shanahan, S. Amer-Yahia, I. Manolescu, Y. Zhang, D. A. Evans, A. Kolcz, K.-S. Choi, & A. Chowdhury (Eds.), CIKM (pp. 911–920). ACM.

Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Neural information processing systems. MIT Press.

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society, Series B, 39, 1–38.MathSciNetMATH

Gaussier, E., & Goutte, C. (2005). Relation between plsa and nmf and implications. In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (pp. 601–602). New York, NY, USA: ACM.

Geman, S., & Geman, D. (1984). Stochastic relaxation, gibbs distributions and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6), 721–741.

Girolami, M., & Kabán, A. (2003). On an equivalence between plsi and lda. In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval (pp. 433–434). New York, NY, USA: ACM.

Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl. 1), 5228–5235.CrossRef

Hofmann, T. (1999a). Probabilistic latent semantic analysis. In K. B. Laskey & H. Prade (Eds.), UAI (pp. 289–296). Morgan Kaufmann.

Hofmann, T. (1999b). Probabilistic latent semantic indexing. In SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 50–57). New York, NY, USA. ACM.

Lacoste-Julien, S., Sha, F., & Jordan, M. I. (2008). Disclda: Discriminative learning for dimensionality reduction and classification. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), NIPS (pp. 897–904). MIT Press.

Li, W., & Mccallum, A. (2006). Pachinko allocation: Dag-structured mixture models of topic correlations. In ICML ’06 (pp. 577–584). ACM.

Mei, Q., Cai, D., Zhang, D., & Zhai, C. (2008). Topic modeling with network regularization. In WWW ’08: Proceeding of the 17th international conference on World Wide Web (pp. 101–110). New York, NY, USA: ACM.

Mei, Q., Ling, X., Wondra, M., Su, H., & Zhai, C. (2007). Topic sentiment mixture: modeling facets and opinions in weblogs. In WWW ’07: Proceedings of the 16th international conference on World Wide Web (pp. 171–180). New York, NY, USA: ACM.

Mei, Q., & Zhai, C. (2006). A mixture model for contextual text mining. In KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 649–655). New York, NY, USA: ACM.

Minka, T. P., & Lafferty, J. D. (2002). Expectation-propogation for the generative aspect model. In A. Darwiche, & N. Friedman (Eds.), UAI (pp. 352–359). Morgan Kaufmann.

Nallapati, R. M., Ahmed, A., Xing, E. P., & Cohen, W. W. (2008). Joint latent topic models for text and citations. In Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 542–550). New York, NY, USA: ACM.

Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. Lawrence Erlbaum Associates.

Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004). Probabilistic author-topic models for information discovery. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 306–315). New York, NY, USA: ACM.

Teh, Y. W., & Görür, D. (2009). Indian buffet processes with power-law behavior. In Advances in neural information processing systems. MIT Press.

Wallach, H. M., Murray, I., Salakhutdinov, R., & Mimno, D. (2009). Evaluation methods for topic models. In ICML ’09: Proceedings of the 26th annual international conference on machine learning (pp. 1105–1112). New York, NY, USA: ACM.

Wang, X., & McCallum, A. (2006). Topics over time: A non-markov continuous-time model of topical trends. In KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 424–433). New York, NY, USA: ACM.

Wei, X., & Bruce Croft, W. (2006). Lda-based document models for ad-hoc retrieval. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 178–185). New York, NY, USA: ACM.

Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval (pp. 267–273). New York, NY, USA: ACM.

Yi, X., & Allan, J. (2009). A comparative study of utilizing topic models for information retrieval. In ECIR ’09: Proceedings of the 31th European conference on IR research on advances in information retrieval (pp. 29–41). Berlin, Heidelberg: Springer.

Zhai, C., & Lafferty, J. (2001). Model-based feedback in the language modeling approach to information retrieval. In CIKM ’01: Proceedings of the tenth international conference on information and knowledge management (pp. 403–410). New York, NY, USA: ACM.

Zhai, C., & Lafferty, J. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 334–342). New York, NY, USA: ACM.

Zhai, C., Velivelli, A., & Yu, B. (2004). A cross-collection mixture model for comparative text mining. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 743–748). New York, NY, USA: ACM.

Titel: Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA
verfasst von: Yue Lu
Qiaozhu Mei
ChengXiang Zhai
Publikationsdatum: 01.04.2011
Verlag: Springer Netherlands
Erschienen in: Discover Computing / Ausgabe 2/2011
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI: https://doi.org/10.1007/s10791-010-9141-9

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 2/2011

Learning to rank for why-question answering

Michael W. Berry and Jacob Kogan (eds.): Text mining: applications and theory

Diane Kelly: Methods for evaluating interactive information retrieval systems with users

Time-weighted web authoritative ranking

Discriminative probabilistic models for expert search in heterogeneous information sources

Premium Partner