Skip to main content
Erschienen in: Discover Computing 2/2011

01.04.2011

Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA

verfasst von: Yue Lu, Qiaozhu Mei, ChengXiang Zhai

Erschienen in: Discover Computing | Ausgabe 2/2011

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Probabilistic topic models have recently attracted much attention because of their successful applications in many text mining tasks such as retrieval, summarization, categorization, and clustering. Although many existing studies have reported promising performance of these topic models, none of the work has systematically investigated the task performance of topic models; as a result, some critical questions that may affect the performance of all applications of topic models are mostly unanswered, particularly how to choose between competing models, how multiple local maxima affect task performance, and how to set parameters in topic models. In this paper, we address these questions by conducting a systematic investigation of two representative probabilistic topic models, probabilistic latent semantic analysis (PLSA) and Latent Dirichlet Allocation (LDA), using three representative text mining tasks, including document clustering, text categorization, and ad-hoc retrieval. The analysis of our experimental results provides deeper understanding of topic models and many useful insights about how to optimize the performance of topic models for these typical tasks. The task-based evaluation framework is generalizable to other topic models in the family of either PLSA or LDA.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
4
For example, estimating PLSA on the whole collection of LA would take about 40 h for a single run of 50 iterations on a Linux server four 2.2 GHz AMD Opteron 848 processors and 32GB memory.
 
Literatur
Zurück zum Zitat Blei, D. M., Griffiths, T. L., Jordan, M. I., & Tenenbaum, J. B. (2004). Hierarchical topic models and the nested chinese restaurant process. In Advances in neural information processing systems (p. 2003). MIT Press. Blei, D. M., Griffiths, T. L., Jordan, M. I., & Tenenbaum, J. B. (2004). Hierarchical topic models and the nested chinese restaurant process. In Advances in neural information processing systems (p. 2003). MIT Press.
Zurück zum Zitat Blei, D. M., & Lafferty, J. D. (2005). Correlated topic models. In NIPS. MIT Press Blei, D. M., & Lafferty, J. D. (2005). Correlated topic models. In NIPS. MIT Press
Zurück zum Zitat Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.CrossRefMATH Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.CrossRefMATH
Zurück zum Zitat Cai, D., Mei, Q., Han, J., & Zhai, C. (2008). Modeling hidden topics on document manifold. In J. G. Shanahan, S. Amer-Yahia, I. Manolescu, Y. Zhang, D. A. Evans, A. Kolcz, K.-S. Choi, & A. Chowdhury (Eds.), CIKM (pp. 911–920). ACM. Cai, D., Mei, Q., Han, J., & Zhai, C. (2008). Modeling hidden topics on document manifold. In J. G. Shanahan, S. Amer-Yahia, I. Manolescu, Y. Zhang, D. A. Evans, A. Kolcz, K.-S. Choi, & A. Chowdhury (Eds.), CIKM (pp. 911–920). ACM.
Zurück zum Zitat Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Neural information processing systems. MIT Press. Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Neural information processing systems. MIT Press.
Zurück zum Zitat Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society, Series B, 39, 1–38.MathSciNetMATH Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society, Series B, 39, 1–38.MathSciNetMATH
Zurück zum Zitat Gaussier, E., & Goutte, C. (2005). Relation between plsa and nmf and implications. In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (pp. 601–602). New York, NY, USA: ACM. Gaussier, E., & Goutte, C. (2005). Relation between plsa and nmf and implications. In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (pp. 601–602). New York, NY, USA: ACM.
Zurück zum Zitat Geman, S., & Geman, D. (1984). Stochastic relaxation, gibbs distributions and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6), 721–741. Geman, S., & Geman, D. (1984). Stochastic relaxation, gibbs distributions and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6), 721–741.
Zurück zum Zitat Girolami, M., & Kabán, A. (2003). On an equivalence between plsi and lda. In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval (pp. 433–434). New York, NY, USA: ACM. Girolami, M., & Kabán, A. (2003). On an equivalence between plsi and lda. In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval (pp. 433–434). New York, NY, USA: ACM.
Zurück zum Zitat Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl. 1), 5228–5235.CrossRef Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl. 1), 5228–5235.CrossRef
Zurück zum Zitat Hofmann, T. (1999a). Probabilistic latent semantic analysis. In K. B. Laskey & H. Prade (Eds.), UAI (pp. 289–296). Morgan Kaufmann. Hofmann, T. (1999a). Probabilistic latent semantic analysis. In K. B. Laskey & H. Prade (Eds.), UAI (pp. 289–296). Morgan Kaufmann.
Zurück zum Zitat Hofmann, T. (1999b). Probabilistic latent semantic indexing. In SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 50–57). New York, NY, USA. ACM. Hofmann, T. (1999b). Probabilistic latent semantic indexing. In SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 50–57). New York, NY, USA. ACM.
Zurück zum Zitat Lacoste-Julien, S., Sha, F., & Jordan, M. I. (2008). Disclda: Discriminative learning for dimensionality reduction and classification. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), NIPS (pp. 897–904). MIT Press. Lacoste-Julien, S., Sha, F., & Jordan, M. I. (2008). Disclda: Discriminative learning for dimensionality reduction and classification. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), NIPS (pp. 897–904). MIT Press.
Zurück zum Zitat Li, W., & Mccallum, A. (2006). Pachinko allocation: Dag-structured mixture models of topic correlations. In ICML ’06 (pp. 577–584). ACM. Li, W., & Mccallum, A. (2006). Pachinko allocation: Dag-structured mixture models of topic correlations. In ICML ’06 (pp. 577–584). ACM.
Zurück zum Zitat Mei, Q., Cai, D., Zhang, D., & Zhai, C. (2008). Topic modeling with network regularization. In WWW ’08: Proceeding of the 17th international conference on World Wide Web (pp. 101–110). New York, NY, USA: ACM. Mei, Q., Cai, D., Zhang, D., & Zhai, C. (2008). Topic modeling with network regularization. In WWW ’08: Proceeding of the 17th international conference on World Wide Web (pp. 101–110). New York, NY, USA: ACM.
Zurück zum Zitat Mei, Q., Ling, X., Wondra, M., Su, H., & Zhai, C. (2007). Topic sentiment mixture: modeling facets and opinions in weblogs. In WWW ’07: Proceedings of the 16th international conference on World Wide Web (pp. 171–180). New York, NY, USA: ACM. Mei, Q., Ling, X., Wondra, M., Su, H., & Zhai, C. (2007). Topic sentiment mixture: modeling facets and opinions in weblogs. In WWW ’07: Proceedings of the 16th international conference on World Wide Web (pp. 171–180). New York, NY, USA: ACM.
Zurück zum Zitat Mei, Q., & Zhai, C. (2006). A mixture model for contextual text mining. In KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 649–655). New York, NY, USA: ACM. Mei, Q., & Zhai, C. (2006). A mixture model for contextual text mining. In KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 649–655). New York, NY, USA: ACM.
Zurück zum Zitat Minka, T. P., & Lafferty, J. D. (2002). Expectation-propogation for the generative aspect model. In A. Darwiche, & N. Friedman (Eds.), UAI (pp. 352–359). Morgan Kaufmann. Minka, T. P., & Lafferty, J. D. (2002). Expectation-propogation for the generative aspect model. In A. Darwiche, & N. Friedman (Eds.), UAI (pp. 352–359). Morgan Kaufmann.
Zurück zum Zitat Nallapati, R. M., Ahmed, A., Xing, E. P., & Cohen, W. W. (2008). Joint latent topic models for text and citations. In Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 542–550). New York, NY, USA: ACM. Nallapati, R. M., Ahmed, A., Xing, E. P., & Cohen, W. W. (2008). Joint latent topic models for text and citations. In Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 542–550). New York, NY, USA: ACM.
Zurück zum Zitat Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. Lawrence Erlbaum Associates. Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. Lawrence Erlbaum Associates.
Zurück zum Zitat Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004). Probabilistic author-topic models for information discovery. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 306–315). New York, NY, USA: ACM. Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004). Probabilistic author-topic models for information discovery. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 306–315). New York, NY, USA: ACM.
Zurück zum Zitat Teh, Y. W., & Görür, D. (2009). Indian buffet processes with power-law behavior. In Advances in neural information processing systems. MIT Press. Teh, Y. W., & Görür, D. (2009). Indian buffet processes with power-law behavior. In Advances in neural information processing systems. MIT Press.
Zurück zum Zitat Wallach, H. M., Murray, I., Salakhutdinov, R., & Mimno, D. (2009). Evaluation methods for topic models. In ICML ’09: Proceedings of the 26th annual international conference on machine learning (pp. 1105–1112). New York, NY, USA: ACM. Wallach, H. M., Murray, I., Salakhutdinov, R., & Mimno, D. (2009). Evaluation methods for topic models. In ICML ’09: Proceedings of the 26th annual international conference on machine learning (pp. 1105–1112). New York, NY, USA: ACM.
Zurück zum Zitat Wang, X., & McCallum, A. (2006). Topics over time: A non-markov continuous-time model of topical trends. In KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 424–433). New York, NY, USA: ACM. Wang, X., & McCallum, A. (2006). Topics over time: A non-markov continuous-time model of topical trends. In KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 424–433). New York, NY, USA: ACM.
Zurück zum Zitat Wei, X., & Bruce Croft, W. (2006). Lda-based document models for ad-hoc retrieval. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 178–185). New York, NY, USA: ACM. Wei, X., & Bruce Croft, W. (2006). Lda-based document models for ad-hoc retrieval. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 178–185). New York, NY, USA: ACM.
Zurück zum Zitat Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval (pp. 267–273). New York, NY, USA: ACM. Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval (pp. 267–273). New York, NY, USA: ACM.
Zurück zum Zitat Yi, X., & Allan, J. (2009). A comparative study of utilizing topic models for information retrieval. In ECIR ’09: Proceedings of the 31th European conference on IR research on advances in information retrieval (pp. 29–41). Berlin, Heidelberg: Springer. Yi, X., & Allan, J. (2009). A comparative study of utilizing topic models for information retrieval. In ECIR ’09: Proceedings of the 31th European conference on IR research on advances in information retrieval (pp. 29–41). Berlin, Heidelberg: Springer.
Zurück zum Zitat Zhai, C., & Lafferty, J. (2001). Model-based feedback in the language modeling approach to information retrieval. In CIKM ’01: Proceedings of the tenth international conference on information and knowledge management (pp. 403–410). New York, NY, USA: ACM. Zhai, C., & Lafferty, J. (2001). Model-based feedback in the language modeling approach to information retrieval. In CIKM ’01: Proceedings of the tenth international conference on information and knowledge management (pp. 403–410). New York, NY, USA: ACM.
Zurück zum Zitat Zhai, C., & Lafferty, J. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 334–342). New York, NY, USA: ACM. Zhai, C., & Lafferty, J. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 334–342). New York, NY, USA: ACM.
Zurück zum Zitat Zhai, C., Velivelli, A., & Yu, B. (2004). A cross-collection mixture model for comparative text mining. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 743–748). New York, NY, USA: ACM. Zhai, C., Velivelli, A., & Yu, B. (2004). A cross-collection mixture model for comparative text mining. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 743–748). New York, NY, USA: ACM.
Metadaten
Titel
Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA
verfasst von
Yue Lu
Qiaozhu Mei
ChengXiang Zhai
Publikationsdatum
01.04.2011
Verlag
Springer Netherlands
Erschienen in
Discover Computing / Ausgabe 2/2011
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-010-9141-9

Weitere Artikel der Ausgabe 2/2011

Discover Computing 2/2011 Zur Ausgabe

Premium Partner