skip to main content
research-article

An architecture for parallel topic models

Published:01 September 2010Publication History
Skip Abstract Section

Abstract

This paper describes a high performance sampling architecture for inference of latent topic models on a cluster of workstations. Our system is faster than previous work by over an order of magnitude and it is capable of dealing with hundreds of millions of documents and thousands of topics.

The algorithm relies on a novel communication structure, namely the use of a distributed (key, value) storage for synchronizing the sampler state between computers. Our architecture entirely obviates the need for separate computation and synchronization phases. Instead, disk, CPU, and network are used simultaneously to achieve high performance. We show that this architecture is entirely general and that it can be extended easily to more sophisticated latent variable models such as n-grams and hierarchies.

References

  1. S. Aji and R. McEliece. The generalized distributive law. IEEE IT, 46:325--343, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Asuncion, P. Smyth, and M. Welling. Asynchronous distributed learning of topic models. In NIPS, pages 81--88. MIT Press, 2008.Google ScholarGoogle Scholar
  3. D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. JMLR, 3:993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, UK, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Gonzalez, Y. Low, and C. Guestrin. Residual splash for optimally parallelizing belief propagation. In AISTATS, Clearwater Beach, FL, 2009.Google ScholarGoogle Scholar
  6. T. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101:5228--5235, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  7. D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed algorithms for topic models, NIPS 2009.Google ScholarGoogle Scholar
  8. H. Wallach, D. Mimno, and A. McCallum. Rethinking LDA: Why priors matter. NIPS, p. 1973--1981. 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Y. Wang, H. Bai, M. Stanton, W. Chen, and E. Chang. PLDA: Parallel latent dirichlet allocation for large-scale applications. In Proc. of 5th International Conference on Algorithmic Aspects in Information and Management, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model inference on streaming document collections. In KDD'09, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 3, Issue 1-2
    September 2010
    1658 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 September 2010
    Published in pvldb Volume 3, Issue 1-2

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader