research-article

An architecture for parallel topic models

Authors:
Alexander Smola

Yahoo! Research, Santa Clara, CA, and Australian National University, Canberra

Yahoo! Research, Santa Clara, CA, and Australian National University, Canberra
View Profile

,
Shravan Narayanamurthy

Yahoo! Labs, Bangalore, India

Yahoo! Labs, Bangalore, India
View Profile

Proceedings of the VLDB Endowment Volume 3 Issue 1-2pp 703–710https://doi.org/10.14778/1920841.1920931

Published:01 September 2010Publication History

Proceedings of the VLDB Endowment

Abstract

This paper describes a high performance sampling architecture for inference of latent topic models on a cluster of workstations. Our system is faster than previous work by over an order of magnitude and it is capable of dealing with hundreds of millions of documents and thousands of topics.

The algorithm relies on a novel communication structure, namely the use of a distributed (key, value) storage for synchronizing the sampler state between computers. Our architecture entirely obviates the need for separate computation and synchronization phases. Instead, disk, CPU, and network are used simultaneously to achieve high performance. We show that this architecture is entirely general and that it can be extended easily to more sophisticated latent variable models such as n-grams and hierarchies.

References

S. Aji and R. McEliece. The generalized distributive law. IEEE IT, 46:325--343, 2000. Google ScholarDigital Library
A. Asuncion, P. Smyth, and M. Welling. Asynchronous distributed learning of topic models. In NIPS, pages 81--88. MIT Press, 2008.Google Scholar
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. JMLR, 3:993--1022, 2003. Google ScholarDigital Library
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, UK, 2004. Google ScholarDigital Library
J. Gonzalez, Y. Low, and C. Guestrin. Residual splash for optimally parallelizing belief propagation. In AISTATS, Clearwater Beach, FL, 2009.Google Scholar
T. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101:5228--5235, 2004.Google ScholarCross Ref
D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed algorithms for topic models, NIPS 2009.Google Scholar
H. Wallach, D. Mimno, and A. McCallum. Rethinking LDA: Why priors matter. NIPS, p. 1973--1981. 2009.Google ScholarDigital Library
Y. Wang, H. Bai, M. Stanton, W. Chen, and E. Chang. PLDA: Parallel latent dirichlet allocation for large-scale applications. In Proc. of 5th International Conference on Algorithmic Aspects in Information and Management, 2009. Google ScholarDigital Library
L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model inference on streaming document collections. In KDD'09, 2009. Google ScholarDigital Library

Recommendations

Topic Models with Topic Ordering Regularities for Topic Segmentation
ICDM '14: Proceedings of the 2014 IEEE International Conference on Data Mining

Documents from the same domain usually discuss similar topics in a similar order. In this paper we present new ordering-based topic models that use generalised Mallows models to capture this regularity to constrain topic assignments. Specifically, these ...
Read More
Probabilistic topic models
KDD '11 Tutorials: Proceedings of the 17th ACM SIGKDD International Conference Tutorials

Probabilistic topic modeling provides a suite of tools for the unsupervised analysis of large collections of documents. Topic modeling algorithms can uncover the underlying themes of a collection and decompose its documents according to those themes. ...
Read More
Topic analysis for topic-focused multi-document summarization
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Topic-focused multi-document summarization has been a challenging task because the created summary is required to be biased to the given topic or query. Existing methods consider the given topic as a single coarse unit and then directly incorporate the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 3, Issue 1-2
September 2010
1658 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 September 2010
Published in pvldb Volume 3, Issue 1-2
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 128
  Total Citations
  View Citations
- 1,066
  Total Downloads
- Downloads (Last 12 months)47
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An architecture for parallel topic models

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Topic Models with Topic Ordering Regularities for Topic Segmentation

Probabilistic topic models

Topic analysis for topic-focused multi-document summarization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

An architecture for parallel topic models

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Topic Models with Topic Ordering Regularities for Topic Segmentation

Probabilistic topic models

Topic analysis for topic-focused multi-document summarization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media