research-article

PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing

Authors:
Zhiyuan Liu

Google Inc., China

Google Inc., China
View Profile

,
Yuzhou Zhang

Google Inc., China

Google Inc., China
View Profile

,
Edward Y. Chang

Google Inc., China

Google Inc., China
View Profile

,
Maosong Sun

Tsinghua University

Tsinghua University
View Profile

ACM Transactions on Intelligent Systems and Technology Volume 2 Issue 3Article No.: 26pp 1–18https://doi.org/10.1145/1961189.1961198

Published:06 May 2011Publication History

ACM Transactions on Intelligent Systems and Technology

Abstract

Previous methods of distributed Gibbs sampling for LDA run into either memory or communication bottlenecks. To improve scalability, we propose four strategies: data placement, pipeline processing, word bundling, and priority-based scheduling. Experiments show that our strategies significantly reduce the unparallelizable communication bottleneck and achieve good load balancing, and hence improve scalability of LDA.

References

Asuncion, A., Smyth, P., and Welling, M. 2008. Asynchronous distributed learning of topic models. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'08). 81--88.Google Scholar
Asuncion, A., Smyth, P., and Welling, M. 2010. Asynchronous distributed estimation of topic models for document analysis. Statist. Methodol. 8, 1, 3--17.Google ScholarCross Ref
Berenbrink, P., Friedetzky, T., Hu, Z., and Martin, R. 2008. On weighted balls-into-bins games. Theor. Comput. Sci. 409, 3, 511--520. Google ScholarDigital Library
Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993--1022. Google ScholarCross Ref
Blinn, J. 1991. A trip down the graphics pipeline: Line clipping. IEEE Comput. Graph. Appl. 11, 1, 98--105. Google ScholarDigital Library
Chemudugunta, C., Smyth, P., and Steyvers, M. 2007. Modeling general and specific aspects of documents with a probabilistic topic model. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'07). 241--248.Google Scholar
Chen, W., Chu, J., Luan, J., Bai, H., Wang, Y., and Chang, E. 2009. Collaborative filtering for orkut communities: Discovery of user latent behavior. In Proceedings of the International World Wide Web Conference (WWW'09). 681--690. Google ScholarDigital Library
Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y., and Olukotun, K. 2006. Mapreduce for machine learning on multicore. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'06).Google Scholar
Dean, J. and Ghemawat, S. 2004. Mapreduce: Simplified data processing on large clusters. In Proceedings of the ACM USENIX Symposium on Operating Systems Design and Implentation (OSDI'04). 137--150. Google ScholarDigital Library
Gomes, R., Welling, M., and Perona, P. 2008. Memory bounded inference in topic models. In Proceedings of the International Conference on Machine Learning (ICML'08). 344--351. Google ScholarDigital Library
Graham, S., Snir, M., and Patterson, C. 2005. Getting Up to Speed: The Future of Supercomputing. National Academies Press.Google Scholar
Griffiths, T. and Steyvers, M. 2004. Finding scientific topics. Proc. Nat. Acad. Sci. United States Amer. 101, 90001, 5228--5235.Google Scholar
Li, W. and Mccallum, A. 2006. Pachinko allocation: DAG-Structured mixture models of topic correlations. In Proceedings of the International Conference on Machine Learning (ICML'06). Google ScholarDigital Library
Mimno, D. M. and Mccallum, A. 2007. Organizing the OCA: Learning faceted subjects from a library of digital books. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries. 376--385. Google ScholarDigital Library
Newman, D., Asuncion, A., Smyth, P., and Welling, M. 2007. Distributed inference for latent dirichlet allocation. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'07). 1081--1088.Google Scholar
Newman, D., Asuncion, A., Smyth, P., and Welling, M. 2009. Distributed algorithms for topic models. J. Mach. Learn. Res. 10, 1801--1828. Google ScholarDigital Library
Porteous, I., Newman, D., Ihler, A., Asuncion, A., Smyth, P., and Welling, M. 2008. Fast collapsed gibbs sampling for latent dirichlet allocation. In Proceedings of the International SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'08). 569--577. Google ScholarDigital Library
Rosen-Zvi, M., Chemudugunta, C., Griffiths, T., Smyth, P., and Steyvers, M. 2010. Learning author-topic models from text corpora. ACM Trans. Inf. Syst. 28, 1, 1--38. Google ScholarDigital Library
Shen, J. P. and Lipasti, M. H. 2005. Modern Processor Design: Fundamentals of Superscalar Processors. McGraw-Hill Higher Education.Google Scholar
Wang, Y., Bai, H., Stanton, M., Chen, W., and Chang, E. 2009. PLDA: Parallel latent dirichlet allocation for large-scale applications. In Algorithmic Aspects in Information and Management. 301--314. Google ScholarDigital Library
Yan, F., Xu, N., and Qi, Y. 2009. Parallel inference for latent dirichlet allocation on graphics processing units. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'09). 2134--2142.Google Scholar

Index Terms

PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing

Recommendations

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Topic modeling is one of the most powerful techniques in text mining for data mining, latent data discovery, and finding relationships among data and text documents. Researchers have published many articles in the field of topic modeling and applied in ...
Read More
Generating contextualized sentiment lexica based on latent topics and user ratings
HT '13: Proceedings of the 24th ACM Conference on Hypertext and Social Media

Sentiment lexica are useful for analyzing opinions in Web collections, for domain-dependent sentiment classification, and as sub-components of recommender systems. In this paper, we present a strategy for automatically generating topic-dependent lexica ...
Read More
Learning author-topic models from text corpora

We propose an unsupervised learning technique for extracting information about authors and topics from large text collections. We model documents as if they were generated by a two-stage stochastic process. An author is represented by a probability ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Intelligent Systems and Technology Volume 2, Issue 3
April 2011
259 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/1961189
Issue’s Table of Contents

Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 May 2011
- Accepted: 1 October 2010
- Revised: 1 June 2010
- Received: 1 April 2010
Published in tist Volume 2, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Gibbs sampling
Topic models
distributed parallel computations
latent Dirichlet allocation
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 106
  Total Citations
  View Citations
- 1,255
  Total Downloads
- Downloads (Last 12 months)19
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing

ACM Transactions on Intelligent Systems and Technology

Abstract

References

Cited By

Index Terms

Recommendations

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Generating contextualized sentiment lexica based on latent topics and user ratings

Learning author-topic models from text corpora

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing

ACM Transactions on Intelligent Systems and Technology

Abstract

References

Cited By

Index Terms

Recommendations

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Generating contextualized sentiment lexica based on latent topics and user ratings

Learning author-topic models from text corpora

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media