Abstract
Previous methods of distributed Gibbs sampling for LDA run into either memory or communication bottlenecks. To improve scalability, we propose four strategies: data placement, pipeline processing, word bundling, and priority-based scheduling. Experiments show that our strategies significantly reduce the unparallelizable communication bottleneck and achieve good load balancing, and hence improve scalability of LDA.
- Asuncion, A., Smyth, P., and Welling, M. 2008. Asynchronous distributed learning of topic models. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'08). 81--88.Google Scholar
- Asuncion, A., Smyth, P., and Welling, M. 2010. Asynchronous distributed estimation of topic models for document analysis. Statist. Methodol. 8, 1, 3--17.Google ScholarCross Ref
- Berenbrink, P., Friedetzky, T., Hu, Z., and Martin, R. 2008. On weighted balls-into-bins games. Theor. Comput. Sci. 409, 3, 511--520. Google ScholarDigital Library
- Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993--1022. Google ScholarCross Ref
- Blinn, J. 1991. A trip down the graphics pipeline: Line clipping. IEEE Comput. Graph. Appl. 11, 1, 98--105. Google ScholarDigital Library
- Chemudugunta, C., Smyth, P., and Steyvers, M. 2007. Modeling general and specific aspects of documents with a probabilistic topic model. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'07). 241--248.Google Scholar
- Chen, W., Chu, J., Luan, J., Bai, H., Wang, Y., and Chang, E. 2009. Collaborative filtering for orkut communities: Discovery of user latent behavior. In Proceedings of the International World Wide Web Conference (WWW'09). 681--690. Google ScholarDigital Library
- Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y., and Olukotun, K. 2006. Mapreduce for machine learning on multicore. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'06).Google Scholar
- Dean, J. and Ghemawat, S. 2004. Mapreduce: Simplified data processing on large clusters. In Proceedings of the ACM USENIX Symposium on Operating Systems Design and Implentation (OSDI'04). 137--150. Google ScholarDigital Library
- Gomes, R., Welling, M., and Perona, P. 2008. Memory bounded inference in topic models. In Proceedings of the International Conference on Machine Learning (ICML'08). 344--351. Google ScholarDigital Library
- Graham, S., Snir, M., and Patterson, C. 2005. Getting Up to Speed: The Future of Supercomputing. National Academies Press.Google Scholar
- Griffiths, T. and Steyvers, M. 2004. Finding scientific topics. Proc. Nat. Acad. Sci. United States Amer. 101, 90001, 5228--5235.Google Scholar
- Li, W. and Mccallum, A. 2006. Pachinko allocation: DAG-Structured mixture models of topic correlations. In Proceedings of the International Conference on Machine Learning (ICML'06). Google ScholarDigital Library
- Mimno, D. M. and Mccallum, A. 2007. Organizing the OCA: Learning faceted subjects from a library of digital books. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries. 376--385. Google ScholarDigital Library
- Newman, D., Asuncion, A., Smyth, P., and Welling, M. 2007. Distributed inference for latent dirichlet allocation. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'07). 1081--1088.Google Scholar
- Newman, D., Asuncion, A., Smyth, P., and Welling, M. 2009. Distributed algorithms for topic models. J. Mach. Learn. Res. 10, 1801--1828. Google ScholarDigital Library
- Porteous, I., Newman, D., Ihler, A., Asuncion, A., Smyth, P., and Welling, M. 2008. Fast collapsed gibbs sampling for latent dirichlet allocation. In Proceedings of the International SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'08). 569--577. Google ScholarDigital Library
- Rosen-Zvi, M., Chemudugunta, C., Griffiths, T., Smyth, P., and Steyvers, M. 2010. Learning author-topic models from text corpora. ACM Trans. Inf. Syst. 28, 1, 1--38. Google ScholarDigital Library
- Shen, J. P. and Lipasti, M. H. 2005. Modern Processor Design: Fundamentals of Superscalar Processors. McGraw-Hill Higher Education.Google Scholar
- Wang, Y., Bai, H., Stanton, M., Chen, W., and Chang, E. 2009. PLDA: Parallel latent dirichlet allocation for large-scale applications. In Algorithmic Aspects in Information and Management. 301--314. Google ScholarDigital Library
- Yan, F., Xu, N., and Qi, Y. 2009. Parallel inference for latent dirichlet allocation on graphics processing units. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'09). 2134--2142.Google Scholar
Index Terms
- PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing
Recommendations
Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey
Topic modeling is one of the most powerful techniques in text mining for data mining, latent data discovery, and finding relationships among data and text documents. Researchers have published many articles in the field of topic modeling and applied in ...
Generating contextualized sentiment lexica based on latent topics and user ratings
HT '13: Proceedings of the 24th ACM Conference on Hypertext and Social MediaSentiment lexica are useful for analyzing opinions in Web collections, for domain-dependent sentiment classification, and as sub-components of recommender systems. In this paper, we present a strategy for automatically generating topic-dependent lexica ...
Learning author-topic models from text corpora
We propose an unsupervised learning technique for extracting information about authors and topics from large text collections. We model documents as if they were generated by a two-stage stochastic process. An author is represented by a probability ...
Comments