skip to main content
research-article

PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing

Published:06 May 2011Publication History
Skip Abstract Section

Abstract

Previous methods of distributed Gibbs sampling for LDA run into either memory or communication bottlenecks. To improve scalability, we propose four strategies: data placement, pipeline processing, word bundling, and priority-based scheduling. Experiments show that our strategies significantly reduce the unparallelizable communication bottleneck and achieve good load balancing, and hence improve scalability of LDA.

References

  1. Asuncion, A., Smyth, P., and Welling, M. 2008. Asynchronous distributed learning of topic models. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'08). 81--88.Google ScholarGoogle Scholar
  2. Asuncion, A., Smyth, P., and Welling, M. 2010. Asynchronous distributed estimation of topic models for document analysis. Statist. Methodol. 8, 1, 3--17.Google ScholarGoogle ScholarCross RefCross Ref
  3. Berenbrink, P., Friedetzky, T., Hu, Z., and Martin, R. 2008. On weighted balls-into-bins games. Theor. Comput. Sci. 409, 3, 511--520. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993--1022. Google ScholarGoogle ScholarCross RefCross Ref
  5. Blinn, J. 1991. A trip down the graphics pipeline: Line clipping. IEEE Comput. Graph. Appl. 11, 1, 98--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chemudugunta, C., Smyth, P., and Steyvers, M. 2007. Modeling general and specific aspects of documents with a probabilistic topic model. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'07). 241--248.Google ScholarGoogle Scholar
  7. Chen, W., Chu, J., Luan, J., Bai, H., Wang, Y., and Chang, E. 2009. Collaborative filtering for orkut communities: Discovery of user latent behavior. In Proceedings of the International World Wide Web Conference (WWW'09). 681--690. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y., and Olukotun, K. 2006. Mapreduce for machine learning on multicore. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'06).Google ScholarGoogle Scholar
  9. Dean, J. and Ghemawat, S. 2004. Mapreduce: Simplified data processing on large clusters. In Proceedings of the ACM USENIX Symposium on Operating Systems Design and Implentation (OSDI'04). 137--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Gomes, R., Welling, M., and Perona, P. 2008. Memory bounded inference in topic models. In Proceedings of the International Conference on Machine Learning (ICML'08). 344--351. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Graham, S., Snir, M., and Patterson, C. 2005. Getting Up to Speed: The Future of Supercomputing. National Academies Press.Google ScholarGoogle Scholar
  12. Griffiths, T. and Steyvers, M. 2004. Finding scientific topics. Proc. Nat. Acad. Sci. United States Amer. 101, 90001, 5228--5235.Google ScholarGoogle Scholar
  13. Li, W. and Mccallum, A. 2006. Pachinko allocation: DAG-Structured mixture models of topic correlations. In Proceedings of the International Conference on Machine Learning (ICML'06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Mimno, D. M. and Mccallum, A. 2007. Organizing the OCA: Learning faceted subjects from a library of digital books. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries. 376--385. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Newman, D., Asuncion, A., Smyth, P., and Welling, M. 2007. Distributed inference for latent dirichlet allocation. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'07). 1081--1088.Google ScholarGoogle Scholar
  16. Newman, D., Asuncion, A., Smyth, P., and Welling, M. 2009. Distributed algorithms for topic models. J. Mach. Learn. Res. 10, 1801--1828. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Porteous, I., Newman, D., Ihler, A., Asuncion, A., Smyth, P., and Welling, M. 2008. Fast collapsed gibbs sampling for latent dirichlet allocation. In Proceedings of the International SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'08). 569--577. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Rosen-Zvi, M., Chemudugunta, C., Griffiths, T., Smyth, P., and Steyvers, M. 2010. Learning author-topic models from text corpora. ACM Trans. Inf. Syst. 28, 1, 1--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Shen, J. P. and Lipasti, M. H. 2005. Modern Processor Design: Fundamentals of Superscalar Processors. McGraw-Hill Higher Education.Google ScholarGoogle Scholar
  20. Wang, Y., Bai, H., Stanton, M., Chen, W., and Chang, E. 2009. PLDA: Parallel latent dirichlet allocation for large-scale applications. In Algorithmic Aspects in Information and Management. 301--314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Yan, F., Xu, N., and Qi, Y. 2009. Parallel inference for latent dirichlet allocation on graphics processing units. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'09). 2134--2142.Google ScholarGoogle Scholar

Index Terms

  1. PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image ACM Transactions on Intelligent Systems and Technology
                ACM Transactions on Intelligent Systems and Technology  Volume 2, Issue 3
                April 2011
                259 pages
                ISSN:2157-6904
                EISSN:2157-6912
                DOI:10.1145/1961189
                Issue’s Table of Contents

                Copyright © 2011 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 6 May 2011
                • Accepted: 1 October 2010
                • Revised: 1 June 2010
                • Received: 1 April 2010
                Published in tist Volume 2, Issue 3

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article
                • Research
                • Refereed

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader