skip to main content
10.1145/1772690.1772760acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce

Published:26 April 2010Publication History

ABSTRACT

The Web abounds with dyadic data that keeps increasing by every single second. Previous work has repeatedly shown the usefulness of extracting the interaction structure inside dyadic data [21, 9, 8]. A commonly used tool in extracting the underlying structure is the matrix factorization, whose fame was further boosted in the Netflix challenge [26]. When we were trying to replicate the same success on real-world Web dyadic data, we were seriously challenged by the scalability of available tools. We therefore in this paper report our efforts on scaling up the nonnegative matrix factorization (NMF) technique. We show that by carefully partitioning the data and arranging the computations to maximize data locality and parallelism, factorizing a tens of millions by hundreds of millions matrix with billions of nonzero cells can be accomplished within tens of hours. This result effectively assures practitioners of the scalability of NMF on Web-scale dyadic data.

References

  1. D. Agarwal and S. Merugu. Predictive discrete latent factor models for large scale dyadic data. In KDD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In SIGIR, pages 19--26, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Baeza-Yates, C. Hurtado, and M. Mendoza. Query recommendation using query logs in search engines. EDBT Workshops, pages 588--596, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. W. Berry, M. Browne, A. N. Langville, P. V. Pauca, and R. J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics & Data Analysis, 52(1):155--173, September 2007.Google ScholarGoogle ScholarCross RefCross Ref
  5. E. Y. Chang, K. Zhu, and H. Bai. Parallel algorithms for mining large-scale datasets. In CIKM, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. Y. Chang, K. Zhu, H. Wang, H. Bai, J. Li, Z. Qiu, and H. Cui. PSVM: Parallelizing support vector machines on distributed computers. In NIPS, 2007.Google ScholarGoogle Scholar
  7. D. Chen. Efficient geometric algorithms on the EREW PRAM. IEEE Trans. Parallel Distrib. Syst., 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W.-Y. Chen, J.-C. Chu, J. Luan, H. Bai, Y. Wang, and E. Y. Chang. Collaborative filtering for orkut communities: discovery of user latent behavior. In WWW, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Y. Chen, D. Pavlov, and J. F. Canny. Large-scale behavioral targeting. In KDD, pages 209--218, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. T. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In NIPS, 2006.Google ScholarGoogle Scholar
  11. A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW, pages 271--280, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. L. C. David A. Kenny, Deborah A. Kashy. Query clustering using user logs. ACM Trans. Inf. Syst., 20(1):59--81, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. L. C. David A. Kenny, Deborah A. Kashy. Dyadic Data Analysis. The Guilford Press, 2006.Google ScholarGoogle Scholar
  14. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41:391--407, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  16. I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In KDD, pages 269--274, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix tri-factorizations for clustering. In KDD, pages 126--135, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Q. Dong, X. Wang, and L. Lin. Application of latent semantic analysis to protein remote homology detection. Bioinformatics, 22(3):285--290, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Filippone and M. Colajanni. PSBLAS: a library for parallel linear algebra computation on sparse matrices. ACM Trans. Math. Softw., 26(4):527--550, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Hanisch, A. Zien, R. Zimmer, and T. Lengauer. Co-clustering of biological networks and gene expression data. Bioinformatics, 18(1):145--154, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  21. T. Hofmann, J. Puzicha, and M. I. Jordan. Learning from dyadic data. In NIPS, pages 466--472, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res., 5, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. K. Kanjani. Parallel non negative matrix factorization for document clustering. Technical report, Texas A & M University, May 2007.Google ScholarGoogle Scholar
  24. D. Kim, S. Sra, and I. S. Dhillon. Fast newton-type methods for the least squares nonnegative matrix approximation problem. In SDM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Kim, S. Sra, and I. S. Dhillon. Fast projection based methods for the least squares nonnegative matrix approximation problem. Statistical Analysis and Data Mining, 1(1), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30--37, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. W. Kowalczyk and N. Vlassis. Newscast em. In In NIPS 17, pages 713--720. MIT Press, 2005.Google ScholarGoogle Scholar
  28. D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788--791, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  29. D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, 2000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Li, X. Hou, H. Zhang, and Q. Cheng. Learning spatially localized, parts-based representation. CVPR, 1:207, 2001.Google ScholarGoogle Scholar
  31. X. Li, C. Snoek, and M. Worring. Learning tag relevance by neighbor voting for social image retrieval. In Proc. of the 1st ACM international conference on Multimedia information retrieval (MIR '08), pages 180--187, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Y. Liu, B. Gao, T.-Y. Liu, Y. Zhang, Z. Ma, S. He, and H. Li. BrowseRank: letting web users vote for page importance. In SIGIR, pages 451--458, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. Nallapati, W. Cohen, and J. Lafferty. Parallelized variational em for latent dirichlet allocation: An experimental evaluation of speed and scalability. Proc. of the 7th International Conference on Data Mining Workshops (ICDMW'07), pages 349--354, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. P. Pacheco. Parallel Programming with MPI. Morgan Kaufmann, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In ICML, pages 713--719, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. S. A. Robila and L. G. Maciak. A parallel unmixing algorithm for hyperspectral images. In Intelligent Robots and Computer Vision XXIV, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  37. M. N. Schmidt, O. Winther, and L. K. Hansen. Bayesian non-negative matrix factorization. In Independent Component Analysis and Signal Separation, pages 540--547, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. F. Shahnaz, M. W. Berry, V. P. Pauca, and R. J. Plemmons. Document clustering using nonnegative matrix factorization. Inf. Process. Manage., 42(2), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. P. Singh and G. J. Gordon. A unified view of matrix factorization models. In PKDD, pages 358--373, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  40. S. Sra and I. S. Dhillon. Nonnegative matrix approximation: Algorithms and applications. Technical report, CS Department, University of Texas, June 2006.Google ScholarGoogle Scholar
  41. Y. Wang, H. Bai, M. Stanton, W.-Y. Chen, and E. Y. Chang. Plda: Parallel latent dirichlet allocation for large-scale applications. In ICAAIM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. J. Wolfe, A. Haghighi, and D. Klein. Fully distributed em for very large datasets. In ICML, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In SIGIR, pages 267--273, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. K. Yu, S. Zhu, J. Lafferty, and Y. Gong. Fast nonparametric matrix factorization for large-scale collaborative filtering. In SIGIR, pages 211--218, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      WWW '10: Proceedings of the 19th international conference on World wide web
      April 2010
      1407 pages
      ISBN:9781605587998
      DOI:10.1145/1772690

      Copyright © 2010 International World Wide Web Conference Committee (IW3C2)

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 April 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,899of8,196submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    ePub

    View this article in ePub.

    View ePub