ABSTRACT
The Web abounds with dyadic data that keeps increasing by every single second. Previous work has repeatedly shown the usefulness of extracting the interaction structure inside dyadic data [21, 9, 8]. A commonly used tool in extracting the underlying structure is the matrix factorization, whose fame was further boosted in the Netflix challenge [26]. When we were trying to replicate the same success on real-world Web dyadic data, we were seriously challenged by the scalability of available tools. We therefore in this paper report our efforts on scaling up the nonnegative matrix factorization (NMF) technique. We show that by carefully partitioning the data and arranging the computations to maximize data locality and parallelism, factorizing a tens of millions by hundreds of millions matrix with billions of nonzero cells can be accomplished within tens of hours. This result effectively assures practitioners of the scalability of NMF on Web-scale dyadic data.
- D. Agarwal and S. Merugu. Predictive discrete latent factor models for large scale dyadic data. In KDD, 2007. Google ScholarDigital Library
- E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In SIGIR, pages 19--26, 2006. Google ScholarDigital Library
- R. Baeza-Yates, C. Hurtado, and M. Mendoza. Query recommendation using query logs in search engines. EDBT Workshops, pages 588--596, 2004. Google ScholarDigital Library
- M. W. Berry, M. Browne, A. N. Langville, P. V. Pauca, and R. J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics & Data Analysis, 52(1):155--173, September 2007.Google ScholarCross Ref
- E. Y. Chang, K. Zhu, and H. Bai. Parallel algorithms for mining large-scale datasets. In CIKM, 2009.Google ScholarDigital Library
- E. Y. Chang, K. Zhu, H. Wang, H. Bai, J. Li, Z. Qiu, and H. Cui. PSVM: Parallelizing support vector machines on distributed computers. In NIPS, 2007.Google Scholar
- D. Chen. Efficient geometric algorithms on the EREW PRAM. IEEE Trans. Parallel Distrib. Syst., 1995. Google ScholarDigital Library
- W.-Y. Chen, J.-C. Chu, J. Luan, H. Bai, Y. Wang, and E. Y. Chang. Collaborative filtering for orkut communities: discovery of user latent behavior. In WWW, 2009. Google ScholarDigital Library
- Y. Chen, D. Pavlov, and J. F. Canny. Large-scale behavioral targeting. In KDD, pages 209--218, 2009. Google ScholarDigital Library
- C. T. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In NIPS, 2006.Google Scholar
- A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW, pages 271--280, 2007. Google ScholarDigital Library
- W. L. C. David A. Kenny, Deborah A. Kashy. Query clustering using user logs. ACM Trans. Inf. Syst., 20(1):59--81, 2002. Google ScholarDigital Library
- W. L. C. David A. Kenny, Deborah A. Kashy. Dyadic Data Analysis. The Guilford Press, 2006.Google Scholar
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarDigital Library
- S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41:391--407, 1990.Google ScholarCross Ref
- I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In KDD, pages 269--274, 2001. Google ScholarDigital Library
- C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix tri-factorizations for clustering. In KDD, pages 126--135, 2006. Google ScholarDigital Library
- Q. Dong, X. Wang, and L. Lin. Application of latent semantic analysis to protein remote homology detection. Bioinformatics, 22(3):285--290, 2005. Google ScholarDigital Library
- S. Filippone and M. Colajanni. PSBLAS: a library for parallel linear algebra computation on sparse matrices. ACM Trans. Math. Softw., 26(4):527--550, 2000. Google ScholarDigital Library
- D. Hanisch, A. Zien, R. Zimmer, and T. Lengauer. Co-clustering of biological networks and gene expression data. Bioinformatics, 18(1):145--154, 2002.Google ScholarCross Ref
- T. Hofmann, J. Puzicha, and M. I. Jordan. Learning from dyadic data. In NIPS, pages 466--472, 1999. Google ScholarDigital Library
- P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res., 5, 2004. Google ScholarDigital Library
- K. Kanjani. Parallel non negative matrix factorization for document clustering. Technical report, Texas A & M University, May 2007.Google Scholar
- D. Kim, S. Sra, and I. S. Dhillon. Fast newton-type methods for the least squares nonnegative matrix approximation problem. In SDM, 2007. Google ScholarDigital Library
- D. Kim, S. Sra, and I. S. Dhillon. Fast projection based methods for the least squares nonnegative matrix approximation problem. Statistical Analysis and Data Mining, 1(1), 2008. Google ScholarDigital Library
- Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30--37, 2009. Google ScholarDigital Library
- W. Kowalczyk and N. Vlassis. Newscast em. In In NIPS 17, pages 713--720. MIT Press, 2005.Google Scholar
- D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788--791, 1999.Google ScholarCross Ref
- D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, 2000.Google ScholarDigital Library
- S. Li, X. Hou, H. Zhang, and Q. Cheng. Learning spatially localized, parts-based representation. CVPR, 1:207, 2001.Google Scholar
- X. Li, C. Snoek, and M. Worring. Learning tag relevance by neighbor voting for social image retrieval. In Proc. of the 1st ACM international conference on Multimedia information retrieval (MIR '08), pages 180--187, 2008. Google ScholarDigital Library
- Y. Liu, B. Gao, T.-Y. Liu, Y. Zhang, Z. Ma, S. He, and H. Li. BrowseRank: letting web users vote for page importance. In SIGIR, pages 451--458, 2008. Google ScholarDigital Library
- R. Nallapati, W. Cohen, and J. Lafferty. Parallelized variational em for latent dirichlet allocation: An experimental evaluation of speed and scalability. Proc. of the 7th International Conference on Data Mining Workshops (ICDMW'07), pages 349--354, 2007. Google ScholarDigital Library
- P. Pacheco. Parallel Programming with MPI. Morgan Kaufmann, 1997. Google ScholarDigital Library
- J. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In ICML, pages 713--719, 2005. Google ScholarDigital Library
- S. A. Robila and L. G. Maciak. A parallel unmixing algorithm for hyperspectral images. In Intelligent Robots and Computer Vision XXIV, 2006.Google ScholarCross Ref
- M. N. Schmidt, O. Winther, and L. K. Hansen. Bayesian non-negative matrix factorization. In Independent Component Analysis and Signal Separation, pages 540--547, 2009. Google ScholarDigital Library
- F. Shahnaz, M. W. Berry, V. P. Pauca, and R. J. Plemmons. Document clustering using nonnegative matrix factorization. Inf. Process. Manage., 42(2), 2006. Google ScholarDigital Library
- A. P. Singh and G. J. Gordon. A unified view of matrix factorization models. In PKDD, pages 358--373, 2008.Google ScholarCross Ref
- S. Sra and I. S. Dhillon. Nonnegative matrix approximation: Algorithms and applications. Technical report, CS Department, University of Texas, June 2006.Google Scholar
- Y. Wang, H. Bai, M. Stanton, W.-Y. Chen, and E. Y. Chang. Plda: Parallel latent dirichlet allocation for large-scale applications. In ICAAIM, 2009. Google ScholarDigital Library
- J. Wolfe, A. Haghighi, and D. Klein. Fully distributed em for very large datasets. In ICML, 2008. Google ScholarDigital Library
- W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In SIGIR, pages 267--273, 2003. Google ScholarDigital Library
- K. Yu, S. Zhu, J. Lafferty, and Y. Gong. Fast nonparametric matrix factorization for large-scale collaborative filtering. In SIGIR, pages 211--218, 2009. Google ScholarDigital Library
Index Terms
- Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce
Recommendations
Heuristics for exact nonnegative matrix factorization
The exact nonnegative matrix factorization (exact NMF) problem is the following: given an m-by-n nonnegative matrix X and a factorization rank r, find, if possible, an m-by-r nonnegative matrix W and an r-by-n nonnegative matrix H such that $$X = WH$$X=...
Large-Scale Matrix Factorization Using MapReduce
ICDMW '10: Proceedings of the 2010 IEEE International Conference on Data Mining WorkshopsDue to the popularity of nonnegative matrix factorization and the increasing availability of massive data sets, researchers are facing the problem of factorizing large-scale matrices of dimensions in the orders of millions. Recent research [11] has ...
Quadratic nonnegative matrix factorization
In Nonnegative Matrix Factorization (NMF), a nonnegative matrix is approximated by a product of lower-rank factorizing matrices. Most NMF methods assume that each factorizing matrix appears only once in the approximation, thus the approximation is ...
Comments