research-article

Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce

Authors:
Chao Liu

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Hung-chih Yang

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Jinliang Fan

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Li-Wei He

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Yi-Min Wang

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

WWW '10: Proceedings of the 19th international conference on World wide webApril 2010Pages 681–690https://doi.org/10.1145/1772690.1772760

Published:26 April 2010Publication History

WWW '10: Proceedings of the 19th international conference on World wide web

Pages 681–690

ABSTRACT

The Web abounds with dyadic data that keeps increasing by every single second. Previous work has repeatedly shown the usefulness of extracting the interaction structure inside dyadic data [21, 9, 8]. A commonly used tool in extracting the underlying structure is the matrix factorization, whose fame was further boosted in the Netflix challenge [26]. When we were trying to replicate the same success on real-world Web dyadic data, we were seriously challenged by the scalability of available tools. We therefore in this paper report our efforts on scaling up the nonnegative matrix factorization (NMF) technique. We show that by carefully partitioning the data and arranging the computations to maximize data locality and parallelism, factorizing a tens of millions by hundreds of millions matrix with billions of nonzero cells can be accomplished within tens of hours. This result effectively assures practitioners of the scalability of NMF on Web-scale dyadic data.

References

D. Agarwal and S. Merugu. Predictive discrete latent factor models for large scale dyadic data. In KDD, 2007. Google ScholarDigital Library
E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In SIGIR, pages 19--26, 2006. Google ScholarDigital Library
R. Baeza-Yates, C. Hurtado, and M. Mendoza. Query recommendation using query logs in search engines. EDBT Workshops, pages 588--596, 2004. Google ScholarDigital Library
M. W. Berry, M. Browne, A. N. Langville, P. V. Pauca, and R. J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics & Data Analysis, 52(1):155--173, September 2007.Google ScholarCross Ref
E. Y. Chang, K. Zhu, and H. Bai. Parallel algorithms for mining large-scale datasets. In CIKM, 2009.Google ScholarDigital Library
E. Y. Chang, K. Zhu, H. Wang, H. Bai, J. Li, Z. Qiu, and H. Cui. PSVM: Parallelizing support vector machines on distributed computers. In NIPS, 2007.Google Scholar
D. Chen. Efficient geometric algorithms on the EREW PRAM. IEEE Trans. Parallel Distrib. Syst., 1995. Google ScholarDigital Library
W.-Y. Chen, J.-C. Chu, J. Luan, H. Bai, Y. Wang, and E. Y. Chang. Collaborative filtering for orkut communities: discovery of user latent behavior. In WWW, 2009. Google ScholarDigital Library
Y. Chen, D. Pavlov, and J. F. Canny. Large-scale behavioral targeting. In KDD, pages 209--218, 2009. Google ScholarDigital Library
C. T. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In NIPS, 2006.Google Scholar
A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW, pages 271--280, 2007. Google ScholarDigital Library
W. L. C. David A. Kenny, Deborah A. Kashy. Query clustering using user logs. ACM Trans. Inf. Syst., 20(1):59--81, 2002. Google ScholarDigital Library
W. L. C. David A. Kenny, Deborah A. Kashy. Dyadic Data Analysis. The Guilford Press, 2006.Google Scholar
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarDigital Library
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41:391--407, 1990.Google ScholarCross Ref
I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In KDD, pages 269--274, 2001. Google ScholarDigital Library
C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix tri-factorizations for clustering. In KDD, pages 126--135, 2006. Google ScholarDigital Library
Q. Dong, X. Wang, and L. Lin. Application of latent semantic analysis to protein remote homology detection. Bioinformatics, 22(3):285--290, 2005. Google ScholarDigital Library
S. Filippone and M. Colajanni. PSBLAS: a library for parallel linear algebra computation on sparse matrices. ACM Trans. Math. Softw., 26(4):527--550, 2000. Google ScholarDigital Library
D. Hanisch, A. Zien, R. Zimmer, and T. Lengauer. Co-clustering of biological networks and gene expression data. Bioinformatics, 18(1):145--154, 2002.Google ScholarCross Ref
T. Hofmann, J. Puzicha, and M. I. Jordan. Learning from dyadic data. In NIPS, pages 466--472, 1999. Google ScholarDigital Library
P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res., 5, 2004. Google ScholarDigital Library
K. Kanjani. Parallel non negative matrix factorization for document clustering. Technical report, Texas A & M University, May 2007.Google Scholar
D. Kim, S. Sra, and I. S. Dhillon. Fast newton-type methods for the least squares nonnegative matrix approximation problem. In SDM, 2007. Google ScholarDigital Library
D. Kim, S. Sra, and I. S. Dhillon. Fast projection based methods for the least squares nonnegative matrix approximation problem. Statistical Analysis and Data Mining, 1(1), 2008. Google ScholarDigital Library
Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30--37, 2009. Google ScholarDigital Library
W. Kowalczyk and N. Vlassis. Newscast em. In In NIPS 17, pages 713--720. MIT Press, 2005.Google Scholar
D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788--791, 1999.Google ScholarCross Ref
D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, 2000.Google ScholarDigital Library
S. Li, X. Hou, H. Zhang, and Q. Cheng. Learning spatially localized, parts-based representation. CVPR, 1:207, 2001.Google Scholar
X. Li, C. Snoek, and M. Worring. Learning tag relevance by neighbor voting for social image retrieval. In Proc. of the 1st ACM international conference on Multimedia information retrieval (MIR '08), pages 180--187, 2008. Google ScholarDigital Library
Y. Liu, B. Gao, T.-Y. Liu, Y. Zhang, Z. Ma, S. He, and H. Li. BrowseRank: letting web users vote for page importance. In SIGIR, pages 451--458, 2008. Google ScholarDigital Library
R. Nallapati, W. Cohen, and J. Lafferty. Parallelized variational em for latent dirichlet allocation: An experimental evaluation of speed and scalability. Proc. of the 7th International Conference on Data Mining Workshops (ICDMW'07), pages 349--354, 2007. Google ScholarDigital Library
P. Pacheco. Parallel Programming with MPI. Morgan Kaufmann, 1997. Google ScholarDigital Library
J. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In ICML, pages 713--719, 2005. Google ScholarDigital Library
S. A. Robila and L. G. Maciak. A parallel unmixing algorithm for hyperspectral images. In Intelligent Robots and Computer Vision XXIV, 2006.Google ScholarCross Ref
M. N. Schmidt, O. Winther, and L. K. Hansen. Bayesian non-negative matrix factorization. In Independent Component Analysis and Signal Separation, pages 540--547, 2009. Google ScholarDigital Library
F. Shahnaz, M. W. Berry, V. P. Pauca, and R. J. Plemmons. Document clustering using nonnegative matrix factorization. Inf. Process. Manage., 42(2), 2006. Google ScholarDigital Library
A. P. Singh and G. J. Gordon. A unified view of matrix factorization models. In PKDD, pages 358--373, 2008.Google ScholarCross Ref
S. Sra and I. S. Dhillon. Nonnegative matrix approximation: Algorithms and applications. Technical report, CS Department, University of Texas, June 2006.Google Scholar
Y. Wang, H. Bai, M. Stanton, W.-Y. Chen, and E. Y. Chang. Plda: Parallel latent dirichlet allocation for large-scale applications. In ICAAIM, 2009. Google ScholarDigital Library
J. Wolfe, A. Haghighi, and D. Klein. Fully distributed em for very large datasets. In ICML, 2008. Google ScholarDigital Library
W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In SIGIR, pages 267--273, 2003. Google ScholarDigital Library
K. Yu, S. Zhu, J. Lafferty, and Y. Gong. Fast nonparametric matrix factorization for large-scale collaborative filtering. In SIGIR, pages 211--218, 2009. Google ScholarDigital Library

Index Terms

Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce
1. Mathematics of computing
  1. Mathematical software

Recommendations

Heuristics for exact nonnegative matrix factorization

The exact nonnegative matrix factorization (exact NMF) problem is the following: given an m-by-n nonnegative matrix X and a factorization rank r, find, if possible, an m-by-r nonnegative matrix W and an r-by-n nonnegative matrix H such that $$X = WH$$X=...
Read More
Large-Scale Matrix Factorization Using MapReduce
ICDMW '10: Proceedings of the 2010 IEEE International Conference on Data Mining Workshops

Due to the popularity of nonnegative matrix factorization and the increasing availability of massive data sets, researchers are facing the problem of factorizing large-scale matrices of dimensions in the orders of millions. Recent research [11] has ...
Read More
Quadratic nonnegative matrix factorization

In Nonnegative Matrix Factorization (NMF), a nonnegative matrix is approximated by a product of lower-rank factorizing matrices. Most NMF methods assume that each factorizing matrix appears only once in the approximation, thus the approximation is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '10: Proceedings of the 19th international conference on World wide web
April 2010
1407 pages
ISBN:9781605587998
DOI:10.1145/1772690
General Chairs:
Michael Rappa
North Carolina State University, USA
,
Paul Jones
University of North Carolina at Chapel Hill, USA
,
Program Chairs:
Juliana Freire
University of Utah, USA
,
Soumen Chakrabarti
Indian Institute of Technology, India
Copyright © 2010 International World Wide Web Conference Committee (IW3C2)
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 April 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
distributed computing
dyadic data
mapreduce
nonnegative matrix factorization
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 146
  Total Citations
  View Citations
- 1,268
  Total Downloads
- Downloads (Last 12 months)38
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

ePub

View this article in ePub.

View ePub

Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce

WWW '10: Proceedings of the 19th international conference on World wide web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Heuristics for exact nonnegative matrix factorization

Large-Scale Matrix Factorization Using MapReduce

Quadratic nonnegative matrix factorization