ABSTRACT
Many machine learning and data mining (MLDM) problems like recommendation, topic modeling and medical diagnosis can be modeled as computing on bipartite graphs. However, most distributed graph-parallel systems are oblivious to the unique characteristics in such graphs and existing online graph partitioning algorithms usually causes excessive replication of vertices as well as significant pressure on network communication. This article identifies the challenges and opportunities of partitioning bipartite graphs for distributed MLDM processing and proposes BiGraph, a set of bipartite-oriented graph partitioning algorithms. BiGraph leverages observations such as the skewed distribution of vertices, discriminated computation load and imbalanced data sizes between the two subsets of vertices to derive a set of optimal graph partition algorithms that result in minimal vertex replication and network communication. BiGraph has been implemented on PowerGraph and is shown to have a performance boost up to 17.75X (from 1.38X) for four typical MLDM algorithms, due to reducing up to 62% vertex replication, and up to 96% network traffic.
- Netflix prize. http://www.netflixprize.com/.Google Scholar
- Brin, S., and Page, L. The anatomy of a large-scale hypertextual web search engine. In WWW (1998), pp. 107--117. Google ScholarDigital Library
- Chen, R., Shi, J., Chen, Y., Guan, H., Zang, B., and Chen, H. Powerlyra: Differentiated graph computation and partitioning on skewed graphs. http://ipads.se.sjtu.edu.cn/projects/powerlyra/PowerLyra-IPADSTR-2013-001.pdf, 2013.Google Scholar
- Davis, T., and Hu, Y. The university of florida sparse matrix collection. http://www.cise.ufl.edu/research/sparse/matrices/index.html.Google Scholar
- Dhillon, I. S. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (2001), ACM, pp. 269--274. Google ScholarDigital Library
- Gao, B., Liu, T.-Y., Feng, G., Qin, T., Cheng, Q.-S., and Ma, W.-Y. Hierarchical taxonomy preparation for text categorization using consistent bipartite spectral graph copartitioning. Knowledge and Data Engineering, IEEE Transactions on 17, 9 (2005), 1263--1273. Google ScholarDigital Library
- Gao, B., Liu, T.-Y., Zheng, X., Cheng, Q.-S., and Ma, W.-Y. Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (2005), ACM, pp. 41--50. Google ScholarDigital Library
- Gonzalez, J., Low, Y., Gu, H., Bickson, D., and Guestrin, C. PowerGraph: Distributed graph-parallel computation on natural graphs. In OSDI (2012). Google ScholarDigital Library
- Jain, N., Liao, G., and Willke, T. L. Graphbuilder: scalable graph etl framework. In First International Workshop on Graph Data Management Experiences and Systems (New York, NY, USA, 2013), GRADES '13, ACM, pp. 4:1--4:6. Google ScholarDigital Library
- Koren, Y., Bell, R., and Volinsky, C. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30--37. Google ScholarDigital Library
- Kumar, A., Beutel, A., Ho, Q., and Xing, E. P. Fugue: Slow-worker-agnostic distributed learning for big models on big data. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics (2014), pp. 531--539.Google Scholar
- Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., and Hellerstein, J. M. Distributed GraphLab: a framework for machine learning and data mining in the cloud. VLDB Endow. 5, 8 (2012), 716--727. Google ScholarDigital Library
- Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. Pregel: a system for large-scale graph processing. In SIGMOD (2010), pp. 135--146. Google ScholarDigital Library
- Project, S. N. A. Stanford large network dataset collection. http://snap.stanford.edu/data/.Google Scholar
- Zha, H., He, X., Ding, C., Simon, H., and Gu, M. Bipartite graph partitioning and data clustering. In Proceedings of the tenth international conference on Information and knowledge management (2001), ACM, pp. 25--32. Google ScholarDigital Library
Index Terms
- Bipartite-oriented distributed graph partitioning for big learning
Recommendations
Rainbow matchings in an edge-colored planar bipartite graph
Highlights- Given an edge-colored graph G, if any two edges of G receive distinct colors, then we call G a rainbow graph. The anti-Ramsey number AR(G;H) is the maximum ...
AbstractIn this paper, we consider the existence of rainbow matchings in maximal bipartite planar graphs. We determine the maximum number of colors appearing in an edge-coloring of maximal bipartite planar graphs with a Hamilton cycle which ...
Bipartite bihypergraphs: A survey and new results
Let H^0 and H^1 be hypergraphs with the same vertex-set V. The ordered pair H=(H^0,H^1) is called a bihypergraph. A set S@?V is stable in H^i if S contains no hyperedges of H^i, i=0,1. A bihypergraph H=(H^0,H^1) is called bipartite if there exists an ...
Equistarable bipartite graphs
Recently, Milanič and Trotignon introduced the class of equistarable graphs as graphs without isolated vertices admitting positive weights on the edges such that a subset of edges is of total weight 1 if and only if it forms a maximal star. Based on ...
Comments