ABSTRACT
From social networks to targeted advertising, big graphs capture the structure in data and are central to recent advances in machine learning and data mining. Unfortunately, directly applying existing data-parallel tools to graph computation tasks can be cumbersome and inefficient. The need for intuitive, scalable tools for graph computation has lead to the development of new graph-parallel systems (e.g., Pregel, PowerGraph) which are designed to efficiently execute graph algorithms. Unfortunately, these new graph-parallel systems do not address the challenges of graph construction and transformation which are often just as problematic as the subsequent computation. Furthermore, existing graph-parallel systems provide limited fault-tolerance and support for interactive data mining.
We introduce GraphX, which combines the advantages of both data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark data-parallel framework. We leverage new ideas in distributed graph representation to efficiently distribute graphs as tabular data-structures. Similarly, we leverage advances in data-flow systems to exploit in-memory computation and fault-tolerance. We provide powerful new operations to simplify graph construction and transformation. Using these primitives we implement the PowerGraph and Pregel abstractions in less than 20 lines of code. Finally, by exploiting the Scala foundation of Spark, we enable users to interactively load, transform, and compute on massive graphs.
- A. Abou-Rjeili and G. Karypis. Multilevel algorithms for partitioning power-law graphs. In IPDPS, 2006. 3.1.1 Google ScholarDigital Library
- R. Albert, H. Jeong, and A. L. Barabási. Error and attack tolerance of complex networks. In Nature, volume 406, pages 378--482, 2000. 3.1.1Google Scholar
- A. Buluç and J. R. Gilbert. The combinatorial blas: design, implementation, and applications. IJHPCA, 25(4):496--509, 2011. 1 Google ScholarDigital Library
- U. V. Çatalyürek, C. Aykanat, and B. Uçar. On two-dimensional sparse matrix partitioning: Models, methods, and a recipe. SIAM J. Sci. Comput., 32(2):656--683, 2010. 3.1.1 Google ScholarDigital Library
- R. Cheng et al. Kineograph: taking the pulse of a fast-changing and connected world. In EuroSys, 2012. 1 Google ScholarDigital Library
- J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI '12. 1, 2, 3.1.1, 4, 4.1, 4.2 Google ScholarDigital Library
- K. Lang. Finding good nearly balanced cuts in power law graphs. Technical Report YRL-2004-036, Yahoo! Research Labs, Pasadena, CA, Nov. 2004. 3.1.1Google Scholar
- J. Leskovec et al. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 6(1):29--123, 2008. 3.1.1, 5Google ScholarCross Ref
- G. Malewicz, M. H. Austern, A. J. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010. 1, 4, 4.1 Google ScholarDigital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999--66, Stanford InfoLab, 1999. 4Google Scholar
- P. Stutz, A. Bernstein, and W. Cohen. Signal/collect: graph algorithms for the (semantic) web. In ISWC, 2010. 1 Google ScholarDigital Library
- M. Zaharia et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. NSDI, 2012. 1 Google ScholarDigital Library
- GraphX: a resilient distributed graph system on Spark
Recommendations
GraphX: graph processing in a distributed dataflow framework
OSDI'14: Proceedings of the 11th USENIX conference on Operating Systems Design and ImplementationIn pursuit of graph processing performance, the systems community has largely abandoned general-purpose distributed dataflow frameworks in favor of specialized graph processing systems that provide tailored programming abstractions and accelerate the ...
Improving the shortest path finding algorithm in apache spark graphX
ICMLSC '18: Proceedings of the 2nd International Conference on Machine Learning and Soft ComputingThe shortest path finding problem is one of the most important and common problems on graphs. It is also a basic problem applied to solve other problems such as the betweenness centrality problem, the closeness centrality problem... Therefore, in all ...
Comments