Abstract
In this paper, we study the problem of choosing among partitioning strategies in distributed graph processing systems. To this end, we evaluate and characterize both the performance and resource usage of different partitioning strategies under various popular distributed graph processing systems, applications, input graphs, and execution environments. Through our experiments, we found that no single partitioning strategy is the best fit for all situations, and that the choice of partitioning strategy has a significant effect on resource usage and application run-time. Our experiments demonstrate that the choice of partitioning strategy depends on (1) the degree distribution of input graph, (2) the type and duration of the application, and (3) the cluster size. Based on our results, we present rules of thumb to help users pick the best partitioning strategy for their particular use cases. We present results from each system, as well as from all partitioning strategies implemented in one common system (PowerLyra).
- Apache Giraph. http://giraph.apache.org/. Last accessed 2016-04-18.Google Scholar
- P. Boldi, M. Rosa, M. Santini, and S. Vigna. Layered Label Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social Networks. In Proc. of Int'l. Conf. on World Wide Web (WWW). ACM, 2011. Google ScholarDigital Library
- P. Boldi and S. Vigna. The WebGraph framework I: Compression techniques. In Proc. of Int'l. Conf. World Wide Web (WWW). ACM, 2004. Google ScholarDigital Library
- R. Chen, J. Shi, Y. Chen, and H. Chen. Powerlyra: Differentiated graph computation and partitioning on skewed graphs. In Proc. of European Conf. on Computer Systems, (EuroSys), pages 1:1-1:15. ACM, 2015. Google ScholarDigital Library
- A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan. One trillion edges: graph processing at Facebook-scale. Proc. of VLDB Endowment, 2015. Google ScholarDigital Library
- DIMACS Challenge 9 - Shortest Paths. http://www.dis.uniroma1.it/challenge9/. Last Accessed 2016-05-30.Google Scholar
- M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In Proc. of Conf. on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM), pages 251--262. ACM, 1999. Google ScholarDigital Library
- J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In Proc. of Symposium on Operating Systems Design and Implementation (OSDI). USENIX, 2012. Google ScholarDigital Library
- J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph processing in a distributed dataflow framework. In Proc. of Symposium on Operating Systems Design and Implementation (OSDI). USENIX, 2014. Google ScholarDigital Library
- H. Halberstam and R. R. Laxton. Perfect difference sets. Proc. of Glasgow Mathematical Association, 6:177--184, July 1964.Google ScholarCross Ref
- M. Han, K. Daudjee, K. Ammar, M. T. Özsu, X. Wang, and T. Jin. An experimental comparison of pregel-like graph processing systems. In Proc. of VLDB Endowment, volume 7, pages 1047--1058. VLDB Endowment, Aug. 2014. Google ScholarDigital Library
- I. Hoque and I. Gupta. LFGraph: Simple and fast distributed graph analytics. In Proc. of Conf. on Timely Results In Operating Systems (TRiOS). ACM, 2013. Google ScholarDigital Library
- N. Jain, G. Liao, and T. L. Willke. Graphbuilder: scalable graph ETL framework. In First Int'l. Workshop on Graph Data Management Experiences and Systems, (GRADES), 2013. Google ScholarDigital Library
- R. M. Karp. Reducibility among combinatorial problems. In Proc. of a Symposium on the Complexity of Computer Computations, pages 85--103, 1972.Google ScholarCross Ref
- H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a social network or a news media? In Proc. of Int'l. Conf. on World Wide Web (WWW). ACM, 2010. Google ScholarDigital Library
- A. Kyrola, G. E. Blelloch, and C. Guestrin. GraphChi: Large-scale graph computation on just a PC. In Proc. of Symposium on Operating Systems Design and Implementation (OSDI). USENIX, 2012. Google ScholarDigital Library
- Laboratory For Web Algorithms. http://law.di.unimi.it/datasets.php. Last Accessed 2016-04-18.Google Scholar
- M. LeBeane, S. Song, R. Panda, J. H. Ryoo, and L. K. John. Data partitioning strategies for graph workloads on heterogeneous clusters. In Proc. of Int'l. Conf. for High Performance Computing, Networking, Storage and Analysis, (SC '15), pages 56:1--56:12. ACM, 2015. Google ScholarDigital Library
- J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.Google Scholar
- Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: A framework for machine learning and data mining in the cloud. In Proc. of VLDB Endowment. VLDB Endowment, 2012. Google ScholarDigital Library
- Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A new parallel framework for machine learning. In Conf. on Uncertainty in Artificial Intelligence (UAI), 2010. Google ScholarDigital Library
- G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In Proc. of Int'l. Conf. on Management of Data (SIGMOD). ACM, 2010. Google ScholarDigital Library
- F. Petroni, L. Querzoni, K. Daudjee, S. Kamali, and G. Iacoboni. Hdrf: Stream-based partitioning for power-law graphs. In Proc. of 24th ACM Int'l. on Conf. on Information and Knowledge Management (CIKM), pages 243--252. ACM, 2015. Google ScholarDigital Library
- S. Schelter, S. Ewen, K. Tzoumas, and V. Markl. All roads lead to Rome: Optimistic recovery for distributed iterative data processing. In Proc. of Int'l. Conf. on Information and Knowledge Management (CIKM). ACM, 2013. Google ScholarDigital Library
- C. Tsourakakis, C. Gkantsidis, B. Radunovic, and M. Vojnovic. Fennel: Streaming graph partitioning for massive scale graphs. In Proc. of 7th ACM Int'l. Conf. on Web Search and Data Mining (WSDM), pages 333--342. ACM, 2014. Google ScholarDigital Library
- L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, Aug. 1990. Google ScholarDigital Library
- S. Verma, L. M. Leslie, Y. Shin, and I. Gupta. An experimental comparison of partitioning strategies in distributed graph processing. Technical Report, IDEALS, http://hdl.handle.net/2142/91657, 2016.Google Scholar
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. of Conf. on Networked Systems Design and Implementation (NSDI). USENIX, 2012. Google ScholarDigital Library
Index Terms
- An experimental comparison of partitioning strategies in distributed graph processing
Recommendations
Application Driven Graph Partitioning
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataGraph partitioning is crucial to parallel computations on large graphs. The choice of partitioning strategies has strong impact on not only the performance of graph algorithms, but also the design of the algorithms. For an algorithm of our interest, ...
Experimental Analysis of Streaming Algorithms for Graph Partitioning
SIGMOD '19: Proceedings of the 2019 International Conference on Management of DataWe report a systematic performance study of streaming graph partitioning algorithms. Graph partitioning plays a crucial role in overall system performance as it has a significant impact on both load balancing and inter-machine communication. The ...
Graph partitioning strategies: one size does not fit all
AbstractAs an important part of distributed graph computing, graph partitioning has been widely studied. However, the majority of the existing approaches to distributed graph partitioning barely take into consideration the relationship between the ...
Comments