nach oben

The VLDB Journal

Erschienen in:

04.08.2021 | Special Issue Paper

G-thinker: a general distributed framework for finding qualified subgraphs in a big graph with load balancing

verfasst von: Da Yan, Guimu Guo, Jalal Khalil, M. Tamer Özsu, Wei-Shinn Ku, John C. S. Lui

Erschienen in: The VLDB Journal | Ausgabe 2/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Finding from a big graph those subgraphs that satisfy certain conditions is useful in many applications such as community detection and subgraph matching. These problems have a high time complexity, but existing systems that attempt to scale them are all IO-bound in execution. We propose the first truly CPU-bound distributed framework called G-thinker for subgraph finding algorithms, which adopts a task-based computation model, and which also provides a user-friendly subgraph-centric vertex-pulling API for writing distributed subgraph finding algorithms that can be easily adapted from existing serial algorithms. To utilize all CPU cores of a cluster, G-thinker features (1) a highly concurrent vertex cache for parallel task access and (2) a lightweight task scheduling approach that ensures high task throughput. These designs well overlap communication with computation to minimize the idle time of CPU cores. To further improve load balancing on graphs where the workloads of individual tasks can be drastically different due to biased graph density distribution, we propose to prioritize the scheduling of those tasks that tend to be long running for processing and decomposition, plus a timeout mechanism for task decomposition to prevent long-running straggler tasks. The idea has been integrated into a novelty algorithm for maximum clique finding (MCF) that adopts a hybrid task decomposition strategy, which significantly improves the running time of MCF on dense and large graphs: The algorithm finds a maximum clique of size 1,109 on a large and dense WikiLinks graph dataset in 70 minutes. Extensive experiments demonstrate that G-thinker achieves orders of magnitude speedup compared even with the fastest existing subgraph-centric system, and it scales well to much larger and denser real network data. G-thinker is open-sourced at http://bit.ly/gthinker with detailed documentation.

Vorheriger Artikel PrefixFPM: a parallel framework for general-purpose mining of frequent and closed patterns

Nächster Artikel RDFFrames: knowledge graph access for machine learning tools

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

For a labeled graph, v’s adjacency list item to its neighbor u may also keep the labels of u and edge (v, u).

https://yanlab19870714.github.io/yanda/gthinker

https://github.com/guimuguo/G-thinker

https://neo4j.com/docs/operations-manual/current/tutorial/neo4j-admin-import/.

Arabesque Code. https://github.com/qcri/Arabesque

COST in the Land of Databases. https://github.com/frankmcsherry/blog/blob/master/posts/2017-09-23.md

G-Miner Code. https://github.com/yaobaiwei/GMiner

RStream Code. https://github.com/rstream-system

Bu, Y., Borkar, V.R., Jia, J., Carey, M.J., Condie, T.: Pregelix: Big(ger) graph analytics on a dataflow engine. PVLDB 8(2), 161–172 (2014)

Chen, H., Liu, M., Zhao, Y., Yan, X., Yan, D., Cheng, J.: G-miner: an efficient task-oriented graph mining system. In EuroSys, pages 32:1–32:12 (2018)

Cheng, J., Liu, Q., Li, Z., Fan, W., Lui, J. C. S., He, C.: VENUS: vertex-centric streamlined graph computation on a single PC. In ICDE, pages 1131–1142 (2015)

Ching, A., Edunov, S., Kabiljo, M., Logothetis, D., Muthukrishnan, S.: One trillion edges: Graph processing at facebook-scale. PVLDB 8(12), 1804–1815 (2015)

Chu, S., Cheng, J.: Triangle listing in massive networks. TKDD, 6(4):17:1–17:32 (2012)

10.

Csun, S., Luo, Q.: Parallelizing recursive backtracking based subgraph matching on a single machine. In ICPADS, pages 42–50. IEEE (2018)

11.

Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137–150 (2004)

12.

Dominguez-Sal, D., Urbón-Bayes, P., Giménez-Vañó, A., Gómez-Villamor, S., Martínez-Bazan, N., Larriba-Pey, J. L.: Survey of graph database performance on the HPC scalable graph analysis benchmark. In WAIM Workshops, volume 6185 of Lecture Notes in Computer Science, pages 37–48. Springer (2010)

13.

Friendster. http://snap.stanford.edu/data/com-friendster.html

14.

Gao, J., Zhou, C., Zhou, J., Yu, J. X.: Continuous pattern detection over billion-edge graph using distributed framework. In I. F. Cruz, E. Ferrari, Y. Tao, E. Bertino, and G. Trajcevski, editors, ICDE, pages 556–567. IEEE Computer Society (2014)

15.

Guo, G., Yan, D., Özsu, M. T., Jiang, Z., Khalil, J.: Scalable mining of maximal quasi-cliques: An algorithm-system codesign approach. PVLDB (2021)

16.

Hu, X., Tao, Y., Chung, C.: I/o-efficient algorithms on triangle listing and counting. ACM Trans. Database Syst., 39(4):27:1–27:30 (2014)

17.

Joshi, A., Zhang, Y., Bogdanov, P., Hwang, J.: An efficient system for subgraph discovery. In IEEE Big Data, pages 703–712 (2018)

18.

Kyrola, A., Blelloch, G. E., Guestrin, C.: Graphchi: Large-scale graph computation on just a PC. In OSDI, pages 31–46 (2012)

19.

Lee, J., Han, W., Kasperovics, R., Lee, J.: An in-depth comparison of subgraph isomorphism algorithms in graph databases. PVLDB 6(2), 133–144 (2012)

20.

Lin, W., Xiao, X., Ghinita, G.: Large-scale frequent subgraph mining in mapreduce. In ICDE, pages 844–855 (2014)

21.

Liu, G., Wong, L.: Effective pruning techniques for mining quasi-cliques. In PKDD, pages 33–49 (2008)

22.

Malewicz, G., Austern, M. H., Bik, A. J. C., Dehnert, J. C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In SIGMOD Conference, pages 135–146 (2010)

23.

McCune, R. R., Weninger, T., Madey, G.: Thinking like a vertex: A survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Comput. Surv., 48(2):25:1–25:39 (2015)

24.

McSherry, F., Isard, M., Murray, D. G.: Scalability! but at what cost? In HotOS (2015)

25.

Mhedhbi, A., Salihoglu, S.: Optimizing subgraph queries by combining binary and worst-case optimal joins. Proc. VLDB Endow. 12(11), 1692–1704 (2019)CrossRef

26.

Michael, M. M., Scott, M. L.: Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In PODC, pages 267–275 (1996)

27.

Quamar, A., Deshpande, A., Lin, J.: Nscale: neighborhood-centric large-scale graph analytics in the cloud. The VLDB Journal, pages 1–26 (2014)

28.

Quick, L., Wilkinson, P., Hardcastle, D.: Using pregel-like large scale graph processing frameworks for social network analysis. In ASONAM, pages 457–463 (2012)

29.

Reza, T., Ripeanu, M., Tripoul, N., Sanders, G., Pearce, R.: Prunejuice: pruning trillion-edge graphs to a precise pattern-matching solution. In SC, pages 21:1–21:17. IEEE / ACM (2018)

30.

Roy, A., Mihailovic, I., Zwaenepoel, W.: X-stream: edge-centric graph processing using streaming partitions. In SOSP, pages 472–488 (2013)

31.

Shao, Y., Cui, B., Chen, L., Ma, L., Yao, J., Xu, N.: Parallel subgraph listing in a large-scale graph. In C. E. Dyreson, F. Li, and M. T. Özsu, editors, SIGMOD, pages 625–636. ACM (2014)

32.

Sheroubi, M.: Benchmarking performance for neo4j in a social media application, https://people.ok.ubc.ca/rlawrenc/research/Students/MS_20_Thesis.pdf. Computer Science Bachelor of Science Thesis, The University of British Columbia – Okanagan Campus (2020)

33.

Sun, Z., Wang, H., Wang, H., Shao, B., Li, J.: Efficient subgraph matching on billion node graphs. PVLDB 5(9), 788–799 (2012)

34.

Talukder, N., Zaki, M.J.: A distributed approach for graph mining in massive networks. Data Min. Knowl. Discov. 30(5), 1024–1052 (2016)MathSciNetCrossRef

35.

Teixeira, C. H. C., Fonseca, A. J., Serafini, M., Siganos, G., Zaki, M. J., Aboulnaga, A.: Arabesque: a system for distributed graph mining. In SOSP, pages 425–440 (2015)

36.

Tian, Y., Balmin, A., Corsten, S.A., Tatikonda, S., McPherson, J.: From “think like a vertex’’ to “think like a graph’’. PVLDB 7(3), 193–204 (2013)

37.

Tomita, E., Seki, T.: An efficient branch-and-bound algorithm for finding a maximum clique. In C. Calude, M. J. Dinneen, and V. Vajnovszki, editors, DMTCS, volume 2731 of Lecture Notes in Computer Science, pages 278–289. Springer (2003)

38.

Wang, K., Zuo, Z., Thorpe, J., Nguyen, T. Q., Xu, G. H.: Rstream: Marrying relational algebra with streaming for efficient graph mining on A single machine. In OSDI, pages 763–782 (2018)

39.

Yan, D., Bu, Y., Tian, Y., Deshpande, A.: Big graph analytics platforms. Foundations and Trends in Databases 7(1–2), 1–195 (2017)CrossRef

40.

Yan, D., Chen, H., Cheng, J., Özsu, M. T., Zhang, Q., Lui, J. C. S.: G-thinker: Big graph mining made easier and faster. CoRR, arXiv:1709.03110, (2017)

41.

Yan, D., Cheng, J., Lu, Y., Ng, W.: Blogel: A block-centric framework for distributed computation on real-world graphs. PVLDB 7(14), 1981–1992 (2014)

42.

Yan, D., Cheng, J., Lu, Y., Ng, W.: Effective techniques for message reduction and load balancing in distributed graph computation. In WWW, pages 1307–1317 (2015)

43.

Yan, D., Cheng, J., Özsu, M.T., Yang, F., Lu, Y., Lui, J.C.S., Zhang, Q., Ng, W.: A general-purpose query-centric framework for querying big graphs. Proc. VLDB Endow. 9(7), 564–575 (2016)CrossRef

44.

Yan, D., Guo, G., Chowdhury, M. M. R., Özsu, M. T., Ku, W., Lui, J. C. S.: G-thinker: A distributed framework for mining subgraphs in a big graph. In ICDE, pages 1369–1380. IEEE (2020)

45.

Yan, D., Qu, W., Guo, G., Wang, X.: Prefixfpm: A parallel framework for general-purpose frequent pattern mining. In ICDE, pages 1938–1941. IEEE (2020)

46.

Zhang, Q., Yan, D., Cheng, J.: Quegel: A general-purpose system for querying big graphs. In F. Özcan, G. Koutrika, and S. Madden, editors, SIGMOD, pages 2189–2192. ACM (2016)

Titel: G-thinker: a general distributed framework for finding qualified subgraphs in a big graph with load balancing
verfasst von: Da Yan
Guimu Guo
Jalal Khalil
M. Tamer Özsu
Wei-Shinn Ku
John C. S. Lui
Publikationsdatum: 04.08.2021
Verlag: Springer Berlin Heidelberg
Erschienen in: The VLDB Journal / Ausgabe 2/2022
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI: https://doi.org/10.1007/s00778-021-00688-z

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 2/2022

Special issue on big graph data management and processing

Distributed temporal graph analytics with GRADOOP

Anchored coreness: efficient reinforcement of social networks

A design space for RDF data representations

A fractional memory-efficient approach for online continuous-time influence maximization

PrefixFPM: a parallel framework for general-purpose mining of frequent and closed patterns

Premium Partner