nach oben

The VLDB Journal

Erschienen in:

07.05.2021 | Regular Paper

Memory-aware framework for fast and scalable second-order random walk over billion-edge natural graphs

verfasst von: Yingxia Shao, Shiyue Huang, Yawen Li, Xupeng Miao, Bin Cui, Lei Chen

Erschienen in: The VLDB Journal | Ausgabe 5/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Second-order random walk is an important technique for graph analysis. Many applications including graph embedding, proximity measure and community detection use it to capture higher-order patterns in the graph, thus improving the model accuracy. However, the memory explosion problem of this technique hinders it from analyzing large graphs. When processing a billion-edge graph like Twitter, existing solutions (e.g., alias method) of the second-order random walk may take up 1796TB memory. Such high memory consumption comes from the memory-unaware strategies for the node sampling during the random walk. In this paper, to clearly compare the efficiency of various node sampling methods, we first design a cost model and propose two new node sampling methods: one follows the acceptance-rejection paradigm to achieve a better balance between memory and time cost, and the other is optimized for fast sampling the skewed probability distributions existed in natural graphs. Second, to achieve the high efficiency of the second-order random walk within arbitrary memory budgets, we propose a novel memory-aware framework on the basis of the cost model. The framework applies a cost-based optimizer to assign desirable node sampling method for each node or edge in the graph within a memory budget meanwhile minimizing the time cost of the random walk. Finally, the framework provides general programming interfaces for users to define new second-order random walk models easily. The empirical studies demonstrate that our memory-aware framework is robust with respect to memory and is able to achieve considerable efficiency by reducing 90% of the memory cost.

Vorheriger Artikel Managing bias and unfairness in data for decision support: a survey of machine learning and data engineering approaches to identify and mitigate bias and unfairness within data management and analytics systems

Nächster Artikel Efficient Hop-constrained s-t Simple Path Enumeration

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

http://www.columbia.edu/~ks20/4703-Sigman/4703-07-Notes-ARM.pdf

Note that the coefficient c is incurred by finding the edge id between previous node u and current node v to access the group information.

https://www.openmp.org/

http://snap.stanford.edu/data/soc-LiveJournal1.html

https://an.kaist.ac.kr/traces/WWW2010.html

http://law.di.unimi.it/webdata/uk-2007-05/

http://socialcomputing.asu.edu/datasets

Note that the minimal memory of rejection method is different from the one in our conference version, because we store the number of common neighbors of edges in memory for fast computing the exact bounding constant, thus improving the efficiency of rejection method over billion-edge graphs.

https://www.mindspore.cn/

Boldi, P., Rosa, M.: Arc-community detection via triangular random walks. In: 2012 Eighth Latin American Web Congress, pp. 48–56 (2012)

Bonner, S., Kureshi, I., Brennan, J., Theodoropoulos, G., McGough, A.S., Obara, B.: Exploring the semantic content of unsupervised graph embeddings: an empirical study. Data Sci. Eng. 4(3), 269–289 (2019)CrossRef

Chaudhuri, S.: An overview of query optimization in relational systems. In: PODS, pp. 34–43 (1998)

Das Sarma, A., Molla, A.R., Pandurangan, G.: Efficient random walk sampling in distributed networks. J. Parallel Distrib. Comput. 77, 84–94 (2015)CrossRef

Dave, V.S., Zhang, B., Chen, P.Y., Hasan, M.A.: Neural-brane: neural Bayesian personalized ranking for attributed network embedding. Data Sci. Eng. 4(2), 119–131 (2019)CrossRef

Dudzinski, K., Walukiewicz, S.: Exact methods for the knapsack problem and its generalizations. Eur. J. Op. Res. 28(1), 3–21 (1987)MathSciNetCrossRef

Feng, S., Cong, G., Khan, A., Li, X., Liu, Y., Chee, Y.M.: Inf2vec: Latent representation model for social influence embedding. In: ICDE, pp. 941–952 (2018)

Grimmett, G., Stirzaker, D.: Probability and Random Processes, vol. 80. Oxford University Press, Oxford (2001)MATH

Grover, A., Leskovec, J.: Node2vec: Scalable feature learning for networks. In: KDD, pp. 855–864 (2016)

10.

Hamilton, W.L., Ying, R., Leskovec, J.: Inductive representation learning on large graphs. In: NIPS, pp. 1025–1035 (2017)

11.

He, H., Singh, A.K.: Graphs-at-a-time: Query language and access methods for graph databases. In: SIGMOD, pp. 405–418 (2008)

12.

Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. Proc. VLDB Endow. 4(11), 1111–1122 (2011)CrossRef

13.

Hu, X., Tao, Y., Chung, C.W.: Massive graph triangulation. In: SIGMOD, p. 325–336 (2013)

14.

Huang, J., Venkatraman, K., Abadi, D.J.: Query optimization of distributed pattern matching. In: ICDE, pp. 64–75 (2014)

15.

Kyrola, A.: Drunkardmob: Billions of random walks on just a pc. In: RecSys, pp. 257–264 (2013)

16.

Langville, A.N., Meyer, C.D.: Google’s PageRank and Beyond: The Science of Search Engine Rankings, Chapter The Mathematics Guide. Princeton University Press, Princeton (2011)MATH

17.

Latapy, M.: Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor. Comput. Sci. 407(1–3), 458–473 (2008)MathSciNetCrossRef

18.

Li, R.H., Yu, J.X., Qin, L., Mao, R., Jin, T.: On random walk based graph sampling. In: ICDE, pp. 927–938 (2015)

19.

Li, X., Zhuang, Y., Fu, Y., He, X.: A trust-aware random walk model for return propensity estimation and consumer anomaly scoring in online shopping. Sci. China Inf. Sci. 62(5), 52101 (2019)CrossRef

20.

Liben-Nowell, D., Kleinberg, J.: The link prediction problem for social networks. In: CIKM, pp. 556–559 (2003)

21.

Lim, S., Ryu, S., Kwon, S., Jung, K., Lee, J.G.: Linkscan*: Overlapping community detection using the link-space transformation. In: ICDE, pp. 292–303 (2014)

22.

Liu, H., Xiao, D., Didwania, P., Eltabakh, M.Y.: Exploiting soft and hard correlations in big data query optimization. Proc. VLDB Endow. 9(12), 1005–1016 (2016)CrossRef

23.

Lombardo, G., Poggi, A.: A scalable and distributed actor-based version of the node2vec algorithm. In: WOA (2019)

24.

Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: SIGMOD, pp. 135–146 (2010)

25.

Marsaglia, G.: Generating discrete random variables in a computer. Commun. ACM 6(1), 37–38 (1963)CrossRef

26.

Martin, R., et al.: Memory in network flows and its effects on spreading dynamics and community detection. Nat. Commun. 5, 4630 (2014)CrossRef

27.

Nazi, A., Zhou, Z., Thirumuruganathan, S., Zhang, N., Das, G.: Walk, not wait: faster sampling over online social networks. Proc. VLDB Endow. 8(6), 678–689 (2015)CrossRef

28.

Peng, H., Li, J., Yan, H., Gong, Q., Wang, S., Liu, L., Wang, L., Ren, X.: Dynamic network embedding via incremental skip-gram with negative sampling. Sci. China Inf. Sci. 63(10), 1–19 (2020)MathSciNet

29.

Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: Online learning of social representations. In: KDD, pp. 701–710 (2014)

30.

Pisinger, D.: A minimal algorithm for the multiple-choice knapsack problem. Eur. J. Op. Res. 83(2), 394–410 (1995)MathSciNetCrossRef

31.

Raftery, A.E.: A model for high-order markov chains. J. R. Stat. Soc. Ser. B 47(3), 528–539 (1985)MathSciNetMATH

32.

Robert, C.P., Casella, G.: Monte Carlo Statistical Methods. Springer Publishing Company, New York (2010)MATH

33.

Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA (2003)CrossRef

34.

Salnikov, V., Schaub, M.T., Lambiotte, R.: Using higher-order markov models to reveal flow-based communities in networks. Sci. Rep. 5(23194), 1–13 (2016)

35.

Sengupta, N., Bagchi, A., Ramanath, M., Bedathur, S.: Arrow: Approximating reachability using random walks over web-scale graphs. In: ICDE, pp. 470–481 (2019)

36.

Shao, Y., Cui, B., Chen, L., Liu, M., Xie, X.: An efficient similarity search framework for simrank over large dynamic graphs. Proc. VLDB Endow. 8(8), 838–849 (2015)CrossRef

37.

Shao, Y., Cui, B., Chen, L., Ma, L., Yao, J., Xu, N.: Parallel subgraph listing in a large-scale graph. In: SIGMOD, pp. 625–636 (2014)

38.

Shao, Y., Huang, S., Miao, X., Cui, B., Chen, L.: Memory-aware framework for efficient second-order random walk on large graphs. In: SIGMOD, pp. 1797–1812 (2020)

39.

Sinha, P., Zoltners, A.A.: The multiple-choice knapsack problem. Op. Res. 27(3), 503–515 (1979)MathSciNetCrossRef

40.

Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)CrossRef

41.

Tsitsulin, A., Mottin, D., Karras, P., Müller, E.: Verse: Versatile graph embeddings from similarity measures. In: WWW, pp. 539–548 (2018)

42.

Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Data Warehous. Min. 2007, 1–13 (2007)CrossRef

43.

Walker, A.J.: An efficient method for generating discrete random variables with general distributions. ACM Trans. Math. Softw. 3(3), 253–256 (1977)CrossRef

44.

Wang, R., Li, Y., Xie, H., Xu, Y., Lui, J.C.S.: Graphwalker: An i/o-efficient and resource-friendly graph analytic system for fast and scalable random walks. In: ATC, pp. 559–571 (2020)

45.

Wu, Y., Bian, Y., Zhang, X.: Remember where you came from: on the second-order random walk based proximity measures. Proc. VLDB Endow. 10(1), 13–24 (2016)CrossRef

46.

Xu, J., Wickramarathne, T., Chawla, N.V.: Representing higher-order dependencies in networks. In: Sci. Adv. (2016)

47.

Yang, K., Zhang, M., Chen, K., Ma, X., Bai, Y., Jiang, Y.: Knightking: a fast distributed graph random walk engine. In: SOSP, pp. 524–537 (2019)

48.

Zemel, E.: The linear multiple choice knapsack problem. Op. Res. 28(6), 1412–1423 (1980)MathSciNetCrossRef

49.

Zhao, P., Han, J.: On graph query optimization in large networks. Proc. VLDB Endow. 3(1–2), 340–351 (2010)CrossRef

50.

Zhou, D., Niu, S., Chen, S.: Efficient graph computation for node2vec. CoRR abs/1805.00280 (2018)

Titel: Memory-aware framework for fast and scalable second-order random walk over billion-edge natural graphs
verfasst von: Yingxia Shao
Shiyue Huang
Yawen Li
Xupeng Miao
Bin Cui
Lei Chen
Publikationsdatum: 07.05.2021
Verlag: Springer Berlin Heidelberg
Erschienen in: The VLDB Journal / Ausgabe 5/2021
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI: https://doi.org/10.1007/s00778-021-00669-2

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 5/2021

Formal semantics and high performance in declarative machine learning using Datalog

Correction to: Internal and external memory set containment join

Tidy Tuples and Flying Start: fast compilation and fast execution of relational queries in Umbra

Efficient Hop-constrained s-t Simple Path Enumeration

I/O efficient k-truss community search in massive graphs

Managing bias and unfairness in data for decision support: a survey of machine learning and data engineering approaches to identify and mitigate bias and unfairness within data management and analytics systems

Premium Partner