Skip to main content
Erschienen in: The VLDB Journal 5/2021

07.05.2021 | Regular Paper

Memory-aware framework for fast and scalable second-order random walk over billion-edge natural graphs

verfasst von: Yingxia Shao, Shiyue Huang, Yawen Li, Xupeng Miao, Bin Cui, Lei Chen

Erschienen in: The VLDB Journal | Ausgabe 5/2021

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Second-order random walk is an important technique for graph analysis. Many applications including graph embedding, proximity measure and community detection use it to capture higher-order patterns in the graph, thus improving the model accuracy. However, the memory explosion problem of this technique hinders it from analyzing large graphs. When processing a billion-edge graph like Twitter, existing solutions (e.g., alias method) of the second-order random walk may take up 1796TB memory. Such high memory consumption comes from the memory-unaware strategies for the node sampling during the random walk. In this paper, to clearly compare the efficiency of various node sampling methods, we first design a cost model and propose two new node sampling methods: one follows the acceptance-rejection paradigm to achieve a better balance between memory and time cost, and the other is optimized for fast sampling the skewed probability distributions existed in natural graphs. Second, to achieve the high efficiency of the second-order random walk within arbitrary memory budgets, we propose a novel memory-aware framework on the basis of the cost model. The framework applies a cost-based optimizer to assign desirable node sampling method for each node or edge in the graph within a memory budget meanwhile minimizing the time cost of the random walk. Finally, the framework provides general programming interfaces for users to define new second-order random walk models easily. The empirical studies demonstrate that our memory-aware framework is robust with respect to memory and is able to achieve considerable efficiency by reducing 90% of the memory cost.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
2
Note that the coefficient c is incurred by finding the edge id between previous node u and current node v to access the group information.
 
8
Note that the minimal memory of rejection method is different from the one in our conference version, because we store the number of common neighbors of edges in memory for fast computing the exact bounding constant, thus improving the efficiency of rejection method over billion-edge graphs.
 
Literatur
1.
Zurück zum Zitat Boldi, P., Rosa, M.: Arc-community detection via triangular random walks. In: 2012 Eighth Latin American Web Congress, pp. 48–56 (2012) Boldi, P., Rosa, M.: Arc-community detection via triangular random walks. In: 2012 Eighth Latin American Web Congress, pp. 48–56 (2012)
2.
Zurück zum Zitat Bonner, S., Kureshi, I., Brennan, J., Theodoropoulos, G., McGough, A.S., Obara, B.: Exploring the semantic content of unsupervised graph embeddings: an empirical study. Data Sci. Eng. 4(3), 269–289 (2019)CrossRef Bonner, S., Kureshi, I., Brennan, J., Theodoropoulos, G., McGough, A.S., Obara, B.: Exploring the semantic content of unsupervised graph embeddings: an empirical study. Data Sci. Eng. 4(3), 269–289 (2019)CrossRef
3.
Zurück zum Zitat Chaudhuri, S.: An overview of query optimization in relational systems. In: PODS, pp. 34–43 (1998) Chaudhuri, S.: An overview of query optimization in relational systems. In: PODS, pp. 34–43 (1998)
4.
Zurück zum Zitat Das Sarma, A., Molla, A.R., Pandurangan, G.: Efficient random walk sampling in distributed networks. J. Parallel Distrib. Comput. 77, 84–94 (2015)CrossRef Das Sarma, A., Molla, A.R., Pandurangan, G.: Efficient random walk sampling in distributed networks. J. Parallel Distrib. Comput. 77, 84–94 (2015)CrossRef
5.
Zurück zum Zitat Dave, V.S., Zhang, B., Chen, P.Y., Hasan, M.A.: Neural-brane: neural Bayesian personalized ranking for attributed network embedding. Data Sci. Eng. 4(2), 119–131 (2019)CrossRef Dave, V.S., Zhang, B., Chen, P.Y., Hasan, M.A.: Neural-brane: neural Bayesian personalized ranking for attributed network embedding. Data Sci. Eng. 4(2), 119–131 (2019)CrossRef
6.
Zurück zum Zitat Dudzinski, K., Walukiewicz, S.: Exact methods for the knapsack problem and its generalizations. Eur. J. Op. Res. 28(1), 3–21 (1987)MathSciNetCrossRef Dudzinski, K., Walukiewicz, S.: Exact methods for the knapsack problem and its generalizations. Eur. J. Op. Res. 28(1), 3–21 (1987)MathSciNetCrossRef
7.
Zurück zum Zitat Feng, S., Cong, G., Khan, A., Li, X., Liu, Y., Chee, Y.M.: Inf2vec: Latent representation model for social influence embedding. In: ICDE, pp. 941–952 (2018) Feng, S., Cong, G., Khan, A., Li, X., Liu, Y., Chee, Y.M.: Inf2vec: Latent representation model for social influence embedding. In: ICDE, pp. 941–952 (2018)
8.
Zurück zum Zitat Grimmett, G., Stirzaker, D.: Probability and Random Processes, vol. 80. Oxford University Press, Oxford (2001)MATH Grimmett, G., Stirzaker, D.: Probability and Random Processes, vol. 80. Oxford University Press, Oxford (2001)MATH
9.
Zurück zum Zitat Grover, A., Leskovec, J.: Node2vec: Scalable feature learning for networks. In: KDD, pp. 855–864 (2016) Grover, A., Leskovec, J.: Node2vec: Scalable feature learning for networks. In: KDD, pp. 855–864 (2016)
10.
Zurück zum Zitat Hamilton, W.L., Ying, R., Leskovec, J.: Inductive representation learning on large graphs. In: NIPS, pp. 1025–1035 (2017) Hamilton, W.L., Ying, R., Leskovec, J.: Inductive representation learning on large graphs. In: NIPS, pp. 1025–1035 (2017)
11.
Zurück zum Zitat He, H., Singh, A.K.: Graphs-at-a-time: Query language and access methods for graph databases. In: SIGMOD, pp. 405–418 (2008) He, H., Singh, A.K.: Graphs-at-a-time: Query language and access methods for graph databases. In: SIGMOD, pp. 405–418 (2008)
12.
Zurück zum Zitat Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. Proc. VLDB Endow. 4(11), 1111–1122 (2011)CrossRef Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. Proc. VLDB Endow. 4(11), 1111–1122 (2011)CrossRef
13.
Zurück zum Zitat Hu, X., Tao, Y., Chung, C.W.: Massive graph triangulation. In: SIGMOD, p. 325–336 (2013) Hu, X., Tao, Y., Chung, C.W.: Massive graph triangulation. In: SIGMOD, p. 325–336 (2013)
14.
Zurück zum Zitat Huang, J., Venkatraman, K., Abadi, D.J.: Query optimization of distributed pattern matching. In: ICDE, pp. 64–75 (2014) Huang, J., Venkatraman, K., Abadi, D.J.: Query optimization of distributed pattern matching. In: ICDE, pp. 64–75 (2014)
15.
Zurück zum Zitat Kyrola, A.: Drunkardmob: Billions of random walks on just a pc. In: RecSys, pp. 257–264 (2013) Kyrola, A.: Drunkardmob: Billions of random walks on just a pc. In: RecSys, pp. 257–264 (2013)
16.
Zurück zum Zitat Langville, A.N., Meyer, C.D.: Google’s PageRank and Beyond: The Science of Search Engine Rankings, Chapter The Mathematics Guide. Princeton University Press, Princeton (2011)MATH Langville, A.N., Meyer, C.D.: Google’s PageRank and Beyond: The Science of Search Engine Rankings, Chapter The Mathematics Guide. Princeton University Press, Princeton (2011)MATH
17.
Zurück zum Zitat Latapy, M.: Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor. Comput. Sci. 407(1–3), 458–473 (2008)MathSciNetCrossRef Latapy, M.: Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor. Comput. Sci. 407(1–3), 458–473 (2008)MathSciNetCrossRef
18.
Zurück zum Zitat Li, R.H., Yu, J.X., Qin, L., Mao, R., Jin, T.: On random walk based graph sampling. In: ICDE, pp. 927–938 (2015) Li, R.H., Yu, J.X., Qin, L., Mao, R., Jin, T.: On random walk based graph sampling. In: ICDE, pp. 927–938 (2015)
19.
Zurück zum Zitat Li, X., Zhuang, Y., Fu, Y., He, X.: A trust-aware random walk model for return propensity estimation and consumer anomaly scoring in online shopping. Sci. China Inf. Sci. 62(5), 52101 (2019)CrossRef Li, X., Zhuang, Y., Fu, Y., He, X.: A trust-aware random walk model for return propensity estimation and consumer anomaly scoring in online shopping. Sci. China Inf. Sci. 62(5), 52101 (2019)CrossRef
20.
Zurück zum Zitat Liben-Nowell, D., Kleinberg, J.: The link prediction problem for social networks. In: CIKM, pp. 556–559 (2003) Liben-Nowell, D., Kleinberg, J.: The link prediction problem for social networks. In: CIKM, pp. 556–559 (2003)
21.
Zurück zum Zitat Lim, S., Ryu, S., Kwon, S., Jung, K., Lee, J.G.: Linkscan*: Overlapping community detection using the link-space transformation. In: ICDE, pp. 292–303 (2014) Lim, S., Ryu, S., Kwon, S., Jung, K., Lee, J.G.: Linkscan*: Overlapping community detection using the link-space transformation. In: ICDE, pp. 292–303 (2014)
22.
Zurück zum Zitat Liu, H., Xiao, D., Didwania, P., Eltabakh, M.Y.: Exploiting soft and hard correlations in big data query optimization. Proc. VLDB Endow. 9(12), 1005–1016 (2016)CrossRef Liu, H., Xiao, D., Didwania, P., Eltabakh, M.Y.: Exploiting soft and hard correlations in big data query optimization. Proc. VLDB Endow. 9(12), 1005–1016 (2016)CrossRef
23.
Zurück zum Zitat Lombardo, G., Poggi, A.: A scalable and distributed actor-based version of the node2vec algorithm. In: WOA (2019) Lombardo, G., Poggi, A.: A scalable and distributed actor-based version of the node2vec algorithm. In: WOA (2019)
24.
Zurück zum Zitat Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: SIGMOD, pp. 135–146 (2010) Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: SIGMOD, pp. 135–146 (2010)
25.
Zurück zum Zitat Marsaglia, G.: Generating discrete random variables in a computer. Commun. ACM 6(1), 37–38 (1963)CrossRef Marsaglia, G.: Generating discrete random variables in a computer. Commun. ACM 6(1), 37–38 (1963)CrossRef
26.
Zurück zum Zitat Martin, R., et al.: Memory in network flows and its effects on spreading dynamics and community detection. Nat. Commun. 5, 4630 (2014)CrossRef Martin, R., et al.: Memory in network flows and its effects on spreading dynamics and community detection. Nat. Commun. 5, 4630 (2014)CrossRef
27.
Zurück zum Zitat Nazi, A., Zhou, Z., Thirumuruganathan, S., Zhang, N., Das, G.: Walk, not wait: faster sampling over online social networks. Proc. VLDB Endow. 8(6), 678–689 (2015)CrossRef Nazi, A., Zhou, Z., Thirumuruganathan, S., Zhang, N., Das, G.: Walk, not wait: faster sampling over online social networks. Proc. VLDB Endow. 8(6), 678–689 (2015)CrossRef
28.
Zurück zum Zitat Peng, H., Li, J., Yan, H., Gong, Q., Wang, S., Liu, L., Wang, L., Ren, X.: Dynamic network embedding via incremental skip-gram with negative sampling. Sci. China Inf. Sci. 63(10), 1–19 (2020)MathSciNet Peng, H., Li, J., Yan, H., Gong, Q., Wang, S., Liu, L., Wang, L., Ren, X.: Dynamic network embedding via incremental skip-gram with negative sampling. Sci. China Inf. Sci. 63(10), 1–19 (2020)MathSciNet
29.
Zurück zum Zitat Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: Online learning of social representations. In: KDD, pp. 701–710 (2014) Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: Online learning of social representations. In: KDD, pp. 701–710 (2014)
30.
Zurück zum Zitat Pisinger, D.: A minimal algorithm for the multiple-choice knapsack problem. Eur. J. Op. Res. 83(2), 394–410 (1995)MathSciNetCrossRef Pisinger, D.: A minimal algorithm for the multiple-choice knapsack problem. Eur. J. Op. Res. 83(2), 394–410 (1995)MathSciNetCrossRef
31.
Zurück zum Zitat Raftery, A.E.: A model for high-order markov chains. J. R. Stat. Soc. Ser. B 47(3), 528–539 (1985)MathSciNetMATH Raftery, A.E.: A model for high-order markov chains. J. R. Stat. Soc. Ser. B 47(3), 528–539 (1985)MathSciNetMATH
32.
Zurück zum Zitat Robert, C.P., Casella, G.: Monte Carlo Statistical Methods. Springer Publishing Company, New York (2010)MATH Robert, C.P., Casella, G.: Monte Carlo Statistical Methods. Springer Publishing Company, New York (2010)MATH
33.
Zurück zum Zitat Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA (2003)CrossRef Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA (2003)CrossRef
34.
Zurück zum Zitat Salnikov, V., Schaub, M.T., Lambiotte, R.: Using higher-order markov models to reveal flow-based communities in networks. Sci. Rep. 5(23194), 1–13 (2016) Salnikov, V., Schaub, M.T., Lambiotte, R.: Using higher-order markov models to reveal flow-based communities in networks. Sci. Rep. 5(23194), 1–13 (2016)
35.
Zurück zum Zitat Sengupta, N., Bagchi, A., Ramanath, M., Bedathur, S.: Arrow: Approximating reachability using random walks over web-scale graphs. In: ICDE, pp. 470–481 (2019) Sengupta, N., Bagchi, A., Ramanath, M., Bedathur, S.: Arrow: Approximating reachability using random walks over web-scale graphs. In: ICDE, pp. 470–481 (2019)
36.
Zurück zum Zitat Shao, Y., Cui, B., Chen, L., Liu, M., Xie, X.: An efficient similarity search framework for simrank over large dynamic graphs. Proc. VLDB Endow. 8(8), 838–849 (2015)CrossRef Shao, Y., Cui, B., Chen, L., Liu, M., Xie, X.: An efficient similarity search framework for simrank over large dynamic graphs. Proc. VLDB Endow. 8(8), 838–849 (2015)CrossRef
37.
Zurück zum Zitat Shao, Y., Cui, B., Chen, L., Ma, L., Yao, J., Xu, N.: Parallel subgraph listing in a large-scale graph. In: SIGMOD, pp. 625–636 (2014) Shao, Y., Cui, B., Chen, L., Ma, L., Yao, J., Xu, N.: Parallel subgraph listing in a large-scale graph. In: SIGMOD, pp. 625–636 (2014)
38.
Zurück zum Zitat Shao, Y., Huang, S., Miao, X., Cui, B., Chen, L.: Memory-aware framework for efficient second-order random walk on large graphs. In: SIGMOD, pp. 1797–1812 (2020) Shao, Y., Huang, S., Miao, X., Cui, B., Chen, L.: Memory-aware framework for efficient second-order random walk on large graphs. In: SIGMOD, pp. 1797–1812 (2020)
39.
40.
Zurück zum Zitat Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)CrossRef Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)CrossRef
41.
Zurück zum Zitat Tsitsulin, A., Mottin, D., Karras, P., Müller, E.: Verse: Versatile graph embeddings from similarity measures. In: WWW, pp. 539–548 (2018) Tsitsulin, A., Mottin, D., Karras, P., Müller, E.: Verse: Versatile graph embeddings from similarity measures. In: WWW, pp. 539–548 (2018)
42.
Zurück zum Zitat Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Data Warehous. Min. 2007, 1–13 (2007)CrossRef Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Data Warehous. Min. 2007, 1–13 (2007)CrossRef
43.
Zurück zum Zitat Walker, A.J.: An efficient method for generating discrete random variables with general distributions. ACM Trans. Math. Softw. 3(3), 253–256 (1977)CrossRef Walker, A.J.: An efficient method for generating discrete random variables with general distributions. ACM Trans. Math. Softw. 3(3), 253–256 (1977)CrossRef
44.
Zurück zum Zitat Wang, R., Li, Y., Xie, H., Xu, Y., Lui, J.C.S.: Graphwalker: An i/o-efficient and resource-friendly graph analytic system for fast and scalable random walks. In: ATC, pp. 559–571 (2020) Wang, R., Li, Y., Xie, H., Xu, Y., Lui, J.C.S.: Graphwalker: An i/o-efficient and resource-friendly graph analytic system for fast and scalable random walks. In: ATC, pp. 559–571 (2020)
45.
Zurück zum Zitat Wu, Y., Bian, Y., Zhang, X.: Remember where you came from: on the second-order random walk based proximity measures. Proc. VLDB Endow. 10(1), 13–24 (2016)CrossRef Wu, Y., Bian, Y., Zhang, X.: Remember where you came from: on the second-order random walk based proximity measures. Proc. VLDB Endow. 10(1), 13–24 (2016)CrossRef
46.
Zurück zum Zitat Xu, J., Wickramarathne, T., Chawla, N.V.: Representing higher-order dependencies in networks. In: Sci. Adv. (2016) Xu, J., Wickramarathne, T., Chawla, N.V.: Representing higher-order dependencies in networks. In: Sci. Adv. (2016)
47.
Zurück zum Zitat Yang, K., Zhang, M., Chen, K., Ma, X., Bai, Y., Jiang, Y.: Knightking: a fast distributed graph random walk engine. In: SOSP, pp. 524–537 (2019) Yang, K., Zhang, M., Chen, K., Ma, X., Bai, Y., Jiang, Y.: Knightking: a fast distributed graph random walk engine. In: SOSP, pp. 524–537 (2019)
49.
Zurück zum Zitat Zhao, P., Han, J.: On graph query optimization in large networks. Proc. VLDB Endow. 3(1–2), 340–351 (2010)CrossRef Zhao, P., Han, J.: On graph query optimization in large networks. Proc. VLDB Endow. 3(1–2), 340–351 (2010)CrossRef
50.
Zurück zum Zitat Zhou, D., Niu, S., Chen, S.: Efficient graph computation for node2vec. CoRR abs/1805.00280 (2018) Zhou, D., Niu, S., Chen, S.: Efficient graph computation for node2vec. CoRR abs/1805.00280 (2018)
Metadaten
Titel
Memory-aware framework for fast and scalable second-order random walk over billion-edge natural graphs
verfasst von
Yingxia Shao
Shiyue Huang
Yawen Li
Xupeng Miao
Bin Cui
Lei Chen
Publikationsdatum
07.05.2021
Verlag
Springer Berlin Heidelberg
Erschienen in
The VLDB Journal / Ausgabe 5/2021
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI
https://doi.org/10.1007/s00778-021-00669-2

Weitere Artikel der Ausgabe 5/2021

The VLDB Journal 5/2021 Zur Ausgabe

Premium Partner