Skip to main content
Erschienen in: Knowledge and Information Systems 1/2019

15.03.2018 | Regular Paper

Fast crawling methods of exploring content distributed over large graphs

verfasst von: Pinghui Wang, Junzhou Zhao, John C. S. Lui, Don Towsley, Xiaohong Guan

Erschienen in: Knowledge and Information Systems | Ausgabe 1/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Despite recent effort to estimate topology characteristics of large graphs (e.g., online social networks and peer-to-peer networks), little attention has been given to develop a formal crawling methodology to characterize the vast amount of content distributed over these networks. Due to the large-scale nature of these networks and a limited query rate imposed by network service providers, exhaustively crawling and enumerating content maintained by each vertex is computationally prohibitive. In this paper, we show how one can obtain content properties by crawling only a small fraction of vertices and collecting their content. We first show that when sampling is naively applied, this can produce a huge bias in content statistics (i.e., average number of content replicas). To remove this bias, one may use maximum likelihood estimation to estimate content characteristics. However, our experimental results show that this straightforward method requires to sample most vertices to obtain accurate estimates. To address this challenge, we propose two efficient estimators: special copy estimator (SCE) and weighted copy estimator (WCE) to estimate content characteristics using available information in sampled content. SCE uses the special content copy indicator to compute the estimate, while WCE derives the estimate based on meta-information in sampled vertices. We conduct experiments on a variety of real-word and synthetic datasets, and the results show that WCE and SCE are cost effective and also “asymptotically unbiased”. Our methodology provides a new tool for researchers to efficiently query content distributed in large-scale networks.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
1.
Zurück zum Zitat Achlioptas D et al (2005) On the bias of traceroute sampling or, power-law degree distributions in regular graphs. In: STOC, pp 694–703 Achlioptas D et al (2005) On the bias of traceroute sampling or, power-law degree distributions in regular graphs. In: STOC, pp 694–703
2.
Zurück zum Zitat Ahmed N et al (2014) Graph sample and hold: a framework for big-graph analytics. In: SIGKDD, pp 1446–1455 Ahmed N et al (2014) Graph sample and hold: a framework for big-graph analytics. In: SIGKDD, pp 1446–1455
3.
Zurück zum Zitat Avrachenkov K et al (2010) Improving random walk estimation accuracy with uniform restarts. In: WAW, pp 98–109 Avrachenkov K et al (2010) Improving random walk estimation accuracy with uniform restarts. In: WAW, pp 98–109
4.
Zurück zum Zitat Bar-Yossef Z et al (2002) Reductions in streaming algorithms, with an application to counting triangles in graphs. In: SODA, pp 623–632 Bar-Yossef Z et al (2002) Reductions in streaming algorithms, with an application to counting triangles in graphs. In: SODA, pp 623–632
5.
Zurück zum Zitat Becchetti L et al (2010) Efficient algorithms for large-scale local triangle counting. TKDD 4(3):13:1–13:28CrossRef Becchetti L et al (2010) Efficient algorithms for large-scale local triangle counting. TKDD 4(3):13:1–13:28CrossRef
6.
Zurück zum Zitat Bhuiyan MA et al (2012) Guise: uniform sampling of graphlets for large graph analysis. In: ICDM, pp. 91–100 Bhuiyan MA et al (2012) Guise: uniform sampling of graphlets for large graph analysis. In: ICDM, pp. 91–100
8.
Zurück zum Zitat Buriol LS et al (2006) Counting triangles in data streams. In: PODS, pp 253–262 Buriol LS et al (2006) Counting triangles in data streams. In: PODS, pp 253–262
9.
Zurück zum Zitat Chen X et al (2017) A general framework for estimating graphlet statistics via random walk. In: PVLDB, pp 253–264 Chen X et al (2017) A general framework for estimating graphlet statistics via random walk. In: PVLDB, pp 253–264
10.
Zurück zum Zitat Chib S, Greenberg E (1995) Understanding the metropolis-hastings algorithm. Am. Stat. 49(4):327–335 Chib S, Greenberg E (1995) Understanding the metropolis-hastings algorithm. Am. Stat. 49(4):327–335
11.
Zurück zum Zitat Dasgupta A et al (2012) Social sampling. In: SIGKDD, pp 235–243 Dasgupta A et al (2012) Social sampling. In: SIGKDD, pp 235–243
12.
Zurück zum Zitat Duffield N et al (2003) Estimating flow distributions from sampled flow statistics. In: SIGCOMM, pp 325–336 Duffield N et al (2003) Estimating flow distributions from sampled flow statistics. In: SIGCOMM, pp 325–336
13.
Zurück zum Zitat Gjoka M et al (2011) Multigraph sampling of online social networks. JSAC 29(9):1893–1905 Gjoka M et al (2011) Multigraph sampling of online social networks. JSAC 29(9):1893–1905
14.
Zurück zum Zitat Gjoka M et al (2010) Walking in facebook: a case study of unbiased sampling of OSNs. In: INFOCOM, pp 2498–2506 Gjoka M et al (2010) Walking in facebook: a case study of unbiased sampling of OSNs. In: INFOCOM, pp 2498–2506
15.
Zurück zum Zitat Goldenberg J et al (2001) Talk of the network: a complex systems look at the underlying process of word-of-mouth. Mark. Lett. 12(3):211–223CrossRef Goldenberg J et al (2001) Talk of the network: a complex systems look at the underlying process of word-of-mouth. Mark. Lett. 12(3):211–223CrossRef
16.
17.
Zurück zum Zitat Heckathorn DD (2002) Respondent-driven sampling II: deriving valid population estimates from chain-referral samples of hidden populations. Soc Probl 49(1):11–34CrossRef Heckathorn DD (2002) Respondent-driven sampling II: deriving valid population estimates from chain-referral samples of hidden populations. Soc Probl 49(1):11–34CrossRef
18.
19.
Zurück zum Zitat Jha M et al (2013) A space efficient streaming algorithm for triangle counting using the birthday paradox. In: SIGKDD, pp 589–597 Jha M et al (2013) A space efficient streaming algorithm for triangle counting using the birthday paradox. In: SIGKDD, pp 589–597
20.
Zurück zum Zitat Jowhari H, Ghodsi M (2005) New streaming algorithms for counting triangles in graphs. In: COCOON, pp 710–716 Jowhari H, Ghodsi M (2005) New streaming algorithms for counting triangles in graphs. In: COCOON, pp 710–716
21.
Zurück zum Zitat Kashtan N et al (2004) Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11):1746–1758CrossRef Kashtan N et al (2004) Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11):1746–1758CrossRef
22.
Zurück zum Zitat Katzir L et al (2011) Estimating sizes of social networks via biased sampling. In: WWW, pp 597–606 Katzir L et al (2011) Estimating sizes of social networks via biased sampling. In: WWW, pp 597–606
23.
Zurück zum Zitat Kurant M et al (2011) Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In: SIGMETRICS, pp 281–292 Kurant M et al (2011) Walking on a graph with a magnifying glass: stratified sampling via weighted random walks. In: SIGMETRICS, pp 281–292
24.
Zurück zum Zitat Kurant M et al (2012) Coarse-grained topology estimation via graph sampling. In: WOSN, pp. 25–30 Kurant M et al (2012) Coarse-grained topology estimation via graph sampling. In: WOSN, pp. 25–30
25.
Zurück zum Zitat Kurant M et al (2010) On the bias of bfs (breadth first search) and of other graph sampling techniques. In: ITC, pp 1–9 Kurant M et al (2010) On the bias of bfs (breadth first search) and of other graph sampling techniques. In: ITC, pp 1–9
26.
Zurück zum Zitat Kurant M et al (2011) Towards unbiased bfs sampling. JSAC 29(9):1799–1809 Kurant M et al (2011) Towards unbiased bfs sampling. JSAC 29(9):1799–1809
27.
Zurück zum Zitat Kutzkov K, Pagh R (2013) On the streaming complexity of computing local clustering coefficients. In: WSDM, pp 677–686 Kutzkov K, Pagh R (2013) On the streaming complexity of computing local clustering coefficients. In: WSDM, pp 677–686
28.
Zurück zum Zitat Li Z et al (2012) Socialtube: P2P-assisted video sharing in online social networks. In: INFOCOM mini conference, pp 2886–2890 Li Z et al (2012) Socialtube: P2P-assisted video sharing in online social networks. In: INFOCOM mini conference, pp 2886–2890
29.
Zurück zum Zitat Lim Y, Kang U (2015) MASCOT: memory-efficient and accurate sampling for counting local triangles in graph streams. In: SIGKDD, pp 685–694 Lim Y, Kang U (2015) MASCOT: memory-efficient and accurate sampling for counting local triangles in graph streams. In: SIGKDD, pp 685–694
30.
Zurück zum Zitat Lovász L (1993) Random walks on graphs: a survey. Combinatorics 2:1–46 Lovász L (1993) Random walks on graphs: a survey. Combinatorics 2:1–46
31.
Zurück zum Zitat Malandrino F et al (2012) Proactive seeding for information cascades in cellular networks. In: INFOCOM, pp 2886–2890 Malandrino F et al (2012) Proactive seeding for information cascades in cellular networks. In: INFOCOM, pp 2886–2890
32.
Zurück zum Zitat Metropolis N et al (1953) Equations of state calculations by fast computing machines. JSAC 21(6):1087–1092 Metropolis N et al (1953) Equations of state calculations by fast computing machines. JSAC 21(6):1087–1092
33.
Zurück zum Zitat Mislove A et al (2007) Measurement and analysis of online social networks. In: IMC, pp 29–42 Mislove A et al (2007) Measurement and analysis of online social networks. In: IMC, pp 29–42
34.
Zurück zum Zitat Mohaisen A et al (2010) Measuring the mixing time of social graphs. In: IMC, pp 390–403 Mohaisen A et al (2010) Measuring the mixing time of social graphs. In: IMC, pp 390–403
35.
Zurück zum Zitat Murai F et al (2012) On set size distribution estimation and the characterization of large networks via sampling. JSAC 31(6):1017–1025 Murai F et al (2012) On set size distribution estimation and the characterization of large networks via sampling. JSAC 31(6):1017–1025
36.
Zurück zum Zitat Omidi S et al (2009) Moda: an efficient algorithm for network Motif discovery in biological networks. GGS 84(5):385–395 Omidi S et al (2009) Moda: an efficient algorithm for network Motif discovery in biological networks. GGS 84(5):385–395
37.
Zurück zum Zitat Pavany A et al (2013) Counting and sampling triangles from a graph stream. In: PVLDB, pp 1870–1881 Pavany A et al (2013) Counting and sampling triangles from a graph stream. In: PVLDB, pp 1870–1881
38.
Zurück zum Zitat Rasti AH et al (2009) Respondent-driven sampling for characterizing unstructured overlays. In: INFOCOM mini-conference, pp 2701–2705 Rasti AH et al (2009) Respondent-driven sampling for characterizing unstructured overlays. In: INFOCOM mini-conference, pp 2701–2705
39.
Zurück zum Zitat Ribeiro B et al (2010) On MySpace account spans and double Pareto-like distribution of friends. In: NetSciCom, pp 1–6 Ribeiro B et al (2010) On MySpace account spans and double Pareto-like distribution of friends. In: NetSciCom, pp 1–6
40.
Zurück zum Zitat Ribeiro B, Towsley D (2010) Estimating and sampling graphs with multidimensional random walks. In: IMC, pp 390–403 Ribeiro B, Towsley D (2010) Estimating and sampling graphs with multidimensional random walks. In: IMC, pp 390–403
41.
Zurück zum Zitat Ribeiro B et al (2012) Sampling directed graphs with random walks. In: INFOCOM, pp 1692–1700 Ribeiro B et al (2012) Sampling directed graphs with random walks. In: INFOCOM, pp 1692–1700
42.
Zurück zum Zitat Salganik MJ, Heckathorn DD (2004) Sampling and estimation in hidden populations using respondent-driven sampling. Sociol Methodol 34:193–239CrossRef Salganik MJ, Heckathorn DD (2004) Sampling and estimation in hidden populations using respondent-driven sampling. Sociol Methodol 34:193–239CrossRef
43.
Zurück zum Zitat Seshadhri C et al (2014) Wedge sampling for computing clustering coefficients and triangle counts on large graphs. Stat Anal Data Min 7(4):294–307MathSciNetCrossRef Seshadhri C et al (2014) Wedge sampling for computing clustering coefficients and triangle counts on large graphs. Stat Anal Data Min 7(4):294–307MathSciNetCrossRef
44.
Zurück zum Zitat Stefani LD et al (2016) Trièst: counting local and global triangles in fully-dynamic streams with fixed memory size. In: SIGKDD, pp 825–834 Stefani LD et al (2016) Trièst: counting local and global triangles in fully-dynamic streams with fixed memory size. In: SIGKDD, pp 825–834
45.
Zurück zum Zitat Stutzbach D et al (2009) On unbiased sampling for unstructured peer-to-peer networks. TON 17(2):377–390 Stutzbach D et al (2009) On unbiased sampling for unstructured peer-to-peer networks. TON 17(2):377–390
46.
Zurück zum Zitat Suh B et al (2010) Want to be retweeted? large scale analytics on factors impacting retweet in twitter network. In: SocialCom, pp 177–184 Suh B et al (2010) Want to be retweeted? large scale analytics on factors impacting retweet in twitter network. In: SocialCom, pp 177–184
47.
Zurück zum Zitat Tsourakakis CE et al (2009) Doulion: counting triangles in massive graphs with a coin. In: KDD, pp 837–846 Tsourakakis CE et al (2009) Doulion: counting triangles in massive graphs with a coin. In: KDD, pp 837–846
48.
Zurück zum Zitat Wang P et al (2014) Efficiently estimating motif statistics of large networks. TKDD 9(2):8:1–8:27CrossRef Wang P et al (2014) Efficiently estimating motif statistics of large networks. TKDD 9(2):8:1–8:27CrossRef
49.
Zurück zum Zitat Wang P et al (2016) Minfer: a method of inferring motif statistics from sampled edges. In: ICDE, pp 1050–1061 Wang P et al (2016) Minfer: a method of inferring motif statistics from sampled edges. In: ICDE, pp 1050–1061
50.
Zurück zum Zitat Wernicke S (2006) Efficient detection of network motifs. TCBB 3(4):347–359 Wernicke S (2006) Efficient detection of network motifs. TCBB 3(4):347–359
51.
Zurück zum Zitat Wu B et al (2016) Counting triangles in large graphs by random sampling. TKDE 28(8):2013–2026 Wu B et al (2016) Counting triangles in large graphs by random sampling. TKDE 28(8):2013–2026
52.
Zurück zum Zitat Yang M et al (2004) Deployment of a large-scale peer-to-peer social network. In: WORLDS, pp 1–6 Yang M et al (2004) Deployment of a large-scale peer-to-peer social network. In: WORLDS, pp 1–6
53.
Zurück zum Zitat Zafar MB et al (2015) Sampling content from online social networks: comparing random versus expert sampling of the twitter stream. TWEB 9(3):12:1–12:33CrossRef Zafar MB et al (2015) Sampling content from online social networks: comparing random versus expert sampling of the twitter stream. TWEB 9(3):12:1–12:33CrossRef
54.
Zurück zum Zitat Zhong M, Shen K (2006) Random walk based node sampling in self-organizing networks. SIGOPS Oper Syst Rev 40(3):49–55MathSciNetCrossRef Zhong M, Shen K (2006) Random walk based node sampling in self-organizing networks. SIGOPS Oper Syst Rev 40(3):49–55MathSciNetCrossRef
55.
Zurück zum Zitat Zhou Z et al (2013) Faster random walks by rewiring online social networks on-the-fly. TODS 40(4):26:1–26:36MathSciNet Zhou Z et al (2013) Faster random walks by rewiring online social networks on-the-fly. TODS 40(4):26:1–26:36MathSciNet
Metadaten
Titel
Fast crawling methods of exploring content distributed over large graphs
verfasst von
Pinghui Wang
Junzhou Zhao
John C. S. Lui
Don Towsley
Xiaohong Guan
Publikationsdatum
15.03.2018
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 1/2019
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-018-1178-x

Weitere Artikel der Ausgabe 1/2019

Knowledge and Information Systems 1/2019 Zur Ausgabe