Skip to main content
Erschienen in: Data Mining and Knowledge Discovery 4/2020

18.03.2020

Guided sampling for large graphs

verfasst von: Muhammad Irfan Yousuf, Suhyun Kim

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 4/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Large real-world graphs claim lots of resources in terms of memory and computational power to study them and this makes their full analysis extremely challenging. In order to understand the structure and properties of these graphs, we intend to extract a small representative subgraph from a big graph while preserving its topology and characteristics. In this work, we aim at producing good samples with sample size as low as 0.1% while maintaining the structure and some of the key properties of a network. We exploit the fact that average values of degree and clustering coefficient of a graph can be estimated accurately and efficiently. We use the estimated values to guide the sampling process and extract tiny samples that preserve the properties of the graph and closely approximate their distributions in the original graph. The distinguishing feature of our work is that we apply traversal based sampling that utilizes only the local information of nodes as opposed to the global information of the network and this makes our approach a practical choice for crawling online networks. We evaluate the effectiveness of our sampling technique using real-world datasets and show that it surpasses the existing methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Ahmed N, Neville J, Kompella RR (2011) Network sampling via edge-based node selection with graph induction. Technical Report 11-016, Purdue Digital Library Ahmed N, Neville J, Kompella RR (2011) Network sampling via edge-based node selection with graph induction. Technical Report 11-016, Purdue Digital Library
Zurück zum Zitat Ahn Y, Han S, Kwak H, Moon S, Jeong H (2007) Analysis of topological characteristics of huge online social networking services. In: Proceedings of WWW, pp 835–844 Ahn Y, Han S, Kwak H, Moon S, Jeong H (2007) Analysis of topological characteristics of huge online social networking services. In: Proceedings of WWW, pp 835–844
Zurück zum Zitat Al Hasan M, Zaki MJ (2009) Output space sampling for graph patterns. Proc VLDB Endow 2(1):730–741CrossRef Al Hasan M, Zaki MJ (2009) Output space sampling for graph patterns. Proc VLDB Endow 2(1):730–741CrossRef
Zurück zum Zitat Becchetti L, Castillo C, Donato D, Fazzone A (2006) A comparison of sampling techniques for web graph characterization. In: LinkKDD Becchetti L, Castillo C, Donato D, Fazzone A (2006) A comparison of sampling techniques for web graph characterization. In: LinkKDD
Zurück zum Zitat Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 10:P10008 CrossRef Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 10:P10008 CrossRef
Zurück zum Zitat Chen B, Liu L, Jia H, Zhang Y (2017a) Reducing repetition rate: unbiased delay sampling in online social networks. Recent Pat Comput Sci 10(4):308–314CrossRef Chen B, Liu L, Jia H, Zhang Y (2017a) Reducing repetition rate: unbiased delay sampling in online social networks. Recent Pat Comput Sci 10(4):308–314CrossRef
Zurück zum Zitat Chen Y, Ding C, Hu J, Chen R, Hui P, Fu X (2017b) Building and analyzing a global co-authorship network using google scholar data. In: Proceedings of 26th international World Wide Web conference (WWW 2017) Companion Chen Y, Ding C, Hu J, Chen R, Hui P, Fu X (2017b) Building and analyzing a global co-authorship network using google scholar data. In: Proceedings of 26th international World Wide Web conference (WWW 2017) Companion
Zurück zum Zitat Chepuri SP, Leus G (2017) Graph sampling for covariance estimation. IEEE Trans Signal Inf Process Over Netw 3:451–466MathSciNetCrossRef Chepuri SP, Leus G (2017) Graph sampling for covariance estimation. IEEE Trans Signal Inf Process Over Netw 3:451–466MathSciNetCrossRef
Zurück zum Zitat Chiericetti F, Dasgupta A, Kumar R, Lattanzi S, Sarlós T (2016) On sampling nodes in a network. In: Proceedings of the 25th international conference on World Wide Web, WWW ’16, pp 471–481 Chiericetti F, Dasgupta A, Kumar R, Lattanzi S, Sarlós T (2016) On sampling nodes in a network. In: Proceedings of the 25th international conference on World Wide Web, WWW ’16, pp 471–481
Zurück zum Zitat Doerr C, Blenn N (2013) Metric convergence in social network sampling. In: ACM Hotplanet Doerr C, Blenn N (2013) Metric convergence in social network sampling. In: ACM Hotplanet
Zurück zum Zitat Feige U (1995) A tight upper bound on the cover time for random walks on graphs. Random Struct Algorithms 6:51–54 MathSciNetCrossRef Feige U (1995) A tight upper bound on the cover time for random walks on graphs. Random Struct Algorithms 6:51–54 MathSciNetCrossRef
Zurück zum Zitat Gjoka M, Kurant M, Butts C, Markopoulou A (2010) Walking in facebook: a case study of unbiased sampling of OSNS. In: INFOCOM Gjoka M, Kurant M, Butts C, Markopoulou A (2010) Walking in facebook: a case study of unbiased sampling of OSNS. In: INFOCOM
Zurück zum Zitat Gkantsidis C, Mihail M, Saberi A (2006) Random walks in peer-to-peer networks: algorithms and evaluation. Perform Eval 63(3):241–263CrossRef Gkantsidis C, Mihail M, Saberi A (2006) Random walks in peer-to-peer networks: algorithms and evaluation. Perform Eval 63(3):241–263CrossRef
Zurück zum Zitat Hardiman SJ, Katzir L (2013) Estimating clustering coefficient and size of social networks via random walk. In: ACM’s WWW Hardiman SJ, Katzir L (2013) Estimating clustering coefficient and size of social networks via random walk. In: ACM’s WWW
Zurück zum Zitat Hubler C, Kriegel P, Borgwardt KM, Ghahramani Z (2008) Metropolis algorithms for representative subgraph sampling. In: ICDM Hubler C, Kriegel P, Borgwardt KM, Ghahramani Z (2008) Metropolis algorithms for representative subgraph sampling. In: ICDM
Zurück zum Zitat Hu P, Lau WC (2014) A survey and taxonomy of graph sampling. In: HONGKONG UNI Hu P, Lau WC (2014) A survey and taxonomy of graph sampling. In: HONGKONG UNI
Zurück zum Zitat Kim N, Laing C, Elmetwaly S, Jung S, Curuksu J, Schlick T (2014) Graph-based sampling for approximating global helical topologies of RNA. Proc Natl Acad Sci 111(11):4079–4084CrossRef Kim N, Laing C, Elmetwaly S, Jung S, Curuksu J, Schlick T (2014) Graph-based sampling for approximating global helical topologies of RNA. Proc Natl Acad Sci 111(11):4079–4084CrossRef
Zurück zum Zitat Lee CH, Xu X, Eun DY (2012) Beyond random walk and metropolis–hastings samplers: why you should not backtrack for unbiased graph sampling. In: Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on measurement and modeling of computer systems, SIGMETRICS ’12, pp 319–330 Lee CH, Xu X, Eun DY (2012) Beyond random walk and metropolis–hastings samplers: why you should not backtrack for unbiased graph sampling. In: Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on measurement and modeling of computer systems, SIGMETRICS ’12, pp 319–330
Zurück zum Zitat Lee S, Kim P, Jeong H (2006) Statistical properties of sampled networks. Phys Rev E 73:016102CrossRef Lee S, Kim P, Jeong H (2006) Statistical properties of sampled networks. Phys Rev E 73:016102CrossRef
Zurück zum Zitat Leskovec J, Faloutsos C (2006) Sampling from large graphs. In: SIGKDD, pp 631–636 Leskovec J, Faloutsos C (2006) Sampling from large graphs. In: SIGKDD, pp 631–636
Zurück zum Zitat Liu L, Wang L, Wu W, Jia H, Zhang Y (2019) A novel hybrid-jump-based sampling method for complex social networks. IEEE Trans Comput Soc Syst 6(2):241–249CrossRef Liu L, Wang L, Wu W, Jia H, Zhang Y (2019) A novel hybrid-jump-based sampling method for complex social networks. IEEE Trans Comput Soc Syst 6(2):241–249CrossRef
Zurück zum Zitat Li R, Yu JX, Qin L, Mao R, Jin T (2015) On random walk based graph sampling. In: 2015 IEEE 31st international conference on data engineering, pp 927–938 Li R, Yu JX, Qin L, Mao R, Jin T (2015) On random walk based graph sampling. In: 2015 IEEE 31st international conference on data engineering, pp 927–938
Zurück zum Zitat Maiya AS, Berger-Wolf TY (2010) Sampling community structure. In: Proceedings of the 19th international conference on World Wide Web, WWW ’10, pp 701–710 Maiya AS, Berger-Wolf TY (2010) Sampling community structure. In: Proceedings of the 19th international conference on World Wide Web, WWW ’10, pp 701–710
Zurück zum Zitat Maiya AC, Berger-Wolf TY (2011) Benefits of bias: towards better characterization of network sampling. In: ACM KDD Maiya AC, Berger-Wolf TY (2011) Benefits of bias: towards better characterization of network sampling. In: ACM KDD
Zurück zum Zitat Najork M, Wiener JL (2001) Breadth-first crawling yields high-quality pages. In: Proceedings of the 10th international conference on World Wide Web, WWW ’01, pp 114–118 Najork M, Wiener JL (2001) Breadth-first crawling yields high-quality pages. In: Proceedings of the 10th international conference on World Wide Web, WWW ’01, pp 114–118
Zurück zum Zitat Rasti AH, Torkjazi M, Rejaie R, Duffield NG, Willinger W, Stutzbach D (2009) Respondent-driven sampling for characterizing unstructured overlays. In: INFOCOM 2009. 28th IEEE international conference on computer communications, 19–25 April 2009, Rio de Janeiro, Brazil, pp 2701–2705 Rasti AH, Torkjazi M, Rejaie R, Duffield NG, Willinger W, Stutzbach D (2009) Respondent-driven sampling for characterizing unstructured overlays. In: INFOCOM 2009. 28th IEEE international conference on computer communications, 19–25 April 2009, Rio de Janeiro, Brazil, pp 2701–2705
Zurück zum Zitat Ribeeiro B, Towsley D (2010) Estimating and sampling graphs with multidimensional random walks. In: ACM internet measurement conference Ribeeiro B, Towsley D (2010) Estimating and sampling graphs with multidimensional random walks. In: ACM internet measurement conference
Zurück zum Zitat Sethu H, Chu X (2012) A new algorithm for extracting a small representative subgraph from a very large graph. arXiv:1207.4825 Sethu H, Chu X (2012) A new algorithm for extracting a small representative subgraph from a very large graph. arXiv:​1207.​4825
Zurück zum Zitat Stutzbach D, Rejaie R, Duffield N, Sen S, Willinger W (2009) On unbiased sampling for unstructured peer-to-peer networks. IEEE/ACM Trans Netw 17(2):377–390CrossRef Stutzbach D, Rejaie R, Duffield N, Sen S, Willinger W (2009) On unbiased sampling for unstructured peer-to-peer networks. IEEE/ACM Trans Netw 17(2):377–390CrossRef
Zurück zum Zitat Voudigari E, Salamanos N, Papageorgiou T, Yannakoudakis EJ (2016) Rank degree: an efficient algorithm for graph sampling. In: 2016 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM), pp 120–129 Voudigari E, Salamanos N, Papageorgiou T, Yannakoudakis EJ (2016) Rank degree: an efficient algorithm for graph sampling. In: 2016 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM), pp 120–129
Zurück zum Zitat Wang T, Chen Y, Zhang Z, Sun P, Deng B, Li X (2010) Unbiased sampling in directed social graph. In: Proceedings of the ACM SIGCOMM 2010 conference, SIGCOMM ’10, pp 401–402 Wang T, Chen Y, Zhang Z, Sun P, Deng B, Li X (2010) Unbiased sampling in directed social graph. In: Proceedings of the ACM SIGCOMM 2010 conference, SIGCOMM ’10, pp 401–402
Zurück zum Zitat Wang T, Chen Y, Zhang Z, Xu T, Jin L, Hui P, Deng B, Li X (2011) Understanding graph sampling algorithms for social network analysis. In: Proceedings of the 2011 31st international conference on distributed computing systems workshops, ICDCSW ’11, pp 123–128 Wang T, Chen Y, Zhang Z, Xu T, Jin L, Hui P, Deng B, Li X (2011) Understanding graph sampling algorithms for social network analysis. In: Proceedings of the 2011 31st international conference on distributed computing systems workshops, ICDCSW ’11, pp 123–128
Zurück zum Zitat Watts DJ, Strogatz SH (1998) Collective dynamics of ’small-world’ networks. Nature 393:440–442CrossRef Watts DJ, Strogatz SH (1998) Collective dynamics of ’small-world’ networks. Nature 393:440–442CrossRef
Zurück zum Zitat Xu X, Lee C (2014) A general framework of hybrid graph sampling for complex network analysis. In: Proceedings of INFOCOM Xu X, Lee C (2014) A general framework of hybrid graph sampling for complex network analysis. In: Proceedings of INFOCOM
Zurück zum Zitat Xu XK, Zhu JJ (2016) Flexible sampling large-scale social networks by self-adjustable random walk. Phys A: Stat Mech Appl 463:356–365CrossRef Xu XK, Zhu JJ (2016) Flexible sampling large-scale social networks by self-adjustable random walk. Phys A: Stat Mech Appl 463:356–365CrossRef
Zurück zum Zitat Xu X, Lee CH et al (2017) Challenging the limits: sampling online social networks with cost constraints. In: IEEE INFOCOM 2017-IEEE conference on computer communications, pp 1–9 Xu X, Lee CH et al (2017) Challenging the limits: sampling online social networks with cost constraints. In: IEEE INFOCOM 2017-IEEE conference on computer communications, pp 1–9
Zurück zum Zitat Ye S, Lang J, Wu F (2010) Crawling online social graphs. In: Proceedings of the 2010 12th international Asia-Pacific web conference, pp 236–242 Ye S, Lang J, Wu F (2010) Crawling online social graphs. In: Proceedings of the 2010 12th international Asia-Pacific web conference, pp 236–242
Zurück zum Zitat Zafarani R, Liu H (2009) Social computing data repository at ASU. School of Computing, Informatics and Decision Systems Engineering Zafarani R, Liu H (2009) Social computing data repository at ASU. School of Computing, Informatics and Decision Systems Engineering
Metadaten
Titel
Guided sampling for large graphs
verfasst von
Muhammad Irfan Yousuf
Suhyun Kim
Publikationsdatum
18.03.2020
Verlag
Springer US
Erschienen in
Data Mining and Knowledge Discovery / Ausgabe 4/2020
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-020-00683-y

Weitere Artikel der Ausgabe 4/2020

Data Mining and Knowledge Discovery 4/2020 Zur Ausgabe

Premium Partner