skip to main content
10.1145/1993744.1993773acmconferencesArticle/Chapter ViewAbstractPublication PagesmetricsConference Proceedingsconference-collections
research-article

Walking on a graph with a magnifying glass: stratified sampling via weighted random walks

Published:07 June 2011Publication History

ABSTRACT

Our objective is to sample the node set of a large unknown graph via crawling, to accurately estimate a given metric of interest. We design a random walk on an appropriately defined weighted graph that achieves high efficiency by preferentially crawling those nodes and edges that convey greater information regarding the target metric. Our approach begins by employing the theory of stratification to find optimal node weights, for a given estimation problem, under an independence sampler. While optimal under independence sampling, these weights may be impractical under graph crawling due to constraints arising from the structure of the graph. Therefore, the edge weights for our random walk should be chosen so as to lead to an equilibrium distribution that strikes a balance between approximating the optimal weights under an independence sampler and achieving fast convergence. We propose a heuristic approach (stratified weighted random walk, or S-WRW) that achieves this goal, while using only limited information about the graph structure and the node properties. We evaluate our technique in simulation, and experimentally, by collecting a sample of Facebook college users. We show that S-WRW requires 13-15 times fewer samples than the simple re-weighted random walk (RW) to achieve the same estimation accuracy for a range of metrics.

Skip Supplemental Material Section

Supplemental Material

metrics_8_1.mp4

mp4

140.4 MB

References

  1. Weighted Random Walks of the Facebook social graph: http://odysseas.calit2.uci.edu/osn, 2011.Google ScholarGoogle Scholar
  2. D. Achlioptas, A. Clauset, D. Kempe, and C. On the bias of traceroute sampling: or, power-law degree distributions in regular graphs. Journal of the ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Y. Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong. Analysis of topological characteristics of huge online social networking services. In WWW, pages 835--844, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Aldous and J. A. Fill. Reversible Markov Chains and Random Walks on Graphs. In preparation.Google ScholarGoogle Scholar
  5. K. Avrachenkov, B. Ribeiro, and D. Towsley. Improving Random Walk Estimation Accuracy with Uniform Restarts. In I7th Workshop on Algorithms and Models for the Web Graph, 2010.Google ScholarGoogle Scholar
  6. L. Backstrom and J. Kleinberg. Network Bucket Testing. In WWW, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. Backstrom and J. Leskovec. Supervised Random Walks: Predicting and Recommending Links in Social Networks. In ACM International Conference on Web Search and Data Minig (WSDM), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Becchetti, C. Castillo, D. Donato, and A. Fazzone. A comparison of sampling techniques for web graph characterization. In LinkKDD, 2006.Google ScholarGoogle Scholar
  9. H. R. Bernard, T. Hallett, A. Iovita, E. C. Johnsen, R. Lyerla, C. McCarty, M. Mahy, M. J. Salganik, T. Saliuk, O. Scutelniciuc, G. a. Shelley, P. Sirinirund, S. Weir, and D. F. Stroup. Counting hard-to-count populations: the network scale-up method for public health. Sexually Transmitted Infections, 86(Suppl 2):ii11--ii15, Nov. 2010.Google ScholarGoogle Scholar
  10. S. Boyd, P. Diaconis, and L. Xiao. Fastest mixing Markov chain on a graph. SIAM review, 46(4):667--689, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Chakrabarti. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks, 31(11-16):1623--1640, May 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. G. Cochran. Sampling Techniques, volume 20 of McGraw-Hil Series in Probability and Statistics. Wiley, 1977.Google ScholarGoogle Scholar
  13. M. Diligenti, F. Coetzee, S. Lawrence, C. Giles, and M. Gori. Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Data Bases, pages 527--534, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. N. Duffield, C. Lund, and M. Thorup. Learn more, sample less: control of volume and variance in network measurement. IEEE Transactions on Information Theory, 51(5):1756--1775, May 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Gentle. Random number generation and Monte Carlo methods. Springer Verlag, 2003.Google ScholarGoogle Scholar
  16. W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov Chain Monte Carlo in Practice. Chapman and Hall/CRC, 1996.Google ScholarGoogle Scholar
  17. M. Gjoka, C. Butts, M. Kurant, and A. Markopoulou. Multigraph Sampling of Online Social Networks. arXiv, (arXiv:1008.2565v1):1--10, 2010.Google ScholarGoogle Scholar
  18. M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou. Walking in Facebook: A Case Study of Unbiased Sampling of OSNs. In INFOCOM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Gkantsidis, M. Mihail, and A. Saberi. Random walks in peer-to-peer networks. In INFOCOM, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  20. M. Hansen and W. Hurwitz. On the Theory of Sampling from Finite Populations. Annals of Mathematical Statistics, 14(3), 1943.Google ScholarGoogle Scholar
  21. D. D. Heckathorn. Respondent-Driven Sampling: A New Approach to the Study of Hidden Populations. Social Problems, 44:174--199, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  22. M. R. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. On near-uniform URL sampling. In WWW, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. E. D. Kolaczyk. Statistical Analysis of Network Data, volume 69 of Springer Series in Statistics. Springer New York, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. B. Krishnamurthy, P. Gill, and M. Arlitt. A few chirps about Twitter. In WOSN, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Kurant, M. Gjoka, C. Butts, and A. Markopoulou. Walking on a Graph with a Magnifying Glass. Arxiv preprint arXiv:1101.5463, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Kurant, A. Markopoulou, and P. Thiran. On the bias of BFS (Breadth First Search). In ITC, also in arXiv:1004.1729, 2010.Google ScholarGoogle Scholar
  27. S. H. Lee, P.-J. Kim, and H. Jeong. Statistical properties of Sampled Networks. Phys. Rev. E, 73:16102, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  28. J. Leskovec and C. Faloutsos. Sampling from large graphs. In KDD, pages 631--636, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Lohr. Sampling: design and analysis. Brooks/Cole, second edition, 2009.Google ScholarGoogle Scholar
  30. L. Lovász. Random walks on graphs: A survey. Combinatorics, Paul Erdos is Eighty, 2(1):1--46, 1993.Google ScholarGoogle Scholar
  31. N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculation by fast computing machines. Journal of Chemical Physics, 21:1087--1092, 1953.Google ScholarGoogle ScholarCross RefCross Ref
  32. A. Mislove, H. S. Koppula, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Growth of the Flickr social network. In WOSN, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In IMC, pages 29--42, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Mohaisen, A. Yun, and Y. Kim. Measuring the mixing time of social graphs. IMC, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Neyman. On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection. Journal of the Royal Statistical Society, 97(4):558, 1934.Google ScholarGoogle ScholarCross RefCross Ref
  36. A. Rasti, M. Torkjazi, R. Rejaie, N. Duffield, W. Willinger, and D. Stutzbach. Respondent-driven sampling for characterizing unstructured overlays. In Infocom Mini-conference, pages 2701--2705, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  37. A. H. Rasti, M. Torkjazi, R. Rejaie, and D. Stutzbach. Evaluating Sampling Techniques for Large Dynamic Graphs. In Technical Report, volume 1, 2008.Google ScholarGoogle Scholar
  38. B. Ribeiro and D. Towsley. Estimating and sampling graphs with multidimensional random walks. In IMC, volume 011, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. B. Ribeiro, P. Wang, and D. Towsley. On Estimating Degree Distributions of Directed Graphs through Sampling. UMass Technical Report, 2010.Google ScholarGoogle Scholar
  40. M. Salganik and D. D. Heckathorn. Sampling and estimation in hidden populations using respondent-driven sampling. Sociological Methodology, 34(1):193--240, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  41. D. Stutzbach, R. Rejaie, N. Duffield, S. Sen, and W. Willinger. On unbiased sampling for unstructured peer-to-peer networks. In IMC, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. E. Volz and D. D. Heckathorn. Probability based estimation theory for respondent driven sampling. Journal of Official Statistics, 24(1):79--97, 2008.Google ScholarGoogle Scholar
  43. W. Willinger, R. Rejaie, M. Torkjazi, M. Valafar, and M. Maggioni. OSN Research: Time to Face the Real Challenges. In HotMetrics, 2009.Google ScholarGoogle Scholar
  44. C. Wilson, B. Boe, A. Sala, K. P. N. Puttaswamy, and B. Y. Zhao. User interactions in social networks and their implications. In EuroSys, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Walking on a graph with a magnifying glass: stratified sampling via weighted random walks

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMETRICS '11: Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
        June 2011
        376 pages
        ISBN:9781450308144
        DOI:10.1145/1993744

        Copyright © 2011 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 7 June 2011

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate459of2,691submissions,17%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader