ABSTRACT
Our objective is to sample the node set of a large unknown graph via crawling, to accurately estimate a given metric of interest. We design a random walk on an appropriately defined weighted graph that achieves high efficiency by preferentially crawling those nodes and edges that convey greater information regarding the target metric. Our approach begins by employing the theory of stratification to find optimal node weights, for a given estimation problem, under an independence sampler. While optimal under independence sampling, these weights may be impractical under graph crawling due to constraints arising from the structure of the graph. Therefore, the edge weights for our random walk should be chosen so as to lead to an equilibrium distribution that strikes a balance between approximating the optimal weights under an independence sampler and achieving fast convergence. We propose a heuristic approach (stratified weighted random walk, or S-WRW) that achieves this goal, while using only limited information about the graph structure and the node properties. We evaluate our technique in simulation, and experimentally, by collecting a sample of Facebook college users. We show that S-WRW requires 13-15 times fewer samples than the simple re-weighted random walk (RW) to achieve the same estimation accuracy for a range of metrics.
Supplemental Material
- Weighted Random Walks of the Facebook social graph: http://odysseas.calit2.uci.edu/osn, 2011.Google Scholar
- D. Achlioptas, A. Clauset, D. Kempe, and C. On the bias of traceroute sampling: or, power-law degree distributions in regular graphs. Journal of the ACM, 2009. Google ScholarDigital Library
- Y. Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong. Analysis of topological characteristics of huge online social networking services. In WWW, pages 835--844, 2007. Google ScholarDigital Library
- D. Aldous and J. A. Fill. Reversible Markov Chains and Random Walks on Graphs. In preparation.Google Scholar
- K. Avrachenkov, B. Ribeiro, and D. Towsley. Improving Random Walk Estimation Accuracy with Uniform Restarts. In I7th Workshop on Algorithms and Models for the Web Graph, 2010.Google Scholar
- L. Backstrom and J. Kleinberg. Network Bucket Testing. In WWW, 2011. Google ScholarDigital Library
- L. Backstrom and J. Leskovec. Supervised Random Walks: Predicting and Recommending Links in Social Networks. In ACM International Conference on Web Search and Data Minig (WSDM), 2011. Google ScholarDigital Library
- L. Becchetti, C. Castillo, D. Donato, and A. Fazzone. A comparison of sampling techniques for web graph characterization. In LinkKDD, 2006.Google Scholar
- H. R. Bernard, T. Hallett, A. Iovita, E. C. Johnsen, R. Lyerla, C. McCarty, M. Mahy, M. J. Salganik, T. Saliuk, O. Scutelniciuc, G. a. Shelley, P. Sirinirund, S. Weir, and D. F. Stroup. Counting hard-to-count populations: the network scale-up method for public health. Sexually Transmitted Infections, 86(Suppl 2):ii11--ii15, Nov. 2010.Google Scholar
- S. Boyd, P. Diaconis, and L. Xiao. Fastest mixing Markov chain on a graph. SIAM review, 46(4):667--689, 2004. Google ScholarDigital Library
- S. Chakrabarti. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks, 31(11-16):1623--1640, May 1999. Google ScholarDigital Library
- W. G. Cochran. Sampling Techniques, volume 20 of McGraw-Hil Series in Probability and Statistics. Wiley, 1977.Google Scholar
- M. Diligenti, F. Coetzee, S. Lawrence, C. Giles, and M. Gori. Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Data Bases, pages 527--534, 2000. Google ScholarDigital Library
- N. Duffield, C. Lund, and M. Thorup. Learn more, sample less: control of volume and variance in network measurement. IEEE Transactions on Information Theory, 51(5):1756--1775, May 2005. Google ScholarDigital Library
- J. Gentle. Random number generation and Monte Carlo methods. Springer Verlag, 2003.Google Scholar
- W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov Chain Monte Carlo in Practice. Chapman and Hall/CRC, 1996.Google Scholar
- M. Gjoka, C. Butts, M. Kurant, and A. Markopoulou. Multigraph Sampling of Online Social Networks. arXiv, (arXiv:1008.2565v1):1--10, 2010.Google Scholar
- M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou. Walking in Facebook: A Case Study of Unbiased Sampling of OSNs. In INFOCOM, 2010. Google ScholarDigital Library
- C. Gkantsidis, M. Mihail, and A. Saberi. Random walks in peer-to-peer networks. In INFOCOM, 2004.Google ScholarCross Ref
- M. Hansen and W. Hurwitz. On the Theory of Sampling from Finite Populations. Annals of Mathematical Statistics, 14(3), 1943.Google Scholar
- D. D. Heckathorn. Respondent-Driven Sampling: A New Approach to the Study of Hidden Populations. Social Problems, 44:174--199, 1997.Google ScholarCross Ref
- M. R. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. On near-uniform URL sampling. In WWW, 2000. Google ScholarDigital Library
- E. D. Kolaczyk. Statistical Analysis of Network Data, volume 69 of Springer Series in Statistics. Springer New York, 2009. Google ScholarDigital Library
- B. Krishnamurthy, P. Gill, and M. Arlitt. A few chirps about Twitter. In WOSN, 2008. Google ScholarDigital Library
- M. Kurant, M. Gjoka, C. Butts, and A. Markopoulou. Walking on a Graph with a Magnifying Glass. Arxiv preprint arXiv:1101.5463, 2011. Google ScholarDigital Library
- M. Kurant, A. Markopoulou, and P. Thiran. On the bias of BFS (Breadth First Search). In ITC, also in arXiv:1004.1729, 2010.Google Scholar
- S. H. Lee, P.-J. Kim, and H. Jeong. Statistical properties of Sampled Networks. Phys. Rev. E, 73:16102, 2006.Google ScholarCross Ref
- J. Leskovec and C. Faloutsos. Sampling from large graphs. In KDD, pages 631--636, 2006. Google ScholarDigital Library
- S. Lohr. Sampling: design and analysis. Brooks/Cole, second edition, 2009.Google Scholar
- L. Lovász. Random walks on graphs: A survey. Combinatorics, Paul Erdos is Eighty, 2(1):1--46, 1993.Google Scholar
- N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculation by fast computing machines. Journal of Chemical Physics, 21:1087--1092, 1953.Google ScholarCross Ref
- A. Mislove, H. S. Koppula, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Growth of the Flickr social network. In WOSN, 2008. Google ScholarDigital Library
- A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In IMC, pages 29--42, 2007. Google ScholarDigital Library
- A. Mohaisen, A. Yun, and Y. Kim. Measuring the mixing time of social graphs. IMC, 2010. Google ScholarDigital Library
- J. Neyman. On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection. Journal of the Royal Statistical Society, 97(4):558, 1934.Google ScholarCross Ref
- A. Rasti, M. Torkjazi, R. Rejaie, N. Duffield, W. Willinger, and D. Stutzbach. Respondent-driven sampling for characterizing unstructured overlays. In Infocom Mini-conference, pages 2701--2705, 2009.Google ScholarCross Ref
- A. H. Rasti, M. Torkjazi, R. Rejaie, and D. Stutzbach. Evaluating Sampling Techniques for Large Dynamic Graphs. In Technical Report, volume 1, 2008.Google Scholar
- B. Ribeiro and D. Towsley. Estimating and sampling graphs with multidimensional random walks. In IMC, volume 011, 2010. Google ScholarDigital Library
- B. Ribeiro, P. Wang, and D. Towsley. On Estimating Degree Distributions of Directed Graphs through Sampling. UMass Technical Report, 2010.Google Scholar
- M. Salganik and D. D. Heckathorn. Sampling and estimation in hidden populations using respondent-driven sampling. Sociological Methodology, 34(1):193--240, 2004.Google ScholarCross Ref
- D. Stutzbach, R. Rejaie, N. Duffield, S. Sen, and W. Willinger. On unbiased sampling for unstructured peer-to-peer networks. In IMC, 2006. Google ScholarDigital Library
- E. Volz and D. D. Heckathorn. Probability based estimation theory for respondent driven sampling. Journal of Official Statistics, 24(1):79--97, 2008.Google Scholar
- W. Willinger, R. Rejaie, M. Torkjazi, M. Valafar, and M. Maggioni. OSN Research: Time to Face the Real Challenges. In HotMetrics, 2009.Google Scholar
- C. Wilson, B. Boe, A. Sala, K. P. N. Puttaswamy, and B. Y. Zhao. User interactions in social networks and their implications. In EuroSys, 2009. Google ScholarDigital Library
Index Terms
- Walking on a graph with a magnifying glass: stratified sampling via weighted random walks
Recommendations
Walking on a graph with a magnifying glass: stratified sampling via weighted random walks
Performance evaluation reviewOur objective is to sample the node set of a large unknown graph via crawling, to accurately estimate a given metric of interest. We design a random walk on an appropriately defined weighted graph that achieves high efficiency by preferentially crawling ...
Walking in facebook: a case study of unbiased sampling of OSNs
INFOCOM'10: Proceedings of the 29th conference on Information communicationsWith more than 250 million active users [1], Facebook (FB) is currently one of the most important online social networks. Our goal in this paper is to obtain a representative (unbiased) sample of Facebook users by crawling its social graph. In this ...
Unbiased Sampling of Bipartite Graph
CYBERC '11: Proceedings of the 2011 International Conference on Cyber-Enabled Distributed Computing and Knowledge DiscoveryIncreasing size of online social networks (OSNs) has given rise to sampling method studies that provide a relatively small but representative sample of large-scale OSNs so that the measurement and analysis burden can be affordable. So far, a number of ...
Comments