skip to main content
research-article

Random sampling from a search engine's index

Published:05 November 2008Publication History
Skip Abstract Section

Abstract

We revisit a problem introduced by Bharat and Broder almost a decade ago: How to sample random pages from the corpus of documents indexed by a search engine, using only the search engine's public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines.

The technique of Bharat and Broder suffers from a well-recorded bias: it favors long documents. In this article we introduce two novel sampling algorithms: a lexicon-based algorithm and a random walk algorithm. Our algorithms produce biased samples, but each sample is accompanied by a weight, which represents its bias. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to four well-known Monte Carlo simulation methods: rejection sampling, importance sampling, the Metropolis--Hastings algorithm, and the Maximum Degree method.

The limited access to search engines force our algorithms to use bias weights that are only “approximate”. We characterize analytically the effect of approximate bias weights on Monte Carlo methods and conclude that our algorithms are guaranteed to produce near-uniform samples from the search engine's corpus. Our study of approximate Monte Carlo methods could be of independent interest.

Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long documents. We use our algorithms to collect comparative statistics about the corpora of the Google, MSN Search, and Yahoo! search engines.

References

  1. Aldous, D. 1987. On the Markov chain simulation method for uniform combinatorial distributed and simulated annealing. Prob. Eng. Inf. Sci. 1, 33--46.Google ScholarGoogle ScholarCross RefCross Ref
  2. Anagnostopoulos, A., Broder, A., and Carmel, D. 2005. Sampling search-engine results. In Proceedings of the 14th International World Wide Web Conference (WWW). 245--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bar-Yossef, Z., Berg, A., Chien, S., Fakcharoenphol, J., and Weitz, D. 2000. Approximating aggregate queries about Web pages via random walks. In Proceedings of the 26th International Conference on Very Large Databases (VLDB). ACM, New York, 535--544. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bar-Yossef, Z., and Gurevich, M. 2007. Efficient search engine measurements. In Proceedings of the 16th International World Wide Web Conference (WWW). 401--410. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Battelle, J. 2005. John Battelle's searchblog. http://battellemedia.com/archives/001889.php.Google ScholarGoogle Scholar
  6. Bharat, K., and Broder, A. 1998. A technique for measuring the relative size and overlap of public Web search engines. In Proceedings of the 7th International World Wide Web Conference (WWW7). 379--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bondar, S. 2006. Search engine indexing limits: Where do the bots stop? (Available at http://www.sitepoint.com/article/indexing-limits-where-bots-stop.)Google ScholarGoogle Scholar
  8. Boyd, S., Diaconis, P., and Xiao, L. 2004. Fastest mixing Markov chain on a graph. SIAM Rev. 46, 4, 667--689. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Bradlow, E. T., and Schmittlein, D. C. 2000. The little engines that could: Modeling the performance of World Wide Web search engines. Market. Sci. 19, 43--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Broder, A., Fontoura, M., Josifovski, V., Kumar, R., Motwani, R., Nabar, S., Panigrahy, R., Tomkins, A., and Xu, Y. 2006. Estimating corpus size via queries. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (Arlington, VA). ACM. New York, 594--603. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Can, F., Nuray, R., and Sevdik, A. B. 2004. Automatic performance evaluation of Web search engines. Inf. Proc. Manage. 40, 495--514. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Cheney, M., and Perry, M. 2005. A comparison of the size of the Yahoo! and Google indices. Available at http://vburton.ncsa.uiuc.edu/indexsize.html.Google ScholarGoogle Scholar
  13. Davison, B. D. 2004. The potential of the metasearch engine. In Proceedings of the Annual Meeting of the American Society for Information Science and Technology (ASIST). Vol. 41. 393--402.Google ScholarGoogle ScholarCross RefCross Ref
  14. Diaconis, P., and Saloff-Coste, L. 1998. What do we know about the Metropolis algorithm? J. Comput. Syst. Sci. 57, 20--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Dobra, A., and Fienberg, S. E. 2004. How large is the World Wide Web? Web Dyn. 23--44.Google ScholarGoogle Scholar
  16. Gillman, D. 1998. A Chernoff bound for random walks on expander graphs. SIAM J. Comput. 27, 4, 1203--1220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Golub, G. H., and van Loan, C. F. 1996. Matrix Computations, 3rd ed. The Johns Hopkins University Press, Baltimore, MD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Gordon, M., and Pathak, P. 1999. Finding information on the World Wide Web: the retrieval effectiveness of search engines. Inf. Proc. Manage. 35, 2, 141--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Gulli, A., and Signorini, A. 2005. The indexable Web is more than 11.5 billion pages. In Proceedings of the 14th International Conference on World Wide Web (WWW)—Special Interest Tracks and Posters. 902--903. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 1, 97--109.Google ScholarGoogle ScholarCross RefCross Ref
  21. Hawking, D., Craswel, N., Bailey, P., and Griffiths, K. 2001. Measuring search engine quality. Inf. Retriev. 4, 1, 33--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Henzinger, M. R., Heydon, A., Mitzenmacher, M., and Najork, M. 1999. Measuring index quality using random walks on the Web. In Proceedings of the 8th International World Wide Web Conference. 213--225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Henzinger, M. R., Heydon, A., Mitzenmacher, M., and Najork, M. 2000. On near-uniform URL sampling. In Proceedings of the 9th International World Wide Web Conference. 295--308. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Hesterberg, T. C. 1988. Advances in importance sampling. Ph.D. dissertation, Stanford University.Google ScholarGoogle Scholar
  25. Kahale, N. 1997. Large deviation for Markov chains. Combinatorics, Probability and Computing 6, 465--474. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Lawrence, S., and Giles, C. L. 1998. Searching the World Wide Web. Science 5360, 280, 98.Google ScholarGoogle Scholar
  27. Lawrence, S., and Giles, C. L. 1999. Accessibility of information on the Web. Nature 400, 107--109.Google ScholarGoogle ScholarCross RefCross Ref
  28. Liu, J. S. 1996. Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Stat. Comput. 6, 113--119.Google ScholarGoogle ScholarCross RefCross Ref
  29. Liu, J. S. 2001. Monte Carlo Strategies in Scientific Computing. Springer-Verlag, New York.Google ScholarGoogle Scholar
  30. Marshal, A. 1956. The use of multi-stage sampling schemes in Monte Carlo computations. In Proceedings of the Symposium on Monte Carlo Methods, M. Meyer, Ed. Vol. 21. Wiley, New York, 123--140.Google ScholarGoogle Scholar
  31. Mayer, T. 2005. Our blog is growing up - and so has our index. Available at http://www.ysearchblog.com/archives/000172.html.Google ScholarGoogle Scholar
  32. McCown, F., and Nelson, M. L. 2007. Agreeing to disagree: Search engines and their public interfaces. In JCDL '07: Proceedings of the 2007 Conference on Digital Libraries. 309--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087--1091.Google ScholarGoogle ScholarCross RefCross Ref
  34. Price, G. 2005. More on the total database size battle and Googlewhacking with Yahoo. Available at http://blog.searchenginewatch.com/blog/050811-231448.Google ScholarGoogle Scholar
  35. Rusmevichientong, P., Pennock, D., Lawrence, S., and Giles, C. L. 2001. Methods for sampling pages uniformly from the World Wide Web. In Proceedings of AAAI Fall Symposium on Using Uncertainty within Computation.Google ScholarGoogle Scholar
  36. Siegmund, D. 1985. Sequential Analysis - Tests and Confidence Intervals. Springer-Verlag, New York.Google ScholarGoogle Scholar
  37. Sinclair, A. 1993. Algorithms for Random Generation and Counting: A Markov Chain Approach. Birkhauser Verlag, Basel, Switzerland. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Stuart, A. and Ord, J.-K. 1994. Kendall's Advanced Theory of Statistics. Vol. 1, Distribution Theory. Haddon Arnold, London, UK.Google ScholarGoogle Scholar
  39. Sullivan, D. 2005. Search engine sizes. http://searchenginewatch.com/showPage.html?page=2156481.Google ScholarGoogle Scholar
  40. Véronis, J. 2005. Yahoo: Missing pages? (2). http://aixtal.blogspot.com/2005/08/yahoo-missing-pages-2.html.Google ScholarGoogle Scholar
  41. von Neumann, J. 1963. Various techniques used in connection with random digits. In John von Neumann, Collected Works. Vol. V. Oxford.Google ScholarGoogle Scholar

Index Terms

  1. Random sampling from a search engine's index

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Journal of the ACM
      Journal of the ACM  Volume 55, Issue 5
      October 2008
      164 pages
      ISSN:0004-5411
      EISSN:1557-735X
      DOI:10.1145/1411509
      Issue’s Table of Contents

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 5 November 2008
      • Accepted: 1 August 2008
      • Revised: 1 March 2008
      • Received: 1 August 2006
      Published in jacm Volume 55, Issue 5

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader