Abstract
We revisit a problem introduced by Bharat and Broder almost a decade ago: How to sample random pages from the corpus of documents indexed by a search engine, using only the search engine's public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines.
The technique of Bharat and Broder suffers from a well-recorded bias: it favors long documents. In this article we introduce two novel sampling algorithms: a lexicon-based algorithm and a random walk algorithm. Our algorithms produce biased samples, but each sample is accompanied by a weight, which represents its bias. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to four well-known Monte Carlo simulation methods: rejection sampling, importance sampling, the Metropolis--Hastings algorithm, and the Maximum Degree method.
The limited access to search engines force our algorithms to use bias weights that are only “approximate”. We characterize analytically the effect of approximate bias weights on Monte Carlo methods and conclude that our algorithms are guaranteed to produce near-uniform samples from the search engine's corpus. Our study of approximate Monte Carlo methods could be of independent interest.
Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long documents. We use our algorithms to collect comparative statistics about the corpora of the Google, MSN Search, and Yahoo! search engines.
- Aldous, D. 1987. On the Markov chain simulation method for uniform combinatorial distributed and simulated annealing. Prob. Eng. Inf. Sci. 1, 33--46.Google ScholarCross Ref
- Anagnostopoulos, A., Broder, A., and Carmel, D. 2005. Sampling search-engine results. In Proceedings of the 14th International World Wide Web Conference (WWW). 245--256. Google ScholarDigital Library
- Bar-Yossef, Z., Berg, A., Chien, S., Fakcharoenphol, J., and Weitz, D. 2000. Approximating aggregate queries about Web pages via random walks. In Proceedings of the 26th International Conference on Very Large Databases (VLDB). ACM, New York, 535--544. Google ScholarDigital Library
- Bar-Yossef, Z., and Gurevich, M. 2007. Efficient search engine measurements. In Proceedings of the 16th International World Wide Web Conference (WWW). 401--410. Google ScholarDigital Library
- Battelle, J. 2005. John Battelle's searchblog. http://battellemedia.com/archives/001889.php.Google Scholar
- Bharat, K., and Broder, A. 1998. A technique for measuring the relative size and overlap of public Web search engines. In Proceedings of the 7th International World Wide Web Conference (WWW7). 379--388. Google ScholarDigital Library
- Bondar, S. 2006. Search engine indexing limits: Where do the bots stop? (Available at http://www.sitepoint.com/article/indexing-limits-where-bots-stop.)Google Scholar
- Boyd, S., Diaconis, P., and Xiao, L. 2004. Fastest mixing Markov chain on a graph. SIAM Rev. 46, 4, 667--689. Google ScholarDigital Library
- Bradlow, E. T., and Schmittlein, D. C. 2000. The little engines that could: Modeling the performance of World Wide Web search engines. Market. Sci. 19, 43--62. Google ScholarDigital Library
- Broder, A., Fontoura, M., Josifovski, V., Kumar, R., Motwani, R., Nabar, S., Panigrahy, R., Tomkins, A., and Xu, Y. 2006. Estimating corpus size via queries. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (Arlington, VA). ACM. New York, 594--603. Google ScholarDigital Library
- Can, F., Nuray, R., and Sevdik, A. B. 2004. Automatic performance evaluation of Web search engines. Inf. Proc. Manage. 40, 495--514. Google ScholarDigital Library
- Cheney, M., and Perry, M. 2005. A comparison of the size of the Yahoo! and Google indices. Available at http://vburton.ncsa.uiuc.edu/indexsize.html.Google Scholar
- Davison, B. D. 2004. The potential of the metasearch engine. In Proceedings of the Annual Meeting of the American Society for Information Science and Technology (ASIST). Vol. 41. 393--402.Google ScholarCross Ref
- Diaconis, P., and Saloff-Coste, L. 1998. What do we know about the Metropolis algorithm? J. Comput. Syst. Sci. 57, 20--36. Google ScholarDigital Library
- Dobra, A., and Fienberg, S. E. 2004. How large is the World Wide Web? Web Dyn. 23--44.Google Scholar
- Gillman, D. 1998. A Chernoff bound for random walks on expander graphs. SIAM J. Comput. 27, 4, 1203--1220. Google ScholarDigital Library
- Golub, G. H., and van Loan, C. F. 1996. Matrix Computations, 3rd ed. The Johns Hopkins University Press, Baltimore, MD. Google ScholarDigital Library
- Gordon, M., and Pathak, P. 1999. Finding information on the World Wide Web: the retrieval effectiveness of search engines. Inf. Proc. Manage. 35, 2, 141--180. Google ScholarDigital Library
- Gulli, A., and Signorini, A. 2005. The indexable Web is more than 11.5 billion pages. In Proceedings of the 14th International Conference on World Wide Web (WWW)—Special Interest Tracks and Posters. 902--903. Google ScholarDigital Library
- Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 1, 97--109.Google ScholarCross Ref
- Hawking, D., Craswel, N., Bailey, P., and Griffiths, K. 2001. Measuring search engine quality. Inf. Retriev. 4, 1, 33--59. Google ScholarDigital Library
- Henzinger, M. R., Heydon, A., Mitzenmacher, M., and Najork, M. 1999. Measuring index quality using random walks on the Web. In Proceedings of the 8th International World Wide Web Conference. 213--225. Google ScholarDigital Library
- Henzinger, M. R., Heydon, A., Mitzenmacher, M., and Najork, M. 2000. On near-uniform URL sampling. In Proceedings of the 9th International World Wide Web Conference. 295--308. Google ScholarDigital Library
- Hesterberg, T. C. 1988. Advances in importance sampling. Ph.D. dissertation, Stanford University.Google Scholar
- Kahale, N. 1997. Large deviation for Markov chains. Combinatorics, Probability and Computing 6, 465--474. Google ScholarDigital Library
- Lawrence, S., and Giles, C. L. 1998. Searching the World Wide Web. Science 5360, 280, 98.Google Scholar
- Lawrence, S., and Giles, C. L. 1999. Accessibility of information on the Web. Nature 400, 107--109.Google ScholarCross Ref
- Liu, J. S. 1996. Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Stat. Comput. 6, 113--119.Google ScholarCross Ref
- Liu, J. S. 2001. Monte Carlo Strategies in Scientific Computing. Springer-Verlag, New York.Google Scholar
- Marshal, A. 1956. The use of multi-stage sampling schemes in Monte Carlo computations. In Proceedings of the Symposium on Monte Carlo Methods, M. Meyer, Ed. Vol. 21. Wiley, New York, 123--140.Google Scholar
- Mayer, T. 2005. Our blog is growing up - and so has our index. Available at http://www.ysearchblog.com/archives/000172.html.Google Scholar
- McCown, F., and Nelson, M. L. 2007. Agreeing to disagree: Search engines and their public interfaces. In JCDL '07: Proceedings of the 2007 Conference on Digital Libraries. 309--318. Google ScholarDigital Library
- Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087--1091.Google ScholarCross Ref
- Price, G. 2005. More on the total database size battle and Googlewhacking with Yahoo. Available at http://blog.searchenginewatch.com/blog/050811-231448.Google Scholar
- Rusmevichientong, P., Pennock, D., Lawrence, S., and Giles, C. L. 2001. Methods for sampling pages uniformly from the World Wide Web. In Proceedings of AAAI Fall Symposium on Using Uncertainty within Computation.Google Scholar
- Siegmund, D. 1985. Sequential Analysis - Tests and Confidence Intervals. Springer-Verlag, New York.Google Scholar
- Sinclair, A. 1993. Algorithms for Random Generation and Counting: A Markov Chain Approach. Birkhauser Verlag, Basel, Switzerland. Google ScholarDigital Library
- Stuart, A. and Ord, J.-K. 1994. Kendall's Advanced Theory of Statistics. Vol. 1, Distribution Theory. Haddon Arnold, London, UK.Google Scholar
- Sullivan, D. 2005. Search engine sizes. http://searchenginewatch.com/showPage.html?page=2156481.Google Scholar
- Véronis, J. 2005. Yahoo: Missing pages? (2). http://aixtal.blogspot.com/2005/08/yahoo-missing-pages-2.html.Google Scholar
- von Neumann, J. 1963. Various techniques used in connection with random digits. In John von Neumann, Collected Works. Vol. V. Oxford.Google Scholar
Index Terms
- Random sampling from a search engine's index
Recommendations
Random sampling from a search engine's index
WWW '06: Proceedings of the 15th international conference on World Wide WebWe revisit a problem introduced by Bharat and Broder almost a decade ago: how to sample random pages from a search engine's index using only the search engine's public interface? Such a primitive is particularly useful in creating objective benchmarks ...
Sampling search-engine results
WWW '05: Proceedings of the 14th international conference on World Wide WebWe consider the problem of efficiently sampling Web search engine query results. In turn, using a small random sample instead of the full set of results leads to efficient approximate algorithms for several applications, such as:
- Determining the set of ...
What users see - Structures in search engine results pages
This paper investigates the composition of search engine results pages. We define what elements the most popular web search engines use on their results pages (e.g., organic results, advertisements, shortcuts) and to which degree they are used for ...
Comments