research-article

Random sampling from a search engine's index

Authors:
Ziv Bar-Yossef

Technion and Google Haifa, Haifa, Israel

Technion and Google Haifa, Haifa, Israel
View Profile

,
Maxim Gurevich

Technion, Haifa, Israel

Technion, Haifa, Israel
View Profile

Authors Info & Claims

Journal of the ACM Volume 55 Issue 5Article No.: 24pp 1–74https://doi.org/10.1145/1411509.1411514

Published:05 November 2008Publication History

Journal of the ACM

Abstract

We revisit a problem introduced by Bharat and Broder almost a decade ago: How to sample random pages from the corpus of documents indexed by a search engine, using only the search engine's public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines.

The technique of Bharat and Broder suffers from a well-recorded bias: it favors long documents. In this article we introduce two novel sampling algorithms: a lexicon-based algorithm and a random walk algorithm. Our algorithms produce biased samples, but each sample is accompanied by a weight, which represents its bias. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to four well-known Monte Carlo simulation methods: rejection sampling, importance sampling, the Metropolis--Hastings algorithm, and the Maximum Degree method.

The limited access to search engines force our algorithms to use bias weights that are only “approximate”. We characterize analytically the effect of approximate bias weights on Monte Carlo methods and conclude that our algorithms are guaranteed to produce near-uniform samples from the search engine's corpus. Our study of approximate Monte Carlo methods could be of independent interest.

Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long documents. We use our algorithms to collect comparative statistics about the corpora of the Google, MSN Search, and Yahoo! search engines.

References

Aldous, D. 1987. On the Markov chain simulation method for uniform combinatorial distributed and simulated annealing. Prob. Eng. Inf. Sci. 1, 33--46.Google ScholarCross Ref
Anagnostopoulos, A., Broder, A., and Carmel, D. 2005. Sampling search-engine results. In Proceedings of the 14th International World Wide Web Conference (WWW). 245--256. Google ScholarDigital Library
Bar-Yossef, Z., Berg, A., Chien, S., Fakcharoenphol, J., and Weitz, D. 2000. Approximating aggregate queries about Web pages via random walks. In Proceedings of the 26th International Conference on Very Large Databases (VLDB). ACM, New York, 535--544. Google ScholarDigital Library
Bar-Yossef, Z., and Gurevich, M. 2007. Efficient search engine measurements. In Proceedings of the 16th International World Wide Web Conference (WWW). 401--410. Google ScholarDigital Library
Battelle, J. 2005. John Battelle's searchblog. http://battellemedia.com/archives/001889.php.Google Scholar
Bharat, K., and Broder, A. 1998. A technique for measuring the relative size and overlap of public Web search engines. In Proceedings of the 7th International World Wide Web Conference (WWW7). 379--388. Google ScholarDigital Library
Bondar, S. 2006. Search engine indexing limits: Where do the bots stop&quest; (Available at http://www.sitepoint.com/article/indexing-limits-where-bots-stop.)Google Scholar
Boyd, S., Diaconis, P., and Xiao, L. 2004. Fastest mixing Markov chain on a graph. SIAM Rev. 46, 4, 667--689. Google ScholarDigital Library
Bradlow, E. T., and Schmittlein, D. C. 2000. The little engines that could: Modeling the performance of World Wide Web search engines. Market. Sci. 19, 43--62. Google ScholarDigital Library
Broder, A., Fontoura, M., Josifovski, V., Kumar, R., Motwani, R., Nabar, S., Panigrahy, R., Tomkins, A., and Xu, Y. 2006. Estimating corpus size via queries. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (Arlington, VA). ACM. New York, 594--603. Google ScholarDigital Library
Can, F., Nuray, R., and Sevdik, A. B. 2004. Automatic performance evaluation of Web search engines. Inf. Proc. Manage. 40, 495--514. Google ScholarDigital Library
Cheney, M., and Perry, M. 2005. A comparison of the size of the Yahoo&excl; and Google indices. Available at http://vburton.ncsa.uiuc.edu/indexsize.html.Google Scholar
Davison, B. D. 2004. The potential of the metasearch engine. In Proceedings of the Annual Meeting of the American Society for Information Science and Technology (ASIST). Vol. 41. 393--402.Google ScholarCross Ref
Diaconis, P., and Saloff-Coste, L. 1998. What do we know about the Metropolis algorithm&quest; J. Comput. Syst. Sci. 57, 20--36. Google ScholarDigital Library
Dobra, A., and Fienberg, S. E. 2004. How large is the World Wide Web&quest; Web Dyn. 23--44.Google Scholar
Gillman, D. 1998. A Chernoff bound for random walks on expander graphs. SIAM J. Comput. 27, 4, 1203--1220. Google ScholarDigital Library
Golub, G. H., and van Loan, C. F. 1996. Matrix Computations, 3rd ed. The Johns Hopkins University Press, Baltimore, MD. Google ScholarDigital Library
Gordon, M., and Pathak, P. 1999. Finding information on the World Wide Web: the retrieval effectiveness of search engines. Inf. Proc. Manage. 35, 2, 141--180. Google ScholarDigital Library
Gulli, A., and Signorini, A. 2005. The indexable Web is more than 11.5 billion pages. In Proceedings of the 14th International Conference on World Wide Web (WWW)—Special Interest Tracks and Posters. 902--903. Google ScholarDigital Library
Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 1, 97--109.Google ScholarCross Ref
Hawking, D., Craswel, N., Bailey, P., and Griffiths, K. 2001. Measuring search engine quality. Inf. Retriev. 4, 1, 33--59. Google ScholarDigital Library
Henzinger, M. R., Heydon, A., Mitzenmacher, M., and Najork, M. 1999. Measuring index quality using random walks on the Web. In Proceedings of the 8th International World Wide Web Conference. 213--225. Google ScholarDigital Library
Henzinger, M. R., Heydon, A., Mitzenmacher, M., and Najork, M. 2000. On near-uniform URL sampling. In Proceedings of the 9th International World Wide Web Conference. 295--308. Google ScholarDigital Library
Hesterberg, T. C. 1988. Advances in importance sampling. Ph.D. dissertation, Stanford University.Google Scholar
Kahale, N. 1997. Large deviation for Markov chains. Combinatorics, Probability and Computing 6, 465--474. Google ScholarDigital Library
Lawrence, S., and Giles, C. L. 1998. Searching the World Wide Web. Science 5360, 280, 98.Google Scholar
Lawrence, S., and Giles, C. L. 1999. Accessibility of information on the Web. Nature 400, 107--109.Google ScholarCross Ref
Liu, J. S. 1996. Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Stat. Comput. 6, 113--119.Google ScholarCross Ref
Liu, J. S. 2001. Monte Carlo Strategies in Scientific Computing. Springer-Verlag, New York.Google Scholar
Marshal, A. 1956. The use of multi-stage sampling schemes in Monte Carlo computations. In Proceedings of the Symposium on Monte Carlo Methods, M. Meyer, Ed. Vol. 21. Wiley, New York, 123--140.Google Scholar
Mayer, T. 2005. Our blog is growing up - and so has our index. Available at http://www.ysearchblog.com/archives/000172.html.Google Scholar
McCown, F., and Nelson, M. L. 2007. Agreeing to disagree: Search engines and their public interfaces. In JCDL '07: Proceedings of the 2007 Conference on Digital Libraries. 309--318. Google ScholarDigital Library
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087--1091.Google ScholarCross Ref
Price, G. 2005. More on the total database size battle and Googlewhacking with Yahoo. Available at http://blog.searchenginewatch.com/blog/050811-231448.Google Scholar
Rusmevichientong, P., Pennock, D., Lawrence, S., and Giles, C. L. 2001. Methods for sampling pages uniformly from the World Wide Web. In Proceedings of AAAI Fall Symposium on Using Uncertainty within Computation.Google Scholar
Siegmund, D. 1985. Sequential Analysis - Tests and Confidence Intervals. Springer-Verlag, New York.Google Scholar
Sinclair, A. 1993. Algorithms for Random Generation and Counting: A Markov Chain Approach. Birkhauser Verlag, Basel, Switzerland. Google ScholarDigital Library
Stuart, A. and Ord, J.-K. 1994. Kendall's Advanced Theory of Statistics. Vol. 1, Distribution Theory. Haddon Arnold, London, UK.Google Scholar
Sullivan, D. 2005. Search engine sizes. http://searchenginewatch.com/showPage.html?page=2156481.Google Scholar
Véronis, J. 2005. Yahoo: Missing pages&quest; (2). http://aixtal.blogspot.com/2005/08/yahoo-missing-pages-2.html.Google Scholar
von Neumann, J. 1963. Various techniques used in connection with random digits. In John von Neumann, Collected Works. Vol. V. Oxford.Google Scholar

Index Terms

Random sampling from a search engine's index
1. Information systems
  1. Information retrieval

Recommendations

Random sampling from a search engine's index
WWW '06: Proceedings of the 15th international conference on World Wide Web

We revisit a problem introduced by Bharat and Broder almost a decade ago: how to sample random pages from a search engine's index using only the search engine's public interface? Such a primitive is particularly useful in creating objective benchmarks ...
Read More
Sampling search-engine results
WWW '05: Proceedings of the 14th international conference on World Wide Web
We consider the problem of efficiently sampling Web search engine query results. In turn, using a small random sample instead of the full set of results leads to efficient approximate algorithms for several applications, such as:
- Determining the set of ...
Read More
What users see - Structures in search engine results pages

This paper investigates the composition of search engine results pages. We define what elements the most popular web search engines use on their results pages (e.g., organic results, advertisements, shortcuts) and to which degree they are used for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Journal of the ACM Volume 55, Issue 5
October 2008
164 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/1411509
Issue’s Table of Contents

Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 November 2008
- Accepted: 1 August 2008
- Revised: 1 March 2008
- Received: 1 August 2006
Published in jacm Volume 55, Issue 5

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Benchmarks
sampling
search engines
size estimation
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 77
  Total Citations
  View Citations
- 1,546
  Total Downloads
- Downloads (Last 12 months)46
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Random sampling from a search engine's index

Journal of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Random sampling from a search engine's index

Sampling search-engine results

What users see - Structures in search engine results pages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Random sampling from a search engine's index

Journal of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Random sampling from a search engine's index

Sampling search-engine results

What users see - Structures in search engine results pages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media