research-article

Efficient Search Engine Measurements

Authors:
Ziv Bar-Yossef

Google Israel and Technion, Haifa, Israel

Google Israel and Technion, Haifa, Israel
View Profile

,
Maxim Gurevich

Yahoo! Research, Santa Clara, CA

Yahoo! Research, Santa Clara, CA
View Profile

Authors Info & Claims

ACM Transactions on the Web Volume 5 Issue 4Article No.: 18pp 1–48https://doi.org/10.1145/2019643.2019645

Published:01 October 2011Publication History

ACM Transactions on the Web

Abstract

We address the problem of externally measuring aggregate functions over documents indexed by search engines, like corpus size, index freshness, and density of duplicates in the corpus. State of the art estimators for such quantities [Bar-Yossef and Gurevich 2008b; Broder et al. 2006] are biased due to inaccurate approximation of the so called “document degrees”. In addition, the estimators in Bar-Yossef and Gurevich [2008b] are quite costly, due to their reliance on rejection sampling.

We present new estimators that are able to overcome the bias introduced by approximate degrees. Our estimators are based on a careful implementation of an approximate importance sampling procedure. Comprehensive theoretical and empirical analysis of the estimators demonstrates that they have essentially no bias even in situations where document degrees are poorly approximated.

By avoiding the costly rejection sampling approach, our new importance sampling estimators are significantly more efficient than the estimators proposed in Bar-Yossef and Gurevich [2008b]. Furthermore, building on an idea from Broder et al. [2006], we discuss Rao-Blackwellization as a generic method for reducing variance in search engine estimators. We show that Rao-Blackwellizing our estimators results in performance improvements, without compromising accuracy.

References

Anagnostopoulos, A., Broder, A. Z., and Carmel, D. 2006. Sampling search-engine results. World Wide Web 9, 4, 397--429. Google ScholarDigital Library
Bar-Yossef, Z., Berg, A., Chien, S., Fakcharoenphol, J., and Weitz, D. 2000. Approximating aggregate queries about Web pages via random walks. In Proceedings of the 26th Annual Conference on Very Large Databases. 535--544. Google ScholarDigital Library
Bar-Yossef, Z. and Gurevich, M. 2007. Efficient search engine measurements. In Proceedings of the 16th International World Wide Web Conference (WWW). ACM, New York, 401--410. Google ScholarDigital Library
Bar-Yossef, Z. and Gurevich, M. 2008a. Mining search engine query logs via suggestion sampling. Proc. VLDB 1, 1, 54--65. Google ScholarDigital Library
Bar-Yossef, Z. and Gurevich, M. 2008b. Random sampling from a search engine’s index. J. ACM 55, 5, 1--74. Google ScholarDigital Library
Bar-Yossef, Z. and Gurevich, M. 2009. Estimating the impressionrank of web pages. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). ACM, New York, 41--50. Google ScholarDigital Library
Baykan, E., Henzinger, M. R., Keller, S. F., Castelberg, S. D., and Kinzler, M. 2009. A comparison of techniques for sampling web pages. In Proceedings of the Symposium on Theoretical Aspects of Computer Science (STACS). 13--30.Google Scholar
Bharat, K. and Broder, A. 1998. A technique for measuring the relative size and overlap of public Web search engines. In Proceedings of the 7th International Conference on World Wide Web. ACM, New York, 379--388. Google ScholarDigital Library
Bradlow, E. T. and Schmittlein, D. C. 2000. The little engines that could: Modeling the performance of World Wide Web search engines. Market. Sci. 19, 43--62. Google ScholarDigital Library
Broder, A., Fontoura, M., Josifovski, V., Kumar, R., Motwani, R., Nabar, S., Panigrahy, R., Tomkins, A., and Xu, Y. 2006. Estimating corpus size via queries. In Proceedings of the 15th International Conference on Information and Knowledge Management (CIKM). ACM, New York, 594--603. Google ScholarDigital Library
Casella, G. and Robert, C. P. 1996. Rao-Blackwellisation of sampling schemes. Biometrika 83, 1, 81--94.Google ScholarCross Ref
Cheney, M. and Perry, M. 2005. A comparison of the size of the Yahoo! and Google indices. http://www.mostpopularsites.net/MPS_News/Search_Engine_Facts/A_Comparison_of_the_Size_of_the_Yahoo!_and_Google_Indices.Google Scholar
Dasgupta, A., Das, G., and Mannila, H. 2007. A random walk approach to sampling hidden databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’07). ACM, New York, 629--640. Google ScholarDigital Library
dmoz. The open directory project. http://dmoz.org.Google Scholar
Dobra, A. and Fienberg, S. E. 2004. How large is the worldwide web? In Web Dynamics, 23--44.Google Scholar
Goldreich, O. 1997. A sample of samplers - A computational perspective on sampling (survey). Electron. Colloq. Computat. Complex. (ECCC) 4, 20.Google Scholar
Gulli, A. and Signorini, A. 2005. The indexable Web is more than 11.5 billion pages. In Proceedings of the 14th International Conference on the World Wide Web (WWW). 902--903. Google ScholarDigital Library
Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 1, 97--109.Google ScholarCross Ref
Henzinger, M. R., Heydon, A., Mitzenmacher, M., and Najork, M. 1999. Measuring index quality using random walks on the Web. In Proceedings of the 8th International Conference on the World Wide Web (WWW). 213--225. Google ScholarDigital Library
Henzinger, M. R., Heydon, A., Mitzenmacher, M., and Najork, M. 2000. On near-uniform URL sampling. In Proceedings of the 9th International Conference on the World Wide Web (WWW). 295--308. Google ScholarDigital Library
Hesterberg, T. C. 1988. Advances in importance sampling. Ph.D. dissertation, Stanford University.Google Scholar
Lawrence, S. and Giles, C. L. 1998. Searching the World Wide Web. Science 5360, 280, 98.Google Scholar
Lawrence, S. and Giles, C. L. 1999. Accessibility of information on the Web. Nature 400, 107--109. Google ScholarCross Ref
Lee, S.-M. and Chao, A. 1994. Estimating population size via sample coverage for closed capture-recapture models. Biometrics 50, 1, 88--97.Google ScholarCross Ref
Liu, J. S. 1996. Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Stat. Comput. 6, 113--119.Google ScholarCross Ref
Liu, J. S. 2001. Monte Carlo Strategies in Scientific Computing. Springer. Google ScholarDigital Library
Marshall, A. W. 1956. The use of multi-stage sampling schemes in Monte Carlo computations. In Proceedings of the Symposium on Monte Carlo Methods. 123--140.Google Scholar
McCurley, K. 2007. Income inequality in the attention economy. http://mccurley.org/papers/effective.Google Scholar
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087--1091.Google ScholarCross Ref
Rusmevichientong, P., Pennock, D., Lawrence, S., and Giles, C. L. 2001. Methods for sampling pages uniformly from the World Wide Web. In Proceedings of the AAAI Symposium on Using Uncertainty within Computation.Google Scholar
Siegmund, D. 1985. Sequential Analysis - Tests and Confidence Intervals. Springer-Verlag.Google Scholar
Stuart, A. and Ord, J. K. 1994. Kendall’s Advanced Theory of Statistics. Vol.1: Distribution Theory. Hodder Arnold, London.Google Scholar
von Neumann, J. 1963. Various techniques used in connection with random digits. In John von Neumann, Collected Works. Vol. V. Oxford.Google Scholar

Index Terms

Efficient Search Engine Measurements
1. Information systems
  1. Information retrieval

Recommendations

Random sampling from a search engine's index

We revisit a problem introduced by Bharat and Broder almost a decade ago: How to sample random pages from the corpus of documents indexed by a search engine, using only the search engine's public interface? Such a primitive is particularly useful in ...
Read More
Efficient search engine measurements
WWW '07: Proceedings of the 16th international conference on World Wide Web

We address the problem of measuring global quality met-rics of search engines, like corpus size, index freshness, anddensity of duplicates in the corpus. The recently proposedestimators for such metrics [2, 6] suffer from significant biasand/or poor ...
Read More
What users see - Structures in search engine results pages

This paper investigates the composition of search engine results pages. We define what elements the most popular web search engines use on their results pages (e.g., organic results, advertisements, shortcuts) and to which degree they are used for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on the Web Volume 5, Issue 4
October 2011
154 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/2019643
Issue’s Table of Contents

Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 October 2011
- Accepted: 1 December 2010
- Revised: 1 July 2010
- Received: 1 August 2009
Published in tweb Volume 5, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Corpus size estimation
evaluation
sampling
search engines
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 891
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient Search Engine Measurements

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

Random sampling from a search engine's index

Efficient search engine measurements

What users see - Structures in search engine results pages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Efficient Search Engine Measurements

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

Random sampling from a search engine's index

Efficient search engine measurements

What users see - Structures in search engine results pages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media