Abstract
We address the problem of externally measuring aggregate functions over documents indexed by search engines, like corpus size, index freshness, and density of duplicates in the corpus. State of the art estimators for such quantities [Bar-Yossef and Gurevich 2008b; Broder et al. 2006] are biased due to inaccurate approximation of the so called “document degrees”. In addition, the estimators in Bar-Yossef and Gurevich [2008b] are quite costly, due to their reliance on rejection sampling.
We present new estimators that are able to overcome the bias introduced by approximate degrees. Our estimators are based on a careful implementation of an approximate importance sampling procedure. Comprehensive theoretical and empirical analysis of the estimators demonstrates that they have essentially no bias even in situations where document degrees are poorly approximated.
By avoiding the costly rejection sampling approach, our new importance sampling estimators are significantly more efficient than the estimators proposed in Bar-Yossef and Gurevich [2008b]. Furthermore, building on an idea from Broder et al. [2006], we discuss Rao-Blackwellization as a generic method for reducing variance in search engine estimators. We show that Rao-Blackwellizing our estimators results in performance improvements, without compromising accuracy.
- Anagnostopoulos, A., Broder, A. Z., and Carmel, D. 2006. Sampling search-engine results. World Wide Web 9, 4, 397--429. Google ScholarDigital Library
- Bar-Yossef, Z., Berg, A., Chien, S., Fakcharoenphol, J., and Weitz, D. 2000. Approximating aggregate queries about Web pages via random walks. In Proceedings of the 26th Annual Conference on Very Large Databases. 535--544. Google ScholarDigital Library
- Bar-Yossef, Z. and Gurevich, M. 2007. Efficient search engine measurements. In Proceedings of the 16th International World Wide Web Conference (WWW). ACM, New York, 401--410. Google ScholarDigital Library
- Bar-Yossef, Z. and Gurevich, M. 2008a. Mining search engine query logs via suggestion sampling. Proc. VLDB 1, 1, 54--65. Google ScholarDigital Library
- Bar-Yossef, Z. and Gurevich, M. 2008b. Random sampling from a search engine’s index. J. ACM 55, 5, 1--74. Google ScholarDigital Library
- Bar-Yossef, Z. and Gurevich, M. 2009. Estimating the impressionrank of web pages. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). ACM, New York, 41--50. Google ScholarDigital Library
- Baykan, E., Henzinger, M. R., Keller, S. F., Castelberg, S. D., and Kinzler, M. 2009. A comparison of techniques for sampling web pages. In Proceedings of the Symposium on Theoretical Aspects of Computer Science (STACS). 13--30.Google Scholar
- Bharat, K. and Broder, A. 1998. A technique for measuring the relative size and overlap of public Web search engines. In Proceedings of the 7th International Conference on World Wide Web. ACM, New York, 379--388. Google ScholarDigital Library
- Bradlow, E. T. and Schmittlein, D. C. 2000. The little engines that could: Modeling the performance of World Wide Web search engines. Market. Sci. 19, 43--62. Google ScholarDigital Library
- Broder, A., Fontoura, M., Josifovski, V., Kumar, R., Motwani, R., Nabar, S., Panigrahy, R., Tomkins, A., and Xu, Y. 2006. Estimating corpus size via queries. In Proceedings of the 15th International Conference on Information and Knowledge Management (CIKM). ACM, New York, 594--603. Google ScholarDigital Library
- Casella, G. and Robert, C. P. 1996. Rao-Blackwellisation of sampling schemes. Biometrika 83, 1, 81--94.Google ScholarCross Ref
- Cheney, M. and Perry, M. 2005. A comparison of the size of the Yahoo! and Google indices. http://www.mostpopularsites.net/MPS_News/Search_Engine_Facts/A_Comparison_of_the_Size_of_the_Yahoo!_and_Google_Indices.Google Scholar
- Dasgupta, A., Das, G., and Mannila, H. 2007. A random walk approach to sampling hidden databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’07). ACM, New York, 629--640. Google ScholarDigital Library
- dmoz. The open directory project. http://dmoz.org.Google Scholar
- Dobra, A. and Fienberg, S. E. 2004. How large is the worldwide web? In Web Dynamics, 23--44.Google Scholar
- Goldreich, O. 1997. A sample of samplers - A computational perspective on sampling (survey). Electron. Colloq. Computat. Complex. (ECCC) 4, 20.Google Scholar
- Gulli, A. and Signorini, A. 2005. The indexable Web is more than 11.5 billion pages. In Proceedings of the 14th International Conference on the World Wide Web (WWW). 902--903. Google ScholarDigital Library
- Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 1, 97--109.Google ScholarCross Ref
- Henzinger, M. R., Heydon, A., Mitzenmacher, M., and Najork, M. 1999. Measuring index quality using random walks on the Web. In Proceedings of the 8th International Conference on the World Wide Web (WWW). 213--225. Google ScholarDigital Library
- Henzinger, M. R., Heydon, A., Mitzenmacher, M., and Najork, M. 2000. On near-uniform URL sampling. In Proceedings of the 9th International Conference on the World Wide Web (WWW). 295--308. Google ScholarDigital Library
- Hesterberg, T. C. 1988. Advances in importance sampling. Ph.D. dissertation, Stanford University.Google Scholar
- Lawrence, S. and Giles, C. L. 1998. Searching the World Wide Web. Science 5360, 280, 98.Google Scholar
- Lawrence, S. and Giles, C. L. 1999. Accessibility of information on the Web. Nature 400, 107--109. Google ScholarCross Ref
- Lee, S.-M. and Chao, A. 1994. Estimating population size via sample coverage for closed capture-recapture models. Biometrics 50, 1, 88--97.Google ScholarCross Ref
- Liu, J. S. 1996. Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Stat. Comput. 6, 113--119.Google ScholarCross Ref
- Liu, J. S. 2001. Monte Carlo Strategies in Scientific Computing. Springer. Google ScholarDigital Library
- Marshall, A. W. 1956. The use of multi-stage sampling schemes in Monte Carlo computations. In Proceedings of the Symposium on Monte Carlo Methods. 123--140.Google Scholar
- McCurley, K. 2007. Income inequality in the attention economy. http://mccurley.org/papers/effective.Google Scholar
- Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087--1091.Google ScholarCross Ref
- Rusmevichientong, P., Pennock, D., Lawrence, S., and Giles, C. L. 2001. Methods for sampling pages uniformly from the World Wide Web. In Proceedings of the AAAI Symposium on Using Uncertainty within Computation.Google Scholar
- Siegmund, D. 1985. Sequential Analysis - Tests and Confidence Intervals. Springer-Verlag.Google Scholar
- Stuart, A. and Ord, J. K. 1994. Kendall’s Advanced Theory of Statistics. Vol.1: Distribution Theory. Hodder Arnold, London.Google Scholar
- von Neumann, J. 1963. Various techniques used in connection with random digits. In John von Neumann, Collected Works. Vol. V. Oxford.Google Scholar
Index Terms
- Efficient Search Engine Measurements
Recommendations
Random sampling from a search engine's index
We revisit a problem introduced by Bharat and Broder almost a decade ago: How to sample random pages from the corpus of documents indexed by a search engine, using only the search engine's public interface? Such a primitive is particularly useful in ...
Efficient search engine measurements
WWW '07: Proceedings of the 16th international conference on World Wide WebWe address the problem of measuring global quality met-rics of search engines, like corpus size, index freshness, anddensity of duplicates in the corpus. The recently proposedestimators for such metrics [2, 6] suffer from significant biasand/or poor ...
What users see - Structures in search engine results pages
This paper investigates the composition of search engine results pages. We define what elements the most popular web search engines use on their results pages (e.g., organic results, advertisements, shortcuts) and to which degree they are used for ...
Comments