skip to main content
research-article

Efficient Search Engine Measurements

Published:01 October 2011Publication History
Skip Abstract Section

Abstract

We address the problem of externally measuring aggregate functions over documents indexed by search engines, like corpus size, index freshness, and density of duplicates in the corpus. State of the art estimators for such quantities [Bar-Yossef and Gurevich 2008b; Broder et al. 2006] are biased due to inaccurate approximation of the so called “document degrees”. In addition, the estimators in Bar-Yossef and Gurevich [2008b] are quite costly, due to their reliance on rejection sampling.

We present new estimators that are able to overcome the bias introduced by approximate degrees. Our estimators are based on a careful implementation of an approximate importance sampling procedure. Comprehensive theoretical and empirical analysis of the estimators demonstrates that they have essentially no bias even in situations where document degrees are poorly approximated.

By avoiding the costly rejection sampling approach, our new importance sampling estimators are significantly more efficient than the estimators proposed in Bar-Yossef and Gurevich [2008b]. Furthermore, building on an idea from Broder et al. [2006], we discuss Rao-Blackwellization as a generic method for reducing variance in search engine estimators. We show that Rao-Blackwellizing our estimators results in performance improvements, without compromising accuracy.

References

  1. Anagnostopoulos, A., Broder, A. Z., and Carmel, D. 2006. Sampling search-engine results. World Wide Web 9, 4, 397--429. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Bar-Yossef, Z., Berg, A., Chien, S., Fakcharoenphol, J., and Weitz, D. 2000. Approximating aggregate queries about Web pages via random walks. In Proceedings of the 26th Annual Conference on Very Large Databases. 535--544. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bar-Yossef, Z. and Gurevich, M. 2007. Efficient search engine measurements. In Proceedings of the 16th International World Wide Web Conference (WWW). ACM, New York, 401--410. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bar-Yossef, Z. and Gurevich, M. 2008a. Mining search engine query logs via suggestion sampling. Proc. VLDB 1, 1, 54--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bar-Yossef, Z. and Gurevich, M. 2008b. Random sampling from a search engine’s index. J. ACM 55, 5, 1--74. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bar-Yossef, Z. and Gurevich, M. 2009. Estimating the impressionrank of web pages. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). ACM, New York, 41--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Baykan, E., Henzinger, M. R., Keller, S. F., Castelberg, S. D., and Kinzler, M. 2009. A comparison of techniques for sampling web pages. In Proceedings of the Symposium on Theoretical Aspects of Computer Science (STACS). 13--30.Google ScholarGoogle Scholar
  8. Bharat, K. and Broder, A. 1998. A technique for measuring the relative size and overlap of public Web search engines. In Proceedings of the 7th International Conference on World Wide Web. ACM, New York, 379--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Bradlow, E. T. and Schmittlein, D. C. 2000. The little engines that could: Modeling the performance of World Wide Web search engines. Market. Sci. 19, 43--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Broder, A., Fontoura, M., Josifovski, V., Kumar, R., Motwani, R., Nabar, S., Panigrahy, R., Tomkins, A., and Xu, Y. 2006. Estimating corpus size via queries. In Proceedings of the 15th International Conference on Information and Knowledge Management (CIKM). ACM, New York, 594--603. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Casella, G. and Robert, C. P. 1996. Rao-Blackwellisation of sampling schemes. Biometrika 83, 1, 81--94.Google ScholarGoogle ScholarCross RefCross Ref
  12. Cheney, M. and Perry, M. 2005. A comparison of the size of the Yahoo! and Google indices. http://www.mostpopularsites.net/MPS_News/Search_Engine_Facts/A_Comparison_of_the_Size_of_the_Yahoo!_and_Google_Indices.Google ScholarGoogle Scholar
  13. Dasgupta, A., Das, G., and Mannila, H. 2007. A random walk approach to sampling hidden databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’07). ACM, New York, 629--640. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. dmoz. The open directory project. http://dmoz.org.Google ScholarGoogle Scholar
  15. Dobra, A. and Fienberg, S. E. 2004. How large is the worldwide web? In Web Dynamics, 23--44.Google ScholarGoogle Scholar
  16. Goldreich, O. 1997. A sample of samplers - A computational perspective on sampling (survey). Electron. Colloq. Computat. Complex. (ECCC) 4, 20.Google ScholarGoogle Scholar
  17. Gulli, A. and Signorini, A. 2005. The indexable Web is more than 11.5 billion pages. In Proceedings of the 14th International Conference on the World Wide Web (WWW). 902--903. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 1, 97--109.Google ScholarGoogle ScholarCross RefCross Ref
  19. Henzinger, M. R., Heydon, A., Mitzenmacher, M., and Najork, M. 1999. Measuring index quality using random walks on the Web. In Proceedings of the 8th International Conference on the World Wide Web (WWW). 213--225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Henzinger, M. R., Heydon, A., Mitzenmacher, M., and Najork, M. 2000. On near-uniform URL sampling. In Proceedings of the 9th International Conference on the World Wide Web (WWW). 295--308. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Hesterberg, T. C. 1988. Advances in importance sampling. Ph.D. dissertation, Stanford University.Google ScholarGoogle Scholar
  22. Lawrence, S. and Giles, C. L. 1998. Searching the World Wide Web. Science 5360, 280, 98.Google ScholarGoogle Scholar
  23. Lawrence, S. and Giles, C. L. 1999. Accessibility of information on the Web. Nature 400, 107--109. Google ScholarGoogle ScholarCross RefCross Ref
  24. Lee, S.-M. and Chao, A. 1994. Estimating population size via sample coverage for closed capture-recapture models. Biometrics 50, 1, 88--97.Google ScholarGoogle ScholarCross RefCross Ref
  25. Liu, J. S. 1996. Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Stat. Comput. 6, 113--119.Google ScholarGoogle ScholarCross RefCross Ref
  26. Liu, J. S. 2001. Monte Carlo Strategies in Scientific Computing. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Marshall, A. W. 1956. The use of multi-stage sampling schemes in Monte Carlo computations. In Proceedings of the Symposium on Monte Carlo Methods. 123--140.Google ScholarGoogle Scholar
  28. McCurley, K. 2007. Income inequality in the attention economy. http://mccurley.org/papers/effective.Google ScholarGoogle Scholar
  29. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087--1091.Google ScholarGoogle ScholarCross RefCross Ref
  30. Rusmevichientong, P., Pennock, D., Lawrence, S., and Giles, C. L. 2001. Methods for sampling pages uniformly from the World Wide Web. In Proceedings of the AAAI Symposium on Using Uncertainty within Computation.Google ScholarGoogle Scholar
  31. Siegmund, D. 1985. Sequential Analysis - Tests and Confidence Intervals. Springer-Verlag.Google ScholarGoogle Scholar
  32. Stuart, A. and Ord, J. K. 1994. Kendall’s Advanced Theory of Statistics. Vol.1: Distribution Theory. Hodder Arnold, London.Google ScholarGoogle Scholar
  33. von Neumann, J. 1963. Various techniques used in connection with random digits. In John von Neumann, Collected Works. Vol. V. Oxford.Google ScholarGoogle Scholar

Index Terms

  1. Efficient Search Engine Measurements

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on the Web
      ACM Transactions on the Web  Volume 5, Issue 4
      October 2011
      154 pages
      ISSN:1559-1131
      EISSN:1559-114X
      DOI:10.1145/2019643
      Issue’s Table of Contents

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 October 2011
      • Accepted: 1 December 2010
      • Revised: 1 July 2010
      • Received: 1 August 2009
      Published in tweb Volume 5, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader