skip to main content
10.1145/1242572.1242630acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

The discoverability of the web

Published:08 May 2007Publication History

ABSTRACT

Previous studies have highlighted the high arrival rate of new contenton the web. We study the extent to which this new content can beefficiently discovered by a crawler. Our study has two parts. First,we study the inherent difficulty of the discovery problem using amaximum cover formulation, under an assumption of perfect estimates oflikely sources of links to new content. Second, we relax thisassumption and study a more realistic setting in which algorithms mustuse historical statistics to estimate which pages are most likely toyield links to new content. We recommend a simple algorithm thatperforms comparably to all approaches we consider.We measure the emphoverhead of discovering new content, defined asthe average number of fetches required to discover one new page. Weshow first that with perfect foreknowledge of where to explore forlinks to new content, it is possible to discover 90% of all newcontent with under 3% overhead, and 100% of new content with 9%overhead. But actual algorithms, which do not have access to perfectforeknowledge, face a more difficult task: one quarter of new contentis simply not amenable to efficient discovery. Of the remaining threequarters, 80% of new content during a given week may be discoveredwith 160% overhead if content is recrawled fully on a monthly basis.

References

  1. P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2/3): 235--256, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286:509--512, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  3. B. E. Brewington and G. Cybenko. How dynamic is the web? WWW9 / Computer Networks, 33(1-6):257--276, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. WWW9 / Computer Networks, 33(1-6):309--320, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proc. 26th VLDB, pages 200--209, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Cho and H. Garcia-Molina. Synchronizing a database to improve freshness. In Proc. SIGMOD, pages 117--128, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. WWW8 / Computer Networks, 30(1-7):161--172, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. F. Douglis, A. Feldmann, and B. Krishnamurthy. Rate of change and other metrics: A live study of the world wide web. In Proc. 1st USENIX Symposium on Internet Technologies and Systems, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. ECoffman, Z. Liu, and R. R. Weber. Optimal robot scheduling for web search engines. Journal of Scheduling, 1(1):15--29, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  10. J. Edwards, K. McCurley, and J. Tomlin. An adaptive model for optimizing performance of an incremental web crawler. In Proc. 10th WWW, pages 106--113, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. N. Eiron, K. S. McCurley, and J. A. Tomlin. Ranking the web frontier. In Proc. 13th WWW, pages 309--318, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In Proc. SIGCOMM, pages 251--262, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Fetterly, M. Manasse, and M. Najork. the evolution of clusters of near-duplicate web pages. In Proc. 1st LA-WEB, pages 37--45, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A large-scale study of the evolution of web pages. Software Practice and Experience, 34(2): 213--237, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Gittins. Bandit Processes and Dynamic Allocation Indices. John Wiley, 1989.Google ScholarGoogle Scholar
  16. M. Kearns. The Computational Complexity of Machine Learning. MIT Press, Cambridge, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the web for emerging cyber-communities. WWW8 / Computer Networks, 31: 1481--1493, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Mitzenmacher. A brief history of lognormal and power law distributions. Internet Mathematics, 1(2):226--251, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  19. A. Ntoulas, J. Cho, and C. Olston. What's new on the web? The evolution of the web from a search engine perspective. In Proc. 13th WWW, pages 1--12, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Ntoulas, J. Cho, and C. Olston. What's new on the web? The evolution of the web from a search engine perspective. In Proc. 13th WWW, pages 1--12, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Pandey and C. Olston. User-centric web crawling. In Proc. 14th WWW, pages 401--411, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Pitkow and P. Pirolli. Life, death, and lawfulness on the electronic frontier. In Proc. CHI, pages 383--390, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Slavik. Approximation Algorithms for Set Cover and Related Problems. PhD thesis, SUNY at Buffalo, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. V. V. Vazirani. Approximation Algorithms. Springer, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, and L. Ozsen. Optimal crawling strategies for web search engines. In Proc. 11th WWW, pages 136--147, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The discoverability of the web

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        WWW '07: Proceedings of the 16th international conference on World Wide Web
        May 2007
        1382 pages
        ISBN:9781595936547
        DOI:10.1145/1242572

        Copyright © 2007 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 8 May 2007

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate1,899of8,196submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader