ABSTRACT
Previous studies have highlighted the high arrival rate of new contenton the web. We study the extent to which this new content can beefficiently discovered by a crawler. Our study has two parts. First,we study the inherent difficulty of the discovery problem using amaximum cover formulation, under an assumption of perfect estimates oflikely sources of links to new content. Second, we relax thisassumption and study a more realistic setting in which algorithms mustuse historical statistics to estimate which pages are most likely toyield links to new content. We recommend a simple algorithm thatperforms comparably to all approaches we consider.We measure the emphoverhead of discovering new content, defined asthe average number of fetches required to discover one new page. Weshow first that with perfect foreknowledge of where to explore forlinks to new content, it is possible to discover 90% of all newcontent with under 3% overhead, and 100% of new content with 9%overhead. But actual algorithms, which do not have access to perfectforeknowledge, face a more difficult task: one quarter of new contentis simply not amenable to efficient discovery. Of the remaining threequarters, 80% of new content during a given week may be discoveredwith 160% overhead if content is recrawled fully on a monthly basis.
- P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2/3): 235--256, 2002. Google ScholarDigital Library
- A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286:509--512, 1999.Google ScholarCross Ref
- B. E. Brewington and G. Cybenko. How dynamic is the web? WWW9 / Computer Networks, 33(1-6):257--276, 2000. Google ScholarDigital Library
- A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. WWW9 / Computer Networks, 33(1-6):309--320, 2000. Google ScholarDigital Library
- J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proc. 26th VLDB, pages 200--209, 2000. Google ScholarDigital Library
- J. Cho and H. Garcia-Molina. Synchronizing a database to improve freshness. In Proc. SIGMOD, pages 117--128, 2000. Google ScholarDigital Library
- J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. WWW8 / Computer Networks, 30(1-7):161--172, 1998. Google ScholarDigital Library
- F. Douglis, A. Feldmann, and B. Krishnamurthy. Rate of change and other metrics: A live study of the world wide web. In Proc. 1st USENIX Symposium on Internet Technologies and Systems, 1997. Google ScholarDigital Library
- J. ECoffman, Z. Liu, and R. R. Weber. Optimal robot scheduling for web search engines. Journal of Scheduling, 1(1):15--29, 1998.Google ScholarCross Ref
- J. Edwards, K. McCurley, and J. Tomlin. An adaptive model for optimizing performance of an incremental web crawler. In Proc. 10th WWW, pages 106--113, 2001. Google ScholarDigital Library
- N. Eiron, K. S. McCurley, and J. A. Tomlin. Ranking the web frontier. In Proc. 13th WWW, pages 309--318, 2004. Google ScholarDigital Library
- M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In Proc. SIGCOMM, pages 251--262, 1999. Google ScholarDigital Library
- D. Fetterly, M. Manasse, and M. Najork. the evolution of clusters of near-duplicate web pages. In Proc. 1st LA-WEB, pages 37--45, 2003. Google ScholarDigital Library
- D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A large-scale study of the evolution of web pages. Software Practice and Experience, 34(2): 213--237, 2004. Google ScholarDigital Library
- J. Gittins. Bandit Processes and Dynamic Allocation Indices. John Wiley, 1989.Google Scholar
- M. Kearns. The Computational Complexity of Machine Learning. MIT Press, Cambridge, 1990. Google ScholarDigital Library
- R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the web for emerging cyber-communities. WWW8 / Computer Networks, 31: 1481--1493, 1999. Google ScholarDigital Library
- M. Mitzenmacher. A brief history of lognormal and power law distributions. Internet Mathematics, 1(2):226--251, 2004.Google ScholarCross Ref
- A. Ntoulas, J. Cho, and C. Olston. What's new on the web? The evolution of the web from a search engine perspective. In Proc. 13th WWW, pages 1--12, 2004. Google ScholarDigital Library
- A. Ntoulas, J. Cho, and C. Olston. What's new on the web? The evolution of the web from a search engine perspective. In Proc. 13th WWW, pages 1--12, 2004. Google ScholarDigital Library
- S. Pandey and C. Olston. User-centric web crawling. In Proc. 14th WWW, pages 401--411, 2005. Google ScholarDigital Library
- J. Pitkow and P. Pirolli. Life, death, and lawfulness on the electronic frontier. In Proc. CHI, pages 383--390, 1997. Google ScholarDigital Library
- P. Slavik. Approximation Algorithms for Set Cover and Related Problems. PhD thesis, SUNY at Buffalo, 1998. Google ScholarDigital Library
- V. V. Vazirani. Approximation Algorithms. Springer, 2001. Google ScholarDigital Library
- J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, and L. Ozsen. Optimal crawling strategies for web search engines. In Proc. 11th WWW, pages 136--147, 2002. Google ScholarDigital Library
Index Terms
- The discoverability of the web
Recommendations
Intelligent crawling of web applications for web archiving
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide WebThe steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently ...
Research on the New Products Discovery Based on Web Mining
MINES '11: Proceedings of the 2011 Third International Conference on Multimedia Information Networking and SecurityThe new products discovery is a very interesting data which can bring new business opportunity for shopkeepers selling online. By using the Web mining can get more and more data in everywhere such as e-supermarkets and e-commerce. This paper shows a ...
Web data mining: exploring hyperlinks, contents, and usage data
This paper presents a review of the book "Web Data Mining - Exploring Hyperlinks, Contents, and Usage Data" by Bing Liu. The review concludes that the breadth and depth of this book makes it a required staple for every Web mining researcher, student, or ...
Comments