Article

The discoverability of the web

Authors:
Anirban Dasgupta

Yahoo! Research, Sunnyvale, CA

Yahoo! Research, Sunnyvale, CA
View Profile

,
Arpita Ghosh

Yahoo! Research, Sunnyvale, CA

Yahoo! Research, Sunnyvale, CA
View Profile

,
Ravi Kumar

Yahoo! Research, Sunnyvale, CA

Yahoo! Research, Sunnyvale, CA
View Profile

,
Christopher Olston

Yahoo! Research, Sunnyvale, CA

Yahoo! Research, Sunnyvale, CA
View Profile

,
Sandeep Pandey

Yahoo! Research, Sunnyvale, CA

Yahoo! Research, Sunnyvale, CA
View Profile

,
Andrew Tomkins

Yahoo! Research, Sunnyvale, CA

Yahoo! Research, Sunnyvale, CA
View Profile

WWW '07: Proceedings of the 16th international conference on World Wide WebMay 2007Pages 421–430https://doi.org/10.1145/1242572.1242630

Published:08 May 2007Publication History

WWW '07: Proceedings of the 16th international conference on World Wide Web

Pages 421–430

ABSTRACT

Previous studies have highlighted the high arrival rate of new contenton the web. We study the extent to which this new content can beefficiently discovered by a crawler. Our study has two parts. First,we study the inherent difficulty of the discovery problem using amaximum cover formulation, under an assumption of perfect estimates oflikely sources of links to new content. Second, we relax thisassumption and study a more realistic setting in which algorithms mustuse historical statistics to estimate which pages are most likely toyield links to new content. We recommend a simple algorithm thatperforms comparably to all approaches we consider.We measure the emphoverhead of discovering new content, defined asthe average number of fetches required to discover one new page. Weshow first that with perfect foreknowledge of where to explore forlinks to new content, it is possible to discover 90% of all newcontent with under 3% overhead, and 100% of new content with 9%overhead. But actual algorithms, which do not have access to perfectforeknowledge, face a more difficult task: one quarter of new contentis simply not amenable to efficient discovery. Of the remaining threequarters, 80% of new content during a given week may be discoveredwith 160% overhead if content is recrawled fully on a monthly basis.

References

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2/3): 235--256, 2002. Google ScholarDigital Library
A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286:509--512, 1999.Google ScholarCross Ref
B. E. Brewington and G. Cybenko. How dynamic is the web? WWW9 / Computer Networks, 33(1-6):257--276, 2000. Google ScholarDigital Library
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. WWW9 / Computer Networks, 33(1-6):309--320, 2000. Google ScholarDigital Library
J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proc. 26th VLDB, pages 200--209, 2000. Google ScholarDigital Library
J. Cho and H. Garcia-Molina. Synchronizing a database to improve freshness. In Proc. SIGMOD, pages 117--128, 2000. Google ScholarDigital Library
J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering. WWW8 / Computer Networks, 30(1-7):161--172, 1998. Google ScholarDigital Library
F. Douglis, A. Feldmann, and B. Krishnamurthy. Rate of change and other metrics: A live study of the world wide web. In Proc. 1st USENIX Symposium on Internet Technologies and Systems, 1997. Google ScholarDigital Library
J. ECoffman, Z. Liu, and R. R. Weber. Optimal robot scheduling for web search engines. Journal of Scheduling, 1(1):15--29, 1998.Google ScholarCross Ref
J. Edwards, K. McCurley, and J. Tomlin. An adaptive model for optimizing performance of an incremental web crawler. In Proc. 10th WWW, pages 106--113, 2001. Google ScholarDigital Library
N. Eiron, K. S. McCurley, and J. A. Tomlin. Ranking the web frontier. In Proc. 13th WWW, pages 309--318, 2004. Google ScholarDigital Library
M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In Proc. SIGCOMM, pages 251--262, 1999. Google ScholarDigital Library
D. Fetterly, M. Manasse, and M. Najork. the evolution of clusters of near-duplicate web pages. In Proc. 1st LA-WEB, pages 37--45, 2003. Google ScholarDigital Library
D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A large-scale study of the evolution of web pages. Software Practice and Experience, 34(2): 213--237, 2004. Google ScholarDigital Library
J. Gittins. Bandit Processes and Dynamic Allocation Indices. John Wiley, 1989.Google Scholar
M. Kearns. The Computational Complexity of Machine Learning. MIT Press, Cambridge, 1990. Google ScholarDigital Library
R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the web for emerging cyber-communities. WWW8 / Computer Networks, 31: 1481--1493, 1999. Google ScholarDigital Library
M. Mitzenmacher. A brief history of lognormal and power law distributions. Internet Mathematics, 1(2):226--251, 2004.Google ScholarCross Ref
A. Ntoulas, J. Cho, and C. Olston. What's new on the web? The evolution of the web from a search engine perspective. In Proc. 13th WWW, pages 1--12, 2004. Google ScholarDigital Library
A. Ntoulas, J. Cho, and C. Olston. What's new on the web? The evolution of the web from a search engine perspective. In Proc. 13th WWW, pages 1--12, 2004. Google ScholarDigital Library
S. Pandey and C. Olston. User-centric web crawling. In Proc. 14th WWW, pages 401--411, 2005. Google ScholarDigital Library
J. Pitkow and P. Pirolli. Life, death, and lawfulness on the electronic frontier. In Proc. CHI, pages 383--390, 1997. Google ScholarDigital Library
P. Slavik. Approximation Algorithms for Set Cover and Related Problems. PhD thesis, SUNY at Buffalo, 1998. Google ScholarDigital Library
V. V. Vazirani. Approximation Algorithms. Springer, 2001. Google ScholarDigital Library
J. L. Wolf, M. S. Squillante, P. S. Yu, J. Sethuraman, and L. Ozsen. Optimal crawling strategies for web search engines. In Proc. 11th WWW, pages 136--147, 2002. Google ScholarDigital Library

Index Terms

The discoverability of the web
1. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Intelligent crawling of web applications for web archiving
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web

The steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently ...
Read More
Research on the New Products Discovery Based on Web Mining
MINES '11: Proceedings of the 2011 Third International Conference on Multimedia Information Networking and Security

The new products discovery is a very interesting data which can bring new business opportunity for shopkeepers selling online. By using the Web mining can get more and more data in everywhere such as e-supermarkets and e-commerce. This paper shows a ...
Read More
Web data mining: exploring hyperlinks, contents, and usage data

This paper presents a review of the book "Web Data Mining - Exploring Hyperlinks, Contents, and Usage Data" by Bing Liu. The review concludes that the breadth and depth of this book makes it a required staple for every Web mining researcher, student, or ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '07: Proceedings of the 16th international conference on World Wide Web
May 2007
1382 pages
ISBN:9781595936547
DOI:10.1145/1242572
General Chairs:
Carey Williamson
University of Calgary, Canada
,
Mary Ellen Zurko
IBM, USA
,
Program Chairs:
Peter Patel-Schneider
Bell Labs Research, USA
,
Prashant Shenoy
University of Massachusetts at Amherst, USA
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 May 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
crawling
discovery
greedy
max cover
set cover
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 56
  Total Citations
  View Citations
- 735
  Total Downloads
- Downloads (Last 12 months)20
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The discoverability of the web

WWW '07: Proceedings of the 16th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Intelligent crawling of web applications for web archiving

Research on the New Products Discovery Based on Web Mining

Web data mining: exploring hyperlinks, contents, and usage data