article

Effective page refresh policies for Web crawlers

Authors:
Junghoo Cho

University of California, Los Angeles, California

University of California, Los Angeles, California
View Profile

,
Hector Garcia-Molina

Stanford University, Stanford, California

Stanford University, Stanford, California
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 28 Issue 4pp 390–426https://doi.org/10.1145/958942.958945

Published:01 December 2003Publication History

ACM Transactions on Database Systems

Abstract

In this article, we study how we can maintain local copies of remote data sources "fresh," when the source data is updated autonomously and independently. In particular, we study the problem of Web crawlers that maintain local copies of remote Web pages for Web search engines. In this context, remote data sources (Websites) do not notify the copies (Web crawlers) of new changes, so we need to periodically poll the sources to maintain the copies up-to-date. Since polling the sources takes significant time and resources, it is very difficult to keep the copies completely up-to-date.This article proposes various refresh policies and studies their effectiveness. We first formalize the notion of "freshness" of copied data by defining two freshness metrics, and we propose a Poisson process as the change model of data sources. Based on this framework, we examine the effectiveness of the proposed refresh policies analytically and experimentally. We show that a Poisson process is a good model to describe the changes of Web pages and we also show that our proposed refresh policies improve the "freshness" of data very significantly. In certain cases, we got orders of magnitude improvement from existing policies.

References

Alonso, R., Barbara, D., and Garcia-Molina, H. 1990. Data caching issues in an information retrieval system. ACM Trans. Datab. Syst. 15, 3 (Sept.), 359--384. Google ScholarDigital Library
Barbara, D. and Garcia-Molina, H. 1995. The demarcation protocol: A technique for maintaining linear arithmetic constraints in distributed database systems. In Proceedings of the International Conference on Extending Database Technology (Vienna, Austria). 373--388. Google ScholarDigital Library
Bernstein, P., Blaustein, B., and Clarke, E. 1980. Fast maintenance of semantic integrity assertions using redundant aggregate data. In Proceedings of the 6th International Conference on Very Large Databases (Montreal, Ont., Canada). 126--136.Google Scholar
Bernstein, P. and Goodman, N. 1984. The failure and recovery problem for replicated distributed databases. ACM Trans. Datab. Syst. 9, 4 (Dec.), 596--615. Google ScholarDigital Library
Brewington, B. E. and Cybenko, G. 2000a. How dynamic is the web. In Proceedings of the 9th International World-Wide Web Conference. Google ScholarDigital Library
Brewington, B. E. and Cybenko, G. 2000b. Keeping up with the changing web. IEEE Comput. 33, 5 (May), 52--58. Google ScholarDigital Library
Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Focused crawling: A new approach to topic-specific web resource discovery. In Proceedings of the 8th International World-Wide Web Conference (Toronto, Ont., Canada). Google Scholar
Cho, J. 2001. Crawling the web: Discovery and maintenance of a large-scale web data. Ph.D. thesis, Stanford University. Google ScholarDigital Library
Cho, J. and Garcia-Molina, H. 2000. The evolution of the web and implications for an incremental crawler. In Proceedings of the 26th International Conference on Very Large Databases (Cairo, Egypt). Google ScholarDigital Library
Cho, J. and Garcia-Molina, H. 2002. Parallel crawlers. In Proceedings of the 11th International World-Wide Web Conference (Honolulu, Hawaii). Google ScholarDigital Library
Cho, J. and Garcia-Molina, H. 2003. Estimating frequency of change. ACM Trans. Internet Tech. 3, 3 (Aug.). Google ScholarDigital Library
Cho, J., Garcia-Molina, H., and Page, L. 1998. Efficient crawling through URL ordering. In Proceedings of the 7th International World-Wide Web Conference (Brisbane, Australia). Google ScholarDigital Library
Coffman, Jr., E. G., Liu, Z., and Weber, R. R. 1998. Optimal robot scheduling for web search engines. J. Sched. 1, 1 (June), 15--29.Google ScholarCross Ref
Colby, L. S., Kawaguchi, A., Lieuwen, D. F., and l Singh Mumick, I. 1997. Supporting multiple view maintenance policies. In Proceedings of the International Conference on Management of Data (Tuscon, Az.). 405--416. Google ScholarDigital Library
de Carvalho, O. S. F. and Roucairol, G. 1982. On the distribution of an assertion. In Proceedings of the ACM Symposium on Principles of Distributed Computing (Ottawa, Ont., Canada). ACM, New York, 121--131. Google ScholarDigital Library
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. 2000. Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Databases (Cairo, Egypt). 527--534. Google ScholarDigital Library
Douglis, F., Feldmann, A., and Krishnamurthy, B. 1999. Rate of change and other metrics: a live study of the world wide web. In Proceedings of the 2nd USENIX Symposium on Internetworking Technologies and Systems (Boulder, Colo.). Google ScholarDigital Library
Edwards, J., McCurley, K., and Tomlin, J. 2001. An adaptive model for optimizing performance of an incremental web crawler. In Proceedings of the 10th International World-Wide Web Conference. Google ScholarDigital Library
Golding, R. A. and Long, D. D. 1993. Modeling replica divergence in a weak-consistency protocol for global-sc ale distributed data bases. Tech. rep. UCSC-CRL-93-09, Computer and Information Sciences Board, University of California, Santa Cruz, Santa Cruz, Calif. Google ScholarDigital Library
Google. Google Inc. http://www.google.com.Google Scholar
Heydon, A. and Najork, M. 1999. Mercator: A scalable, extensible web crawler. In Proceedings of the 8th International World-Wide Web Conference (Toronto, Ont., Canada). 219--229. Google ScholarDigital Library
Krishnakumar, N. and Bernstein, A. 1991. Bounded ignorance in replicated systems. In Proceedings of the 10th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (Denver, Colo.). ACM, New York, 63--74. Google ScholarDigital Library
Krishnakumar, N. and Bernstein, A. 1994. Bounded ignorance: A technique for increasing concurrency in a replicated system. ACM Trans. Datab. Syst. 19, 4 (Dec.). 586--625. Google ScholarDigital Library
Ladin, R., Liskov, B., Shrira, L., and Ghemawat, S. 1992. Providing high availability using lazy replication. ACM Trans. Comput. Syst. 10, 4 (Nov.). 360--391. Google ScholarDigital Library
Lawrence, S. and Giles, C. L. 1998. Searching the World Wide Web. Science 280, 5360 (Apr.), 98--100.Google ScholarCross Ref
Lawrence, S. and Giles, C. L. 1999. Accessibility of information on the web. Nature 400, 6740 (July), 107--109. Google ScholarDigital Library
Menczer, F., Pant, G., and Ruiz, M. E. 2001. Evaluating topic-driven web crawlers. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (New Orleans, La.). ACM, New York. Google ScholarDigital Library
Olston, C. and Widom, J. 2000. Offering a precision-performance tradeoff for aggregation queries over replicated data. In Proceedings of the 26th International Conference on Very Large Databases (Cairo, Egypt). Google ScholarDigital Library
Page, L. and Brin, S. 1998. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World-Wide Web Conference (Brisbane, Australia). Google ScholarDigital Library
Pinkerton, B. 1994. Finding what people want: Experiences with the web crawler. In Proceedings of the 2nd World-Wide Web Conference (Chicago, Ill.).Google Scholar
Pitkow, J. and Pirolli, P. 1997. Life, death, and lawfulness on the electronic frontier. In Proceedings of the Conference on Human Factors in Computing Systems (CHI'97). (Atlanta, Ga.). 383--390. Google ScholarDigital Library
Pu, C. and Leff, A. 1991. Replica control in distributed systems: An asynchronous approach. In Proceedings of the International Conference on Management of Data (Denver, Colo.), 377--386. Google ScholarDigital Library
Shkapenyuk, V. and Suel, T. 2002. Design and implementation of a high-performance distributed web crawler. In Proceedings of the 18th International Conference on Data Engineering (San Jose, Calif.). Google ScholarDigital Library
Taylor, H. M. and Karlin, S. 1998. An Introduction to Stochastic Modeling, 3rd ed. Academic Press, Orlando, Fla.Google Scholar
Thomas, Jr., G. B. 1969. Calculus and analytic geometry, 4th ed. Addison-Wesley, Reading, Mass.Google Scholar
Wills, C. E. and Mikhailov, M. 1999. Towards a better understanding of web resources and server responses for improved caching. In Proceedings of the 8th International World-Wide Web Conference (Toronto, Ont., Canada). Google Scholar
Wolman, A., Voelker, G. M., Sharma, N., Cardwell, N., Karlin, A., and Levy, H. M. 1999. On the scale and performance of cooperative web proxy caching. In Proceedings of the 17th ACM Symposium on Operating Systems Principles. ACM, New York. 16--31. Google ScholarDigital Library
Yu, H. and Vahdat, A. 2000. Efficient numerical error bounding for replicated network services. In Proceedings of the 26th International Conference on Very Large Databases (Cairo, Egypt). 123--133. Google ScholarDigital Library

Index Terms

Effective page refresh policies for Web crawlers
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs
  2. Information retrieval
2. Mathematics of computing
  1. Probability and statistics

Recommendations

Mining the web with hierarchical crawlers – a resource sharing based crawling approach

An important component of any web search engine is its crawler, which is also known as robot or spider. An efficient set of crawlers make any search engine more powerful, apart from its other measures of performance, such as its ranking algorithm, ...
Read More
Designing clustering-based web crawling policies for search engine crawlers
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

The World Wide Web is growing and changing at an astonishing rate. Web information systems such as search engines have to keep up with the growth and change of the Web. Due to resource constraints, search engines usually have difficulties keeping the ...
Read More
Finding pages on the unarchived web
JCDL '14: Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries

Web archives preserve the fast changing Web, yet are highly incomplete due to crawling restrictions, crawling depth and frequency, or restrictive selection policies---most of the Web is unarchived and therefore lost to posterity. In this paper, we ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Database Systems Volume 28, Issue 4
December 2003
286 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/958942
Issue’s Table of Contents

Copyright © 2003 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 December 2003
Published in tods Volume 28, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Web crawlers
page refresh
web search engines
world-wide web
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 166
  Total Citations
  View Citations
- 2,808
  Total Downloads
- Downloads (Last 12 months)40
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Effective page refresh policies for Web crawlers

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Mining the web with hierarchical crawlers – a resource sharing based crawling approach

Designing clustering-based web crawling policies for search engine crawlers

Finding pages on the unarchived web

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Effective page refresh policies for Web crawlers

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Mining the web with hierarchical crawlers – a resource sharing based crawling approach

Designing clustering-based web crawling policies for search engine crawlers

Finding pages on the unarchived web

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media