Article

Factors affecting website reconstruction from the web infrastructure

Authors:
Frank McCown

Old Dominion University, Norfolk, VA

Old Dominion University, Norfolk, VA
View Profile

,
Norou Diawara

Old Dominion University, Norfolk, VA

Old Dominion University, Norfolk, VA
View Profile

,
Michael L. Nelson

Old Dominion University, Norfolk, VA

Old Dominion University, Norfolk, VA
View Profile

JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital librariesJune 2007Pages 39–48https://doi.org/10.1145/1255175.1255182

Published:18 June 2007Publication History

JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries

Pages 39–48

ABSTRACT

When a website is suddenly lost without a backup, it maybe reconstituted by probing web archives and search engine caches for missing content. In this paper we describe an experiment where we crawled and reconstructed 300 randomly selected websites on a weekly basis for 14 weeks. The reconstructions were performed using our web-repository crawler named Warrick which recovers missing resources from the Web Infrastructure (WI), the collective preservation effort of web archives and search engine caches. We examine several characteristics of the websites over time including birth rate, decay and age of resources. We evaluate the reconstructions when compared to the crawled sites and develop a statistical model for predicting reconstruction success from the WI. On average, we were able to recover 61% of each website's resources. We found that Google's PageRank, number of hops and resource age were the three most significant factors in determining if a resource would be recovered from the WI.

References

L. A. Adamic and B. A. Huberman. Zipf's law and the Internet. Glottometrics, 3:143--150, 2002.Google Scholar
Alexa toolbar. http://download.alexa.com/.Google Scholar
Z. Bar-Yossef, A. Z. Broder, R. Kumar, and A. Tomkins. Sic transit gloria telae: towards an understanding of the web's decay. In WWW '04: Proceedings of the 13th international conference on World Wide Web, pages 328--337, 2004. Google ScholarDigital Library
Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 367--376, 2006. Google ScholarDigital Library
B. E. Brewington and G. Cybenko. How dynamic is the Web? Computer Networks, 33(1-6):257--276, 2000. Google ScholarDigital Library
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the Web. Computer Networks & ISDN Systems, 29(8-13):1157--1166, 1997. Google ScholarDigital Library
J. Cho and S. Roy. Impact of search engines on page popularity. In WWW '04: Proceedings of the 13th international conference on World Wide Web, pages 20--29, 2004. Google ScholarDigital Library
D. Clinton. Beyond the SOAP search API, Dec. 2006. http://google-code-updates.blogspot.com/2006/12/beyond-soap-search-api.html.Google Scholar
M. Cutts. GoogleGuy's posts, June 2005. http://www.webmasterworld.com/forum30/29720.htm.Google Scholar
Z. Dalal, S. Dash, P. Dave, L. Francisco-Revilla, R. Furuta, U. Karadkar, and F. Shipman. Managing distributed collections: Evaluating web page changes, movement, and replacement. In JCDL '04: Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, pages 160--168, 2004. Google ScholarDigital Library
F. Douglis, A. Feldmann, and B. Krishnamurthy. Rate of change and other metrics: a live study of the World Wide Web. In Proceedings of the USENIX Symposium on Internet Technologies and Systems, 1997. Google ScholarDigital Library
D. Fetterly, M. Manasse, M. Najork, and J. Wiener. A large-scale study of the evolution of web pages. In WWW '03: Proceedings of the 12th international conference on World Wide Web, pages 669--678, 2003. Google ScholarDigital Library
J. Galt. Google says: Toolbar PageRank is for entertainment purposes only, 2004. http://forums.searchenginewatch.com/showthread.php?t=3054.Google Scholar
Google Sitemap Protocol. https://www.google.com/webmasters/tools/docs/en/protocol.html.Google Scholar
Google webmaster help center: Webmaster guidelines, 2007. http://www.google.com/support/webmasters/bin/answer.py?answer=35769.Google Scholar
A. Gulli and A. Signorini. The indexable web is more than 11.5 billion pages. In WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web, pages 902--903, May 2005. Google ScholarDigital Library
T. L. Harrison and M. L. Nelson. Just-in-time recovery of missing web pages. In HYPERTEXT '06: Proceedings of the 17th ACM conference on Hypertextand Hypermedia, pages 145--156, Aug. 2006. Google ScholarDigital Library
Internet Archive FAQ: How can I get my site included in the Archive?http://www.archive.org/about/faqs.php.Google Scholar
Jon. How the Google cache can save your a¿, Dec. 2005. http://www.smartmoneydaily.com/Business/How-the-Google-Cache-can-Save-You.aspx.Google Scholar
W. Koehler. An analysis of web page and web site constancy and permanence. Journal of the American Society for Information Science, 50(2):162--180, 1999. Google ScholarDigital Library
S. Lawrence, D. M. Pennock, G. W. Flake, R. Krovetz, F. M. Coetzee, E. Glover, F. A. Nielsen, A. Kruger,and C. L. Giles. Persistence of web references in scientific research. Computer, 34(2):26--31, 2001. Google ScholarDigital Library
C. Marhsall, F. McCown, and M. L. Nelson. Evaluating personal archiving strategies for Internet-based information. In Proceedings of IS&T Archiving 2007, May 2007.Google Scholar
F. McCown. Mark Foley websites - reconstructed, 2006. http://www.cs.odu.edu/~fmccown/foley/.Google Scholar
F. McCown, X. Liu, M. L. Nelson, and M. Zubair. Search engine coverage of the OAI-PMH corpus. IEEE Internet Computing, 10(2):66--73, Mar/Apr 2006. Google ScholarDigital Library
F. McCown and M. L. Nelson. Evaluation of crawling policies for a web-repository crawler. In HYPERTEXT'06: Proceedings of the 17th ACM conference on Hypertext and Hypermedia, pages 145--156, 2006. Google ScholarDigital Library
F. McCown and M. L. Nelson. Agreeing to disagree: Search engines and their public interfaces. In JCDL'07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital Libraries, 2007. Google ScholarDigital Library
F. McCown and M. L. Nelson. Characterization of search engine caches. In Proceedings of IS&T Archiving 2007, 2007.Google Scholar
F. McCown, J. A. Smith, M. L. Nelson, and J. Bollen. Lazy preservation: Reconstructing websites by crawling the crawlers. In Proceedings from the 8th ACM International Workshop on Web Information and Data Management (WIDM '06), pages 67--74,2006. Google ScholarDigital Library
G. Mohr, M. Kimpton, M. Stack, and I. Ranitovic. An introduction to Heritrix, an archival quality web crawler. In Proceedings of the 4th International Web Archiving Workshop (IWAW '04), Sept. 2004.Google Scholar
M. L. Nelson and B. D. Allen. Object persistence and availability in digital libraries. D-Lib Magazine, 8(1),2002.Google ScholarCross Ref
M. L. Nelson, J. A. Smith, I. Garcia del Campo, H. Van de Sompel, and X. Liu. Efficient, automatic web resource harvesting. In Proceedings from the 8th ACM International Workshop on Web Informationand Data Management (WIDM '06), pages 43--50, 2006. Google ScholarDigital Library
A. Ntoulas, J. Cho, and C. Olston. What's new on the Web? The evolution of the Web from a search engine perspective. In WWW '04: Proceedings of the 13th international conference on World Wide Web, pages 1--12, 2004. Google ScholarDigital Library
S. Olsen. Court backs thumbnail image linking. CNET News.com, July 2003. http://news.com.com/2100-1025_3-1023629.html.Google Scholar
S. Olsen. Google cache raises copyright concerns. CNET News.com, July 2003. http://news.com.com/2100-1038_3-1024234.html.Google Scholar
M. Thelwall. Methodologies for crawler based web surveys. Internet Research, 12(2):124--138, 2002.Google ScholarCross Ref
M. Thelwall and D. Stuart. Web crawling ethics revisited: Cost, privacy, and denial of service. Journal of the American Society for Information Science and Technology, 57(13):1771--1779, 2006. Google ScholarDigital Library
M. Thelwall and L. Vaughan. A fair history of the Web? Examining country balance in the Internet Archive. Library & Information Science Research, 26(2):162--176, 2004.Google ScholarCross Ref
L. Vaughan and M. Thelwall. Search engine coverage bias: Evidence and possible causes. Information Processing & Management, 40(4):693--707, 2004. Google ScholarDigital Library
Yahoo Site Explorer. http://siteexplorer.search.yahoo.com/.Google Scholar

Index Terms

Factors affecting website reconstruction from the web infrastructure
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
2. Information systems

Recommendations

Recovering a website's server components from the web infrastructure
JCDL '08: Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries

Our previous research has shown that the collective behavior of search engine caches (e.g., Google, Yahoo, Live Search) and web archives (e.g., Internet Archive) results in the uncoordinated but large-scale refreshing and migrating of web resources. ...
Read More
Usage analysis of a public website reconstruction tool
JCDL '08: Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries

The Web is increasingly the medium by which information is published today, but due to its ephemeral nature, web pages and sometimes entire websites are often "lost" due to server crashes, viruses, hackers, run-ins with the law, bankruptcy and loss of ...
Read More
How much of the web is archived?
JCDL '11: Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries

The Memento Project's archive access additions to HTTP have enabled development of new web archive access user interfaces. After experiencing this web time travel, the in- evitable question that comes to mind is "How much of the Web is archived?" This ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
June 2007
534 pages
ISBN:9781595936448
DOI:10.1145/1255175
General Chair:
Edie Rasmussen
University of British Columbia, Canada
,
Program Chairs:
Ray R. Larson
University of California, Berkeley
,
Elaine Toms
Dalhousie University, Canada
,
Shigeo Sugimoto
University of Tsukuba, Japan
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
digital preservation
search engine caches
web archiving
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate415of1,482submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 458
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Factors affecting website reconstruction from the web infrastructure

JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Recovering a website's server components from the web infrastructure

Usage analysis of a public website reconstruction tool

How much of the web is archived?

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Factors affecting website reconstruction from the web infrastructure

JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Recovering a website's server components from the web infrastructure

Usage analysis of a public website reconstruction tool

How much of the web is archived?

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media