research-article

Free Access

Why web sites are lost (and how they're sometimes found)

Authors:
Frank McCown

Harding University, Searcy, AR

Harding University, Searcy, AR
View Profile

,
Catherine C. Marshall

Microsoft Research, Silicon Valley

Microsoft Research, Silicon Valley
View Profile

,
Michael L. Nelson

Old Dominion University

Old Dominion University
View Profile

Authors Info & Claims

Communications of the ACM Volume 52 Issue 11November 2009pp 141–145https://doi.org/10.1145/1592761.1592794

Published:01 November 2009Publication History

Communications of the ACM

Abstract

Introduction

The web is in constant flux---new pages and Web sites appear daily, and old pages and sites disappear almost as quickly. One study estimates that about two percent of the Web disappears from its current location every week.² Although Web users have become accustomed to seeing the infamous "404 Not Found" page, they are more taken aback when they own, are responsible for, or have come to rely on the missing material.

Web archivists like those at the Internet Archive have responded to the Web's transience by archiving as much of it as possible, hoping to preserve snapshots of the Web for future generations.³ Search engines have also responded by offering pages that have been cached as a result of the indexing process. These straightforward archiving and caching efforts have been used by the public in unintended ways: individuals and organizations have used them to restore their own lost Web sites.⁵

To automate recovering lost Web sites, we created a Web-repository crawler named Warrick that restores lost resources from the holdings of four Web repositories: Internet Archive, Google, Live Search (now Bing), and Yahoo;⁶ we refer to these Web repositories collectively as the Web Infrastructure (WI). We call this after-loss recovery Lazy Preservation (see the sidebar for more information). Warrick can only recover what is accessible to the WI, namely the crawlable Web. There are numerous resources that cannot be found in the WI: password protected content, pages without incoming links or protected by the robots exclusion protocol, and content hidden behind Flash or JavaScript interfaces. Most importantly, WI crawlers do not have access to the server-side components (for example, scripts, configuration files, databases, among others) of a Web site.

Nevertheless, upon Warrick's public release in 2005, we received many inquiries about its usage and collected a handful of anecdotes about the Web sites individuals and organizations had lost and wanted to recover. Were these Web sites representative? What types of Web resources were people losing? Given the inherent limitations of the WI, were Warrick users recovering enough material to reconstruct the site? Were these losses changing their behavior, or was the availability of cached material reinforcing a "lazy" approach to preservation?

We constructed an online survey to explore these questions and conducted a set of in-depth interviews with survey respondents to clarify the results. Potential participants were solicited by us or the Internet Archive, or they found a link to the survey from the Warrick Web site. A total of 52 participants completed the survey regarding 55 lost Web sites, and seven of the participants allowed us to follow-up with telephone or instant messaging interviews. Participants were divided into two groups:

1. Personal loss: Those who had lost (and tried to recover) a Web site that they had personally created, maintained or owned (34 participants who lost 37 Web sites).

2. Third party: Those who had recovered someone else's lost Web site (18 participants who recovered 18 Web sites).

References

Cox, L. P., Murray, C. D., and Noble, B. D. Pastiche: Making backup cheap and easy. SIGOPS Operating Systems Review 36, SI, (2002), 285--298. Google ScholarDigital Library
Fetterly, D., Manasse, M., Najork, M., and Wiener, J. A large-scale study of the evolution of Web pages. In Proceedings of WWW '03, (2003), 669--678. Google ScholarDigital Library
Kahle, B. Preserving the Internet. Scientific American, (Mar. 1997), 82--83.Google ScholarCross Ref
Marshall, C., Bly, S., and Brun-Cottan, F. The long term fate of our personal digital belongings: Toward a service model for personal archives. In Proceedings of IS&T Archiving 2006, (2006), 25--30.Google Scholar
Marshall, C., McCown, F., and Nelson, M. L. Evaluating personal archiving strategies for Internet-based information. In Proceedings of IS&T Archiving 2007, (2007), 151--156.Google Scholar
McCown, F., Smith, J. A., Nelson, M. L., and Bollen, J. Lazy preservation: Reconstructing Websites by crawling the crawlers. In Proceedings of ACM WIDM '06, (2006), 67--74. Google ScholarDigital Library
F. McCown, A. Benjelloun, and M. L. Nelson. Brass: A queueing manager for Warrick. In IWAW '07: Proceedings of the 7^th International Web Archiving Workshop, June 2007.Google Scholar
F. McCown, N. Diawara, and M. L. Nelson. Factors affecting website reconstruction from the web infrastructure. In JCDL '07: Proceedings of the 7^th ACM/IEEE-CS Joint Conference on Digital Libraries, June 2007, 39--48. Google ScholarDigital Library
F. McCown, J. A. Smith, M. L. Nelson, and J. Bollen. Lazy preservation: Reconstructing websites by crawling the crawlers. In WIDM '06: Proceedings from the 8^th ACM International Workshop on Web Information and Data Management, 2006, 67--74. Google ScholarDigital Library
M. L. Nelson, F. McCown, J. A. Smith, and M. Klein. Using the web infrastructure to preserve web pages. International Journal on Digital Libraries, 6(4), 2007, 327--349. Google ScholarDigital Library
M. O. Rabin. Efficient dispersal of information for security, load balancing, and fault tolerance. Journal of the ACM, 36(2), 1989, 335--348. Google ScholarDigital Library

Index Terms

Why web sites are lost (and how they're sometimes found)
1. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Keeping found things found on the web
CIKM '01: Proceedings of the tenth international conference on Information and knowledge management

This paper describes the results of an observational study into the methods people use to manage web information for re-use. People observed in our study used a diversity of methods and associated tools. For example, several participants emailed web ...
Read More
How to make web sites talk together: web service solution
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web

Integrating web sites to provide more efficient services is a very promising way in the Internet. For example searching house for rent based on train system or preparing a holiday with several constrains such as hotel, air ticket, etc... From resource ...
Read More
Classifying web sites
WWW '07: Proceedings of the 16th international conference on World Wide Web

In this paper, we present a novel method for the classification of Web sites. This method exploits both structure and content of Web sites in order to discern their functionality. It allows for distinguishing between eight of the most relevant ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Communications of the ACM Volume 52, Issue 11
Scratch Programming for All
November 2009
135 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/1592761
Issue’s Table of Contents

Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 November 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Popular
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 980
  Total Downloads
- Downloads (Last 12 months)37
- Downloads (Last 6 weeks)22
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Why web sites are lost (and how they're sometimes found)

Communications of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Keeping found things found on the web

How to make web sites talk together: web service solution

Classifying web sites

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Why web sites are lost (and how they're sometimes found)

Communications of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Keeping found things found on the web

How to make web sites talk together: web service solution

Classifying web sites

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media