ABSTRACT
The Memento Project's archive access additions to HTTP have enabled development of new web archive access user interfaces. After experiencing this web time travel, the in- evitable question that comes to mind is "How much of the Web is archived?" This question is studied by approximating the Web via sampling URIs from DMOZ, Delicious, Bitly, and search engine indexes and measuring number of archive copies available in various public web archives. The results indicate that 35%-90% of URIs have at least one archived copy, 17%-49% have two to five copies, 1%-8% have six to ten copies, and 8%-63% at least ten copies. The number of URI copies varies as a function of time, but only 14.6-31.3% of URIs are archived more than once per month.
- Ziv Bar-Yossef and Maxim Gurevich. Random sampling from a search engine's index. J. ACM, 55(5), 2008. Google ScholarDigital Library
- Michael Day. Preserving the fabric of our lives: A survey of web preservation initiatives. In ECDL'03, 2003.Google ScholarCross Ref
- Alex Franz and Thorsten Brants. All our n-gram are belong to you. http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html, August 2006. Accessed January 15, 2011.Google Scholar
- Daniel Gomes, Sérgio Freitas, and Mário Silva. Design and selection criteria for a national web archive. In ECDL'06, 2006. Google ScholarDigital Library
- A. Gulli and A. Signorini. The indexable web is more than 11.5 billion pages. In WWW'05, 2005. Google ScholarDigital Library
- Ingrid Mason. Virtual preservation: How has digital culture influenced our ideas about permanence? changing practice in a national legal deposit library. Library Trends, 56(1), Summer 2007.Google Scholar
- Frank McCown and Michael L. Nelson. Agreeing to disagree: search engines and their public interfaces. In JCDL'07, 2007. Google ScholarDigital Library
- Frank McCown and Michael L. Nelson. Characterization of search engine caches. In Proceedings of IS&T Archiving 2007, May 2007.Google Scholar
- G. Monroe, J. French, and A. Powell. Obtaining language models of web collections using query-based sampling techniques. HICSS'02, 3, 2002. Google ScholarDigital Library
- Kris Carpenter Negulescu. Web archiving @ the Internet Archive. http://www.digitalpreservation.gov/news/events/ndiipp_meetings/ndiipp10/docs/July21/session09/NDIIPP072110FinalIA.ppt, 2010.Google Scholar
- Michael L. Nelson. Web Science and Digital Libraries Research Group: 2010--11-05: Memento-Datetime is not Last-Modified, 2010. http://ws-dl.blogspot.com/2010/11/2010--11-05-memento-datetime-is-not-last.html.Google Scholar
- Margaret E. Philips. What should we preserve? the question for heritage libraries in a digital world. Library Trends, 54(1), Summer 2005.Google Scholar
- Mike Thelwall and Liwen Vaughan. A fair history of the Web? examining country balance in the Internet Archive. Library & Information Science Research, 26(2), 2004.Google Scholar
- Herbert Van de Sompel, Michael Nelson, and Robert Sanderson. HTTP framework for time-based access to resource states -- Memento, November 2010. http://datatracker.ietf.org/doc/draft-vandesompel-memento/.Google Scholar
- Herbert Van de Sompel, Michael L. Nelson, Robert Sanderson, Lyudmila L. Balakireva, Scott Ainsworth, and Harihar Shankar. Memento: Time travel for the web. Technical Report arXiv:0911.1112, 2009.Google Scholar
Index Terms
- How much of the web is archived?
Recommendations
Only One Out of Five Archived Web Pages Existed as Presented
HT '15: Proceedings of the 26th ACM Conference on Hypertext & Social MediaWhen a user retrieves a page from a web archive, the page is marked with the acquisition datetime of the root resource, which effectively asserts "this is how the page looked at a that datetime." However, embedded resources, such as images, are often ...
Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive
JCDL '13: Proceedings of the 13th ACM/IEEE-CS joint conference on Digital librariesWhen a user views an archived page using the archive's user interface (UI), the user selects a datetime to view from a list. The archived web page, if available, is then displayed. From this display, the web archive UI attempts to simulate the web ...
Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive
When viewing an archived page using the archive's user interface (UI), the user selects a datetime to view from a list. The archived web page, if available, is then displayed. From this display, the web archive UI attempts to simulate the web browsing ...
Comments