skip to main content
10.1145/1998076.1998100acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

How much of the web is archived?

Published:13 June 2011Publication History

ABSTRACT

The Memento Project's archive access additions to HTTP have enabled development of new web archive access user interfaces. After experiencing this web time travel, the in- evitable question that comes to mind is "How much of the Web is archived?" This question is studied by approximating the Web via sampling URIs from DMOZ, Delicious, Bitly, and search engine indexes and measuring number of archive copies available in various public web archives. The results indicate that 35%-90% of URIs have at least one archived copy, 17%-49% have two to five copies, 1%-8% have six to ten copies, and 8%-63% at least ten copies. The number of URI copies varies as a function of time, but only 14.6-31.3% of URIs are archived more than once per month.

References

  1. Ziv Bar-Yossef and Maxim Gurevich. Random sampling from a search engine's index. J. ACM, 55(5), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Michael Day. Preserving the fabric of our lives: A survey of web preservation initiatives. In ECDL'03, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  3. Alex Franz and Thorsten Brants. All our n-gram are belong to you. http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html, August 2006. Accessed January 15, 2011.Google ScholarGoogle Scholar
  4. Daniel Gomes, Sérgio Freitas, and Mário Silva. Design and selection criteria for a national web archive. In ECDL'06, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Gulli and A. Signorini. The indexable web is more than 11.5 billion pages. In WWW'05, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ingrid Mason. Virtual preservation: How has digital culture influenced our ideas about permanence? changing practice in a national legal deposit library. Library Trends, 56(1), Summer 2007.Google ScholarGoogle Scholar
  7. Frank McCown and Michael L. Nelson. Agreeing to disagree: search engines and their public interfaces. In JCDL'07, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Frank McCown and Michael L. Nelson. Characterization of search engine caches. In Proceedings of IS&T Archiving 2007, May 2007.Google ScholarGoogle Scholar
  9. G. Monroe, J. French, and A. Powell. Obtaining language models of web collections using query-based sampling techniques. HICSS'02, 3, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kris Carpenter Negulescu. Web archiving @ the Internet Archive. http://www.digitalpreservation.gov/news/events/ndiipp_meetings/ndiipp10/docs/July21/session09/NDIIPP072110FinalIA.ppt, 2010.Google ScholarGoogle Scholar
  11. Michael L. Nelson. Web Science and Digital Libraries Research Group: 2010--11-05: Memento-Datetime is not Last-Modified, 2010. http://ws-dl.blogspot.com/2010/11/2010--11-05-memento-datetime-is-not-last.html.Google ScholarGoogle Scholar
  12. Margaret E. Philips. What should we preserve? the question for heritage libraries in a digital world. Library Trends, 54(1), Summer 2005.Google ScholarGoogle Scholar
  13. Mike Thelwall and Liwen Vaughan. A fair history of the Web? examining country balance in the Internet Archive. Library & Information Science Research, 26(2), 2004.Google ScholarGoogle Scholar
  14. Herbert Van de Sompel, Michael Nelson, and Robert Sanderson. HTTP framework for time-based access to resource states -- Memento, November 2010. http://datatracker.ietf.org/doc/draft-vandesompel-memento/.Google ScholarGoogle Scholar
  15. Herbert Van de Sompel, Michael L. Nelson, Robert Sanderson, Lyudmila L. Balakireva, Scott Ainsworth, and Harihar Shankar. Memento: Time travel for the web. Technical Report arXiv:0911.1112, 2009.Google ScholarGoogle Scholar

Index Terms

  1. How much of the web is archived?

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        JCDL '11: Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
        June 2011
        500 pages
        ISBN:9781450307444
        DOI:10.1145/1998076

        Copyright © 2011 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 June 2011

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate415of1,482submissions,28%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader