skip to main content
10.1145/2700171.2791044acmconferencesArticle/Chapter ViewAbstractPublication PageshtConference Proceedingsconference-collections
research-article

Only One Out of Five Archived Web Pages Existed as Presented

Published:24 August 2015Publication History

ABSTRACT

When a user retrieves a page from a web archive, the page is marked with the acquisition datetime of the root resource, which effectively asserts "this is how the page looked at a that datetime." However, embedded resources, such as images, are often archived at different datetimes than the main page. The presentation appears temporally coherent, but is composed from resources acquired over a wide range of datetimes. We examine the completeness and temporal coherence of composite archived resources (composite mementos) under two selection heuristics. The completeness and temporal coherence achieved using a single archive was compared to the results achieved using multiple archives. We found that at most 38.7% of composite mementos are both temporally coherent and that at most only 17.9% (roughly 1 in 5) are temporally coherent and 100% complete. Using multiple archives increases mean completeness by 3.1-4.1% but also reduces temporal coherence.

References

  1. S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. How much of the Web is archived? In Proceedings of JCDL'11, pages 133--136, June 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. How much of the Web is archived? Technical Report arXiv:1212.6177, Old Dominion University, December 2012.Google ScholarGoogle Scholar
  3. S. G. Ainsworth and M. L. Nelson. Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive. In Proceedings of JCDL'13, July 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. G. Ainsworth, M. L. Nelson, and H. Van de Sompel. A framework for evaluation of composite memento temporal coherence. Technical Report arXiv:1402.0928, Old Dominion University, February 2014.Google ScholarGoogle Scholar
  5. A. AlSum, M. L. Nelson, R. Sanderson, and H. Van de Sompel. Archival HTTP redirection retrieval policies. In Proceedings of WWW'13 Companion, pages 1051--1058, Republic and Canton of Geneva, Switzerland, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. AlSum, M. C. Weigle, M. L. Nelson, and H. Van de Sompel. Profiling web archive coverage for top-level domain and content language. International Journal on Digital Libraries, 14(3):149--166, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Ben Saad and S. Gançarski. Archiving the Web using page changes patterns: a case study. In Proceedings of JCDL'11, pages 113--122, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Ben Saad and S. Gançarski. Improving the quality of web archives through the importance of changes. In Proceedings of DEXA'11, pages 394--409, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Ben Saad, Z. Pehlivan, and S. Gançarski. Coherence-oriented crawling and navigation using patterns for web archives. In Proceedings of TPDL'11, pages 421--433, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Bright. Web evidence points to pro-Russia rebels in downing of MH17. Christian Science Monitor, 2014.Google ScholarGoogle Scholar
  11. J. F. Brunelle, M. Kelly, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. Not all mementos are created equal: Measuring the impact of missing resources. In Proceedings of JCDL'14, pages 321--330, September 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. F. Brunelle, M. L. Nelson, L. Balakireva, R. Sanderson, and H. Van de Sompel. Evaluating the SiteStory transactional web archive with the ApacheBench tool. In Proceedings of TPDL'13, pages 204--215, 2012.Google ScholarGoogle Scholar
  13. M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of STOC'02, pages 380--388, New York, NY, USA, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Day. Preserving the fabric of our lives: A survey of web preservation initiatives. In Proceedings of ECDL'05, pages 461--472, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  15. D. Denev, A. Mazeika, M. Spaniol, and G. Weikum. SHARC: Framework for quality-conscious web archiving. Proceedings of the VLDB Endowment, 2(1):586--597, August 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. E. Dyreson, H. ling Lin, and Y. Wang. Managing versions of web documents in a transaction-time web server. In Proceedings of WWW'04, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Eysenbach and M. Trudel. Going, going, still there: Using the WebCite service to permanently archive cited web pages. Journal of Medical Internet Research, 7(5), 2005.Google ScholarGoogle ScholarCross RefCross Ref
  18. K. Fitch. Web site archiving: an approach to recording every materially different response produced by a website. In 9th Australasian World Wide Web Conference, Sanctuary Cove, Queensland, Australia,, pages 5--9, 2003.Google ScholarGoogle Scholar
  19. B. A. Howell. Proving web history: How to use the Internet Archive. Journal of Internet Law, 9(8):3--9, 2006.Google ScholarGoogle Scholar
  20. S. M. Jones, M. L. Nelson, H. Shankar, and H. V. de Sompel. Bringing web time travel to MediaWiki: An assessment of the Memento MediaWiki Extension. Technical Report arXiv:1406.3876, Old Dominion University and Los Alamos National Laboratory, June 2014.Google ScholarGoogle Scholar
  21. B. Kahle. Wayback Machine just grew today to 479,160,477,000 pages. Go @internetarchive! https://archive.org/web {Twitter post}. Retrieved from https://twitter.com/brewster_kahle/status/603611567276589056.Google ScholarGoogle Scholar
  22. B. Kahle. Wayback machine: Now with 240,000,000,000 URLs. http://blog.archive.org/2013/01/09/updated-wayback/, January 2013.Google ScholarGoogle Scholar
  23. M. Klein and M. L. Nelson. Revisiting lexical signatures to (re-)discover web pages. In B. Christensen-Dalsgaard, D. Castelli, B. Ammitzbøll Jurik, and J. Lippincott, editors, Research and Advanced Technology for Digital Libraries, volume 5173 of Lecture Notes in Computer Science, pages 371--382. Springer Berlin Heidelberg, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. G. S. Manku, A. Jain, and A. Das Sarma. Detecting near-duplicates for web crawling. In Proceedings of WWW'07, pages 141--150, New York, NY, USA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Masanès. Web Archiving. Springer, Heidelberg, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. F. McCown and M. L. Nelson. Characterization of search engine caches. In Proceedings of IS&T Archiving 2007, pages 48--52, May 2007. (Also available as arXiv:cs/0703083v2).Google ScholarGoogle Scholar
  27. G. Mohr, M. Stack, I. Rnitovic, D. Avery, and M. Kimpton. Introduction to Heritrix, an archival quality web crawler. In Proceedings of IWAW'04, September 2004.Google ScholarGoogle Scholar
  28. K. C. Negulescu. Web archiving @ the Internet Archive. http://www.digitalpreservation.gov/news/events/ndiipp_meetings/ndiipp10/docs/July21/session09/NDIIPP072110FinalIA.ppt, 2010.Google ScholarGoogle Scholar
  29. S.-T. Park, D. M. Pennock, C. L. Giles, and R. Krovetz. Analysis of lexical signatures for finding lost or related documents. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 11--18, New York, NY, USA, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Spaniol, D. Denev, A. Mazeika, G. Weikum, and P. Senellart. Data quality in web archiving. In Proceedings of WICOW'09, pages 19--26, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Spaniol, A. Mazeika, D. Denev, and G. Weikum. "Catch me if you can": Visual analysis of coherence defects in web archiving. In Proceedings of IWAW'09, pages 27--37, 2009.Google ScholarGoogle Scholar
  32. M. Thelwall and L. Vaughan. A fair history of the Web? examining country balance in the Internet Archive. Library & Information Science Research, 26(2):162--176, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  33. B. Tofel. 'Wayback' for accessing web archives. In Proceedings of IWAW'07), 2007.Google ScholarGoogle Scholar
  34. H. Van de Sompel, M. Nelson, and R. Sanderson. HTTP framework for time-based access to resource states--Memento (IETF RFC 7089), December 2013. http://tools.ietf.org/html/rfc7089.Google ScholarGoogle Scholar
  35. H. Van de Sompel, M. L. Nelson, R. Sanderson, L. L. Balakireva, S. Ainsworth, and H. Shankar. Memento: Time travel for the Web. Technical Report arXiv:0911.1112, 2009.Google ScholarGoogle Scholar
  36. M. C. Weigle. How much of the Web is archived? http://ws-dl.blogspot.com/2011/06/2011-06--23-how-much-of-web-is-archived.html, June 2011.Google ScholarGoogle Scholar

Index Terms

  1. Only One Out of Five Archived Web Pages Existed as Presented

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        HT '15: Proceedings of the 26th ACM Conference on Hypertext & Social Media
        August 2015
        360 pages
        ISBN:9781450333955
        DOI:10.1145/2700171

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 August 2015

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        HT '15 Paper Acceptance Rate24of60submissions,40%Overall Acceptance Rate378of1,158submissions,33%

        Upcoming Conference

        HT '24
        35th ACM Conference on Hypertext and Social Media
        September 10 - 13, 2024
        Poznan , Poland

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader