ABSTRACT
When a user retrieves a page from a web archive, the page is marked with the acquisition datetime of the root resource, which effectively asserts "this is how the page looked at a that datetime." However, embedded resources, such as images, are often archived at different datetimes than the main page. The presentation appears temporally coherent, but is composed from resources acquired over a wide range of datetimes. We examine the completeness and temporal coherence of composite archived resources (composite mementos) under two selection heuristics. The completeness and temporal coherence achieved using a single archive was compared to the results achieved using multiple archives. We found that at most 38.7% of composite mementos are both temporally coherent and that at most only 17.9% (roughly 1 in 5) are temporally coherent and 100% complete. Using multiple archives increases mean completeness by 3.1-4.1% but also reduces temporal coherence.
- S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. How much of the Web is archived? In Proceedings of JCDL'11, pages 133--136, June 2011. Google ScholarDigital Library
- S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. How much of the Web is archived? Technical Report arXiv:1212.6177, Old Dominion University, December 2012.Google Scholar
- S. G. Ainsworth and M. L. Nelson. Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive. In Proceedings of JCDL'13, July 2013. Google ScholarDigital Library
- S. G. Ainsworth, M. L. Nelson, and H. Van de Sompel. A framework for evaluation of composite memento temporal coherence. Technical Report arXiv:1402.0928, Old Dominion University, February 2014.Google Scholar
- A. AlSum, M. L. Nelson, R. Sanderson, and H. Van de Sompel. Archival HTTP redirection retrieval policies. In Proceedings of WWW'13 Companion, pages 1051--1058, Republic and Canton of Geneva, Switzerland, 2013. Google ScholarDigital Library
- A. AlSum, M. C. Weigle, M. L. Nelson, and H. Van de Sompel. Profiling web archive coverage for top-level domain and content language. International Journal on Digital Libraries, 14(3):149--166, 2014. Google ScholarDigital Library
- M. Ben Saad and S. Gançarski. Archiving the Web using page changes patterns: a case study. In Proceedings of JCDL'11, pages 113--122, 2011. Google ScholarDigital Library
- M. Ben Saad and S. Gançarski. Improving the quality of web archives through the importance of changes. In Proceedings of DEXA'11, pages 394--409, 2011. Google ScholarDigital Library
- M. Ben Saad, Z. Pehlivan, and S. Gançarski. Coherence-oriented crawling and navigation using patterns for web archives. In Proceedings of TPDL'11, pages 421--433, 2011. Google ScholarDigital Library
- A. Bright. Web evidence points to pro-Russia rebels in downing of MH17. Christian Science Monitor, 2014.Google Scholar
- J. F. Brunelle, M. Kelly, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. Not all mementos are created equal: Measuring the impact of missing resources. In Proceedings of JCDL'14, pages 321--330, September 2014. Google ScholarDigital Library
- J. F. Brunelle, M. L. Nelson, L. Balakireva, R. Sanderson, and H. Van de Sompel. Evaluating the SiteStory transactional web archive with the ApacheBench tool. In Proceedings of TPDL'13, pages 204--215, 2012.Google Scholar
- M. S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of STOC'02, pages 380--388, New York, NY, USA, 2002. Google ScholarDigital Library
- M. Day. Preserving the fabric of our lives: A survey of web preservation initiatives. In Proceedings of ECDL'05, pages 461--472, 2003.Google ScholarCross Ref
- D. Denev, A. Mazeika, M. Spaniol, and G. Weikum. SHARC: Framework for quality-conscious web archiving. Proceedings of the VLDB Endowment, 2(1):586--597, August 2009. Google ScholarDigital Library
- C. E. Dyreson, H. ling Lin, and Y. Wang. Managing versions of web documents in a transaction-time web server. In Proceedings of WWW'04, 2004. Google ScholarDigital Library
- G. Eysenbach and M. Trudel. Going, going, still there: Using the WebCite service to permanently archive cited web pages. Journal of Medical Internet Research, 7(5), 2005.Google ScholarCross Ref
- K. Fitch. Web site archiving: an approach to recording every materially different response produced by a website. In 9th Australasian World Wide Web Conference, Sanctuary Cove, Queensland, Australia,, pages 5--9, 2003.Google Scholar
- B. A. Howell. Proving web history: How to use the Internet Archive. Journal of Internet Law, 9(8):3--9, 2006.Google Scholar
- S. M. Jones, M. L. Nelson, H. Shankar, and H. V. de Sompel. Bringing web time travel to MediaWiki: An assessment of the Memento MediaWiki Extension. Technical Report arXiv:1406.3876, Old Dominion University and Los Alamos National Laboratory, June 2014.Google Scholar
- B. Kahle. Wayback Machine just grew today to 479,160,477,000 pages. Go @internetarchive! https://archive.org/web {Twitter post}. Retrieved from https://twitter.com/brewster_kahle/status/603611567276589056.Google Scholar
- B. Kahle. Wayback machine: Now with 240,000,000,000 URLs. http://blog.archive.org/2013/01/09/updated-wayback/, January 2013.Google Scholar
- M. Klein and M. L. Nelson. Revisiting lexical signatures to (re-)discover web pages. In B. Christensen-Dalsgaard, D. Castelli, B. Ammitzbøll Jurik, and J. Lippincott, editors, Research and Advanced Technology for Digital Libraries, volume 5173 of Lecture Notes in Computer Science, pages 371--382. Springer Berlin Heidelberg, 2008. Google ScholarDigital Library
- G. S. Manku, A. Jain, and A. Das Sarma. Detecting near-duplicates for web crawling. In Proceedings of WWW'07, pages 141--150, New York, NY, USA, 2007. Google ScholarDigital Library
- J. Masanès. Web Archiving. Springer, Heidelberg, 2006. Google ScholarDigital Library
- F. McCown and M. L. Nelson. Characterization of search engine caches. In Proceedings of IS&T Archiving 2007, pages 48--52, May 2007. (Also available as arXiv:cs/0703083v2).Google Scholar
- G. Mohr, M. Stack, I. Rnitovic, D. Avery, and M. Kimpton. Introduction to Heritrix, an archival quality web crawler. In Proceedings of IWAW'04, September 2004.Google Scholar
- K. C. Negulescu. Web archiving @ the Internet Archive. http://www.digitalpreservation.gov/news/events/ndiipp_meetings/ndiipp10/docs/July21/session09/NDIIPP072110FinalIA.ppt, 2010.Google Scholar
- S.-T. Park, D. M. Pennock, C. L. Giles, and R. Krovetz. Analysis of lexical signatures for finding lost or related documents. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 11--18, New York, NY, USA, 2002. Google ScholarDigital Library
- M. Spaniol, D. Denev, A. Mazeika, G. Weikum, and P. Senellart. Data quality in web archiving. In Proceedings of WICOW'09, pages 19--26, 2009. Google ScholarDigital Library
- M. Spaniol, A. Mazeika, D. Denev, and G. Weikum. "Catch me if you can": Visual analysis of coherence defects in web archiving. In Proceedings of IWAW'09, pages 27--37, 2009.Google Scholar
- M. Thelwall and L. Vaughan. A fair history of the Web? examining country balance in the Internet Archive. Library & Information Science Research, 26(2):162--176, 2004.Google ScholarCross Ref
- B. Tofel. 'Wayback' for accessing web archives. In Proceedings of IWAW'07), 2007.Google Scholar
- H. Van de Sompel, M. Nelson, and R. Sanderson. HTTP framework for time-based access to resource states--Memento (IETF RFC 7089), December 2013. http://tools.ietf.org/html/rfc7089.Google Scholar
- H. Van de Sompel, M. L. Nelson, R. Sanderson, L. L. Balakireva, S. Ainsworth, and H. Shankar. Memento: Time travel for the Web. Technical Report arXiv:0911.1112, 2009.Google Scholar
- M. C. Weigle. How much of the Web is archived? http://ws-dl.blogspot.com/2011/06/2011-06--23-how-much-of-web-is-archived.html, June 2011.Google Scholar
Index Terms
- Only One Out of Five Archived Web Pages Existed as Presented
Recommendations
How much of the web is archived?
JCDL '11: Proceedings of the 11th annual international ACM/IEEE joint conference on Digital librariesThe Memento Project's archive access additions to HTTP have enabled development of new web archive access user interfaces. After experiencing this web time travel, the in- evitable question that comes to mind is "How much of the Web is archived?" This ...
An evaluation of caching policies for memento timemaps
JCDL '13: Proceedings of the 13th ACM/IEEE-CS joint conference on Digital librariesAs defined by the Memento Framework, TimeMaps are machine-readable lists of time-specific copies -- called "mementos" -- of an archived original resource. In theory, as an archive acquires additional mementos over time, a TimeMap should be monotonically ...
Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive
JCDL '13: Proceedings of the 13th ACM/IEEE-CS joint conference on Digital librariesWhen a user views an archived page using the archive's user interface (UI), the user selects a datetime to view from a list. The archived web page, if available, is then displayed. From this display, the web archive UI attempts to simulate the web ...
Comments