ABSTRACT
When a user views an archived page using the archive's user interface (UI), the user selects a datetime to view from a list. The archived web page, if available, is then displayed. From this display, the web archive UI attempts to simulate the web browsing experience by smoothly transitioning between archived pages. During this process, the target datetime changes with each link followed; drifting away from the datetime originally selected. When browsing sparsely-archived pages, this nearly-silent drift can be many years in just a few clicks. We conducted 200,000 acyclic walks of archived pages, following up to 50 links per walk, comparing the results of two target datetime policies. The Sliding Target policy allows the target datetime to change as it does in archive UIs such as the Internet Archive's Wayback Machine. The Sticky Target policy, represented by the Memento API, keeps the target datetime the same throughout the walk. We found that the Sliding Target policy drift increases with the number of walk steps, number of domains visited, and choice (number of links available). However, the Sticky Target policy controls temporal drift, holding it to less than 30 days on average regardless of walk length or number of domains visited. The Sticky Target policy shows some increase as choice increases, but this may be caused by other factors. We conclude that based on walk length, the Sticky Target policy generally produces at least 30 days less drift than the Sliding Target policy.
- S. G. Ainsworth, A. Alsum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. How much of the Web is archived? In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital libraries, JCDL'11, pages 133--136, June 2011. Google ScholarDigital Library
- S. G. Ainsworth, A. Alsum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. How much of the Web is archived? Technical Report arXiv:1212.6177, Old Dominion University, December 2012.Google Scholar
- Y. AlNoamany, M. C. Weigle, and M. L. Nelson. Access patterns for robots and humans in web archives. In Proceedings of the 13th Annual International ACM/IEEE Joint Conference on Digital libraries, JCDL'13, July 2013. Google ScholarDigital Library
- M. Ben Saad and S. Gançarski. Archiving the Web using page changes patterns: a case study. In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, JCDL'11, pages 113--122, 2011. Google ScholarDigital Library
- M. Ben Saad and S. Gançarski. Improving the quality of web archives through the importance of changes. In Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part I, DEXA'11, pages 394--409, 2011. Google ScholarDigital Library
- M. Ben Saad, Z. Pehlivan, and S. Gançarski. Coherence-oriented crawling and navigation using patterns for web archives. In Proceedings of the 15th international conference on Theory and practice of digital libraries: research and advanced technology for digital libraries, TPDL'11, pages 421--433, 2011. Google ScholarDigital Library
- J. F. Brunelle and M. L. Nelson. Evaluating the sitestory transactional web archive with the apachebench tool. Technical Report arXiv:1209.1811, Old Dominion University, September 2012.Google Scholar
- C. Casey. The Cyberarchive: a look at the storage and preservation of web sites. College & Research Libraries, 59, 1998.Google Scholar
- M. Day. Preserving the fabric of our lives: A survey of web preservation initiatives. In Proceedings of the 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2005), pages 461--472, 2003.Google ScholarCross Ref
- D. Denev, A. Mazeika, M. Spaniol, and G. Weikum. SHARC: Framework for quality-conscious web archiving. volume 2, pages 586--597, August 2009. Google ScholarDigital Library
- C. E. Dyreson, H.-l. Lin, and Y. Wang. Managing versions of web documents in a transaction-time web server. In Proceedings of the 13th international conference on World Wide Web, WWW'04, 2004. Google ScholarDigital Library
- G. Eysenbach and M. Trudel. Going, going, still there: Using the WebCite service to permanently archive cited web pages. Journal of Medical Internet Research, 7(5), 2005.Google ScholarCross Ref
- K. Fitch. Web site archiving: an approach to recording every materially different response produced by a website. In 9th Australasian World Wide Web Conference, Sanctuary Cove, Queensland, Australia,, pages 5--9, 2003.Google Scholar
- B. Kahle. Wayback machine: Now with 240,000,000,000 URLs. http://blog.archive.org/2013/01/09/updated-wayback/, January 2013.Google Scholar
- M. Kimpton and J. Ubois. Year-by-year:\ from an archive of the Internet to an archive on the Internet. In J. Masanès, editor, Web Archiving, chapter 9, pages 201--212. 2006.Google Scholar
- J. Masanès. Web archiving: issues and methods. In J. Masanès, editor, Web Archving, chapter 1, pages 1--53. 2006.Google Scholar
- F. McCown and M. L. Nelson. Characterization of search engine caches. In Proceedings of IS&T Archiving 2007, pages 48--52, May 2007.Google Scholar
- G. Mohr, M. Stack, I. Rnitovic, D. Avery, and M. Kimpton. Introduction to Heritrix, an archival quality web crawler. In 4th International Web Archiving Workshop, Bath, UK, September 2004.Google Scholar
- K. C. Negulescu. Web archiving @ the Internet Archive. http://www.digitalpreservation.gov/news/events/ndiipp_meetings/ndiipp10/docs/July21/session09/NDIIPP072110FinalIA.ppt, 2010.Google Scholar
- R. Sanderson, H. Shankar, S. Ainsworth, F. McCown, and S. Adams. Implementing time travel for the Web. Code\4\Lib Journal, (13), 2011.Google Scholar
- M. Spaniol, D. Denev, A. Mazeika, G. Weikum, and P. Senellart. Data quality in web archiving. In Proceedings of the 3rd Workshop on Information Credibility on the Web, WICOW'09, pages 19--26, 2009. Google ScholarDigital Library
- M. Spaniol, A. Mazeika, D. Denev, and G. Weikum. "Catch me if you can": Visual analysis of coherence defects in web archiving. In The 9 th International Web Archiving Workshop (IWAW 2009) Corfu, Greece, September/October, 2009 Workshop Proceedings, pages 27--37, 2009.Google Scholar
- M. Thelwall and L. Vaughan. A fair history of the Web? examining country balance in the Internet Archive. Library & Information Science Research, 26(2), 2004.Google Scholar
- B. Tofel. "Wayback" for accessing web archives. In Proceedings of the 7th International Web Archiving Workshop (IWAW'07), 2007.Google Scholar
- H. Van de Sompel, M. Nelson, and R. Sanderson. HTTP framework for time-based access to resource states -- Memento, November 2010. http://datatracker.ietf.org/doc/draft-vandesompel-memento/.Google Scholar
- H. Van de Sompel, M. L. Nelson, R. Sanderson, L. L. Balakireva, S. Ainsworth, and H. Shankar. Memento: Time travel for the Web. Technical Report arXiv:0911.1112, 2009.Google Scholar
- M. C. Weigle. How much of the web is archived? http://ws-dl.blogspot.com/2011/06/2011-06--23-how-much-of-web-is-archived.html, June 2011.Google Scholar
Index Terms
- Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive
Recommendations
The Florida Digital Archive and DAITSS: a working preservation repository based on format migration
The Florida Digital Archive is a long-term digital preservation repository for the use of the libraries of the public universities of Florida. It is managed by the Florida Center for Library Automation (FCLA) and based on Dark Archive in the Sunshine ...
Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive
When viewing an archived page using the archive's user interface (UI), the user selects a datetime to view from a list. The archived web page, if available, is then displayed. From this display, the web archive UI attempts to simulate the web browsing ...
How much of the web is archived?
JCDL '11: Proceedings of the 11th annual international ACM/IEEE joint conference on Digital librariesThe Memento Project's archive access additions to HTTP have enabled development of new web archive access user interfaces. After experiencing this web time travel, the in- evitable question that comes to mind is "How much of the Web is archived?" This ...
Comments