skip to main content
10.1145/1255175.1255182acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
Article

Factors affecting website reconstruction from the web infrastructure

Published:18 June 2007Publication History

ABSTRACT

When a website is suddenly lost without a backup, it maybe reconstituted by probing web archives and search engine caches for missing content. In this paper we describe an experiment where we crawled and reconstructed 300 randomly selected websites on a weekly basis for 14 weeks. The reconstructions were performed using our web-repository crawler named Warrick which recovers missing resources from the Web Infrastructure (WI), the collective preservation effort of web archives and search engine caches. We examine several characteristics of the websites over time including birth rate, decay and age of resources. We evaluate the reconstructions when compared to the crawled sites and develop a statistical model for predicting reconstruction success from the WI. On average, we were able to recover 61% of each website's resources. We found that Google's PageRank, number of hops and resource age were the three most significant factors in determining if a resource would be recovered from the WI.

References

  1. L. A. Adamic and B. A. Huberman. Zipf's law and the Internet. Glottometrics, 3:143--150, 2002.Google ScholarGoogle Scholar
  2. Alexa toolbar. http://download.alexa.com/.Google ScholarGoogle Scholar
  3. Z. Bar-Yossef, A. Z. Broder, R. Kumar, and A. Tomkins. Sic transit gloria telae: towards an understanding of the web's decay. In WWW '04: Proceedings of the 13th international conference on World Wide Web, pages 328--337, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 367--376, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. E. Brewington and G. Cybenko. How dynamic is the Web? Computer Networks, 33(1-6):257--276, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the Web. Computer Networks & ISDN Systems, 29(8-13):1157--1166, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Cho and S. Roy. Impact of search engines on page popularity. In WWW '04: Proceedings of the 13th international conference on World Wide Web, pages 20--29, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Clinton. Beyond the SOAP search API, Dec. 2006. http://google-code-updates.blogspot.com/2006/12/beyond-soap-search-api.html.Google ScholarGoogle Scholar
  9. M. Cutts. GoogleGuy's posts, June 2005. http://www.webmasterworld.com/forum30/29720.htm.Google ScholarGoogle Scholar
  10. Z. Dalal, S. Dash, P. Dave, L. Francisco-Revilla, R. Furuta, U. Karadkar, and F. Shipman. Managing distributed collections: Evaluating web page changes, movement, and replacement. In JCDL '04: Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, pages 160--168, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. F. Douglis, A. Feldmann, and B. Krishnamurthy. Rate of change and other metrics: a live study of the World Wide Web. In Proceedings of the USENIX Symposium on Internet Technologies and Systems, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Fetterly, M. Manasse, M. Najork, and J. Wiener. A large-scale study of the evolution of web pages. In WWW '03: Proceedings of the 12th international conference on World Wide Web, pages 669--678, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Galt. Google says: Toolbar PageRank is for entertainment purposes only, 2004. http://forums.searchenginewatch.com/showthread.php?t=3054.Google ScholarGoogle Scholar
  14. Google Sitemap Protocol. https://www.google.com/webmasters/tools/docs/en/protocol.html.Google ScholarGoogle Scholar
  15. Google webmaster help center: Webmaster guidelines, 2007. http://www.google.com/support/webmasters/bin/answer.py?answer=35769.Google ScholarGoogle Scholar
  16. A. Gulli and A. Signorini. The indexable web is more than 11.5 billion pages. In WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web, pages 902--903, May 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. L. Harrison and M. L. Nelson. Just-in-time recovery of missing web pages. In HYPERTEXT '06: Proceedings of the 17th ACM conference on Hypertextand Hypermedia, pages 145--156, Aug. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Internet Archive FAQ: How can I get my site included in the Archive?http://www.archive.org/about/faqs.php.Google ScholarGoogle Scholar
  19. Jon. How the Google cache can save your a¿, Dec. 2005. http://www.smartmoneydaily.com/Business/How-the-Google-Cache-can-Save-You.aspx.Google ScholarGoogle Scholar
  20. W. Koehler. An analysis of web page and web site constancy and permanence. Journal of the American Society for Information Science, 50(2):162--180, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Lawrence, D. M. Pennock, G. W. Flake, R. Krovetz, F. M. Coetzee, E. Glover, F. A. Nielsen, A. Kruger,and C. L. Giles. Persistence of web references in scientific research. Computer, 34(2):26--31, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. Marhsall, F. McCown, and M. L. Nelson. Evaluating personal archiving strategies for Internet-based information. In Proceedings of IS&T Archiving 2007, May 2007.Google ScholarGoogle Scholar
  23. F. McCown. Mark Foley websites - reconstructed, 2006. http://www.cs.odu.edu/~fmccown/foley/.Google ScholarGoogle Scholar
  24. F. McCown, X. Liu, M. L. Nelson, and M. Zubair. Search engine coverage of the OAI-PMH corpus. IEEE Internet Computing, 10(2):66--73, Mar/Apr 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. F. McCown and M. L. Nelson. Evaluation of crawling policies for a web-repository crawler. In HYPERTEXT'06: Proceedings of the 17th ACM conference on Hypertext and Hypermedia, pages 145--156, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. F. McCown and M. L. Nelson. Agreeing to disagree: Search engines and their public interfaces. In JCDL'07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital Libraries, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. F. McCown and M. L. Nelson. Characterization of search engine caches. In Proceedings of IS&T Archiving 2007, 2007.Google ScholarGoogle Scholar
  28. F. McCown, J. A. Smith, M. L. Nelson, and J. Bollen. Lazy preservation: Reconstructing websites by crawling the crawlers. In Proceedings from the 8th ACM International Workshop on Web Information and Data Management (WIDM '06), pages 67--74,2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. G. Mohr, M. Kimpton, M. Stack, and I. Ranitovic. An introduction to Heritrix, an archival quality web crawler. In Proceedings of the 4th International Web Archiving Workshop (IWAW '04), Sept. 2004.Google ScholarGoogle Scholar
  30. M. L. Nelson and B. D. Allen. Object persistence and availability in digital libraries. D-Lib Magazine, 8(1),2002.Google ScholarGoogle ScholarCross RefCross Ref
  31. M. L. Nelson, J. A. Smith, I. Garcia del Campo, H. Van de Sompel, and X. Liu. Efficient, automatic web resource harvesting. In Proceedings from the 8th ACM International Workshop on Web Informationand Data Management (WIDM '06), pages 43--50, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Ntoulas, J. Cho, and C. Olston. What's new on the Web? The evolution of the Web from a search engine perspective. In WWW '04: Proceedings of the 13th international conference on World Wide Web, pages 1--12, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Olsen. Court backs thumbnail image linking. CNET News.com, July 2003. http://news.com.com/2100-1025_3-1023629.html.Google ScholarGoogle Scholar
  34. S. Olsen. Google cache raises copyright concerns. CNET News.com, July 2003. http://news.com.com/2100-1038_3-1024234.html.Google ScholarGoogle Scholar
  35. M. Thelwall. Methodologies for crawler based web surveys. Internet Research, 12(2):124--138, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  36. M. Thelwall and D. Stuart. Web crawling ethics revisited: Cost, privacy, and denial of service. Journal of the American Society for Information Science and Technology, 57(13):1771--1779, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Thelwall and L. Vaughan. A fair history of the Web? Examining country balance in the Internet Archive. Library & Information Science Research, 26(2):162--176, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  38. L. Vaughan and M. Thelwall. Search engine coverage bias: Evidence and possible causes. Information Processing & Management, 40(4):693--707, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Yahoo Site Explorer. http://siteexplorer.search.yahoo.com/.Google ScholarGoogle Scholar

Index Terms

  1. Factors affecting website reconstruction from the web infrastructure

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
              June 2007
              534 pages
              ISBN:9781595936448
              DOI:10.1145/1255175

              Copyright © 2007 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 18 June 2007

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • Article

              Acceptance Rates

              Overall Acceptance Rate415of1,482submissions,28%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader