Skip to main content
Erschienen in: International Journal on Digital Libraries 2/2016

01.06.2016

A quantitative approach to evaluate Website Archivability using the CLEAR+ method

verfasst von: Vangelis Banos, Yannis Manolopoulos

Erschienen in: International Journal on Digital Libraries | Ausgabe 2/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Website Archivability (WA) is a notion established to capture the core aspects of a website, crucial in diagnosing whether it has the potential to be archived with completeness and accuracy. In this work, aiming at measuring WA, we introduce and elaborate on all aspects of CLEAR+, an extended version of the Credible Live Evaluation Method for Archive Readiness (CLEAR) method. We use a systematic approach to evaluate WA from multiple different perspectives, which we call Website Archivability Facets. We then analyse archiveready.​com, a web application we created as the reference implementation of CLEAR+, and discuss the implementation of the evaluation workflow. Finally, we conduct thorough evaluations of all aspects of WA to support the validity, the reliability and the benefits of our method using real-world web data.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
The numbers reported in this paragraph are from the Daily Estimated Size of the World Wide Web, http://​www.​worldwidewebsize​.​com/​, January 2014.
 
7
Personal communication.
 
9
Personal communication.
 
29
http://​www.​auth.​gr/​ as of 10 August 2014.
 
Literatur
2.
Zurück zum Zitat Ainsworth, S.G., Alsum, A., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: How much of the web is archived? In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, pp. 133–136. ACM (2011) Ainsworth, S.G., Alsum, A., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: How much of the web is archived? In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, pp. 133–136. ACM (2011)
4.
Zurück zum Zitat Avižienis, A., Laprie, J.C., Randell, B.: Fundamental concepts of computer system dependability. In: Proceedings of the IARP/IEEE-RAS Workshop on Robot Dependability: Technological Challenge of Dependable, Robots in Human Environments (2001) Avižienis, A., Laprie, J.C., Randell, B.: Fundamental concepts of computer system dependability. In: Proceedings of the IARP/IEEE-RAS Workshop on Robot Dependability: Technological Challenge of Dependable, Robots in Human Environments (2001)
5.
Zurück zum Zitat Banos, V., Baltas, N., Manolopoulos, Y.: Trends in blog preservation. In: Proceedings of the 14th International Conference on Enterprise Information Systems (ICEIS). Wroclaw, Poland (2012) Banos, V., Baltas, N., Manolopoulos, Y.: Trends in blog preservation. In: Proceedings of the 14th International Conference on Enterprise Information Systems (ICEIS). Wroclaw, Poland (2012)
6.
Zurück zum Zitat Banos, V., Kim, Y., Ross, S., Manolopoulos, Y.: CLEAR: a credible method to evaluate website archivability. In: Proceedings of the 10th International Conference on Preservation of Digital Objects (IPRES). Lisbon, Portugal (2013) Banos, V., Kim, Y., Ross, S., Manolopoulos, Y.: CLEAR: a credible method to evaluate website archivability. In: Proceedings of the 10th International Conference on Preservation of Digital Objects (IPRES). Lisbon, Portugal (2013)
7.
Zurück zum Zitat Brickley, D., Miller, L.: FOAF vocabulary specification 0.98. Namespace Document 9 (2010) Brickley, D., Miller, L.: FOAF vocabulary specification 0.98. Namespace Document 9 (2010)
8.
Zurück zum Zitat Brunelle, J.F., Kelly, M., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: Not all mementos are created equal: Measuring the impact of missing resources. In: 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), pp. 321–330. IEEE (2014) Brunelle, J.F., Kelly, M., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: Not all mementos are created equal: Measuring the impact of missing resources. In: 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), pp. 321–330. IEEE (2014)
11.
Zurück zum Zitat Center, M.D.: Mozilla’s quirks mode. 2007 (2008) Center, M.D.: Mozilla’s quirks mode. 2007 (2008)
12.
Zurück zum Zitat Charron, C., Favier, J., Li, C., Joseph, J., Neurauter, M., Cohen, S., McHarg, T., Kolko, J.: Social computing: how networks erode institutional power, and what to do about it. Forrester Customer Report (2006) Charron, C., Favier, J., Li, C., Joseph, J., Neurauter, M., Cohen, S., McHarg, T., Kolko, J.: Social computing: how networks erode institutional power, and what to do about it. Forrester Customer Report (2006)
13.
Zurück zum Zitat Clausen, L.: Concerning etags and datestamps. In: 4th International Web Archiving Workshop (IWAW04). Citeseer (2004) Clausen, L.: Concerning etags and datestamps. In: 4th International Web Archiving Workshop (IWAW04). Citeseer (2004)
15.
Zurück zum Zitat Crane, G.: Designing documents to enhance the performance of digital libraries. Time, space, people and a digital library on London. D-Lib Mag. 6(7/8) (2000) Crane, G.: Designing documents to enhance the performance of digital libraries. Time, space, people and a digital library on London. D-Lib Mag. 6(7/8) (2000)
16.
Zurück zum Zitat Daskalantonakis, M.: A practical view of software measurement and implementation experiences within motorola. IEEE Trans. Softw. Eng. 18(11), 998–1010 (1992)CrossRef Daskalantonakis, M.: A practical view of software measurement and implementation experiences within motorola. IEEE Trans. Softw. Eng. 18(11), 998–1010 (1992)CrossRef
18.
Zurück zum Zitat Denev, D., Mazeika, A., Spaniol, M., Weikum, G.: The SHARC framework for data quality in web archiving. VLDB J. 20(2), 183–207 (2011)CrossRef Denev, D., Mazeika, A., Spaniol, M., Weikum, G.: The SHARC framework for data quality in web archiving. VLDB J. 20(2), 183–207 (2011)CrossRef
19.
Zurück zum Zitat Donnelly, M.: JSTOR/Harvard Object Validation Environment (JHOVE). Digital Curation Centre Case Studies and Interviews (2006) Donnelly, M.: JSTOR/Harvard Object Validation Environment (JHOVE). Digital Curation Centre Case Studies and Interviews (2006)
21.
Zurück zum Zitat Faheem, M., Senellart, P.: Intelligent and adaptive crawling of web applications for web archiving. In: Proceedings of the 21st International Conference Companion on World Wide Web (WWW), pp. 127–132. Lyon, France (2012) Faheem, M., Senellart, P.: Intelligent and adaptive crawling of web applications for web archiving. In: Proceedings of the 21st International Conference Companion on World Wide Web (WWW), pp. 127–132. Lyon, France (2012)
23.
Zurück zum Zitat Freire, A.P., Bittar, T.J., Fortes, R.P.: An approach based on metrics for monitoring web accessibility in Brazilian municipalities web sites. In: Proceedings of the 23rd ACM Symposium on Applied Computing (SAC), pp. 2421–2425. Fortaleza, Brazil (2008) Freire, A.P., Bittar, T.J., Fortes, R.P.: An approach based on metrics for monitoring web accessibility in Brazilian municipalities web sites. In: Proceedings of the 23rd ACM Symposium on Applied Computing (SAC), pp. 2421–2425. Fortaleza, Brazil (2008)
24.
Zurück zum Zitat Glenn, V.D.: Preserving government and political information: the web-at-risk project. First Monday 12(7) (2007) Glenn, V.D.: Preserving government and political information: the web-at-risk project. First Monday 12(7) (2007)
25.
Zurück zum Zitat Gomes, D., Silva, M.J.: Modelling information persistence on the web. In: Proceedings of the 6th International Conference on Web Engineering, pp. 193–200. ACM (2006) Gomes, D., Silva, M.J.: Modelling information persistence on the web. In: Proceedings of the 6th International Conference on Web Engineering, pp. 193–200. ACM (2006)
26.
Zurück zum Zitat Gray, G., Martin, S.: Choosing a sustainable web archiving method: a comparison of capture quality. D-Lib Mag. 19(5), 2 (2013) Gray, G., Martin, S.: Choosing a sustainable web archiving method: a comparison of capture quality. D-Lib Mag. 19(5), 2 (2013)
27.
Zurück zum Zitat He, Y., Xin, D., Ganti, V., Rajaraman, S., Shah, N.: Crawling deep web entity pages. In: Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM), pp. 355–364. Rome, Italy (2013) He, Y., Xin, D., Ganti, V., Rajaraman, S., Shah, N.: Crawling deep web entity pages. In: Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM), pp. 355–364. Rome, Italy (2013)
28.
Zurück zum Zitat Hockx-Yu, H., Crawford, L., Coram, R., Johnson, S.: Capturing and replaying streaming media in a web archive—a British Library case study. In: Proceedings of the 7th International Conference on Preservation of Digital Objects (iPres). Vienna, Austria (2010) Hockx-Yu, H., Crawford, L., Coram, R., Johnson, S.: Capturing and replaying streaming media in a web archive—a British Library case study. In: Proceedings of the 7th International Conference on Preservation of Digital Objects (iPres). Vienna, Austria (2010)
29.
Zurück zum Zitat ISO: 28500: 2009 information and documentation-WARC file format. International Organization for Standardization (2009) ISO: 28500: 2009 information and documentation-WARC file format. International Organization for Standardization (2009)
30.
Zurück zum Zitat Kasioumis, N., Banos, V., Kalb, H.: Towards building a blog preservation platform. World Wide Web 17(4), 799–825 (2013)CrossRef Kasioumis, N., Banos, V., Kalb, H.: Towards building a blog preservation platform. World Wide Web 17(4), 799–825 (2013)CrossRef
31.
Zurück zum Zitat Kelly, D.: Methods for evaluating interactive information retrieval systems with users. In: Foundations and Trends in Information Retrieval, vol. 3. Now Publishers Inc., Hanover (2009) Kelly, D.: Methods for evaluating interactive information retrieval systems with users. In: Foundations and Trends in Information Retrieval, vol. 3. Now Publishers Inc., Hanover (2009)
32.
Zurück zum Zitat Kelly, M., Brunelle, J.F., Weigle, M.C., Nelson, M.L.: On the change in archivability of websites over time. In: Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries (TPDL), pp. 35–47. Valletta, Malta (2013) Kelly, M., Brunelle, J.F., Weigle, M.C., Nelson, M.L.: On the change in archivability of websites over time. In: Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries (TPDL), pp. 35–47. Valletta, Malta (2013)
33.
Zurück zum Zitat Kelly, M., Nelson, M.L., Weigle, M.C.: The archival acid test: evaluating archive performance on advanced html and javascript. In: 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), pp. 25–28. IEEE (2014) Kelly, M., Nelson, M.L., Weigle, M.C.: The archival acid test: evaluating archive performance on advanced html and javascript. In: 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), pp. 25–28. IEEE (2014)
34.
Zurück zum Zitat Kenney, A.R., McGovern, N., Botticelli, P., Entlich, R., Lagoze, C., Payette, S.: Preservation risk management for web resources. D-Lib Mag 8(1) (2002) Kenney, A.R., McGovern, N., Botticelli, P., Entlich, R., Lagoze, C., Payette, S.: Preservation risk management for web resources. D-Lib Mag 8(1) (2002)
35.
Zurück zum Zitat de Kunder, M.: Geschatte grootte van het geïndexeerde world wide web. Tilburg University, p. 63 (2008) de Kunder, M.: Geschatte grootte van het geïndexeerde world wide web. Tilburg University, p. 63 (2008)
36.
Zurück zum Zitat Lavoie, B.F.: Implementing metadata in digital preservation systems: the premis activity. D-Lib Mag. 10(4) (2004) Lavoie, B.F.: Implementing metadata in digital preservation systems: the premis activity. D-Lib Mag. 10(4) (2004)
37.
Zurück zum Zitat Liu, N.C., Cheng, Y.: The academic ranking of world universities. High. Educ. Eur. 30(2), 127–136 (2005)CrossRef Liu, N.C., Cheng, Y.: The academic ranking of world universities. High. Educ. Eur. 30(2), 127–136 (2005)CrossRef
38.
Zurück zum Zitat Lowry, R.: Concepts and Applications of Inferential Statistics. Lowry, Richard (1998) Lowry, R.: Concepts and Applications of Inferential Statistics. Lowry, Richard (1998)
39.
Zurück zum Zitat McBride, B., et al.: The resource description framework (RDF) and its vocabulary description language RDFS. In: Handbook on Ontologies, pp. 51–66. Springer, New York (2004) McBride, B., et al.: The resource description framework (RDF) and its vocabulary description language RDFS. In: Handbook on Ontologies, pp. 51–66. Springer, New York (2004)
40.
Zurück zum Zitat Mendes, E., Mosley, N., Counsell, S.: Web metrics-estimating design and authoring effort. IEEE Multimed. 8(1), 50–57 (2001)CrossRefMATH Mendes, E., Mosley, N., Counsell, S.: Web metrics-estimating design and authoring effort. IEEE Multimed. 8(1), 50–57 (2001)CrossRefMATH
41.
Zurück zum Zitat Mohr, G., Stack, M., Rnitovic, I., Avery, D., Kimpton, M.: Introduction to heritrix. In: Proceedings of the 4th International Web Archiving Workshop (IWAW). Vienna, Austria (2004) Mohr, G., Stack, M., Rnitovic, I., Avery, D., Kimpton, M.: Introduction to heritrix. In: Proceedings of the 4th International Web Archiving Workshop (IWAW). Vienna, Austria (2004)
42.
Zurück zum Zitat Morrissey, S., Meyer, J., Bhattarai, S., Kurdikar, S., Ling, J., Stoeffler, M., Thanneeru, U.: Portico: A case study in the use of xml for the long-term preservation of digital artifacts. In: International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML, Montréal, Canada (2010) Morrissey, S., Meyer, J., Bhattarai, S., Kurdikar, S., Ling, J., Stoeffler, M., Thanneeru, U.: Portico: A case study in the use of xml for the long-term preservation of digital artifacts. In: International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML, Montréal, Canada (2010)
43.
Zurück zum Zitat Niu, J.: An overview of web archiving. D-Lib Mag. 18(3), 2 (2012) Niu, J.: An overview of web archiving. D-Lib Mag. 18(3), 2 (2012)
44.
Zurück zum Zitat Olsina, L., Rossi, G.: Measuring web application quality with WebQEM. IEEE Multimed. 9(4), 20–29 (2002)CrossRef Olsina, L., Rossi, G.: Measuring web application quality with WebQEM. IEEE Multimed. 9(4), 20–29 (2002)CrossRef
45.
Zurück zum Zitat Pant, G., Srinivasan, P., Menczer, F.: Crawling the web. In: Web Dynamics: Adapting to Change in Content, Size, Topology and Use, pp. 153–177. Springer, New York (2004) Pant, G., Srinivasan, P., Menczer, F.: Crawling the web. In: Web Dynamics: Adapting to Change in Content, Size, Topology and Use, pp. 153–177. Springer, New York (2004)
46.
Zurück zum Zitat Parmanto, B., Zeng, X.: Metric for web accessibility evaluation. J. Am. Soc. Inf. Sci. Technol. 56(13), 1394–1404 (2005)CrossRef Parmanto, B., Zeng, X.: Metric for web accessibility evaluation. J. Am. Soc. Inf. Sci. Technol. 56(13), 1394–1404 (2005)CrossRef
47.
Zurück zum Zitat Paynter, G., Joe, S., Lala, V., Lee, G.: A year of selective web archiving with the web curator tool at the National Library of New Zealand. D-Lib Mag. 14(5), 2 (2008) Paynter, G., Joe, S., Lala, V., Lee, G.: A year of selective web archiving with the web curator tool at the National Library of New Zealand. D-Lib Mag. 14(5), 2 (2008)
48.
Zurück zum Zitat Pennock, M., Davis, R.: ArchivePress: a really simple solution to archiving blog content. In: Proceedings of the 6th International Conference on Preservation of Digital Objects (IPres). San Francisco, CA (2009) Pennock, M., Davis, R.: ArchivePress: a really simple solution to archiving blog content. In: Proceedings of the 6th International Conference on Preservation of Digital Objects (IPres). San Francisco, CA (2009)
49.
Zurück zum Zitat Pennock, M., Kelly, B.: Archiving web site resources: a records management view. In: Proceedings of the 15th International Conference on World Wide Web (WWW), pp. 987–988. Edinburgh, UK (2006) Pennock, M., Kelly, B.: Archiving web site resources: a records management view. In: Proceedings of the 15th International Conference on World Wide Web (WWW), pp. 987–988. Edinburgh, UK (2006)
50.
Zurück zum Zitat Press, N.: Understanding metadata. National Information Standards 20 (2004) Press, N.: Understanding metadata. National Information Standards 20 (2004)
52.
Zurück zum Zitat Risse, T., Dietze, S., Peters, W., Doka, K., Stavrakas, Y., Senellart, P.: Exploiting the social and semantic web for guided web archiving. In: Proceedings of the 2nd International Conference on Theory and Practice of Digital Libraries (TPDL), pp. 426–432. Paphos, Cyprus (2012) Risse, T., Dietze, S., Peters, W., Doka, K., Stavrakas, Y., Senellart, P.: Exploiting the social and semantic web for guided web archiving. In: Proceedings of the 2nd International Conference on Theory and Practice of Digital Libraries (TPDL), pp. 426–432. Paphos, Cyprus (2012)
53.
Zurück zum Zitat Schonfeld, U., Shivakumar, N.: Sitemaps: above and beyond the crawl of duty. In: Proceedings of the 18th International Conference on World Wide Web (WWW), pp. 991–1000. Madrid, Spain (2009) Schonfeld, U., Shivakumar, N.: Sitemaps: above and beyond the crawl of duty. In: Proceedings of the 18th International Conference on World Wide Web (WWW), pp. 991–1000. Madrid, Spain (2009)
54.
Zurück zum Zitat Spaniol, M., Denev, D., Mazeika, A., Weikum, G., Senellart, P.: Data quality in web archiving. In: Proceedings of the 3rd Workshop on Information Credibility on the Web (WICOW), pp. 19–26. Madrid, Spain (2009) Spaniol, M., Denev, D., Mazeika, A., Weikum, G., Senellart, P.: Data quality in web archiving. In: Proceedings of the 3rd Workshop on Information Credibility on the Web (WICOW), pp. 19–26. Madrid, Spain (2009)
55.
Zurück zum Zitat Sullivan, T., Matson, R.: Barriers to use: usability and content accessibility on the web’s most popular sites. In: Proceedings on the ACM Conference on Universal Usability (CUU), pp. 139–144 (2000) Sullivan, T., Matson, R.: Barriers to use: usability and content accessibility on the web’s most popular sites. In: Proceedings on the ACM Conference on Universal Usability (CUU), pp. 139–144 (2000)
56.
Zurück zum Zitat Voorhees, E., Harman, D.: TREC: Experiment and Evaluation in Information Retrieval. MIT Press, Cambridge (2005) Voorhees, E., Harman, D.: TREC: Experiment and Evaluation in Information Retrieval. MIT Press, Cambridge (2005)
57.
Zurück zum Zitat W3C: W3C HTML validation service (2001) W3C: W3C HTML validation service (2001)
58.
Zurück zum Zitat Weibel, S., Kunze, J., Lagoze, C., Wolf, M.: Dublin core metadata for resource discovery. Internet Eng. Task Force RFC 2413, 222 (1998) Weibel, S., Kunze, J., Lagoze, C., Wolf, M.: Dublin core metadata for resource discovery. Internet Eng. Task Force RFC 2413, 222 (1998)
59.
Zurück zum Zitat Yang, S., Chitturi, K., Wilson, G., Magdy, M., Fox, E.A.: A study of automation from seed URL generation to focused web archive development: the CTRnet context. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), pp. 341–342. Washington, DC (2012) Yang, S., Chitturi, K., Wilson, G., Magdy, M., Fox, E.A.: A study of automation from seed URL generation to focused web archive development: the CTRnet context. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), pp. 341–342. Washington, DC (2012)
Metadaten
Titel
A quantitative approach to evaluate Website Archivability using the CLEAR+ method
verfasst von
Vangelis Banos
Yannis Manolopoulos
Publikationsdatum
01.06.2016
Verlag
Springer Berlin Heidelberg
Erschienen in
International Journal on Digital Libraries / Ausgabe 2/2016
Print ISSN: 1432-5012
Elektronische ISSN: 1432-1300
DOI
https://doi.org/10.1007/s00799-015-0144-4