Skip to main content
Erschienen in: International Journal on Digital Libraries 3-4/2014

01.08.2014

Profiling web archive coverage for top-level domain and content language

verfasst von: Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, Herbert Van de Sompel

Erschienen in: International Journal on Digital Libraries | Ausgabe 3-4/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The Memento Aggregator currently polls every known public web archive when serving a request for an archived web page, even though some web archives focus on only specific domains and ignore the others. Similar to query routing in distributed search, we investigate the impact on aggregated Memento TimeMaps (lists of when and where a web page was archived) by only sending queries to archives likely to hold the archived page. We profile fifteen public web archives using data from a variety of sources (the web, archives’ access logs, and fulltext queries to archives) and use these profiles as resource descriptor. These profiles are used in matching the URI-lookup requests to the most probable web archives. We define \(Recall_{TM}(n)\) as the percentage of a TimeMap that was returned using \(n\) web archives. We discover that only sending queries to the top three web archives (i.e., 80 % reduction in the number of queries) for any request reaches on average \(Recall_{TM}=0.96\). If we exclude the Internet Archive from the list, we can reach \(Recall_{TM}=0.647\) on average using only the remaining top three web archives.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat Ainsworth, S.G., AlSum, A., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: How much of the Web is Archived? In: Proceedings of the 11th annual international ACM/IEEE Joint Conference on Digital libraries, JCDL ’11, pp. 133–136 (2011) Ainsworth, S.G., AlSum, A., SalahEldeen, H., Weigle, M.C., Nelson, M.L.: How much of the Web is Archived? In: Proceedings of the 11th annual international ACM/IEEE Joint Conference on Digital libraries, JCDL ’11, pp. 133–136 (2011)
3.
Zurück zum Zitat AlNoamany, Y., AlSum, A., Weigle, M.C., Nelson, M.L.: Who and what links to the internet archive. In: Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries, TPDL ’13, pp. 346–357 (2013) AlNoamany, Y., AlSum, A., Weigle, M.C., Nelson, M.L.: Who and what links to the internet archive. In: Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries, TPDL ’13, pp. 346–357 (2013)
4.
Zurück zum Zitat AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Access patterns for robots and humans in web archives. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 339–348 (2013) AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Access patterns for robots and humans in web archives. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 339–348 (2013)
5.
Zurück zum Zitat AlSum, A., Weigle, M., Nelson, M., Sompel, H.: Profiling Web Archive Coverage for Top-Level Domain and Content Language. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C. (eds.) Proceeding of the 17th International Conference of Theory of Practice of Digital Libraries, TPDL 2013, pp. 60–71. Springer, Berlin Heidelberg (2013) AlSum, A., Weigle, M., Nelson, M., Sompel, H.: Profiling Web Archive Coverage for Top-Level Domain and Content Language. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C. (eds.) Proceeding of the 17th International Conference of Theory of Practice of Digital Libraries, TPDL 2013, pp. 60–71. Springer, Berlin Heidelberg (2013)
6.
Zurück zum Zitat Aubry, S.: Introducing web archives as a new library service: the experience of the national library of France. LIBER Q. 20(2), 179–199 (2010)MathSciNet Aubry, S.: Introducing web archives as a new library service: the experience of the national library of France. LIBER Q. 20(2), 179–199 (2010)MathSciNet
7.
Zurück zum Zitat Baeza-Yates, R., Riberio-Neto, B.: Modern Information Retrieval: The Concepts and Technology Behind Search, 2nd edn. Addison-Wesley Professional, London (2011) Baeza-Yates, R., Riberio-Neto, B.: Modern Information Retrieval: The Concepts and Technology Behind Search, 2nd edn. Addison-Wesley Professional, London (2011)
8.
Zurück zum Zitat Bailey, S., Thompson, D.: UKWAC building the UK’s first public web archive. D-Lib Mag. 12(1), 1082–9873 (2006) Bailey, S., Thompson, D.: UKWAC building the UK’s first public web archive. D-Lib Mag. 12(1), 1082–9873 (2006)
9.
Zurück zum Zitat Baillie, M., Azzopardi, L., Crestani, F.: Adaptive query-based sampling of distributed collections. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) String Processing and Information Retrieval SE-26. Lecture Notes in Computer Science, vol. 4209, pp. 316–328. Springer, Berlin Heidelberg (2006)CrossRef Baillie, M., Azzopardi, L., Crestani, F.: Adaptive query-based sampling of distributed collections. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) String Processing and Information Retrieval SE-26. Lecture Notes in Computer Science, vol. 4209, pp. 316–328. Springer, Berlin Heidelberg (2006)CrossRef
10.
12.
Zurück zum Zitat Brown, A.: Archiving Websites: A Practical Guide for Information Management Professionals, 1st edn. Facet, London (2006) Brown, A.: Archiving Websites: A Practical Guide for Information Management Professionals, 1st edn. Facet, London (2006)
13.
Zurück zum Zitat Brügger, N.: Archiving Websites. General Considerations and Strategies, 1st edn. The Center for Internet Research, Aarhus N (2005) Brügger, N.: Archiving Websites. General Considerations and Strategies, 1st edn. The Center for Internet Research, Aarhus N (2005)
14.
Zurück zum Zitat Brunelle, J.F., Nelson, M.L.: An evaluation of caching policies for Memento Timemaps. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries. JCDL ’13, pp. 267–276. ACM Press, New York (2013) Brunelle, J.F., Nelson, M.L.: An evaluation of caching policies for Memento Timemaps. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries. JCDL ’13, pp. 267–276. ACM Press, New York (2013)
15.
Zurück zum Zitat Callan, J.: Distributed information retrieval. In: Croft, W. (ed.) Advances in Information Retrieval SE-5, The Information Retrieval Series, vol. 7, pp. 127–150. Springer, New York (2000) Callan, J.: Distributed information retrieval. In: Croft, W. (ed.) Advances in Information Retrieval SE-5, The Information Retrieval Series, vol. 7, pp. 127–150. Springer, New York (2000)
16.
Zurück zum Zitat Callan, J., Connell, M.: Query-based sampling of text databases. ACM Trans. Inform. Syst. 19(2), 97–130 (2001)CrossRef Callan, J., Connell, M.: Query-based sampling of text databases. ACM Trans. Inform. Syst. 19(2), 97–130 (2001)CrossRef
17.
Zurück zum Zitat Callan, J., Connell, M., Du, A.: Automatic discovery of language models for text databases. ACM SIGMOD Record 28(2), 479–490 (1999)CrossRef Callan, J., Connell, M., Du, A.: Automatic discovery of language models for text databases. ACM SIGMOD Record 28(2), 479–490 (1999)CrossRef
18.
Zurück zum Zitat Callan, J.P., Lu, Z., Croft, W.B.: Searching distributed collections with inference networks. In: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR ’95, pp. 21–28. ACM Press, New York (1995) Callan, J.P., Lu, Z., Croft, W.B.: Searching distributed collections with inference networks. In: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR ’95, pp. 21–28. ACM Press, New York (1995)
19.
Zurück zum Zitat Chakrabarti, S., Joshi, M.M., Punera, K., Pennock, D.M.: The structure of broad topics on the web. In: Proceedings of the 11th international conference on World Wide Web. WWW ’02, pp. 251–260. ACM Press, New York (2002) Chakrabarti, S., Joshi, M.M., Punera, K., Pennock, D.M.: The structure of broad topics on the web. In: Proceedings of the 11th international conference on World Wide Web. WWW ’02, pp. 251–260. ACM Press, New York (2002)
20.
Zurück zum Zitat Chen, K., Chen, Y., Ting, P.: Developing national Taiwan university web archiving system. In: Proceedings of 8th International Web Archiving Workshop, IWAW ’08 (2008) Chen, K., Chen, Y., Ting, P.: Developing national Taiwan university web archiving system. In: Proceedings of 8th International Web Archiving Workshop, IWAW ’08 (2008)
21.
Zurück zum Zitat Clausen, L.R.: Overview of the Netarkivet web archiving system. In: Proceedings of 6th International Web Archiving Workshop, IWAW ’06 (2006) Clausen, L.R.: Overview of the Netarkivet web archiving system. In: Proceedings of 6th International Web Archiving Workshop, IWAW ’06 (2006)
22.
Zurück zum Zitat Craswell, N., Bailey, P., Hawking, D.: Server selection on the World Wide Web. In: Proceedings of the fifth ACM conference on Digital libraries. DL ’00, pp. 37–46. ACM Press, New York (2000) Craswell, N., Bailey, P., Hawking, D.: Server selection on the World Wide Web. In: Proceedings of the fifth ACM conference on Digital libraries. DL ’00, pp. 37–46. ACM Press, New York (2000)
23.
Zurück zum Zitat D’Souza, D.J., Thom, J.A., Zobel, J.: A comparison of techniques for selecting text collections. In: Proceedings of 11th Australasian Database Conference, ADC 2000, pp. 28–32 (2000) D’Souza, D.J., Thom, J.A., Zobel, J.: A comparison of techniques for selecting text collections. In: Proceedings of 11th Australasian Database Conference, ADC 2000, pp. 28–32 (2000)
24.
Zurück zum Zitat Gomes, D., Nogueira, A., Miranda, J.a., Costa, M.: Introducing the Portuguese web archive initiative. In: Proceedings of 8th International Web Archiving Workshop, IWAW ’08 (2008) Gomes, D., Nogueira, A., Miranda, J.a., Costa, M.: Introducing the Portuguese web archive initiative. In: Proceedings of 8th International Web Archiving Workshop, IWAW ’08 (2008)
25.
Zurück zum Zitat Gravano, L., García-Molina, H., Tomasic, A.: The effectiveness of GIOSS for the text database discovery problem. ACM SIGMOD Record 23(2), 126–137 (1994)CrossRef Gravano, L., García-Molina, H., Tomasic, A.: The effectiveness of GIOSS for the text database discovery problem. ACM SIGMOD Record 23(2), 126–137 (1994)CrossRef
26.
Zurück zum Zitat Gravano, L., García-Molina, H., Tomasic, A.: GlOSS: text-source discovery over the Internet. ACM Trans. Database Syst. 24(2), 229–264 (1999)CrossRef Gravano, L., García-Molina, H., Tomasic, A.: GlOSS: text-source discovery over the Internet. ACM Trans. Database Syst. 24(2), 229–264 (1999)CrossRef
28.
Zurück zum Zitat Gulli, A., Signorini, A.: The indexable web is more than 11.5 billion pages. In: International World Wide Web Conference, pp. 902–903 (2005) Gulli, A., Signorini, A.: The indexable web is more than 11.5 billion pages. In: International World Wide Web Conference, pp. 902–903 (2005)
30.
Zurück zum Zitat Heuser, C.A., Mecca, G., Raunich, S., Pappalardo, A.: A new algorithm for clustering search results. Data Knowl. Eng. 62(3), 504–522 (2007)CrossRef Heuser, C.A., Mecca, G., Raunich, S., Pappalardo, A.: A new algorithm for clustering search results. Data Knowl. Eng. 62(3), 504–522 (2007)CrossRef
31.
Zurück zum Zitat Ipeirotis, P.G., Gravano, L.: Distributed search over the hidden web: hierarchical database sampling and selection. In: Proceeding of the 28th Very-Large Database conference, VLDB ’02, pp. 394–405 (2002) Ipeirotis, P.G., Gravano, L.: Distributed search over the hidden web: hierarchical database sampling and selection. In: Proceeding of the 28th Very-Large Database conference, VLDB ’02, pp. 394–405 (2002)
32.
Zurück zum Zitat Ipeirotis, P.G., Gravano, L., Sahami, M.: Probe, count, and classify. ACM SIGMOD Record 30(2), 67–78 (2001)CrossRef Ipeirotis, P.G., Gravano, L., Sahami, M.: Probe, count, and classify. ACM SIGMOD Record 30(2), 67–78 (2001)CrossRef
33.
Zurück zum Zitat Kavcic-colic, A., Grobelnik, M.: Archiving the Slovenian Web : Recent Experiences. In: Proceedings of 4th International Web Archiving Workshop, IWAW ’04 (2004) Kavcic-colic, A., Grobelnik, M.: Archiving the Slovenian Web : Recent Experiences. In: Proceedings of 4th International Web Archiving Workshop, IWAW ’04 (2004)
34.
Zurück zum Zitat Losee, R., Church, L.: Information retrieval with distributed databases: analytic models of performance. IEEE Transactions on Parallel and Distributed Systems 15(1), 18–27 (2004)CrossRef Losee, R., Church, L.: Information retrieval with distributed databases: analytic models of performance. IEEE Transactions on Parallel and Distributed Systems 15(1), 18–27 (2004)CrossRef
35.
Zurück zum Zitat Lu, J., Callan, J.: Federated search of text-based digital libraries in hierarchical peer-to-peer networks. In: Proceedings of 27th European Conference on Information Retrieval Research, ECIR ’05, pp. 52–66 (2005) Lu, J., Callan, J.: Federated search of text-based digital libraries in hierarchical peer-to-peer networks. In: Proceedings of 27th European Conference on Information Retrieval Research, ECIR ’05, pp. 52–66 (2005)
36.
37.
Zurück zum Zitat Meng, W., Yu, C., Liu, K.L.: Building efficient and effective metasearch engines. ACM Comput. Surv. 34(1), 48–89 (2002)CrossRef Meng, W., Yu, C., Liu, K.L.: Building efficient and effective metasearch engines. ACM Comput. Surv. 34(1), 48–89 (2002)CrossRef
38.
Zurück zum Zitat Monroe, G., French, J., Powell, A.: Obtaining language models of web collections using query-based sampling techniques. Hawaii Int. Conf. Syst. Sci. 3, 67b (2002) Monroe, G., French, J., Powell, A.: Obtaining language models of web collections using query-based sampling techniques. Hawaii Int. Conf. Syst. Sci. 3, 67b (2002)
39.
Zurück zum Zitat Niu, J.: An overview of web archiving. D-Lib Mag. 18(3/4) (2012) Niu, J.: An overview of web archiving. D-Lib Mag. 18(3/4) (2012)
40.
Zurück zum Zitat Niu, J.: Functionalities of web archives. D-Lib Mag. 18(3/4) (2012) Niu, J.: Functionalities of web archives. D-Lib Mag. 18(3/4) (2012)
42.
Zurück zum Zitat Powell, A.L., French, J.C.: Comparing the performance of collection selection algorithms. ACM Trans. Inform. Syst. 21(4), 412–456 (2003)CrossRef Powell, A.L., French, J.C.: Comparing the performance of collection selection algorithms. ACM Trans. Inform. Syst. 21(4), 412–456 (2003)CrossRef
43.
Zurück zum Zitat Preibusch, S., Bonneau, J.: The privacy landscape: product differentiation on data collection. In: Schneier, B. (ed.) Economics of Information Security and Privacy III SE-12, pp. 263–283. Springer, New York (2013)CrossRef Preibusch, S., Bonneau, J.: The privacy landscape: product differentiation on data collection. In: Schneier, B. (ed.) Economics of Information Security and Privacy III SE-12, pp. 263–283. Springer, New York (2013)CrossRef
47.
Zurück zum Zitat Shiozaki, R., Eisenschitz, T.: Role and justification of web archiving by national libraries: a questionnaire survey. J. Libr. Inform. Sci. 41(2), 90–107 (2009) Shiozaki, R., Eisenschitz, T.: Role and justification of web archiving by national libraries: a questionnaire survey. J. Libr. Inform. Sci. 41(2), 90–107 (2009)
48.
Zurück zum Zitat Shokouhi, M., Scholer, F., Zobel, J.: Sample sizes for query probing in uncooperative distributed information retrieval. In: Zhou, X., Li, J., Shen, H., Kitsuregawa, M., Zhang, Y. (eds.) Frontiers of WWW Research and Development-APWeb 2006 SE-7. Lecture Notes in Computer Science, vol. 3841, pp. 63–75. Springer, Berlin Heidelberg (2006)CrossRef Shokouhi, M., Scholer, F., Zobel, J.: Sample sizes for query probing in uncooperative distributed information retrieval. In: Zhou, X., Li, J., Shen, H., Kitsuregawa, M., Zhang, Y. (eds.) Frontiers of WWW Research and Development-APWeb 2006 SE-7. Lecture Notes in Computer Science, vol. 3841, pp. 63–75. Springer, Berlin Heidelberg (2006)CrossRef
49.
Zurück zum Zitat Shokouhi, M., Si, L.: Federated search. Found. Trends Inform. Retrieval 5(1), 1–102 (2011)CrossRef Shokouhi, M., Si, L.: Federated search. Found. Trends Inform. Retrieval 5(1), 1–102 (2011)CrossRef
50.
Zurück zum Zitat Si, L., Callan, J.: Modeling search engine effectiveness for federated search. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR ’05, pp. 83–92. ACM Press, New York (2005) Si, L., Callan, J.: Modeling search engine effectiveness for federated search. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR ’05, pp. 83–92. ACM Press, New York (2005)
51.
Zurück zum Zitat Stirling, P., Illien, G., Sanz, P., Sepetjan, S.: The state of e-legal deposit in France: looking back at five years of putting new legislation into practice and envisioning the future. In: World Library and Information Congress: 77th IFLA General Conference and Assembly (2011) Stirling, P., Illien, G., Sanz, P., Sepetjan, S.: The state of e-legal deposit in France: looking back at five years of putting new legislation into practice and envisioning the future. In: World Library and Information Congress: 77th IFLA General Conference and Assembly (2011)
52.
Zurück zum Zitat Thomas, P., Hawking, D.: Evaluating sampling methods for uncooperative collections. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’07, pp. 503–512 (2007) Thomas, P., Hawking, D.: Evaluating sampling methods for uncooperative collections. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’07, pp. 503–512 (2007)
53.
Zurück zum Zitat Tofel, B.: ‘Wayback’ for Accessing Web Archives. In: Proceedings of 7th International Web Archiving Workshop, IWAW ’07 (2007) Tofel, B.: ‘Wayback’ for Accessing Web Archives. In: Proceedings of 7th International Web Archiving Workshop, IWAW ’07 (2007)
55.
Zurück zum Zitat Van de Sompel, H., Nelson, M.L., Sanderson, R., Balakireva, L.L., Ainsworth, S., Shankar, H.: Memento: Time Travel for the Web. Tech. Rep. arXiv:0911.1112 (2009) Van de Sompel, H., Nelson, M.L., Sanderson, R., Balakireva, L.L., Ainsworth, S., Shankar, H.: Memento: Time Travel for the Web. Tech. Rep. arXiv:​0911.​1112 (2009)
56.
Zurück zum Zitat Van de Sompel, H., Sanderson, R., Nelson, M.L., Balakireva, L.L., Shankar, H., Ainsworth, S., Sompel, H.V.D.: An HTTP-based versioning mechanism for linked data. In: Proceedings of the Linked Data on the Web Workshop, LDOW 2010 (2010) Van de Sompel, H., Sanderson, R., Nelson, M.L., Balakireva, L.L., Shankar, H., Ainsworth, S., Sompel, H.V.D.: An HTTP-based versioning mechanism for linked data. In: Proceedings of the Linked Data on the Web Workshop, LDOW 2010 (2010)
57.
Zurück zum Zitat Vlcek, I.: Identification and archiving of the Czech web outside the national domain. In: Proceedings of 8th International Web Archiving Workshop, IWAW ’08 (2008) Vlcek, I.: Identification and archiving of the Czech web outside the national domain. In: Proceedings of 8th International Web Archiving Workshop, IWAW ’08 (2008)
58.
Zurück zum Zitat Yan, H., Huang, L., Chen, C., Xie, Z.: A new data storage and service model of China web. In: Proceedings of 4th International Web Archiving Workshop, IWAW ’04 (2004) Yan, H., Huang, L., Chen, C., Xie, Z.: A new data storage and service model of China web. In: Proceedings of 4th International Web Archiving Workshop, IWAW ’04 (2004)
59.
Zurück zum Zitat Zhuge, H., Liu, J., Feng, L., Sun, X., He, C.: Query routing in a peer-to-peer semantic link network. Comput. Intell. 21(2), 197–216 (2005)CrossRefMathSciNet Zhuge, H., Liu, J., Feng, L., Sun, X., He, C.: Query routing in a peer-to-peer semantic link network. Comput. Intell. 21(2), 197–216 (2005)CrossRefMathSciNet
Metadaten
Titel
Profiling web archive coverage for top-level domain and content language
verfasst von
Ahmed AlSum
Michele C. Weigle
Michael L. Nelson
Herbert Van de Sompel
Publikationsdatum
01.08.2014
Verlag
Springer Berlin Heidelberg
Erschienen in
International Journal on Digital Libraries / Ausgabe 3-4/2014
Print ISSN: 1432-5012
Elektronische ISSN: 1432-1300
DOI
https://doi.org/10.1007/s00799-014-0118-y

Weitere Artikel der Ausgabe 3-4/2014

International Journal on Digital Libraries 3-4/2014 Zur Ausgabe