Skip to main content
Top
Published in: International Journal on Digital Libraries 3/2016

01-09-2016

Web archive profiling through CDX summarization

Authors: Sawood Alam, Michael L. Nelson, Herbert Van de Sompel, Lyudmila L. Balakireva, Harihar Shankar, David S. H. Rosenthal

Published in: International Journal on Digital Libraries | Issue 3/2016

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

With the proliferation of public web archives, it is becoming more important to better profile their contents, both to understand their immense holdings as well as to support routing of requests in the Memento aggregator. To save time, the Memento aggregator should only poll the archives that are likely to have a copy of the requested URI. Using the crawler index files produced after crawling, we can generate profiles of the archives that summarize their holdings and can be used to inform routing of the Memento aggregator’s URI requests. Previous work in profiling ranged from using full URIs (no false positives, but with large profiles) to using only top-level domains (TLDs) (smaller profiles, but with many false positives). This work explores strategies in between these two extremes. In our experiments, we correctly identified about 78 % of the URIs that were present or not present in the archive with less than 1 % relative cost as compared to the complete knowledge profile and 94 % URIs with less than 10 % relative cost without any false negatives. With respect to the TLD-only profile, the registered domain profile doubled the routing precision, while complete hostname and one path segment gave a tenfold increase in the routing precision.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
4
CDX files are created as an index of the WARC [15] files generated from the Heritrix web crawler; see [13] for a description of the CDX file format.
 
7
In our dataset, Archive-It has 0.71 % non-HTTP entries in their CDX files, while UKWA has no non-HTTP entries.
 
Literature
3.
go back to reference Alam, S., Nelson, M.L., Van de Sompel, H., Balakireva, L., Shankar, H., Rosenthal, D.S.H.: Web archive profiling through CDX summarization. In: Proceedings of 19th international conference on theory and practice of digital libraries. TPDL 2015, vol. 9316, pp. 3–14. Poznań, Poland (2015) Alam, S., Nelson, M.L., Van de Sompel, H., Balakireva, L., Shankar, H., Rosenthal, D.S.H.: Web archive profiling through CDX summarization. In: Proceedings of 19th international conference on theory and practice of digital libraries. TPDL 2015, vol. 9316, pp. 3–14. Poznań, Poland (2015)
4.
go back to reference AlNoamany, Y., AlSum, A., Weigle, M.C., Nelson, M.L.: Who and what links to the Internet Archive. Int. J. Digit. Librar. 14(3–4), 101–115 (2014)CrossRef AlNoamany, Y., AlSum, A., Weigle, M.C., Nelson, M.L.: Who and what links to the Internet Archive. Int. J. Digit. Librar. 14(3–4), 101–115 (2014)CrossRef
5.
go back to reference AlSum, A., Weigle, M.C., Nelson, M.L., Van de Sompel, H.: Profiling web archive coverage for top-level domain and content language. Proc. Int. Conf. Theory Pract. Digit. Librar. TPDL 2013, 60–71 (2013) AlSum, A., Weigle, M.C., Nelson, M.L., Van de Sompel, H.: Profiling web archive coverage for top-level domain and content language. Proc. Int. Conf. Theory Pract. Digit. Librar. TPDL 2013, 60–71 (2013)
6.
go back to reference AlSum, A., Weigle, M.C., Nelson, M.L., Van de Sompel, H.: Profiling web archive coverage for top-level domain and content language. Int. J. Digit. Librar. 14(3–4), 149–166 (2014)CrossRef AlSum, A., Weigle, M.C., Nelson, M.L., Van de Sompel, H.: Profiling web archive coverage for top-level domain and content language. Int. J. Digit. Librar. 14(3–4), 149–166 (2014)CrossRef
8.
go back to reference Bornand, N.J., Balakireva, L., Van de Sompel, H.: Routing memento requests using binary classifiers. In: Proceedings of the 16th ACM/IEEE-CS on joint conference on digital libraries, JCDL ’16, pp. 63–72 (2016). doi:10.1145/2910896.2910899 Bornand, N.J., Balakireva, L., Van de Sompel, H.: Routing memento requests using binary classifiers. In: Proceedings of the 16th ACM/IEEE-CS on joint conference on digital libraries, JCDL ’16, pp. 63–72 (2016). doi:10.​1145/​2910896.​2910899
9.
go back to reference Crockford, D.: The application/json media type for javascript object notation (JSON). RFC 4627 (2006) Crockford, D.: The application/json media type for javascript object notation (JSON). RFC 4627 (2006)
10.
go back to reference Deutsch, P.: GZIP file format specification version 4.3. RFC 1952 (1996) Deutsch, P.: GZIP file format specification version 4.3. RFC 1952 (1996)
11.
go back to reference Egghe, L.: Untangling Herdan’s law and Heaps’ law: mathematical and informetric arguments. J. Am. Soc. Inf. Sci. Technol. 58(5), 702–709 (2007)CrossRef Egghe, L.: Untangling Herdan’s law and Heaps’ law: mathematical and informetric arguments. J. Am. Soc. Inf. Sci. Technol. 58(5), 702–709 (2007)CrossRef
16.
go back to reference Liu, L.: Query routing in large-scale digital library systems. In: 15th International Conference on Data Engineering, 1999. Proceedings, pp. 154–163 (1999). doi:10.1109/ICDE.1999.754918 Liu, L.: Query routing in large-scale digital library systems. In: 15th International Conference on Data Engineering, 1999. Proceedings, pp. 154–163 (1999). doi:10.​1109/​ICDE.​1999.​754918
17.
go back to reference Meng, W., Yu, C., Liu, K.L.: Building efficient and effective metasearch engines. ACM Comput. Surv. (CSUR) 34(1), 48–89 (2002)CrossRef Meng, W., Yu, C., Liu, K.L.: Building efficient and effective metasearch engines. ACM Comput. Surv. (CSUR) 34(1), 48–89 (2002)CrossRef
19.
go back to reference Sanderson, R.: Global web archive integration with memento. In: Proceedings of the 12th ACM/IEEE-CS joint conference on digital libraries, pp. 379–380. ACM, New York (2012) Sanderson, R.: Global web archive integration with memento. In: Proceedings of the 12th ACM/IEEE-CS joint conference on digital libraries, pp. 379–380. ACM, New York (2012)
22.
go back to reference Sporny, M., Kellogg, G., Lanthaler, M.: A JSON-based serialization for linked data. W3C Recommendation (2014) Sporny, M., Kellogg, G., Lanthaler, M.: A JSON-based serialization for linked data. W3C Recommendation (2014)
24.
go back to reference Sugiura, A., Etzioni, O.: Query routing for web search engines: architecture and experiments. Comput. Netw. 33(1), 417–429 (2000) Sugiura, A., Etzioni, O.: Query routing for web search engines: architecture and experiments. Comput. Netw. 33(1), 417–429 (2000)
25.
go back to reference Tran, T., Zhang, L.: Keyword query routing. IEEE Trans. Knowl. Data Eng. 26(2), 363–375 (2014)CrossRef Tran, T., Zhang, L.: Keyword query routing. IEEE Trans. Knowl. Data Eng. 26(2), 363–375 (2014)CrossRef
27.
go back to reference Van de Sompel, H., Nelson, M.L., Sanderson, R.: HTTP framework for time-based access to resource states—Memento. RFC 7089 (2013) Van de Sompel, H., Nelson, M.L., Sanderson, R.: HTTP framework for time-based access to resource states—Memento. RFC 7089 (2013)
Metadata
Title
Web archive profiling through CDX summarization
Authors
Sawood Alam
Michael L. Nelson
Herbert Van de Sompel
Lyudmila L. Balakireva
Harihar Shankar
David S. H. Rosenthal
Publication date
01-09-2016
Publisher
Springer Berlin Heidelberg
Published in
International Journal on Digital Libraries / Issue 3/2016
Print ISSN: 1432-5012
Electronic ISSN: 1432-1300
DOI
https://doi.org/10.1007/s00799-016-0184-4

Other articles of this Issue 3/2016

International Journal on Digital Libraries 3/2016 Go to the issue

Premium Partner