Skip to main content
Erschienen in: International Journal on Digital Libraries 1/2020

27.10.2018

Towards extracting event-centric collections from Web archives

verfasst von: Gerhard Gossen, Thomas Risse, Elena Demidova

Erschienen in: International Journal on Digital Libraries | Ausgabe 1/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Web archives constitute an increasingly important source of information for computer scientists, humanities researchers and journalists interested in studying past events. However, currently there are no access methods that help Web archive users to efficiently access event-centric information in large-scale archives that go beyond the retrieval of individual disconnected documents. In this article, we tackle the novel problem of extracting interlinked event-centric document collections from large-scale Web archives to facilitate an efficient and intuitive access to information regarding past events. We address this problem by: (1) facilitating users to define event-centric document collections in an intuitive way through a Collection Specification; (2) development of a specialised extraction method that adapts focused crawling techniques to the Web archive settings; and (3) definition of a function to judge the relevance of the archived documents with respect to the Collection Specification taking into account the topical and temporal relevance of the documents. Our extended experiments on the German Web archive (covering a time period of 19 years) demonstrate that our method enables efficient extraction of event-centric collections for different event types.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Aggarwal, C., Al-Garawi, F., Yu, P.S.: Intelligent crawling on the World Wide Web with arbitrary predicates. In: Proceedings of the 10th International World Wide Web Conference, WWW’01. pp. 96–105 (2001) Aggarwal, C., Al-Garawi, F., Yu, P.S.: Intelligent crawling on the World Wide Web with arbitrary predicates. In: Proceedings of the 10th International World Wide Web Conference, WWW’01. pp. 96–105 (2001)
2.
Zurück zum Zitat AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Detecting off-topic pages within timemaps in web archives. Int. J. Digit. Libr. 17(3), 203–221 (2016)CrossRef AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Detecting off-topic pages within timemaps in web archives. Int. J. Digit. Libr. 17(3), 203–221 (2016)CrossRef
3.
Zurück zum Zitat AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Generating stories from archived collections. In: Proceedings of the 2017 ACM Web Science Conference, WebSci’17, ACM, New York, NY, USA, pp. 309–318 (2017) AlNoamany, Y., Weigle, M.C., Nelson, M.L.: Generating stories from archived collections. In: Proceedings of the 2017 ACM Web Science Conference, WebSci’17, ACM, New York, NY, USA, pp. 309–318 (2017)
4.
Zurück zum Zitat Berberich, K., Bedathur, S.: Temporal Diversification of Search Results. In: Proceedings of the Workshop on Time-Aware Information Access (TAIA 2013) (2013) Berberich, K., Bedathur, S.: Temporal Diversification of Search Results. In: Proceedings of the Workshop on Time-Aware Information Access (TAIA 2013) (2013)
5.
Zurück zum Zitat Bergmark, D., Lagoze, C., Sbityakov, A.: Focused crawls, tunneling, and digital libraries. In: Proceedings of the European Conference on Digital Libraries (ECDL’02) (2002)CrossRef Bergmark, D., Lagoze, C., Sbityakov, A.: Focused crawls, tunneling, and digital libraries. In: Proceedings of the European Conference on Digital Libraries (ECDL’02) (2002)CrossRef
6.
Zurück zum Zitat Bouzeghoub, M.: A framework for analysis of data freshness. In: Proceedings of the Workshop on Information Quality in Information Systems (2004) Bouzeghoub, M.: A framework for analysis of data freshness. In: Proceedings of the Workshop on Information Quality in Information Systems (2004)
7.
Zurück zum Zitat Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of the Seventh International Conference on World Wide Web 7, WWW7, pp. 107–117 (1998) Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. In: Proceedings of the Seventh International Conference on World Wide Web 7, WWW7, pp. 107–117 (1998)
8.
Zurück zum Zitat Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Comput. Netw. 31(11–16), 1623–1640 (1999)CrossRef Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Comput. Netw. 31(11–16), 1623–1640 (1999)CrossRef
9.
Zurück zum Zitat Costa, M., Couto, F., Silva, M.: Learning temporal-dependent ranking models. In: Proceedings of the SIGIR’14 (2014) Costa, M., Couto, F., Silva, M.: Learning temporal-dependent ranking models. In: Proceedings of the SIGIR’14 (2014)
10.
Zurück zum Zitat Costa, M., Gomes, D., Silva, M.J.: The evolution of web archiving. Int. J. Digit. Libr. 18(3), 191–205 (2017)CrossRef Costa, M., Gomes, D., Silva, M.J.: The evolution of web archiving. Int. J. Digit. Libr. 18(3), 191–205 (2017)CrossRef
11.
Zurück zum Zitat Demidova, E., Barbieri, N., Dietze, S., Funk, A., Holzmann, H., Maynard, D., Papailiou, N., Peters, W., Risse, T., Spiliotopoulos, D.: Analysing and enriching focused semantic web archives for parliament applications. Fut. Intern. 6(3), 433–456 (2014)CrossRef Demidova, E., Barbieri, N., Dietze, S., Funk, A., Holzmann, H., Maynard, D., Papailiou, N., Peters, W., Risse, T., Spiliotopoulos, D.: Analysing and enriching focused semantic web archives for parliament applications. Fut. Intern. 6(3), 433–456 (2014)CrossRef
12.
Zurück zum Zitat Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: Proceedings of the VLDB’00 (2000) Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: Proceedings of the VLDB’00 (2000)
13.
Zurück zum Zitat Dong, A., Chang, Y., Zheng, Z., Mishne, G., Bai, J., Zhang, R., Buchner, K., Liao, C., Diaz, F.: Towards recency ranking in web search. In: Proceedings of the WSDM’10 (2010) Dong, A., Chang, Y., Zheng, Z., Mishne, G., Bai, J., Zhang, R., Buchner, K., Liao, C., Diaz, F.: Towards recency ranking in web search. In: Proceedings of the WSDM’10 (2010)
14.
Zurück zum Zitat Dong, H., Hussain, F.K.: SOF: a semi-supervised ontology-learning-based focused crawler. Concurr. Comput. Pract. Exp. 25(12), 1755–1770 (2013)CrossRef Dong, H., Hussain, F.K.: SOF: a semi-supervised ontology-learning-based focused crawler. Concurr. Comput. Pract. Exp. 25(12), 1755–1770 (2013)CrossRef
15.
Zurück zum Zitat Ehrig, M., Maedche, A.: Ontology-focused crawling of web documents. In: Proceedings of the ACM SAC (2003) Ehrig, M., Maedche, A.: Ontology-focused crawling of web documents. In: Proceedings of the ACM SAC (2003)
16.
Zurück zum Zitat Farag, M.M.G., Lee, S., Fox, E.A.: Focused crawler for events. Int. J. Digit. Libr. 19(1), 3–19 (2018)CrossRef Farag, M.M.G., Lee, S., Fox, E.A.: Focused crawler for events. Int. J. Digit. Libr. 19(1), 3–19 (2018)CrossRef
17.
Zurück zum Zitat Gossen, G., Demidova, E., Risse, T.: iCrawl: Improving the freshness of web collections by integrating social web and focused web crawling. In: Proceedings of the JCDL’15 (2015) Gossen, G., Demidova, E., Risse, T.: iCrawl: Improving the freshness of web collections by integrating social web and focused web crawling. In: Proceedings of the JCDL’15 (2015)
18.
Zurück zum Zitat Gossen, G., Demidova, E., Risse, T.: The iCrawl Wizard—supporting interactive focused crawl specification. In: Proceedings of the ECIR’15 (2015) Gossen, G., Demidova, E., Risse, T.: The iCrawl Wizard—supporting interactive focused crawl specification. In: Proceedings of the ECIR’15 (2015)
19.
Zurück zum Zitat Gossen, G., Demidova, E., Risse, T.: Analyzing Web archives through topic and event focused sub-collections. In: Proceedings of the WebSci’16, pp. 291–295 (May 2016) Gossen, G., Demidova, E., Risse, T.: Analyzing Web archives through topic and event focused sub-collections. In: Proceedings of the WebSci’16, pp. 291–295 (May 2016)
20.
Zurück zum Zitat Gossen, G., Demidova, E., Risse, T.: Extracting event-centric document collections from large-scale web archives. In: Proceedings of the 21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017, pp. 116–127 (2017)CrossRef Gossen, G., Demidova, E., Risse, T.: Extracting event-centric document collections from large-scale web archives. In: Proceedings of the 21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017, pp. 116–127 (2017)CrossRef
21.
Zurück zum Zitat Gottschalk, S., Demidova, E.: EventKG: A multilingual event-centric temporal knowledge graph. In: Proceedings of the ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, pp. 272–287 (2018) Gottschalk, S., Demidova, E.: EventKG: A multilingual event-centric temporal knowledge graph. In: Proceedings of the ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, pp. 272–287 (2018)
22.
Zurück zum Zitat Gottschalk, S., Demidova, E., Bernacchi, V., Rogers, R., Demidova, E.: Towards better understanding researcher strategies in cross-lingual event analytics. In: Proceedings of the 22nd International Conference on Theory and Practice of Digital Libraries, TPDL 2018 (2018)CrossRef Gottschalk, S., Demidova, E., Bernacchi, V., Rogers, R., Demidova, E.: Towards better understanding researcher strategies in cross-lingual event analytics. In: Proceedings of the 22nd International Conference on Theory and Practice of Digital Libraries, TPDL 2018 (2018)CrossRef
23.
Zurück zum Zitat Heydon, A., Najork, M.: Mercator: a scalable, extensible web crawler. World Wide Web 2(4), 219–229 (1999)CrossRef Heydon, A., Najork, M.: Mercator: a scalable, extensible web crawler. World Wide Web 2(4), 219–229 (1999)CrossRef
26.
Zurück zum Zitat Jiang, J., Song, X., Yu, N., Lin, C.Y.: FoCUS: learning to crawl web forums. IEEE Trans. Knowl. Data Eng. 25(6), 1293–1306 (2013)CrossRef Jiang, J., Song, X., Yu, N., Lin, C.Y.: FoCUS: learning to crawl web forums. IEEE Trans. Knowl. Data Eng. 25(6), 1293–1306 (2013)CrossRef
27.
Zurück zum Zitat Kanhabua, N., Nørvåg, K.: A comparison of time-aware ranking methods. In: Proceedings of the SIGIR’11 (2011) Kanhabua, N., Nørvåg, K.: A comparison of time-aware ranking methods. In: Proceedings of the SIGIR’11 (2011)
28.
Zurück zum Zitat Laranjeira, B., Moreira, V., Villavicencio, A., Ramisch, C., Finatto, M.J.: Comparing the quality of focused crawlers and of the translation resources obtained from them. In: Proceedings of the LREC’14 (2014) Laranjeira, B., Moreira, V., Villavicencio, A., Ramisch, C., Finatto, M.J.: Comparing the quality of focused crawlers and of the translation resources obtained from them. In: Proceedings of the LREC’14 (2014)
29.
Zurück zum Zitat Lehmann, J., Isele, R., Jakob, M., et al.: DBpedia—a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6(2), 167–195 (2015)CrossRef Lehmann, J., Isele, R., Jakob, M., et al.: DBpedia—a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6(2), 167–195 (2015)CrossRef
30.
Zurück zum Zitat Mohr, G., Kimpton, M., Stack, M., Ranitovic, I.: Introduction to Heritrix, an archival quality web crawler. In: Proceedings of the 4th International Web Archiving Workshop (2004) Mohr, G., Kimpton, M., Stack, M., Ranitovic, I.: Introduction to Heritrix, an archival quality web crawler. In: Proceedings of the 4th International Web Archiving Workshop (2004)
31.
Zurück zum Zitat Nguyen, T.N., Kanhabua, N., Niederée, C., Zhu, X.: A time-aware random walk model for finding important documents in web archives. In: Proceedings of the SIGIR’15 (2015) Nguyen, T.N., Kanhabua, N., Niederée, C., Zhu, X.: A time-aware random walk model for finding important documents in web archives. In: Proceedings of the SIGIR’15 (2015)
32.
Zurück zum Zitat Pant, G., Srinivasan, P.: Learning to crawl: comparing classification schemes. ACM Trans. Inf. Syst. 23(4), 430–462 (2005)CrossRef Pant, G., Srinivasan, P.: Learning to crawl: comparing classification schemes. ACM Trans. Inf. Syst. 23(4), 430–462 (2005)CrossRef
33.
Zurück zum Zitat Pant, G., Srinivasan, P., Menczer, F.: Crawling the web. In: Web Dynamics. Springer, New York (2004)CrossRef Pant, G., Srinivasan, P., Menczer, F.: Crawling the web. In: Web Dynamics. Springer, New York (2004)CrossRef
34.
Zurück zum Zitat Pereira, P., Macedo, J., Craveiro, O., Madeira, H.: Time-aware focused web crawling. In: Proceedings of the ECIR’14 (2014) Pereira, P., Macedo, J., Craveiro, O., Madeira, H.: Time-aware focused web crawling. In: Proceedings of the ECIR’14 (2014)
35.
Zurück zum Zitat Qin, J., Zhou, Y., Chau, M.: Building domain-specific Web collections for scientific digital libraries. In: Proceedings of the JCDL’04 (2004) Qin, J., Zhou, Y., Chau, M.: Building domain-specific Web collections for scientific digital libraries. In: Proceedings of the JCDL’04 (2004)
36.
Zurück zum Zitat Risse, T., Demidova, E., Gossen, G.: What do you want to collect from the web? In: Proceedings of the Building Web Observatories Workshop (BWOW) 2014 (2014) Risse, T., Demidova, E., Gossen, G.: What do you want to collect from the web? In: Proceedings of the Building Web Observatories Workshop (BWOW) 2014 (2014)
37.
Zurück zum Zitat Rospocher, M., et al.: Building event-centric knowledge graphs from news. Web Semant. 37, 132–151 (2016)CrossRef Rospocher, M., et al.: Building event-centric knowledge graphs from news. Web Semant. 37, 132–151 (2016)CrossRef
38.
Zurück zum Zitat Souza, T., Demidova, E., Risse, T., Holzmann, H., Gossen, G., Szymanski, J.: Semantic URL analytics to support efficient annotation of large scale web archives. In: Proceedings of the First International KEYSTONE Conference, IKC 2015, Coimbra, Portugal, September 8–9, 2015. pp. 153–166 (2015)CrossRef Souza, T., Demidova, E., Risse, T., Holzmann, H., Gossen, G., Szymanski, J.: Semantic URL analytics to support efficient annotation of large scale web archives. In: Proceedings of the First International KEYSTONE Conference, IKC 2015, Coimbra, Portugal, September 8–9, 2015. pp. 153–166 (2015)CrossRef
39.
Zurück zum Zitat Vrandečić, D.: Wikidata: A new platform for collaborative data collection. In: Proceedings of the 21st International Conference on World Wide Web. WWW’12 Companion, ACM, pp. 1063–1064 (2012) Vrandečić, D.: Wikidata: A new platform for collaborative data collection. In: Proceedings of the 21st International Conference on World Wide Web. WWW’12 Companion, ACM, pp. 1063–1064 (2012)
Metadaten
Titel
Towards extracting event-centric collections from Web archives
verfasst von
Gerhard Gossen
Thomas Risse
Elena Demidova
Publikationsdatum
27.10.2018
Verlag
Springer Berlin Heidelberg
Erschienen in
International Journal on Digital Libraries / Ausgabe 1/2020
Print ISSN: 1432-5012
Elektronische ISSN: 1432-1300
DOI
https://doi.org/10.1007/s00799-018-0258-6

Weitere Artikel der Ausgabe 1/2020

International Journal on Digital Libraries 1/2020 Zur Ausgabe

Premium Partner