Skip to main content
Top
Published in: International Journal on Digital Libraries 2/2020

22-06-2018

Toward comprehensive event collections

Authors: Federico Nanni, Simone Paolo Ponzetto, Laura Dietz

Published in: International Journal on Digital Libraries | Issue 2/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Web archives, such as the Internet Archive, preserve an unprecedented abundance of materials regarding major events and transformations in our society. In this paper, we present an approach for building event-centric sub-collections from such large archives, which includes not only the core documents related to the event itself but, even more importantly, documents describing related aspects (e.g., premises and consequences). This is achieved by identifying relevant concepts and entities from a knowledge base, and then detecting their mentions in documents, which are interpreted as indicators for relevance. We extensively evaluate our system on two diachronic corpora, the New York Times Corpus and the US Congressional Record; additionally, we test its performance on the TREC KBA Stream Corpus and on the TREC-CAR dataset, two publicly available large-scale web collections.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
This article builds upon and expands our previous work [41].
 
2
All gold standards are available at: http://​federiconanni.​com/​event-collections/​.
 
5
See, for example, Nick Ruest collection of the Bataclan Attack: http://​ruebot.​net/​post/​look-14939154-paris-bataclan-parisattacks-porteouverte-tweets.
 
8
Size of the collections varies, spanning from 18 documents to more than 6000, depending on the topic.
 
14
THOMAS has been a digital collection directed by the Library of Congress. It offered, among other materials, the official record of proceedings and debate since the 101th Congress (1989–1990). In 2016, THOMAS has been completely substituted with Congress.gov, which provides full-text access to daily congressional record issues dating from 1995 (beginning with the 104th Congress).
 
16
As also remarked in [19].
 
17
Cf. e.g., the first multi-party election in Algeria, 1991.
 
18
See for example the Italian general election in 1996.
 
20
A list of all events examined in our work is available here: https://​federiconanni.​com/​event-collections/​.
 
23
We also tested TF-IDF weighted frequency, but we did not obtain any significant improvement over raw frequency.
 
24
For example, the youth organization PORA is related to the aspects Protests and Internet usage of the event Orange Revolution and less to its Causes.
 
25
Which corresponds to EvAsp-GloVe.
 
27
Method marked with * is significantly better than all others on its left.
 
28
We detected and removed news duplicates from the initial pool of potentially relevant documents, before conducting the final evaluation.
 
Literature
1.
go back to reference Abujabal, A., Berberich, K.: Important events in the past, present, and future. In: WWW (2015) Abujabal, A., Berberich, K.: Important events in the past, present, and future. In: WWW (2015)
2.
go back to reference Allan, J.: Introduction to topic detection and tracking. In: Allan, J. (ed.) Topic Detection and Tracking. The Information Retrieval Series, vol. 12. Springer, Boston, MA (2002)CrossRef Allan, J.: Introduction to topic detection and tracking. In: Allan, J. (ed.) Topic Detection and Tracking. The Information Retrieval Series, vol. 12. Springer, Boston, MA (2002)CrossRef
3.
go back to reference Allan, J., Lavrenko, V., Jin, H: First story detection in TDT is hard. In: CIKM (2000) Allan, J., Lavrenko, V., Jin, H: First story detection in TDT is hard. In: CIKM (2000)
4.
go back to reference Allan, J., Papka, R., Lavrenko, V.: On-line new event detection and tracking. In: SIGIR (1998) Allan, J., Papka, R., Lavrenko, V.: On-line new event detection and tracking. In: SIGIR (1998)
5.
go back to reference Aslam, J.A., Ekstrand-Abueg, M., Pavlu, V., Diaz, F., Sakai, T.: TREC 2013 temporal summarization. In: TREC (2013) Aslam, J.A., Ekstrand-Abueg, M., Pavlu, V., Diaz, F., Sakai, T.: TREC 2013 temporal summarization. In: TREC (2013)
6.
go back to reference Au Yeung, C.-M., Jatowt, A.: Studying how the past is remembered: towards computational history through large scale text mining. In: CIKM (2011) Au Yeung, C.-M., Jatowt, A.: Studying how the past is remembered: towards computational history through large scale text mining. In: CIKM (2011)
7.
go back to reference Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: Dbpedia: A Nucleus for a Web of Open Data. Springer, Berlin (2007) Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: Dbpedia: A Nucleus for a Web of Open Data. Springer, Berlin (2007)
8.
go back to reference Bailey, S., Thompson, D.: Building the uk’s first public web archive. D-Lib 12, 1 (2006)CrossRef Bailey, S., Thompson, D.: Building the uk’s first public web archive. D-Lib 12, 1 (2006)CrossRef
9.
go back to reference Bethard, S.: Cleartk-timeml: a minimalist approach to tempeval 2013. In: SEM (2013) Bethard, S.: Cleartk-timeml: a minimalist approach to tempeval 2013. In: SEM (2013)
10.
go back to reference Blanco, R., Ottaviano, G., Meij, E.: Fast and space-efficient entity linking for queries. In: WSDM (2015) Blanco, R., Ottaviano, G., Meij, E.: Fast and space-efficient entity linking for queries. In: WSDM (2015)
11.
go back to reference Cano, I., Singh, S., Guestrin, C.: Distributed non-parametric representations for vital filtering: UW at TREC KBA. In: TREC (2014) Cano, I., Singh, S., Guestrin, C.: Distributed non-parametric representations for vital filtering: UW at TREC KBA. In: TREC (2014)
12.
go back to reference Ceroni, A., Gadiraju, U., Matschke, J., Wingert, S., Fisichella, M.: Where the event lies: predicting event occurrence in textual documents. In: SIGIR (2016) Ceroni, A., Gadiraju, U., Matschke, J., Wingert, S., Fisichella, M.: Where the event lies: predicting event occurrence in textual documents. In: SIGIR (2016)
13.
go back to reference Dalton, J., Dietz, L., Allan, J.: Entity query feature expansion using knowledge base links. In: SIGIR (2014) Dalton, J., Dietz, L., Allan, J.: Entity query feature expansion using knowledge base links. In: SIGIR (2014)
14.
go back to reference Dietz, L., Gamari, B.: TREC CAR: A Data Set for Complex Answer Retrieval. Version 1.5 (2017) Dietz, L., Gamari, B.: TREC CAR: A Data Set for Complex Answer Retrieval. Version 1.5 (2017)
15.
go back to reference Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: A latent variable model for geographic lexical variation. In: EMNLP (2010) Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: A latent variable model for geographic lexical variation. In: EMNLP (2010)
16.
go back to reference Ferragina, P., Scaiella, U.: Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). In: CIKM (2010) Ferragina, P., Scaiella, U.: Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). In: CIKM (2010)
17.
go back to reference Glavaš, G., Šnajder, J.: Construction and evaluation of event graphs. Nat. Lang. Eng. 21, 04 (2015)CrossRef Glavaš, G., Šnajder, J.: Construction and evaluation of event graphs. Nat. Lang. Eng. 21, 04 (2015)CrossRef
18.
go back to reference Gomes, D., Miranda, J., Costa, M.: A survey on web archiving initiatives. In: TPDL (2011)CrossRef Gomes, D., Miranda, J., Costa, M.: A survey on web archiving initiatives. In: TPDL (2011)CrossRef
19.
go back to reference Gossen, G., Demidova, E., Risse, T.: Extracting event-centric document collections from large-scale web archives. In: TPDL (2017) Gossen, G., Demidova, E., Risse, T.: Extracting event-centric document collections from large-scale web archives. In: TPDL (2017)
20.
go back to reference Graus, D., Peetz, M.-H., Odijk, D., de Rooij, O., de Rijke, M.: yourhistory-semantic linking for a personalized timeline of historic events. In: Workshop: LinkedUp Challenge at OKCon (2013) Graus, D., Peetz, M.-H., Odijk, D., de Rooij, O., de Rijke, M.: yourhistory-semantic linking for a personalized timeline of historic events. In: Workshop: LinkedUp Challenge at OKCon (2013)
21.
go back to reference Gupta, D.: Event search and analytics: detecting events in semantically annotated corpora for search and analytics. In: WSDM (2016) Gupta, D.: Event search and analytics: detecting events in semantically annotated corpora for search and analytics. In: WSDM (2016)
22.
go back to reference Hasibi, F., Balog, K., Bratsberg, S.E.: Exploiting entity linking in queries for entity retrieval. In: ICTIR (2016) Hasibi, F., Balog, K., Bratsberg, S.E.: Exploiting entity linking in queries for entity retrieval. In: ICTIR (2016)
23.
go back to reference Hockx-Yu, H.: Access and scholarly use of web archives. Alex. J. Nat. Int. Libr. Inf. 25, 1–2 (2014) Hockx-Yu, H.: Access and scholarly use of web archives. Alex. J. Nat. Int. Libr. Inf. 25, 1–2 (2014)
24.
go back to reference Hyde, S.D., Marinov, N.: Which elections can be lost? Polit. Anal. 20, 191–210 (2012)CrossRef Hyde, S.D., Marinov, N.: Which elections can be lost? Polit. Anal. 20, 191–210 (2012)CrossRef
25.
go back to reference Jatowt, A. Au Yeung, C.-M.: Extracting collective expectations about the future from large text collections. In: CIKM (2011) Jatowt, A. Au Yeung, C.-M.: Extracting collective expectations about the future from large text collections. In: CIKM (2011)
26.
go back to reference Kedzie, C., McKeown, K., Diaz, F.: Summarizing disasters over time. In: Workshop on Social Good at SIGKDD (2014) Kedzie, C., McKeown, K., Diaz, F.: Summarizing disasters over time. In: Workshop on Social Good at SIGKDD (2014)
27.
go back to reference Kotov, A., Zhai, C.: Tapping into knowledge base for concept feedback: leveraging conceptnet to improve search results for difficult queries. In: WSDM (2012) Kotov, A., Zhai, C.: Tapping into knowledge base for concept feedback: leveraging conceptnet to improve search results for difficult queries. In: WSDM (2012)
28.
go back to reference Kuzey, E., Vreeken, J., Weikum, G.: A fresh look on knowledge bases: distilling named events from news. In: CIKM (2014) Kuzey, E., Vreeken, J., Weikum, G.: A fresh look on knowledge bases: distilling named events from news. In: CIKM (2014)
29.
go back to reference Lepore, J.: The cobweb: can the internet be archived? The New Yorker (2015) Lepore, J.: The cobweb: can the internet be archived? The New Yorker (2015)
30.
go back to reference Lewis, D.: The trec-4 filtering track. In: TREC (1995) Lewis, D.: The trec-4 filtering track. In: TREC (1995)
31.
go back to reference Li, H.: Learning to rank for information retrieval and natural language processing. Synth. Lect. Hum. Lang. Technol. 7, 3 (2014) Li, H.: Learning to rank for information retrieval and natural language processing. Synth. Lect. Hum. Lang. Technol. 7, 3 (2014)
32.
go back to reference Liu, X., Fang, H.: Latent entity space: a novel retrieval approach for entity-bearing queries. Inf. Retr. J. 18, 6 (2015) Liu, X., Fang, H.: Latent entity space: a novel retrieval approach for entity-bearing queries. Inf. Retr. J. 18, 6 (2015)
33.
go back to reference Lyman, P., Kahle, B.: Archiving digital cultural artifacts. D-Lib 4, 7 (1998)CrossRef Lyman, P., Kahle, B.: Archiving digital cultural artifacts. D-Lib 4, 7 (1998)CrossRef
34.
go back to reference Menini, S., Sprugnoli, R., Moretti, G., Bignotti, E., Tonelli, S., Lepri, B.: Ramble on: tracing movements of popular historical figures. In: EACL (2017) Menini, S., Sprugnoli, R., Moretti, G., Bignotti, E., Tonelli, S., Lepri, B.: Ramble on: tracing movements of popular historical figures. In: EACL (2017)
35.
go back to reference Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38, 11 (1995)CrossRef Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38, 11 (1995)CrossRef
36.
go back to reference Milligan, I., Ruest, N., Lin, J.: Content selection and curation for web archiving: the gatekeepers vs. the masses. In: JCDL (2016) Milligan, I., Ruest, N., Lin, J.: Content selection and curation for web archiving: the gatekeepers vs. the masses. In: JCDL (2016)
37.
go back to reference Mishra, A., Berberich, K.: Expose: exploring past news for seminal events. In: WWW (2015) Mishra, A., Berberich, K.: Expose: exploring past news for seminal events. In: WWW (2015)
38.
go back to reference Mishra, A., Berberich, K.: Event digest: a holistic view on past events. In: SIGIR (2016) Mishra, A., Berberich, K.: Event digest: a holistic view on past events. In: SIGIR (2016)
39.
go back to reference Nanni, F., Mitra, B., Magnusson, M., Dietz, L.: Benchmark for complex answer retrieval. In: ICTIR (2017a) Nanni, F., Mitra, B., Magnusson, M., Dietz, L.: Benchmark for complex answer retrieval. In: ICTIR (2017a)
40.
go back to reference Nanni, F., Ponzetto, S.P., Dietz, L.: Entity relatedness for retrospective analyses of global events. In: NLP+CSS at WebSci (2016) Nanni, F., Ponzetto, S.P., Dietz, L.: Entity relatedness for retrospective analyses of global events. In: NLP+CSS at WebSci (2016)
41.
go back to reference Nanni, F., Ponzetto, S.P., Dietz, L.: Building entity-centric event collections. In: JCDL (2017b) Nanni, F., Ponzetto, S.P., Dietz, L.: Building entity-centric event collections. In: JCDL (2017b)
42.
go back to reference Nanni, F., Ponzetto, S.P., Dietz, L.: Entity-aspect linking: providing fine-grained semantics of entities in context. In: JCDL (2018) Nanni, F., Ponzetto, S.P., Dietz, L.: Entity-aspect linking: providing fine-grained semantics of entities in context. In: JCDL (2018)
43.
go back to reference Nanni, F., Zhao, Y., Ponzetto, S.P., Dietz, L.: Enhancing domain-specific entity linking in dh. In: DH (2017c) Nanni, F., Zhao, Y., Ponzetto, S.P., Dietz, L.: Enhancing domain-specific entity linking in dh. In: DH (2017c)
44.
go back to reference Ntoulas, A., Cho, J., Olston, C.: What’s new on the web? In: WWW (2004) Ntoulas, A., Cho, J., Olston, C.: What’s new on the web? In: WWW (2004)
45.
go back to reference Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP (2014) Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP (2014)
46.
go back to reference Pound, J., Mika, P., Zaragoza, H.: Ad-hoc object retrieval in the web of data. In: WWW (2010) Pound, J., Mika, P., Zaragoza, H.: Ad-hoc object retrieval in the web of data. In: WWW (2010)
47.
go back to reference Raviv, H., Kurland, O., Carmel, D.: Document retrieval using entity-based language models. In: SIGIR (2016) Raviv, H., Kurland, O., Carmel, D.: Document retrieval using entity-based language models. In: SIGIR (2016)
48.
go back to reference Ristoski, P., Paulheim, H.: Rdf2vec: Rdf graph embeddings for data mining. In: ISWC (2016) Ristoski, P., Paulheim, H.: Rdf2vec: Rdf graph embeddings for data mining. In: ISWC (2016)
49.
go back to reference Rollason-Cass, S., Reed, S.: Living movements, living archives: selecting and archiving web content during times of social unrest. N. Rev. Inf. Netw 20, 1–2 (2015)CrossRef Rollason-Cass, S., Reed, S.: Living movements, living archives: selecting and archiving web content during times of social unrest. N. Rev. Inf. Netw 20, 1–2 (2015)CrossRef
50.
go back to reference Rovera, M., Nanni, F., Ponzetto, S.P., Goy, A.: Domain-specific named entity disambiguation in historical memoirs. In: CLiC-it (2017)CrossRef Rovera, M., Nanni, F., Ponzetto, S.P., Goy, A.: Domain-specific named entity disambiguation in historical memoirs. In: CLiC-it (2017)CrossRef
51.
go back to reference Schich, M., Song, C., Ahn, Y.-Y., Mirsky, A., Martino, M., Barabási, A.-L., Helbing, D.: A network framework of cultural history. Science 345, 6196 (2014)CrossRef Schich, M., Song, C., Ahn, Y.-Y., Mirsky, A., Martino, M., Barabási, A.-L., Helbing, D.: A network framework of cultural history. Science 345, 6196 (2014)CrossRef
52.
go back to reference Schuhmacher, M., Dietz, L., Paolo Ponzetto, S.: Ranking entities for web queries through text and knowledge. In: CIKM (2015) Schuhmacher, M., Dietz, L., Paolo Ponzetto, S.: Ranking entities for web queries through text and knowledge. In: CIKM (2015)
53.
go back to reference Singh, J., Nejdl, W., Anand, A.: Expedition: a time-aware exploratory search system designed for scholars. In: SIGIR (2016) Singh, J., Nejdl, W., Anand, A.: Expedition: a time-aware exploratory search system designed for scholars. In: SIGIR (2016)
54.
go back to reference Sprugnoli, R., Tonelli, S.: One, no one and one hundred thousand events: defining and processing events in an inter-disciplinary perspective. Nat. Lang. Eng. 23, 485 (2016)CrossRef Sprugnoli, R., Tonelli, S.: One, no one and one hundred thousand events: defining and processing events in an inter-disciplinary perspective. Nat. Lang. Eng. 23, 485 (2016)CrossRef
55.
go back to reference Tuck, J.: Web archiving in the UK: cooperation, legislation and regulation. Liber Q. 18, 3–4 (2008)CrossRef Tuck, J.: Web archiving in the UK: cooperation, legislation and regulation. Liber Q. 18, 3–4 (2008)CrossRef
56.
go back to reference Witten, I., Milne, D.: An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In: Workshop on Wikipedia and Artificial Intelligence at AAAI (2008) Witten, I., Milne, D.: An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In: Workshop on Wikipedia and Artificial Intelligence at AAAI (2008)
57.
go back to reference Wolfreys, J.: Readings: Acts of Close Reading in Literary Theory. Edinburgh University Press, Edinburgh (2000) Wolfreys, J.: Readings: Acts of Close Reading in Literary Theory. Edinburgh University Press, Edinburgh (2000)
58.
go back to reference Xiong, C., Callan, J.: Esdrank: connecting query and documents through external semi-structured data. In: CIKM (2015) Xiong, C., Callan, J.: Esdrank: connecting query and documents through external semi-structured data. In: CIKM (2015)
Metadata
Title
Toward comprehensive event collections
Authors
Federico Nanni
Simone Paolo Ponzetto
Laura Dietz
Publication date
22-06-2018
Publisher
Springer Berlin Heidelberg
Published in
International Journal on Digital Libraries / Issue 2/2020
Print ISSN: 1432-5012
Electronic ISSN: 1432-1300
DOI
https://doi.org/10.1007/s00799-018-0246-x

Other articles of this Issue 2/2020

International Journal on Digital Libraries 2/2020 Go to the issue

Premium Partner