nach oben

Erschienen in:

2017 | Supplement | Buchkapitel

Extracting Event-Centric Document Collections from Large-Scale Web Archives

verfasst von : Gerhard Gossen, Elena Demidova, Thomas Risse

Erschienen in: Research and Advanced Technology for Digital Libraries

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Web archives are typically very broad in scope and extremely large in scale. This makes data analysis appear daunting, especially for non-computer scientists. These collections constitute an increasingly important source for researchers in the social sciences, the historical sciences and journalists interested in studying past events. However, there are currently no access methods that help users to efficiently access information, in particular about specific events, beyond the retrieval of individual disconnected documents. Therefore we propose a novel method to extract event-centric document collections from large scale Web archives. This method relies on a specialized focused extraction algorithm. Our experiments on the German Web archive (covering a time period of 19 years) demonstrate that our method enables the extraction of event-centric collections for different event types.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Automatic Hierarchical Categorization of Research Expertise Using Minimum Information

Nächstes Kapitel Information Governance Maturity Model Final Development Iteration

https://archive.org

https://archive-it.org/

http://netpreserve.org/openwayback

https://blog.archive.org/2016/10/24/beta-wayback-machine-now-with-site-search/

https://github.com/gerhardgossen/archive-recrawling

http://lucene.apache.org/core/

Code available at: https://github.com/gerhardgossen/dictionary-creator/

https://github.com/gerhardgossen/archive-recrawling

Aggarwal, C., Al-Garawi, F., Yu, P.S.: Intelligent crawling on the world wide web with arbitrary predicates. In: World Wide Web Conference, pp. 96–105 (2001)

Berberich, K., Bedathur, S.: Temporal diversification of search results. In: Workshop on Time-aware Information Access (TAIA 2013) (2013)

Bergmark, D., Lagoze, C., Sbityakov, A.: Focused crawls, tunneling, and digital libraries. In: Agosti, M., Thanos, C. (eds.) ECDL 2002. LNCS, vol. 2458, pp. 91–106. Springer, Heidelberg (2002). doi:10.1007/3-540-45747-X_7 CrossRef

Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11–16) (1999)

Costa, M., Couto, F., Silva, M.: Learning temporal-dependent ranking models. In: SIGIR 2014 (2014)

Costa, M., Gomes, D., Silva, M.J.: The evolution of web archiving. IJDL (2016)

Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: VLDB (2000)

Dong, A., Chang, Y., Zheng, Z., Mishne, G., Bai, J., Zhang, R., Buchner, K., Liao, C., Diaz, F.: Towards recency ranking in web search. In: WSDM 2010 (2010)

Dong, H., Hussain, F.K.: SOF: a semi-supervised ontology-learning-based focused crawler. Concurrency Computat. Prac. Experience 25(12) (2013)

10.

Ehrig, M., Maedche, A.: Ontology-focused crawling of web documents. In: ACM SAC (2003)

11.

Farag, M.M.G., Lee, S., Fox, E.A.: Focused crawler for events. IJDL (2017)

12.

Gossen, G., Demidova, E., Risse, T.: iCrawl: Improving the freshness of web collections by integrating social web and focused web crawling. In: JCDL 2015 (2015)

13.

Gossen, G., Demidova, E., Risse, T.: The iCrawl Wizard – supporting interactive focused crawl specification. In: ECIR 2015 (2015)

14.

Gossen, G., Demidova, E., Risse, T.: Analyzing web archives through topic and event focused sub-collections. In: WebSci 2016. pp. 291–295, May 2016

15.

Heydon, A., Najork, M.: Mercator: a scalable, extensible web crawler. World Wide Web 2(4), 219–229 (1999)CrossRef

16.

Jackson, A., Lin, J., Milligan, I., Ruest, N.: Desiderata for exploratory search interfaces to web archives in support of scholarly activities. In: JCDL2016 (2016)

17.

Jiang, J., Song, X., Yu, N., Lin, C.Y.: Focus: Learning to crawl web forums. IEEE TKDE 25(6) (2013)

18.

Kanhabua, N., Nørvåg, K.: A comparison of time-aware ranking methods. In: SIGIR 2011 (2011)

19.

Laranjeira, B., Moreira, V., Villavicencio, A., Ramisch, C., Finatto, M.J.: Comparing the quality of focused crawlers and of the translation resources obtained from them. In: LREC 2014 (2014)

20.

Mohr, G., Kimpton, M., Stack, M., Ranitovic, I.: Introduction to Heritrix, an archival quality web crawler. In: 4th International Web Archiving Workshop (2004)

21.

Nguyen, T.N., Kanhabua, N., Niederée, C., Zhu, X.: A time-aware random walk model for finding important documents in web archives. In: SIGIR 2015 (2015)

22.

Pant, G., Srinivasan, P.: Learning to crawl: Comparing classification schemes. ACM Trans. Inf. Syst. 23(4) (2005)

23.

Pant, G., Srinivasan, P., Menczer, F.: Crawling the web. In: Web Dynamics (2004)

24.

Pereira, P., Macedo, J., Craveiro, O., Madeira, H.: Time-aware focused web crawling. In: Rijke, M., Kenter, T., Vries, A.P., Zhai, C.X., Jong, F., Radinsky, K., Hofmann, K. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 534–539. Springer, Cham (2014). doi:10.1007/978-3-319-06028-6_53 CrossRef

25.

Qin, J., Zhou, Y., Chau, M.: Building domain-specific web collections for scientific digital libraries. In: JCDL 2004 (2004)

26.

Risse, T., Demidova, E., Gossen, G.: What do you want to collect from the web? In: Proceedings of the Building Web Observatories Workshop (BWOW) 2014 (2014)

Titel: Extracting Event-Centric Document Collections from Large-Scale Web Archives
verfasst von: Gerhard Gossen
Elena Demidova
Thomas Risse
Verlag: Springer International Publishing
Buch: Research and Advanced Technology for Digital Libraries
Print ISBN: 978-3-319-67007-2

Electronic ISBN: 978-3-319-67008-9

Copyright-Jahr: 2017
DOI: https://doi.org/10.1007/978-3-319-67008-9_10

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"