ABSTRACT
Named Entity Recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system's performance is genre and domain dependent and also used entity categories vary [16]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report evaluation result of NER with data out of a digitized Finnish historical newspaper collection Digi. Experiments, results and discussion of this research serve development of the Web collection of historical Finnish newspapers.
Digi collection contains 1,960,921 pages of newspaper material from years 1771-1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70-75% [7]. Our baseline NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. Three other available tools are also evaluated: a Finnish Semantic Tagger (FST), Connexor's NER tool and Polyglot's NER.
- Beatrice Alex and John Burns. 2014. Estimating and Rating the Quality of Optically Character Recognised Text. In DATeCH '14 Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, 97--102. http://dl.acm.org/citation.cfm?id=2595214.Google Scholar
- Marcia Bates. 2007. What is Browsing -- really? A Model Drawing from Behavioural Science Research. Information Research 12. http://www.informationr.net/ir/12-4/paper330.html.Google Scholar
- M-L. Bremer-Laamanen. 2014. In the Spotlight for Crowdsourcing. Scandinavian Librarian Quarterly, 1, 18--21.Google Scholar
- Greogory Crane and Alison Jones. 2006. The Challenge of Virginia Banks: An Evaluation of Named Entity Analysis in a 19th-Century Newspaper Collection. In Proceedings of JCDL'06, June 11-15, 2006, Chapel Hill, North Carolina, USA. http://repository01.lib.tufts.edu:8080/fedora/get/tufts:PB.001.001.00007/Archival.pdf.Google ScholarDigital Library
- Maud Ehrmann, Giovanni Colavizza, Yannick Rochat and Frédéric Kaplan. 2016. Diachronic Evaluation of NER Systems on Old Newspapers. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), 97--107. https://www.linguistics.rub.de/konvens16/pub/13_konvensproc.pdfGoogle Scholar
- Kimmo Kettunen, Timo Honkela, Krister Lindén, Pekka Kauppinen, Tuula Pääkkönen.and Jukka Kervinen. 2014. Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods. In Proceedings of IFLA 2014, Lyon http://www.ifla.org/files/assets/newspapers/Geneva_2014/s6-honkela-en.pdfGoogle Scholar
- Kimmo Kettunen and Tuula Pääkkönen. 2016. Measuring Lexical Quality of a Historical Finnish Newspaper Collection -- Analysis of Garbled OCR Data with Basic Language Technology Tools and Means. In Proceedings of LREC 2016 http://www.lrec-conf.org/proceedings/lrec2016/pdf/17_Paper.pdfGoogle Scholar
- Kimmo Kettunen, Tuula Pääkkönen and Mika Koistinen. 2016. Between Diachrony and Synchrony: Evaluation of Lexical Quality of a Digitized Historical Finnish Newspaper and Journal Collection with Morphological Analyzers. In Skadiņa, I. and Rozis, R. Eds. Human Language Technologies -- The Baltic Perspective, IOS Press, 122--129. http://ebooks.iospress.nl/volumearticle/45525.Google Scholar
- Dimitrios Kokkinakis, Jyrki Niemi, Sam Hardwick, Krister Lindén, and Lars Borin. 2014. HFST-SweNER -- a New NER Resource for Swedish. In Proceedings of LREC 2014. http://www.lrec-conf.org/proceedings/lrec2014/pdf/391_Paper.pdf.Google Scholar
- Laura Löfberg, Scott Piao, Paul Rayson, Jukka-Pekka Juntunen, Asko Nykänen and Krista Varantola. 2005. A semantic tagger for the Finnish language. http://eprints.lancs.ac.uk/12685/1/cl2005_fst.pdfGoogle Scholar
- Sunghwan Mac Kim and Steve Cassidy. 2015. Finding Names in Trove: Named Entity Recognition for Australian. In Proceedings of Australasian Language Technology Association Workshop, 57--65. https://aclweb.org/anthology/U/U15/U15-1007.pdf.Google Scholar
- Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Language Processing. The MIT Press, Cambridge, Massachusetts.Google Scholar
- Mónica Marrero, Julián Urbano, Sonia Sánchez-Cuadrado, Jorge Morato and Juan Miguel Gómez-Berbís. 2013. Named Entity Recognition: Fallacies, challenges and opportunities. Computer Standards & Interfaces 35, 482--489. Google ScholarCross Ref
- Paul McNamee, James C. Mayfield and Christine D. Piatko. 2011. Processing Named Entities in Text. Johns Hopkins APL Technical Digest, 30, 31--40.Google Scholar
- David Miller, Sean Boisen, Richard Schwartz, Rebecca Stone and Ralph Weischedel. 2000. Named entity extraction from noisy input: Speech and OCR. In Proceedings of the 6th Applied Natural Language Processing Conference, 316--324, Seattle, WA. http://www.anthology.aclweb.org/A/A00/A00-1044.pdf.Google ScholarDigital Library
- David Nadeau and Satoshi Sekine. 2007. A Survey of Named Entity Recognition and Classification. Linguisticae Investigationes 30, 3--26. Google Scholar
- Clemens Neudecker. 2016. An Open Corpus for Named Entity Recognition in Historic Newspapers. In Proceedings of LREC 2016, Tenth International Conference on Language Resources and Evaluation. http://www.lrec-conf.org/proceedings/lrec2016/pdf/110_Paper.pdf.Google Scholar
- Clemens Neudecker, Lotte Wilms, Wille Jaan Faber and Theo van Veen. 2014. Large-scale Refinement of Digital Historic Newspapers with Named Entity Recognition. In IFLA 2014. http://www.ifla.org/files/assets/newspapers/Geneva_2014/s6-neudecker_faber_wilms-en.pdf.Google Scholar
- Thomas Packer, Joshua Lutes, Aaron Stewart, David Embley, Eric Ringger, Kevin Seppi and Lee Jensen. 2010. Extracting Person Names from Diverse and Noisy OCR Text. In Proceedings of the fourth workshop on Analytics for noisy unstructured text data. Toronto, ON, Canada: ACM. http://dl.acm.org/citation.cfm?id=1871845 Google ScholarDigital Library
- Thierry Poibeau and Leila Kosseim. 2001. Proper Name Extraction from Non-Journalistic Texts. Language and Computers, 37, 144--157.Google Scholar
- Tuula Pääkkönen, Jukka Kervinen, Asko Nivala, Kimmo Kettunen and Eetu Mäkelä. 2016. Exporting Finnish Digitized Historical Newspaper Contents for Offline Use. D-Lib Magazine, July/August http://www.dlib.org/dlib/july16/paakkonen/07paakkonen.html.Google Scholar
- Kepa Joseba Rodriguez, Mike Bryant, Tobias Blank and Magdalena Luszczynska. 2012. Comparison of Named Entity Recognition Tools for raw OCR text. In Proceedings of KONVENS 2012 (LThist 2012 workshop), Vienna September 21, 410--414. http://www.oegai.at/konvens2012/proceedings/60_rodriquez12w/Google Scholar
- Miikka Silfverberg, Pekka Kauppinen and Krister Linden. 2016. Data-Driven Spelling Correction Using Weighted Finite-State Methods. In Proceedings of the ACL Workshop on Statistical NLP and Weighted Automata, 51--59 https://aclweb.org/anthology/W/W16/W16-2406.pdf.Google ScholarCross Ref
- Alexander Tkachenko, Timo Petmanson, Sven Laur. 2013. Named Entity Recognition in Estonian. In Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing, 78--83. http://aclweb.org/anthology/W13-24.Google Scholar
- Elaine G. Toms. 2000. Understanding and Facilitating the Browsing of Electronic Text. International Journal of Human-Computer Studies, 52, 423--452. Google ScholarDigital Library
- Names, Right or Wrong: Named Entities in an OCRed Historical Finnish Newspaper Collection
Recommendations
Extracting person names from diverse and noisy OCR text
AND '10: Proceedings of the fourth workshop on Analytics for noisy unstructured text dataNamed entity recognition applied to scanned and OCRed historical documents can contribute to the discoverability of historical information. However, entity recognition from some historical corpora is much more difficult than from natively digital text ...
Learning Recognition of Ambiguous Proper Names in Hindi
ICMLA '11: Proceedings of the 2011 10th International Conference on Machine Learning and Applications and Workshops - Volume 01An ambiguous proper name is a name which is also a valid dictionary word with a meaning of its own when used in the text. For example in English, the word 'bush' in 'Mr. Bush' is a proper name whereas in 'a dense bush' it is a lexical entity. Almost all ...
DAWT: Densely Annotated Wikipedia Texts Across Multiple Languages
WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web CompanionIn this work, we open up the DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of the entity. The ...
Comments