research-article

Names, Right or Wrong: Named Entities in an OCRed Historical Finnish Newspaper Collection

Authors:
K. Kettunen

The National Library of Finland, Centre for Preservation and Digitization

The National Library of Finland, Centre for Preservation and Digitization
View Profile

,
T. Ruokolainen

The National Library of Finland, Centre for Preservation and Digitization

The National Library of Finland, Centre for Preservation and Digitization
View Profile

DATeCH2017: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural HeritageJune 2017Pages 181–186https://doi.org/10.1145/3078081.3078084

Published:01 June 2017Publication History

DATeCH2017: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage

Pages 181–186

ABSTRACT

Named Entity Recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system's performance is genre and domain dependent and also used entity categories vary [16]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report evaluation result of NER with data out of a digitized Finnish historical newspaper collection Digi. Experiments, results and discussion of this research serve development of the Web collection of historical Finnish newspapers.

Digi collection contains 1,960,921 pages of newspaper material from years 1771-1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70-75% [7]. Our baseline NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. Three other available tools are also evaluated: a Finnish Semantic Tagger (FST), Connexor's NER tool and Polyglot's NER.

References

Beatrice Alex and John Burns. 2014. Estimating and Rating the Quality of Optically Character Recognised Text. In DATeCH '14 Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, 97--102. http://dl.acm.org/citation.cfm?id=2595214.Google Scholar
Marcia Bates. 2007. What is Browsing -- really? A Model Drawing from Behavioural Science Research. Information Research 12. http://www.informationr.net/ir/12-4/paper330.html.Google Scholar
M-L. Bremer-Laamanen. 2014. In the Spotlight for Crowdsourcing. Scandinavian Librarian Quarterly, 1, 18--21.Google Scholar
Greogory Crane and Alison Jones. 2006. The Challenge of Virginia Banks: An Evaluation of Named Entity Analysis in a 19th-Century Newspaper Collection. In Proceedings of JCDL'06, June 11-15, 2006, Chapel Hill, North Carolina, USA. http://repository01.lib.tufts.edu:8080/fedora/get/tufts:PB.001.001.00007/Archival.pdf.Google ScholarDigital Library
Maud Ehrmann, Giovanni Colavizza, Yannick Rochat and Frédéric Kaplan. 2016. Diachronic Evaluation of NER Systems on Old Newspapers. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), 97--107. https://www.linguistics.rub.de/konvens16/pub/13_konvensproc.pdfGoogle Scholar
Kimmo Kettunen, Timo Honkela, Krister Lindén, Pekka Kauppinen, Tuula Pääkkönen.and Jukka Kervinen. 2014. Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods. In Proceedings of IFLA 2014, Lyon http://www.ifla.org/files/assets/newspapers/Geneva_2014/s6-honkela-en.pdfGoogle Scholar
Kimmo Kettunen and Tuula Pääkkönen. 2016. Measuring Lexical Quality of a Historical Finnish Newspaper Collection -- Analysis of Garbled OCR Data with Basic Language Technology Tools and Means. In Proceedings of LREC 2016 http://www.lrec-conf.org/proceedings/lrec2016/pdf/17_Paper.pdfGoogle Scholar
Kimmo Kettunen, Tuula Pääkkönen and Mika Koistinen. 2016. Between Diachrony and Synchrony: Evaluation of Lexical Quality of a Digitized Historical Finnish Newspaper and Journal Collection with Morphological Analyzers. In Skadiņa, I. and Rozis, R. Eds. Human Language Technologies -- The Baltic Perspective, IOS Press, 122--129. http://ebooks.iospress.nl/volumearticle/45525.Google Scholar
Dimitrios Kokkinakis, Jyrki Niemi, Sam Hardwick, Krister Lindén, and Lars Borin. 2014. HFST-SweNER -- a New NER Resource for Swedish. In Proceedings of LREC 2014. http://www.lrec-conf.org/proceedings/lrec2014/pdf/391_Paper.pdf.Google Scholar
Laura Löfberg, Scott Piao, Paul Rayson, Jukka-Pekka Juntunen, Asko Nykänen and Krista Varantola. 2005. A semantic tagger for the Finnish language. http://eprints.lancs.ac.uk/12685/1/cl2005_fst.pdfGoogle Scholar
Sunghwan Mac Kim and Steve Cassidy. 2015. Finding Names in Trove: Named Entity Recognition for Australian. In Proceedings of Australasian Language Technology Association Workshop, 57--65. https://aclweb.org/anthology/U/U15/U15-1007.pdf.Google Scholar
Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Language Processing. The MIT Press, Cambridge, Massachusetts.Google Scholar
Mónica Marrero, Julián Urbano, Sonia Sánchez-Cuadrado, Jorge Morato and Juan Miguel Gómez-Berbís. 2013. Named Entity Recognition: Fallacies, challenges and opportunities. Computer Standards & Interfaces 35, 482--489. Google ScholarCross Ref
Paul McNamee, James C. Mayfield and Christine D. Piatko. 2011. Processing Named Entities in Text. Johns Hopkins APL Technical Digest, 30, 31--40.Google Scholar
David Miller, Sean Boisen, Richard Schwartz, Rebecca Stone and Ralph Weischedel. 2000. Named entity extraction from noisy input: Speech and OCR. In Proceedings of the 6th Applied Natural Language Processing Conference, 316--324, Seattle, WA. http://www.anthology.aclweb.org/A/A00/A00-1044.pdf.Google ScholarDigital Library
David Nadeau and Satoshi Sekine. 2007. A Survey of Named Entity Recognition and Classification. Linguisticae Investigationes 30, 3--26. Google Scholar
Clemens Neudecker. 2016. An Open Corpus for Named Entity Recognition in Historic Newspapers. In Proceedings of LREC 2016, Tenth International Conference on Language Resources and Evaluation. http://www.lrec-conf.org/proceedings/lrec2016/pdf/110_Paper.pdf.Google Scholar
Clemens Neudecker, Lotte Wilms, Wille Jaan Faber and Theo van Veen. 2014. Large-scale Refinement of Digital Historic Newspapers with Named Entity Recognition. In IFLA 2014. http://www.ifla.org/files/assets/newspapers/Geneva_2014/s6-neudecker_faber_wilms-en.pdf.Google Scholar
Thomas Packer, Joshua Lutes, Aaron Stewart, David Embley, Eric Ringger, Kevin Seppi and Lee Jensen. 2010. Extracting Person Names from Diverse and Noisy OCR Text. In Proceedings of the fourth workshop on Analytics for noisy unstructured text data. Toronto, ON, Canada: ACM. http://dl.acm.org/citation.cfm?id=1871845 Google ScholarDigital Library
Thierry Poibeau and Leila Kosseim. 2001. Proper Name Extraction from Non-Journalistic Texts. Language and Computers, 37, 144--157.Google Scholar
Tuula Pääkkönen, Jukka Kervinen, Asko Nivala, Kimmo Kettunen and Eetu Mäkelä. 2016. Exporting Finnish Digitized Historical Newspaper Contents for Offline Use. D-Lib Magazine, July/August http://www.dlib.org/dlib/july16/paakkonen/07paakkonen.html.Google Scholar
Kepa Joseba Rodriguez, Mike Bryant, Tobias Blank and Magdalena Luszczynska. 2012. Comparison of Named Entity Recognition Tools for raw OCR text. In Proceedings of KONVENS 2012 (LThist 2012 workshop), Vienna September 21, 410--414. http://www.oegai.at/konvens2012/proceedings/60_rodriquez12w/Google Scholar
Miikka Silfverberg, Pekka Kauppinen and Krister Linden. 2016. Data-Driven Spelling Correction Using Weighted Finite-State Methods. In Proceedings of the ACL Workshop on Statistical NLP and Weighted Automata, 51--59 https://aclweb.org/anthology/W/W16/W16-2406.pdf.Google ScholarCross Ref
Alexander Tkachenko, Timo Petmanson, Sven Laur. 2013. Named Entity Recognition in Estonian. In Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing, 78--83. http://aclweb.org/anthology/W13-24.Google Scholar
Elaine G. Toms. 2000. Understanding and Facilitating the Browsing of Electronic Text. International Journal of Human-Computer Studies, 52, 423--452. Google ScholarDigital Library

Names, Right or Wrong: Named Entities in an OCRed Historical Finnish Newspaper Collection
1. Computing methodologies
  1. Artificial intelligence

Recommendations

Extracting person names from diverse and noisy OCR text
AND '10: Proceedings of the fourth workshop on Analytics for noisy unstructured text data

Named entity recognition applied to scanned and OCRed historical documents can contribute to the discoverability of historical information. However, entity recognition from some historical corpora is much more difficult than from natively digital text ...
Read More
Learning Recognition of Ambiguous Proper Names in Hindi
ICMLA '11: Proceedings of the 2011 10th International Conference on Machine Learning and Applications and Workshops - Volume 01

An ambiguous proper name is a name which is also a valid dictionary word with a meaning of its own when used in the text. For example in English, the word 'bush' in 'Mr. Bush' is a proper name whereas in 'a dense bush' it is a lexical entity. Almost all ...
Read More
DAWT: Densely Annotated Wikipedia Texts Across Multiple Languages
WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web Companion

In this work, we open up the DAWT dataset - Densely Annotated Wikipedia Texts across multiple languages. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of the entity. The ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DATeCH2017: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage
June 2017
179 pages
ISBN:9781450352659
DOI:10.1145/3078081

Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Finnish
historical newspaper collections
named entity recognition
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
DATeCH2017 Paper Acceptance Rate29of37submissions,78%Overall Acceptance Rate60of86submissions,70%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 148
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Names, Right or Wrong: Named Entities in an OCRed Historical Finnish Newspaper Collection

DATeCH2017: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage

ABSTRACT

References

Cited By

Recommendations

Extracting person names from diverse and noisy OCR text

Learning Recognition of Ambiguous Proper Names in Hindi

DAWT: Densely Annotated Wikipedia Texts Across Multiple Languages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Names, Right or Wrong: Named Entities in an OCRed Historical Finnish Newspaper Collection

DATeCH2017: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage

ABSTRACT

References

Cited By

Recommendations

Extracting person names from diverse and noisy OCR text

Learning Recognition of Ambiguous Proper Names in Hindi

DAWT: Densely Annotated Wikipedia Texts Across Multiple Languages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media