skip to main content
10.1145/3078081.3078084acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdatechConference Proceedingsconference-collections
research-article

Names, Right or Wrong: Named Entities in an OCRed Historical Finnish Newspaper Collection

Authors Info & Claims
Published:01 June 2017Publication History

ABSTRACT

Named Entity Recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system's performance is genre and domain dependent and also used entity categories vary [16]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report evaluation result of NER with data out of a digitized Finnish historical newspaper collection Digi. Experiments, results and discussion of this research serve development of the Web collection of historical Finnish newspapers.

Digi collection contains 1,960,921 pages of newspaper material from years 1771-1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70-75% [7]. Our baseline NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. Three other available tools are also evaluated: a Finnish Semantic Tagger (FST), Connexor's NER tool and Polyglot's NER.

References

  1. Beatrice Alex and John Burns. 2014. Estimating and Rating the Quality of Optically Character Recognised Text. In DATeCH '14 Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, 97--102. http://dl.acm.org/citation.cfm?id=2595214.Google ScholarGoogle Scholar
  2. Marcia Bates. 2007. What is Browsing -- really? A Model Drawing from Behavioural Science Research. Information Research 12. http://www.informationr.net/ir/12-4/paper330.html.Google ScholarGoogle Scholar
  3. M-L. Bremer-Laamanen. 2014. In the Spotlight for Crowdsourcing. Scandinavian Librarian Quarterly, 1, 18--21.Google ScholarGoogle Scholar
  4. Greogory Crane and Alison Jones. 2006. The Challenge of Virginia Banks: An Evaluation of Named Entity Analysis in a 19th-Century Newspaper Collection. In Proceedings of JCDL'06, June 11-15, 2006, Chapel Hill, North Carolina, USA. http://repository01.lib.tufts.edu:8080/fedora/get/tufts:PB.001.001.00007/Archival.pdf.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Maud Ehrmann, Giovanni Colavizza, Yannick Rochat and Frédéric Kaplan. 2016. Diachronic Evaluation of NER Systems on Old Newspapers. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), 97--107. https://www.linguistics.rub.de/konvens16/pub/13_konvensproc.pdfGoogle ScholarGoogle Scholar
  6. Kimmo Kettunen, Timo Honkela, Krister Lindén, Pekka Kauppinen, Tuula Pääkkönen.and Jukka Kervinen. 2014. Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods. In Proceedings of IFLA 2014, Lyon http://www.ifla.org/files/assets/newspapers/Geneva_2014/s6-honkela-en.pdfGoogle ScholarGoogle Scholar
  7. Kimmo Kettunen and Tuula Pääkkönen. 2016. Measuring Lexical Quality of a Historical Finnish Newspaper Collection -- Analysis of Garbled OCR Data with Basic Language Technology Tools and Means. In Proceedings of LREC 2016 http://www.lrec-conf.org/proceedings/lrec2016/pdf/17_Paper.pdfGoogle ScholarGoogle Scholar
  8. Kimmo Kettunen, Tuula Pääkkönen and Mika Koistinen. 2016. Between Diachrony and Synchrony: Evaluation of Lexical Quality of a Digitized Historical Finnish Newspaper and Journal Collection with Morphological Analyzers. In Skadiņa, I. and Rozis, R. Eds. Human Language Technologies -- The Baltic Perspective, IOS Press, 122--129. http://ebooks.iospress.nl/volumearticle/45525.Google ScholarGoogle Scholar
  9. Dimitrios Kokkinakis, Jyrki Niemi, Sam Hardwick, Krister Lindén, and Lars Borin. 2014. HFST-SweNER -- a New NER Resource for Swedish. In Proceedings of LREC 2014. http://www.lrec-conf.org/proceedings/lrec2014/pdf/391_Paper.pdf.Google ScholarGoogle Scholar
  10. Laura Löfberg, Scott Piao, Paul Rayson, Jukka-Pekka Juntunen, Asko Nykänen and Krista Varantola. 2005. A semantic tagger for the Finnish language. http://eprints.lancs.ac.uk/12685/1/cl2005_fst.pdfGoogle ScholarGoogle Scholar
  11. Sunghwan Mac Kim and Steve Cassidy. 2015. Finding Names in Trove: Named Entity Recognition for Australian. In Proceedings of Australasian Language Technology Association Workshop, 57--65. https://aclweb.org/anthology/U/U15/U15-1007.pdf.Google ScholarGoogle Scholar
  12. Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Language Processing. The MIT Press, Cambridge, Massachusetts.Google ScholarGoogle Scholar
  13. Mónica Marrero, Julián Urbano, Sonia Sánchez-Cuadrado, Jorge Morato and Juan Miguel Gómez-Berbís. 2013. Named Entity Recognition: Fallacies, challenges and opportunities. Computer Standards & Interfaces 35, 482--489. Google ScholarGoogle ScholarCross RefCross Ref
  14. Paul McNamee, James C. Mayfield and Christine D. Piatko. 2011. Processing Named Entities in Text. Johns Hopkins APL Technical Digest, 30, 31--40.Google ScholarGoogle Scholar
  15. David Miller, Sean Boisen, Richard Schwartz, Rebecca Stone and Ralph Weischedel. 2000. Named entity extraction from noisy input: Speech and OCR. In Proceedings of the 6th Applied Natural Language Processing Conference, 316--324, Seattle, WA. http://www.anthology.aclweb.org/A/A00/A00-1044.pdf.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. David Nadeau and Satoshi Sekine. 2007. A Survey of Named Entity Recognition and Classification. Linguisticae Investigationes 30, 3--26. Google ScholarGoogle Scholar
  17. Clemens Neudecker. 2016. An Open Corpus for Named Entity Recognition in Historic Newspapers. In Proceedings of LREC 2016, Tenth International Conference on Language Resources and Evaluation. http://www.lrec-conf.org/proceedings/lrec2016/pdf/110_Paper.pdf.Google ScholarGoogle Scholar
  18. Clemens Neudecker, Lotte Wilms, Wille Jaan Faber and Theo van Veen. 2014. Large-scale Refinement of Digital Historic Newspapers with Named Entity Recognition. In IFLA 2014. http://www.ifla.org/files/assets/newspapers/Geneva_2014/s6-neudecker_faber_wilms-en.pdf.Google ScholarGoogle Scholar
  19. Thomas Packer, Joshua Lutes, Aaron Stewart, David Embley, Eric Ringger, Kevin Seppi and Lee Jensen. 2010. Extracting Person Names from Diverse and Noisy OCR Text. In Proceedings of the fourth workshop on Analytics for noisy unstructured text data. Toronto, ON, Canada: ACM. http://dl.acm.org/citation.cfm?id=1871845 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Thierry Poibeau and Leila Kosseim. 2001. Proper Name Extraction from Non-Journalistic Texts. Language and Computers, 37, 144--157.Google ScholarGoogle Scholar
  21. Tuula Pääkkönen, Jukka Kervinen, Asko Nivala, Kimmo Kettunen and Eetu Mäkelä. 2016. Exporting Finnish Digitized Historical Newspaper Contents for Offline Use. D-Lib Magazine, July/August http://www.dlib.org/dlib/july16/paakkonen/07paakkonen.html.Google ScholarGoogle Scholar
  22. Kepa Joseba Rodriguez, Mike Bryant, Tobias Blank and Magdalena Luszczynska. 2012. Comparison of Named Entity Recognition Tools for raw OCR text. In Proceedings of KONVENS 2012 (LThist 2012 workshop), Vienna September 21, 410--414. http://www.oegai.at/konvens2012/proceedings/60_rodriquez12w/Google ScholarGoogle Scholar
  23. Miikka Silfverberg, Pekka Kauppinen and Krister Linden. 2016. Data-Driven Spelling Correction Using Weighted Finite-State Methods. In Proceedings of the ACL Workshop on Statistical NLP and Weighted Automata, 51--59 https://aclweb.org/anthology/W/W16/W16-2406.pdf.Google ScholarGoogle ScholarCross RefCross Ref
  24. Alexander Tkachenko, Timo Petmanson, Sven Laur. 2013. Named Entity Recognition in Estonian. In Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing, 78--83. http://aclweb.org/anthology/W13-24.Google ScholarGoogle Scholar
  25. Elaine G. Toms. 2000. Understanding and Facilitating the Browsing of Electronic Text. International Journal of Human-Computer Studies, 52, 423--452. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Names, Right or Wrong: Named Entities in an OCRed Historical Finnish Newspaper Collection

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      DATeCH2017: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage
      June 2017
      179 pages
      ISBN:9781450352659
      DOI:10.1145/3078081

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 June 2017

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      DATeCH2017 Paper Acceptance Rate29of37submissions,78%Overall Acceptance Rate60of86submissions,70%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader