ABSTRACT
This paper describes creation of custom OCR profiles for improved recognition of text from historical documents. Presented workflow is based on tools developed by Poznań Supercomputing and Networking Center in the framework of SYNAT project. OCR customization consists of three steps. The first two include preparation of training material in a dedicated web application called Cutouts and processing of the training material using command-line tool called a page-generator. The results of the latter step are passed into Tesseract OCR engine. In the final step Tesseract creates a recognition profile which might be used for an OCR in Virtual Transcription Laboratory (VTL). The described tools allow to crowdsource the most tedious parts of the mentioned process: training material preparation and post OCR correction.
- A. Antonacopoulos and S. Pletschacher. The PAGE (Page Analysis and Ground-Truth Elements) Format Framework. In Proceedings of the 20th International Conference on Pattern Recognition (ICPR2010). IEEE CS Press, 2010. Google ScholarDigital Library
- A. Dudczak, M. Kmieciak, C. Mazurek, M. Stroiński, M. Werla, and J. Weglarz. Improving the Workflow for Creation of Textual Versions of Polish Historical Documents. In R. Bembenik, L. Skonieczny, H. Rybiński, M. Kryszkiewicz, and M. Niezgódka, editors, Intelligent Tools for Building a Scientific Information Platform: Advanced Architectures and Solutions, pages 187--198. Springer Berlin Heidelberg, 2013.Google Scholar
- A. Dudczak, M. Kmieciak, and M. Werla. Creation of Textual Versions of Historical Documents from Polish Digital Libraries. In Lecture Notes in Computer Science, volume 7489, pages 89--94. Springer, 2012. Google ScholarDigital Library
- M. Heliński, K. Miłosz, and T. Parkola. Report on the comparison of Tesseract and ABBYY FineReader OCR engines, 2012.Google Scholar
Recommendations
A knowledge-based recognition system for historical Mongolian documents
This paper proposes a knowledge-based system to recognize historical Mongolian documents in which the words exhibit remarkable variation and character overlapping. According to the characteristics of Mongolian word formation, the system combines a ...
Attempts to recognize anomalously deformed Kana in Japanese historical documents
HIP '17: Proceedings of the 4th International Workshop on Historical Document Imaging and ProcessingThis paper presents methods for three different tasks of recognizing anomalously deformed Kana in Japanese historical documents, which were contested by IEICE PRMU1 2017. The tasks have three levels: single character recognition, three Kana characters ...
Construction of a text digitization system for Nom historical documents
DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural HeritageThis paper presents a text digitization system for Nom historical documents, employing image binarization, character segmentation and character recognition. It incorporates two versions of offline character recognition: one for automatic classification ...
Comments