research-article

Creation of custom recognition profiles for historical documents

Authors:
Adam Dudczak

Poznań Supercomputing and Networking Center, Poznań, Poland

Poznań Supercomputing and Networking Center, Poznań, Poland
View Profile

,
Aleksandra Nowak

Poznań Supercomputing and Networking Center, Poznań, Poland

Poznań Supercomputing and Networking Center, Poznań, Poland
View Profile

,
Tomasz Parkoła

Poznań Supercomputing and Networking Center, Poznań, Poland

Poznań Supercomputing and Networking Center, Poznań, Poland
View Profile

DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural HeritageMay 2014Pages 143–146https://doi.org/10.1145/2595188.2595209

Published:19 May 2014Publication History

DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage

Pages 143–146

ABSTRACT

This paper describes creation of custom OCR profiles for improved recognition of text from historical documents. Presented workflow is based on tools developed by Poznań Supercomputing and Networking Center in the framework of SYNAT project. OCR customization consists of three steps. The first two include preparation of training material in a dedicated web application called Cutouts and processing of the training material using command-line tool called a page-generator. The results of the latter step are passed into Tesseract OCR engine. In the final step Tesseract creates a recognition profile which might be used for an OCR in Virtual Transcription Laboratory (VTL). The described tools allow to crowdsource the most tedious parts of the mentioned process: training material preparation and post OCR correction.

References

A. Antonacopoulos and S. Pletschacher. The PAGE (Page Analysis and Ground-Truth Elements) Format Framework. In Proceedings of the 20th International Conference on Pattern Recognition (ICPR2010). IEEE CS Press, 2010. Google ScholarDigital Library
A. Dudczak, M. Kmieciak, C. Mazurek, M. Stroiński, M. Werla, and J. Weglarz. Improving the Workflow for Creation of Textual Versions of Polish Historical Documents. In R. Bembenik, L. Skonieczny, H. Rybiński, M. Kryszkiewicz, and M. Niezgódka, editors, Intelligent Tools for Building a Scientific Information Platform: Advanced Architectures and Solutions, pages 187--198. Springer Berlin Heidelberg, 2013.Google Scholar
A. Dudczak, M. Kmieciak, and M. Werla. Creation of Textual Versions of Historical Documents from Polish Digital Libraries. In Lecture Notes in Computer Science, volume 7489, pages 89--94. Springer, 2012. Google ScholarDigital Library
M. Heliński, K. Miłosz, and T. Parkola. Report on the comparison of Tesseract and ABBYY FineReader OCR engines, 2012.Google Scholar

Recommendations

A knowledge-based recognition system for historical Mongolian documents

This paper proposes a knowledge-based system to recognize historical Mongolian documents in which the words exhibit remarkable variation and character overlapping. According to the characteristics of Mongolian word formation, the system combines a ...
Read More
Attempts to recognize anomalously deformed Kana in Japanese historical documents
HIP '17: Proceedings of the 4th International Workshop on Historical Document Imaging and Processing

This paper presents methods for three different tasks of recognizing anomalously deformed Kana in Japanese historical documents, which were contested by IEICE PRMU1 2017. The tasks have three levels: single character recognition, three Kana characters ...
Read More
Construction of a text digitization system for Nom historical documents
DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage

This paper presents a text digitization system for Nom historical documents, employing image binarization, character segmentation and character recognition. It incorporates two versions of offline character recognition: one for automatic classification ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage
May 2014
200 pages
ISBN:9781450325882
DOI:10.1145/2595188
Program Chairs:
Apostolos Antonacopoulos
University of Salford
,
Klaus U. Schulz
Ludwig-Maximilians-Universität München
Copyright © 2014 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 May 2014
Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
DATeCH '14 Paper Acceptance Rate31of49submissions,63%Overall Acceptance Rate60of86submissions,70%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 66
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Creation of custom recognition profiles for historical documents

DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage

ABSTRACT

References

Cited By

Recommendations

A knowledge-based recognition system for historical Mongolian documents

Attempts to recognize anomalously deformed Kana in Japanese historical documents

Construction of a text digitization system for Nom historical documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Creation of custom recognition profiles for historical documents

DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage

ABSTRACT

References

Cited By

Recommendations

A knowledge-based recognition system for historical Mongolian documents

Attempts to recognize anomalously deformed Kana in Japanese historical documents

Construction of a text digitization system for Nom historical documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media