research-article

Construction of a text digitization system for Nom historical documents

Authors:
Truyen Van Phan

Tokyo Univ. of Agri. & Tech., Koganei, Tokyo, Japan

Tokyo Univ. of Agri. & Tech., Koganei, Tokyo, Japan
View Profile

,
Masaki Nakagawa

Tokyo Univ. of Agri. & Tech., Koganei, Tokyo, Japan

Tokyo Univ. of Agri. & Tech., Koganei, Tokyo, Japan
View Profile

DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural HeritageMay 2014Pages 65–70https://doi.org/10.1145/2595188.2595196

Published:19 May 2014Publication History

DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage

Pages 65–70

ABSTRACT

This paper presents a text digitization system for Nom historical documents, employing image binarization, character segmentation and character recognition. It incorporates two versions of offline character recognition: one for automatic classification and the other for verification and correction by an operator. They employ the same recognition method but they are trained by two different sets of training patterns with 7,601 and 32,733 categories. For the recognition method, we use the Generalized Learning Vector Quantization (GLVQ) algorithm for coarse classification and the Modified Quadratic Discriminant Function (MQDF2) method for fine classification. Sample character patterns are generated artificially from 27 fonts of Chinese, Japanese and Nom characters since ground-truthed sample patterns are not available. Moreover, in order to accelerate large scale recognition, we use the kd-tree algorithm in the coarse classification process. The system also provides the interface through which an operator can verify and correct the results of image binarization, character segmentation and character recognition.

References

V. J. Shih, and T. L. Chu. The Han Nom Digital Library. In The International Nom Conference, The National Library of Vietnam, Hanoi, Nov. 2004.Google Scholar
M. S. Kim, M. D. Jang, H. I. Choi, T. H. Rhee, J. H. Kim, and H. K. Kwag. Digitalizing scheme of handwritten Hanja historical documents. In Proc. of the 1st International Workshop on Document Image Analysis for Libraries, USA, Jan. 2004, 321--327. Google ScholarDigital Library
T. V. Phan, B. Zhu, and M. Nakagawa. Development of Nom Character Segmentation for Collecting Patterns from Historical Document Pages. In Proc. of 1st International Workshop on Historical Document Imaging and Processing, China, Sep. 2011, 133--139. Google ScholarDigital Library
T. V. Phan, B. Zhu, and M. Nakagawa. Collecting Handwritten Nom Character Patterns from Historical Document Pages. In Proc. of 10th IAPR International Workshop on Document Analysis Systems, Australia, Mar. 2012, 344--348. Google ScholarDigital Library
B. Su, S. Lu, and C. L Tan. Binarization of historical handwritten document images using local maximum and minimum filter. In Proc. of the 9th IAPR International Workshop on Document Analysis Systems, USA, Jun. 2010, 159--165. Google ScholarDigital Library
N. Otsu. A threshold selection method from gray-level histograms. IEEE Trans. System, Man Cybernetics 9, 1979, 62--66.Google ScholarCross Ref
J. Kittler, and J. Illingworth. Threshold selection based on a simple image statistics. Computer Vision Graphics Image Process 30, 1985, 125--147.Google ScholarCross Ref
J. Schindelin, I. Arganda-Carreras, E. Frise, V. Kaynig, M. Longair, T. Pietzsch,... and A. Cardona. Fiji: an open-source platform for biological-image analysis. Nature methods, 9(7), 2012, 676--682.Google Scholar
B. Chen, B. Zhu, and M. Nakagawa. Effects of Generating a Large Amount of Artificial Patterns for On-line Handwritten Japanese Character Recognition. In Proc. of the 11th International Conference on Document Analysis and Recognition, China, Sep. 2011, 663--667. Google ScholarDigital Library
K. C. Leung, and C. H. Leung. Recognition of Handwritten Chinese Characters by Combining Regularization, Fisher's Discriminant and Transformation Sample Generation. In Proc. of the 10th International Conference of Document Analysis and Recognition, Spain, 2009, 1026--1030. Google ScholarDigital Library
J. Tsukumo, and H. Tanaka. Classification of handprinted Chinese characters using non-linear normalization and correlation methods. In Proc. of the 9th International Conference on Pattern Recognition, Italy, 1988, 168--171.Google Scholar
C. L. Liu. Normalization-cooperated gradient feature extraction for handwritten character recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29(8), 2007, 1465--1469. Google ScholarDigital Library
K. Fukunaga. Introduction to Statistical Pattern Recognition, 2nd edition, Academic Press, 1990. Google ScholarDigital Library
F. Kimura, K. Takashina, S. Tsuruoka, and Y. Miyake. Modified quadratic discriminant functions and the application to Chinese character recognition. IEEE Trans. PAMI, 9(1), 1987, 149--153. Google ScholarDigital Library
Y. Yang, and M. Nakagawa. Layered Search Spaces for Accelerating Large Set Character Recognition. In Proc. of the 18th International Conference on Pattern Recognition, 2006, 1006--1009. Google ScholarDigital Library
J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9), 1975, 509--517. Google ScholarDigital Library
T. Kohonen, J. Hynninen, J. Kangas, J. Laaksonen, and K. Torkkola. LVQ PAK: The learning vector quantization program package. Technical report, Laboratory of Computer and Information Science Rakentajanaukio 2 C, 1996, 1991--1992.Google Scholar
A. Sato, and K. Yamada. Generalized learning vector quantization. Advances in neural information processing systems, 1996, 423--429.Google Scholar
B-H. Juang, and S. Katagiri. Discriminative learning for minimum error classification. Signal Processing, IEEE Transactions on, 40(12), 1992, 3043--3054. Google ScholarDigital Library
C. L. Liu, and M. Nakagawa. Evaluation of prototype learning algorithms for nearest-neighbor classifier in application to handwritten character recognition. Pattern Recognition, 34(3), 2001, 601--615.Google ScholarCross Ref
T. Fukumoto, T. Wakabayashi, F. Kimura, and Y. Miyake. Accuracy improvement of handwritten character recognition by GLVQ. In Proc. of the 7th International Workshop on Frontiers in handwriting recognition, 2000, 687--692.Google Scholar
T. V. Phan, M. Nakagawa, H. Baba, and A. Watanabe. MokkAnnotator -- A System for Archiving Mokkan Images. In Proc. of the 16th Biennial Conference of the International Graphonomics Society, Japan, Jun. 2013, 54--57.Google Scholar
M. Nakagawa, and K. Matsumoto. Collection of on-line handwritten Japanese character pattern databases and their analysis. Document Analysis and Recognition, 7(1), 2004, 69--81. Google ScholarDigital Library

Index Terms

Construction of a text digitization system for Nom historical documents
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Optical character recognition
2. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Interest point and salient region detections

Recommendations

Development of Nom character segmentation for collecting patterns from historical document pages
HIP '11: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing

In this paper, we present the first effort in preprocessing and character segmentation on digitized Nom document pages toward their digital archiving. Nom is an ideographic script to represent Vietnamese, used from the 10^th century to 20^th century. ...
Read More
A Nom historical document recognition system for digital archiving

A Nom historical document recognition system is being developed for digital archiving that uses image binarization, character segmentation, and character recognition. It incorporates two versions of off-line character recognition: one for automatic ...
Read More
Binarization, character extraction, and writer identification of historical Hebrew calligraphy documents

We present our work on the paleographic analysis and recognition system intended for processing of historical Hebrew calligraphy documents. The main goal is to analyze documents of different writing styles in order to identify the locations, dates, and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage
May 2014
200 pages
ISBN:9781450325882
DOI:10.1145/2595188
Program Chairs:
Apostolos Antonacopoulos
University of Salford
,
Klaus U. Schulz
Ludwig-Maximilians-Universität München
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 May 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Chu Nôm
Nom character
Vietnamese
area Voronoi diagram
binarization
character segmentation
document image analysis
historical documents
offline character recognition
recursive xy cut
text digitization
Qualifiers
- research-article
Conference

Acceptance Rates
DATeCH '14 Paper Acceptance Rate31of49submissions,63%Overall Acceptance Rate60of86submissions,70%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 80
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Construction of a text digitization system for Nom historical documents

DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage

ABSTRACT

References

Cited By

Index Terms

Recommendations

Development of Nom character segmentation for collecting patterns from historical document pages

A Nom historical document recognition system for digital archiving

Binarization, character extraction, and writer identification of historical Hebrew calligraphy documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Construction of a text digitization system for Nom historical documents

DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage

ABSTRACT

References

Cited By

Index Terms

Recommendations

Development of Nom character segmentation for collecting patterns from historical document pages

A Nom historical document recognition system for digital archiving

Binarization, character extraction, and writer identification of historical Hebrew calligraphy documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media