ABSTRACT
This paper presents a text digitization system for Nom historical documents, employing image binarization, character segmentation and character recognition. It incorporates two versions of offline character recognition: one for automatic classification and the other for verification and correction by an operator. They employ the same recognition method but they are trained by two different sets of training patterns with 7,601 and 32,733 categories. For the recognition method, we use the Generalized Learning Vector Quantization (GLVQ) algorithm for coarse classification and the Modified Quadratic Discriminant Function (MQDF2) method for fine classification. Sample character patterns are generated artificially from 27 fonts of Chinese, Japanese and Nom characters since ground-truthed sample patterns are not available. Moreover, in order to accelerate large scale recognition, we use the kd-tree algorithm in the coarse classification process. The system also provides the interface through which an operator can verify and correct the results of image binarization, character segmentation and character recognition.
- V. J. Shih, and T. L. Chu. The Han Nom Digital Library. In The International Nom Conference, The National Library of Vietnam, Hanoi, Nov. 2004.Google Scholar
- M. S. Kim, M. D. Jang, H. I. Choi, T. H. Rhee, J. H. Kim, and H. K. Kwag. Digitalizing scheme of handwritten Hanja historical documents. In Proc. of the 1st International Workshop on Document Image Analysis for Libraries, USA, Jan. 2004, 321--327. Google ScholarDigital Library
- T. V. Phan, B. Zhu, and M. Nakagawa. Development of Nom Character Segmentation for Collecting Patterns from Historical Document Pages. In Proc. of 1st International Workshop on Historical Document Imaging and Processing, China, Sep. 2011, 133--139. Google ScholarDigital Library
- T. V. Phan, B. Zhu, and M. Nakagawa. Collecting Handwritten Nom Character Patterns from Historical Document Pages. In Proc. of 10th IAPR International Workshop on Document Analysis Systems, Australia, Mar. 2012, 344--348. Google ScholarDigital Library
- B. Su, S. Lu, and C. L Tan. Binarization of historical handwritten document images using local maximum and minimum filter. In Proc. of the 9th IAPR International Workshop on Document Analysis Systems, USA, Jun. 2010, 159--165. Google ScholarDigital Library
- N. Otsu. A threshold selection method from gray-level histograms. IEEE Trans. System, Man Cybernetics 9, 1979, 62--66.Google ScholarCross Ref
- J. Kittler, and J. Illingworth. Threshold selection based on a simple image statistics. Computer Vision Graphics Image Process 30, 1985, 125--147.Google ScholarCross Ref
- J. Schindelin, I. Arganda-Carreras, E. Frise, V. Kaynig, M. Longair, T. Pietzsch,... and A. Cardona. Fiji: an open-source platform for biological-image analysis. Nature methods, 9(7), 2012, 676--682.Google Scholar
- B. Chen, B. Zhu, and M. Nakagawa. Effects of Generating a Large Amount of Artificial Patterns for On-line Handwritten Japanese Character Recognition. In Proc. of the 11th International Conference on Document Analysis and Recognition, China, Sep. 2011, 663--667. Google ScholarDigital Library
- K. C. Leung, and C. H. Leung. Recognition of Handwritten Chinese Characters by Combining Regularization, Fisher's Discriminant and Transformation Sample Generation. In Proc. of the 10th International Conference of Document Analysis and Recognition, Spain, 2009, 1026--1030. Google ScholarDigital Library
- J. Tsukumo, and H. Tanaka. Classification of handprinted Chinese characters using non-linear normalization and correlation methods. In Proc. of the 9th International Conference on Pattern Recognition, Italy, 1988, 168--171.Google Scholar
- C. L. Liu. Normalization-cooperated gradient feature extraction for handwritten character recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29(8), 2007, 1465--1469. Google ScholarDigital Library
- K. Fukunaga. Introduction to Statistical Pattern Recognition, 2nd edition, Academic Press, 1990. Google ScholarDigital Library
- F. Kimura, K. Takashina, S. Tsuruoka, and Y. Miyake. Modified quadratic discriminant functions and the application to Chinese character recognition. IEEE Trans. PAMI, 9(1), 1987, 149--153. Google ScholarDigital Library
- Y. Yang, and M. Nakagawa. Layered Search Spaces for Accelerating Large Set Character Recognition. In Proc. of the 18th International Conference on Pattern Recognition, 2006, 1006--1009. Google ScholarDigital Library
- J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9), 1975, 509--517. Google ScholarDigital Library
- T. Kohonen, J. Hynninen, J. Kangas, J. Laaksonen, and K. Torkkola. LVQ PAK: The learning vector quantization program package. Technical report, Laboratory of Computer and Information Science Rakentajanaukio 2 C, 1996, 1991--1992.Google Scholar
- A. Sato, and K. Yamada. Generalized learning vector quantization. Advances in neural information processing systems, 1996, 423--429.Google Scholar
- B-H. Juang, and S. Katagiri. Discriminative learning for minimum error classification. Signal Processing, IEEE Transactions on, 40(12), 1992, 3043--3054. Google ScholarDigital Library
- C. L. Liu, and M. Nakagawa. Evaluation of prototype learning algorithms for nearest-neighbor classifier in application to handwritten character recognition. Pattern Recognition, 34(3), 2001, 601--615.Google ScholarCross Ref
- T. Fukumoto, T. Wakabayashi, F. Kimura, and Y. Miyake. Accuracy improvement of handwritten character recognition by GLVQ. In Proc. of the 7th International Workshop on Frontiers in handwriting recognition, 2000, 687--692.Google Scholar
- T. V. Phan, M. Nakagawa, H. Baba, and A. Watanabe. MokkAnnotator -- A System for Archiving Mokkan Images. In Proc. of the 16th Biennial Conference of the International Graphonomics Society, Japan, Jun. 2013, 54--57.Google Scholar
- M. Nakagawa, and K. Matsumoto. Collection of on-line handwritten Japanese character pattern databases and their analysis. Document Analysis and Recognition, 7(1), 2004, 69--81. Google ScholarDigital Library
Index Terms
- Construction of a text digitization system for Nom historical documents
Recommendations
Development of Nom character segmentation for collecting patterns from historical document pages
HIP '11: Proceedings of the 2011 Workshop on Historical Document Imaging and ProcessingIn this paper, we present the first effort in preprocessing and character segmentation on digitized Nom document pages toward their digital archiving. Nom is an ideographic script to represent Vietnamese, used from the 10th century to 20th century. ...
A Nom historical document recognition system for digital archiving
A Nom historical document recognition system is being developed for digital archiving that uses image binarization, character segmentation, and character recognition. It incorporates two versions of off-line character recognition: one for automatic ...
Binarization, character extraction, and writer identification of historical Hebrew calligraphy documents
We present our work on the paleographic analysis and recognition system intended for processing of historical Hebrew calligraphy documents. The main goal is to analyze documents of different writing styles in order to identify the locations, dates, and ...
Comments