skip to main content
10.1145/2595188.2595196acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdatechConference Proceedingsconference-collections
research-article

Construction of a text digitization system for Nom historical documents

Published:19 May 2014Publication History

ABSTRACT

This paper presents a text digitization system for Nom historical documents, employing image binarization, character segmentation and character recognition. It incorporates two versions of offline character recognition: one for automatic classification and the other for verification and correction by an operator. They employ the same recognition method but they are trained by two different sets of training patterns with 7,601 and 32,733 categories. For the recognition method, we use the Generalized Learning Vector Quantization (GLVQ) algorithm for coarse classification and the Modified Quadratic Discriminant Function (MQDF2) method for fine classification. Sample character patterns are generated artificially from 27 fonts of Chinese, Japanese and Nom characters since ground-truthed sample patterns are not available. Moreover, in order to accelerate large scale recognition, we use the kd-tree algorithm in the coarse classification process. The system also provides the interface through which an operator can verify and correct the results of image binarization, character segmentation and character recognition.

References

  1. V. J. Shih, and T. L. Chu. The Han Nom Digital Library. In The International Nom Conference, The National Library of Vietnam, Hanoi, Nov. 2004.Google ScholarGoogle Scholar
  2. M. S. Kim, M. D. Jang, H. I. Choi, T. H. Rhee, J. H. Kim, and H. K. Kwag. Digitalizing scheme of handwritten Hanja historical documents. In Proc. of the 1st International Workshop on Document Image Analysis for Libraries, USA, Jan. 2004, 321--327. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. V. Phan, B. Zhu, and M. Nakagawa. Development of Nom Character Segmentation for Collecting Patterns from Historical Document Pages. In Proc. of 1st International Workshop on Historical Document Imaging and Processing, China, Sep. 2011, 133--139. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. T. V. Phan, B. Zhu, and M. Nakagawa. Collecting Handwritten Nom Character Patterns from Historical Document Pages. In Proc. of 10th IAPR International Workshop on Document Analysis Systems, Australia, Mar. 2012, 344--348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Su, S. Lu, and C. L Tan. Binarization of historical handwritten document images using local maximum and minimum filter. In Proc. of the 9th IAPR International Workshop on Document Analysis Systems, USA, Jun. 2010, 159--165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. N. Otsu. A threshold selection method from gray-level histograms. IEEE Trans. System, Man Cybernetics 9, 1979, 62--66.Google ScholarGoogle ScholarCross RefCross Ref
  7. J. Kittler, and J. Illingworth. Threshold selection based on a simple image statistics. Computer Vision Graphics Image Process 30, 1985, 125--147.Google ScholarGoogle ScholarCross RefCross Ref
  8. J. Schindelin, I. Arganda-Carreras, E. Frise, V. Kaynig, M. Longair, T. Pietzsch,... and A. Cardona. Fiji: an open-source platform for biological-image analysis. Nature methods, 9(7), 2012, 676--682.Google ScholarGoogle Scholar
  9. B. Chen, B. Zhu, and M. Nakagawa. Effects of Generating a Large Amount of Artificial Patterns for On-line Handwritten Japanese Character Recognition. In Proc. of the 11th International Conference on Document Analysis and Recognition, China, Sep. 2011, 663--667. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. C. Leung, and C. H. Leung. Recognition of Handwritten Chinese Characters by Combining Regularization, Fisher's Discriminant and Transformation Sample Generation. In Proc. of the 10th International Conference of Document Analysis and Recognition, Spain, 2009, 1026--1030. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Tsukumo, and H. Tanaka. Classification of handprinted Chinese characters using non-linear normalization and correlation methods. In Proc. of the 9th International Conference on Pattern Recognition, Italy, 1988, 168--171.Google ScholarGoogle Scholar
  12. C. L. Liu. Normalization-cooperated gradient feature extraction for handwritten character recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29(8), 2007, 1465--1469. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. Fukunaga. Introduction to Statistical Pattern Recognition, 2nd edition, Academic Press, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. F. Kimura, K. Takashina, S. Tsuruoka, and Y. Miyake. Modified quadratic discriminant functions and the application to Chinese character recognition. IEEE Trans. PAMI, 9(1), 1987, 149--153. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Y. Yang, and M. Nakagawa. Layered Search Spaces for Accelerating Large Set Character Recognition. In Proc. of the 18th International Conference on Pattern Recognition, 2006, 1006--1009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9), 1975, 509--517. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Kohonen, J. Hynninen, J. Kangas, J. Laaksonen, and K. Torkkola. LVQ PAK: The learning vector quantization program package. Technical report, Laboratory of Computer and Information Science Rakentajanaukio 2 C, 1996, 1991--1992.Google ScholarGoogle Scholar
  18. A. Sato, and K. Yamada. Generalized learning vector quantization. Advances in neural information processing systems, 1996, 423--429.Google ScholarGoogle Scholar
  19. B-H. Juang, and S. Katagiri. Discriminative learning for minimum error classification. Signal Processing, IEEE Transactions on, 40(12), 1992, 3043--3054. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. L. Liu, and M. Nakagawa. Evaluation of prototype learning algorithms for nearest-neighbor classifier in application to handwritten character recognition. Pattern Recognition, 34(3), 2001, 601--615.Google ScholarGoogle ScholarCross RefCross Ref
  21. T. Fukumoto, T. Wakabayashi, F. Kimura, and Y. Miyake. Accuracy improvement of handwritten character recognition by GLVQ. In Proc. of the 7th International Workshop on Frontiers in handwriting recognition, 2000, 687--692.Google ScholarGoogle Scholar
  22. T. V. Phan, M. Nakagawa, H. Baba, and A. Watanabe. MokkAnnotator -- A System for Archiving Mokkan Images. In Proc. of the 16th Biennial Conference of the International Graphonomics Society, Japan, Jun. 2013, 54--57.Google ScholarGoogle Scholar
  23. M. Nakagawa, and K. Matsumoto. Collection of on-line handwritten Japanese character pattern databases and their analysis. Document Analysis and Recognition, 7(1), 2004, 69--81. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Construction of a text digitization system for Nom historical documents

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage
        May 2014
        200 pages
        ISBN:9781450325882
        DOI:10.1145/2595188

        Copyright © 2014 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 19 May 2014

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        DATeCH '14 Paper Acceptance Rate31of49submissions,63%Overall Acceptance Rate60of86submissions,70%
      • Article Metrics

        • Downloads (Last 12 months)3
        • Downloads (Last 6 weeks)0

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader