Elsevier

Pattern Recognition

Volume 47, Issue 3, March 2014, Pages 1187-1201
Pattern Recognition

Recognition of Bangla compound characters using structural decomposition

https://doi.org/10.1016/j.patcog.2013.08.026Get rights and content

Highlights

  • The proper recognition of compound characters is a difficult problem due to their complex shapes.

  • In this paper, we propose a novel character recognition method for Bangla compound characters.

  • Our strategy is to decompose the compound character into simpler shape components.

  • Our technique is applicable to printed and handwritten characters.

  • Experiment is done on printed and handwritten Bangla compound characters.

Abstract

In this paper we propose a novel character recognition method for Bangla compound characters. Accurate recognition of compound characters is a difficult problem due to their complex shapes. Our strategy is to decompose a compound character into skeletal segments. The compound character is then recognized by extracting the convex shape primitives and using a template matching scheme. The novelty of our approach lies in the formulation of appropriate rules of character decomposition for segmenting the character skeleton into stroke segments and then grouping them for extraction of meaningful shape components. Our technique is applicable to both printed and handwritten characters. The proposed method performs well for complex-shaped compound characters, which were confusing to the existing methods.

Introduction

Optical character recognition (OCR) is the process of automatic computer recognition of optically scanned and digitized character images. Several OCR systems are available commercially in the market [1], [2]. OCR is a necessary step for tasks like converting books, documents, and office records into electronic form [3] which allows the widely available text processing tools to be used for retrieval and dissemination of information [4]. The electronic text takes up less storage space compared to the image, can be edited, searched [5], [6], [9] and formatted for better display and printing. It can be machine translated [7] and converted to speech [8].

OCR systems are available for Roman or English script [10] and for a few Asian scripts, such as Chinese [11], [12], Japanese [13], Korean [14], [15], and Arabic [16], [17]. In the last two decades, several OCR works have been reported on different Indian scripts, such as Bangla [18], Devanagari [19], Tamil [20], Malayalam [21], Oriya [22], Telugu [23], Kannada [24], Gurmukhi [25], Gujarati [26]. These works mainly deal with recognizing basic characters. However, the main challenge in designing an OCR for Indian scripts is to handle compound (also known as ‘conjunct’) characters which are formed by combining two or more basic characters. The complex shapes of these characters make the problem more difficult.

In this paper, we address the problem of compound character recognition in Bangla which is the second-most popular language in India and among the top ten languages all over the world [27]. Bangla script is used to write Assamese and Bengali (also called ‘Bangla’) languages. There are a large number of (near about 250) compound characters in Bangla. However, some of these characters are obsolete now-a-days. Hence, in our work, we have considered the most familiar character set (about 165 in number [28]) used in the Bangla literature. Many of them are very complex in shape compared to the Devanagari compound characters [29], [30]. Prior work on Bangla OCR includes [18], [31], [32], [33], [34] for printed basic characters and [35], [36], [37], [38], [39], [40] for handwritten basic characters. But the evidences of work on recognizing Bangla compound characters, as observed in the literature [18], [36], [41], [42], [34], are few.

The research on Bangla compound character recognition can be categorized into two sets of methods, developed for printed and handwritten characters. Chaudhuri and Pal [18] have proposed a template matching approach for printed Bangla compound character recognition. In this method, text digitization, noise removal, skew detection, and correction are done as part of preprocessing. The text documents are segmented into lines, words, and characters using horizontal–vertical projection profile analysis and head line removal techniques. They have used eight stroke-based features for representing a character and a filled-circle feature for representing a dot.

Garain and Chaudhuri [41] have proposed a template matching technique for recognizing Bangla printed compound characters. Run number vectors are computed using horizontal and vertical scans organized with respect to the centroid of the pattern. The vector is normalized and abbreviated so as to make it invariant to scaling, insensitive to character style variation and more effective for complex shaped characters. Matching is performed within a group of compound characters.

Sural and Das [34] have used the concept of fuzzy sets for recognizing printed Bangla compound characters. Hough transform is used to extract lines and circles. Attributes such as length, position and orientation are used to define a number of fuzzy sets. A three stage multi-layer perceptron (MLP) classifier, trained with a number of linguistic set memberships derived from the intersections on the basic fuzzy sets, can recognize the characters by their similarities to the different fuzzy pattern classes.

Pal et al. [42] have proposed an off-line Bangla handwritten compound character recognition method using modified quadratic discriminative function (MQDF) classifier. The features used are mainly based on the directional information obtained from the arc tangent of the gradient.

Das et al. [36] have recognized Bangla handwritten basic and compound characters using two different classifiers: multi-layer perceptron (MLP) and support vector machine (SVM). Features used are based on shadow, longest run, and quad-tree. The MLP classifier is used for recognizing different groups of characters. A confusion matrix is prepared for the recognition results of the MLP classifier. Classes having a high degree of mutual misclassification are further handled using an SVM classifier, which gives a better accuracy.

Proper recognition of compound characters for Bangla script is a difficult problem because of the complex structural characteristics of these characters. We highlight some typical characteristics of Bangla compound characters which render the problem quite difficult and challenging.

  • 1.

    Certain compound characters are very similar in shape and are referred to as confusing characters. Fig. 1 shows a representative set of pairs of confusing characters.

  • 2.

    Few compound characters, such as

    (
    +
    +
    ),
    (
    +
    +
    ),
    (
    +
    +
    ), have very complex shapes. It is seen that when a compound character is formed by three basic characters its shape becomes very complex.

  • 3.

    The shapes for handwritten versions of certain compound characters are quite different from their printed styles.

To address the aforementioned challenges we propose a novel character recognition method for Bangla compound characters, using topological features, extracted by analyzing the structural convexities of the script aksharas. We handle the complex shape of a compound character by decomposing it into convex shape primitives. The topological characteristics of the character are represented in the form of the layout of the shape primitives. We recognize the compound character by matching the extracted topological feature with the standard feature templates that we formulate for the compound characters. A unique aspect of this work is the formulation of character decomposition rules for getting simpler shapes within the character skeleton.

The rest of this paper is organized as follows. Section 2 describes the module for detecting compound characters in a dataset containing both basic and compound characters. The decomposition of compound characters into shape components is explained in Section 3. The skeletal segments are decomposed into strokes and represented as shape primitives using the method given in Section 4. In Section 5 we discuss the formulation of topological features and the similarity measure for feature templates for recognizing compound characters. Experimental results and related discussion are reported in Section 6. The concluding notes are given in Section 7.

Section snippets

Detection of compound characters

Compound characters can be detected and recognized by certain typical structural characteristics which distinguish them from the basic characters. The printing style and font information do not contribute to the character topology. Hence we prefer to work with the most simplistic representation of the character topology – its skeleton. For our purpose of detecting and recognizing compound characters it suffices to have a topological representation which is able to distinguish between even the

Skeletal decomposition of compound characters

In this section we explain our methodology to decompose the polygonized skeleton of a compound character. We look for simpler skeletal segments which tend to form cohesive or meaningful units. We present the decomposition rules for breaking a compound character into simpler structures. Character recognition using decomposition into simpler primitives has been used in the past. Hu et al. [50] have used singular points such as terminal, intersection, bend and directional points to decompose a

Identifying convex shapes in a skeletal segment

The skeletal segments obtained till now may have junction points and branches. This section describes the further steps applied on the skeletal segments so as to identify the convex shapes. In Sub section 4.1 we describe how to trace paths (strokes) in a skeletal segment. A stroke is a sequence of vertices which does not exhibit branching. Identifying convex segments from a stroke is discussed in Sub section 4.2. Larger convex segments are further broken down to obtain smaller convex segments.

Recognition using topological features

Each convex segment of a character is labeled with (or mapped to) its best matching shape primitive. Our repertoire of shape primitives comprises nine shapes as shown in Fig. 10, which allow us to have a good representation of the convex shapes. The procedure to identify the matching shape primitive for each convex segment is discussed next (Fig. 11).

Consider a convex segment with k vertex points, namely p1,p2,,pk. The end points p1(x1,y1) and pk(xk,yk) together form the opening of the convex

Experimental results and discussion

We have implemented the Bangla compound character detection and recognition system in C programming language on Fedora 10 running on Intel Core2 Duo 2.20 GHz, 1 GB RAM. We collected printed characters from several heterogeneous printed documents. The handwritten documents were collected from individuals of different age and profession. Samples were collected on a normal writing paper using standard ball-point pens, gel-pens and ink-pens. We avoided pens which produce thick strokes like the

Conclusion

In this paper we have proposed novel topological features for recognizing Bangla compound characters. We have formulated decomposition rules to break a compound character into simpler shape components. The decomposition improves the efficacy of the features used and yields a better recognition performance. Our recognition scheme does not require any training with real examples. This is an advantage because many Bangla compound characters are used rarely and finding a sufficient number of

Conflict of interest

None declared.

Soumen Bag received the B.E. and the M.Tech. degree in Computer Science and Engineering from National Institute of Technology (NIT) Durgapur, India, in 2003 and 2008 respectively. From January 2004 to June 2006, he worked as a lecturer in the Department of Computer Science and Engineering in BCET Durgapur, India. He received his Ph.D. from Indian Institute of Technology (IIT) Kharagpur in 2013. Since August 2012, he has been working as an Assistant Professor in International Institute of

References (53)

  • S. Sural et al.

    An MLP using Hough transform based fuzzy feature extraction for Bengali script recognition

    Pattern Recognition Letters

    (1999)
  • S. Basu et al.

    A hierarchical approach to recognition of handwritten Bangla characters

    Pattern Recognition

    (2009)
  • S. Bag et al.

    An improved contour-based thinning method for character images

    Pattern Recognition Letters

    (2011)
  • P. Sarkar, Document image analysis for digital libraries, in: International Workshop on Research Issues in Digital...
  • T. Kameshiro, T. Hirano, Y. Okada, F. Yoda, A document image retrieval method tolerating recognition and segmentation...
  • A. Kumar, C.V. Jawahar, R. Manmatha, Efficient search in document image collections, in: Asian Conference on Computer...
  • D. Genzel, A. C. Popat, N. Spasojevic, M. Jahr, A. Senior, E. Ie, F.Y. Tang, Translation-inspired OCR, in:...
  • A. Bahrampour, W. Barkhoda, B.Z. Azami, Implementation of three text to speech systems for Kurdish language, in:...
  • S. Laroum et al.

    HYBREDan OCR document representation for classification tasks

    International Journal of Computer Science Issues

    (2011)
  • P.K. Wong et al.

    Off-line handwritten Chinese character recognition as a compound Bays decision problem

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1998)
  • F. Kimura, OCR technologies for machine printed and hand printed Japanese text, in: Digital Document Processing: Major...
  • A. Amin, Off line Arabic character recognition: a survey, in: International Conference on Document Analysis and...
  • M.S. Khorsheed

    Off-line Arabic character recognition-a review

    Pattern Analysis and Applications

    (2002)
  • R. Jayadevan et al.

    Offline recognition of Devanagari scripta survey

    IEEE Transactions on Systems, Man, and Cybernetics—Part CApplications and Reviews

    (2011)
  • R.J. Kannan

    A comparative study of optical character recognition for Tamil script

    European Journal of Scientific Research

    (2009)
  • M.A. Rahiman, M.S. Rajasree, Printed Malayalam character recognition using back-propagation neural networks, in:...
  • Cited by (37)

    • Offline recognition of handwritten Indic scripts: A state-of-the-art survey and future perspectives

      2020, Computer Science Review
      Citation Excerpt :

      In this work, an accuracy of 95.19% has been achieved for 36,127 handwritten characters using MIL classifier with curvature features. Bag et al. [57] used a Template Matching (TM) approach for the recognition of handwritten Bengali compound characters. In this approach, a standard feature template has been constructed for each compound character.

    • Reduction of features to identify characters from degraded historical manuscripts

      2018, Alexandria Engineering Journal
      Citation Excerpt :

      As of today there are 33 languages and 2000 dialects, of which 22 are recognized under the constitution. The popular South Indian recognized languages are Telugu, Tamil, Kannada, Malayalam, Tulu, etc., [1]. Telugu script which is an offshoot of Brahmi script has complex structural characteristics and makes character recognition a difficult task.

    • Shape decomposition-based handwritten compound character recognition for Bangla OCR

      2018, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      A group of different feature set such as shadow, octant centroid, quadtree-based longest run, and different topological attributes are used to form the overall feature set for the recognition purpose. Bag et al. [23] have proposed a method that decomposes the compound characters into skeletal segments for the improvement of recognition accuracy. In this method, convex shape primitives are extracted to form the structural feature set and template matching scheme is used to recognize the handwritten Bangla compound characters.

    • Modeling of palm leaf character recognition system using transform based techniques

      2016, Pattern Recognition Letters
      Citation Excerpt :

      One of the best preserved oldest existing documents is recognized to be recorded in the second century A.D. Palm leaves were used as writing material to record art, medicine, astronomy, etc., and were preserved and passed through generations [9,11–15]. Telugu script which is an offshoot of Brahmi script has complex structural characteristics, which are difficult for character recognition [16]. It has 16 vowels and 36 consonants [14].

    View all citing articles on Scopus

    Soumen Bag received the B.E. and the M.Tech. degree in Computer Science and Engineering from National Institute of Technology (NIT) Durgapur, India, in 2003 and 2008 respectively. From January 2004 to June 2006, he worked as a lecturer in the Department of Computer Science and Engineering in BCET Durgapur, India. He received his Ph.D. from Indian Institute of Technology (IIT) Kharagpur in 2013. Since August 2012, he has been working as an Assistant Professor in International Institute of Information Technology (IIIT), Bhubaneswar, India. He is the recipient of Institute Gold medal for First Class for his Master's degree. His research interests are in the areas of OCR for Indian Scripts, Document Image Analysis, Image Processing, and Pattern Recognition.

    Gaurav Harit received his Ph.D. from Indian Institute of Technology Delhi, in 2007. He worked as an Assistant Professor in IIT Kharagpur from 2008 to 2010. Currently he is an Assistant Professor in IIT Jodhpur since July 2010. His areas of interest include Document Image Analysis, Image Analysis, and Computer Vision.

    Partha Bhowmick did his B.Tech. from IIT Kharagpur and received his masters and Ph.D. from ISI Kolkata. Presently he is an Associate Professor in CSE Department, IIT Kharagpur. His primary research interests are in digital geometry, computer graphics, low-level image processing, approximate pattern matching, shape analysis, document image analysis, GIS, and biometrics.

    View full text