Skip to main content
Log in

Script identification in printed bilingual documents

  • Published:
Sadhana Aims and scope Submit manuscript

Abstract

Identification of the script of the text in multi-script documents is one of the important steps in the design of an OCR system for the analysis and recognition of the page. Much work has already been reported in this area relating to Roman, Arabic, Chinese, Korean and Japanese scripts. In the Indian context, though some results have been reported, the task is still at its infancy. In the work presented in this paper, a successful attempt has been made to identify the script, at the word level, in a bilingual document containing Roman and Tamil scripts. Two different approaches have been proposed and thoroughly tested. In the first method, words are divided into three distinct spatial zones. The spatial spread of a word in upper and lower zones, together with the character density, is used to identify the script. The second technique analyses the directional energy distribution of a word using Gabor filters with suitable frequencies and orientations. Words with various font styles and sizes have been used for the testing of the proposed algorithms and the results are quite encouraging.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Burges C J C 1998 A tutorial on support vector machines for pattern recognition.Data Mining Knowledge Discovery 2: 955–974

    Google Scholar 

  • Campbell F W, Kulikowski J J 1966 Orientational selectivity of human visual system.J. Physiol. 187: 437–45

    Google Scholar 

  • Chaudhury S, Sheth R 1999 Trainable script identification strategies for Indian languages. InProc. Int. Conf. on Document Analysis and Recognition (IEEE Comput. Soc. Press) pp 657–660

  • Clausi D, Jernigan M E 2000 Designing Gabor filters for optimal texture separability.Pattern Recogn. 33: 1835–1849

    Article  Google Scholar 

  • Collobert R, Bengio S 2000 Support vector machines for large-scale regression problem. Technical Report, Dalle Molle Institute for Perceptual Artificial Intelligence, Martigny, Switzerland

    Google Scholar 

  • Hochberg P J, Kerns L, Thomas T 1997 Automatic script identification from images using clusterbased templates.IEEE Trans. Pattern Anal. Machine Intell. 19: 176–181

    Article  Google Scholar 

  • Hubel D H, Wiesel T N 1965 Receptive fields and functional architecture in two non-striate visual areas 18 and 19 of the cat.J. Neurophysiol. 28: 229–289

    Google Scholar 

  • Mahata K 2000Optical character recognition for printed Tamil script. Master’s thesis, Department of Electrical Communication Engineering, Indian Institute of Science Bangalore

    Google Scholar 

  • Pal U, Chaudhuri B B 1997 Automatic separation of words in multi-lingual multiscript Indian documents. InProc. Int. Conf. on Document Analysis and Recognition (IEEE Comput. Soc. Press) pp 576–579

  • Pal U, Chaudhuri B B 1999 Script line separation from Indian multiscript document. InProc. Int. Conf. on Document Analysis and Recognition (IEEE Comput. Soc. Press) pp 406–409

  • Ramakrishnan A G, Mahata K 2000 A complete OCR for Tamil printed text. InProc. Tamil Internet 2000, Singapore, pp 165–170

  • Spitz A K 1997 Determination of script and language content of document images.IEEE Trans. Pattern Anal. Machine Intell. 19: 235–245

    Article  Google Scholar 

  • Spitz A L, Nakayama T 1993 European language determination from image. InProc. Int. Conf. on Document Analysis and Recognition (IEEE Comput. Soc. Press) pp 159–162

  • Spitz A L, Sibun P 1994 Natural language processing from scanned document images. InProc. Applied Natural Language Processing, Stuttgart, pp 115–121

  • Tan T N 1998 Rotation invariant texture features and their use in automatic script identification.IEEE Trans. Pattern Anal. Machine Intell. 20: 751–756

    Article  Google Scholar 

  • Wood S L, Yao X, Krishnamurthi K, Dang L 1995 Language identification for printed text independent of segmentation. InProc. Int. Conf. Image Processing, pp 428–431

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. G. Ramakrishnan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dhanya, D., Ramakrishnan, A.G. & Pati, P.B. Script identification in printed bilingual documents. Sadhana 27, 73–82 (2002). https://doi.org/10.1007/BF02703313

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02703313

Keywords

Navigation