Script identification in printed bilingual documents

Dhanya, D.; Ramakrishnan, A. G.; Pati, Peeta Basa

doi:10.1007/BF02703313

Script identification in printed bilingual documents

Published: February 2002

Volume 27, pages 73–82, (2002)
Cite this article

Sadhana Aims and scope Submit manuscript

D. Dhanya¹,
A. G. Ramakrishnan¹ &
Peeta Basa Pati¹

129 Accesses
59 Citations
3 Altmetric
Explore all metrics

Abstract

Identification of the script of the text in multi-script documents is one of the important steps in the design of an OCR system for the analysis and recognition of the page. Much work has already been reported in this area relating to Roman, Arabic, Chinese, Korean and Japanese scripts. In the Indian context, though some results have been reported, the task is still at its infancy. In the work presented in this paper, a successful attempt has been made to identify the script, at the word level, in a bilingual document containing Roman and Tamil scripts. Two different approaches have been proposed and thoroughly tested. In the first method, words are divided into three distinct spatial zones. The spatial spread of a word in upper and lower zones, together with the character density, is used to identify the script. The second technique analyses the directional energy distribution of a word using Gabor filters with suitable frequencies and orientations. Words with various font styles and sizes have been used for the testing of the proposed algorithms and the results are quite encouraging.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Burges C J C 1998 A tutorial on support vector machines for pattern recognition.Data Mining Knowledge Discovery 2: 955–974
Google Scholar
Campbell F W, Kulikowski J J 1966 Orientational selectivity of human visual system.J. Physiol. 187: 437–45
Google Scholar
Chaudhury S, Sheth R 1999 Trainable script identification strategies for Indian languages. InProc. Int. Conf. on Document Analysis and Recognition (IEEE Comput. Soc. Press) pp 657–660
Clausi D, Jernigan M E 2000 Designing Gabor filters for optimal texture separability.Pattern Recogn. 33: 1835–1849
Article Google Scholar
Collobert R, Bengio S 2000 Support vector machines for large-scale regression problem. Technical Report, Dalle Molle Institute for Perceptual Artificial Intelligence, Martigny, Switzerland
Google Scholar
Hochberg P J, Kerns L, Thomas T 1997 Automatic script identification from images using clusterbased templates.IEEE Trans. Pattern Anal. Machine Intell. 19: 176–181
Article Google Scholar
Hubel D H, Wiesel T N 1965 Receptive fields and functional architecture in two non-striate visual areas 18 and 19 of the cat.J. Neurophysiol. 28: 229–289
Google Scholar
Mahata K 2000Optical character recognition for printed Tamil script. Master’s thesis, Department of Electrical Communication Engineering, Indian Institute of Science Bangalore
Google Scholar
Pal U, Chaudhuri B B 1997 Automatic separation of words in multi-lingual multiscript Indian documents. InProc. Int. Conf. on Document Analysis and Recognition (IEEE Comput. Soc. Press) pp 576–579
Pal U, Chaudhuri B B 1999 Script line separation from Indian multiscript document. InProc. Int. Conf. on Document Analysis and Recognition (IEEE Comput. Soc. Press) pp 406–409
Ramakrishnan A G, Mahata K 2000 A complete OCR for Tamil printed text. InProc. Tamil Internet 2000, Singapore, pp 165–170
Spitz A K 1997 Determination of script and language content of document images.IEEE Trans. Pattern Anal. Machine Intell. 19: 235–245
Article Google Scholar
Spitz A L, Nakayama T 1993 European language determination from image. InProc. Int. Conf. on Document Analysis and Recognition (IEEE Comput. Soc. Press) pp 159–162
Spitz A L, Sibun P 1994 Natural language processing from scanned document images. InProc. Applied Natural Language Processing, Stuttgart, pp 115–121
Tan T N 1998 Rotation invariant texture features and their use in automatic script identification.IEEE Trans. Pattern Anal. Machine Intell. 20: 751–756
Article Google Scholar
Wood S L, Yao X, Krishnamurthi K, Dang L 1995 Language identification for printed text independent of segmentation. InProc. Int. Conf. Image Processing, pp 428–431

Download references

Author information

Authors and Affiliations

Biomedical Laboratory, Department of Electrical Engineering, Indian Institute of Science, 560 012, Bangalore, India
D. Dhanya, A. G. Ramakrishnan & Peeta Basa Pati

Authors

D. Dhanya
View author publications
You can also search for this author in PubMed Google Scholar
A. G. Ramakrishnan
View author publications
You can also search for this author in PubMed Google Scholar
Peeta Basa Pati
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. G. Ramakrishnan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dhanya, D., Ramakrishnan, A.G. & Pati, P.B. Script identification in printed bilingual documents. Sadhana 27, 73–82 (2002). https://doi.org/10.1007/BF02703313

Download citation

Issue Date: February 2002
DOI: https://doi.org/10.1007/BF02703313

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Script identification in printed bilingual documents

Abstract

Access this article

Similar content being viewed by others

Word-Level Handwritten Script Identification from Multi-script Documents

Script identification algorithms: a survey

Differentiation of the Script Using Adjacent Local Binary Patterns

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Script identification in printed bilingual documents

Abstract

Access this article

Similar content being viewed by others

Word-Level Handwritten Script Identification from Multi-script Documents

Script identification algorithms: a survey

Differentiation of the Script Using Adjacent Local Binary Patterns

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation