Abstract
Identification of the script of the text in multi-script documents is one of the important steps in the design of an OCR system for the analysis and recognition of the page. Much work has already been reported in this area relating to Roman, Arabic, Chinese, Korean and Japanese scripts. In the Indian context, though some results have been reported, the task is still at its infancy. In the work presented in this paper, a successful attempt has been made to identify the script, at the word level, in a bilingual document containing Roman and Tamil scripts. Two different approaches have been proposed and thoroughly tested. In the first method, words are divided into three distinct spatial zones. The spatial spread of a word in upper and lower zones, together with the character density, is used to identify the script. The second technique analyses the directional energy distribution of a word using Gabor filters with suitable frequencies and orientations. Words with various font styles and sizes have been used for the testing of the proposed algorithms and the results are quite encouraging.
Similar content being viewed by others
References
Burges C J C 1998 A tutorial on support vector machines for pattern recognition.Data Mining Knowledge Discovery 2: 955–974
Campbell F W, Kulikowski J J 1966 Orientational selectivity of human visual system.J. Physiol. 187: 437–45
Chaudhury S, Sheth R 1999 Trainable script identification strategies for Indian languages. InProc. Int. Conf. on Document Analysis and Recognition (IEEE Comput. Soc. Press) pp 657–660
Clausi D, Jernigan M E 2000 Designing Gabor filters for optimal texture separability.Pattern Recogn. 33: 1835–1849
Collobert R, Bengio S 2000 Support vector machines for large-scale regression problem. Technical Report, Dalle Molle Institute for Perceptual Artificial Intelligence, Martigny, Switzerland
Hochberg P J, Kerns L, Thomas T 1997 Automatic script identification from images using clusterbased templates.IEEE Trans. Pattern Anal. Machine Intell. 19: 176–181
Hubel D H, Wiesel T N 1965 Receptive fields and functional architecture in two non-striate visual areas 18 and 19 of the cat.J. Neurophysiol. 28: 229–289
Mahata K 2000Optical character recognition for printed Tamil script. Master’s thesis, Department of Electrical Communication Engineering, Indian Institute of Science Bangalore
Pal U, Chaudhuri B B 1997 Automatic separation of words in multi-lingual multiscript Indian documents. InProc. Int. Conf. on Document Analysis and Recognition (IEEE Comput. Soc. Press) pp 576–579
Pal U, Chaudhuri B B 1999 Script line separation from Indian multiscript document. InProc. Int. Conf. on Document Analysis and Recognition (IEEE Comput. Soc. Press) pp 406–409
Ramakrishnan A G, Mahata K 2000 A complete OCR for Tamil printed text. InProc. Tamil Internet 2000, Singapore, pp 165–170
Spitz A K 1997 Determination of script and language content of document images.IEEE Trans. Pattern Anal. Machine Intell. 19: 235–245
Spitz A L, Nakayama T 1993 European language determination from image. InProc. Int. Conf. on Document Analysis and Recognition (IEEE Comput. Soc. Press) pp 159–162
Spitz A L, Sibun P 1994 Natural language processing from scanned document images. InProc. Applied Natural Language Processing, Stuttgart, pp 115–121
Tan T N 1998 Rotation invariant texture features and their use in automatic script identification.IEEE Trans. Pattern Anal. Machine Intell. 20: 751–756
Wood S L, Yao X, Krishnamurthi K, Dang L 1995 Language identification for printed text independent of segmentation. InProc. Int. Conf. Image Processing, pp 428–431
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Dhanya, D., Ramakrishnan, A.G. & Pati, P.B. Script identification in printed bilingual documents. Sadhana 27, 73–82 (2002). https://doi.org/10.1007/BF02703313
Issue Date:
DOI: https://doi.org/10.1007/BF02703313