Skip to main content
Top
Published in: International Journal on Document Analysis and Recognition (IJDAR) 4/2015

01-12-2015 | Original Paper

Nastalique segmentation-based approach for Urdu OCR

Authors: Sarmad Hussain, Salman Ali, Qurat ul Ain Akram

Published in: International Journal on Document Analysis and Recognition (IJDAR) | Issue 4/2015

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Much work on Arabic language optical character recognition (OCR) has been on Naskh writing style. Nastalique style, used for most of languages using Arabic script across Southern Asia, is much more challenging to process due to its compactness, cursiveness, higher context sensitivity and diagonality. This makes the Nastalique writing more complex with multiple letters horizontally overlapping each other. Due to these reasons, existing methods used for Naskh would not work for Nastalique and therefore most work on Nastalique has used non-segmentation methods. The current paper presents new approach for segmentation-based analysis for Nastalique style. The paper explains the complexity of Nastalique, why Naskh based techniques cannot work for Nastalique, and proposes a segmentation-based method for developing Nastalique OCR, deriving principles and techniques for the pre-processing and recognition. The OCR is developed for Urdu language. The system is optimized using 79,093 instances of 5249 main bodies derived from a corpus of 18 million words, giving recognition accuracy of 97.11 %. The system is then tested on document images of books with 87.44 % main body recognition accuracy. The work is extensible to other languages using Nastalique.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
2
Based on the graphemes used for Nafees Nastalique font for Urdu.
 
3
Semi circular stroke having end point traversal path in reverse order of normal writing style.
 
Literature
1.
go back to reference Davis, M., Iancu, L.: Unicode Text Segmentation. Unicode Consortium, Mountain View, USA (2015) Davis, M., Iancu, L.: Unicode Text Segmentation. Unicode Consortium, Mountain View, USA (2015)
2.
go back to reference Naseem, T., Hussain, S.: A novel approach for ranking spelling error corrections for Urdu. Language resources and evaluation, Springer 41, (2007) Naseem, T., Hussain, S.: A novel approach for ranking spelling error corrections for Urdu. Language resources and evaluation, Springer 41, (2007)
3.
go back to reference Akram, M., Hussain, S.: Word Segmentation for Urdu OCR System. In: 8th Workshop on Asian language resources, COLIG 2010. Beijing (2010) Akram, M., Hussain, S.: Word Segmentation for Urdu OCR System. In: 8th Workshop on Asian language resources, COLIG 2010. Beijing (2010)
5.
go back to reference Wali, A., Hussain, S.: Context sensitive shape-substitution in nastaliq writing system: analysis and formulation. In: International joint conferences on computer, information, and systems sciences, and engineering (2006) Wali, A., Hussain, S.: Context sensitive shape-substitution in nastaliq writing system: analysis and formulation. In: International joint conferences on computer, information, and systems sciences, and engineering (2006)
6.
go back to reference Hussain, S., Rahman, S., Wali, A., Gulzar, A., Rahman, S.J.: Grammatical analysis of Nastalique writing style of Urdu. Center for Research in Urdu language processing, FAST-nu, Lahore, Pakistan (2002) Hussain, S., Rahman, S., Wali, A., Gulzar, A., Rahman, S.J.: Grammatical analysis of Nastalique writing style of Urdu. Center for Research in Urdu language processing, FAST-nu, Lahore, Pakistan (2002)
7.
go back to reference Ijaz, M., Hussain, S.: Corpus based Urdu Lexicon development. In Conference on Language Technology, Peshawar (2007) Ijaz, M., Hussain, S.: Corpus based Urdu Lexicon development. In Conference on Language Technology, Peshawar (2007)
8.
go back to reference Shaw, B., Parui, S.K., Shridhar, M.: Offline handwritten Devanagari word recognition: a segmentation based approach. In: 19th international conference on pattern recognition (2008) Shaw, B., Parui, S.K., Shridhar, M.: Offline handwritten Devanagari word recognition: a segmentation based approach. In: 19th international conference on pattern recognition (2008)
9.
go back to reference Lorigo, L., Govindaraju, V.: Segmentation and pre-recognition of Arabic handwriting. In: Eight international conference on document analysis and recognition (2005) Lorigo, L., Govindaraju, V.: Segmentation and pre-recognition of Arabic handwriting. In: Eight international conference on document analysis and recognition (2005)
10.
go back to reference Cheung, A., Bennamoun, M., Bergmann, N.M.: An Arabic optical character recognition system using recognition-based segmentation. Pattern Recognit. 34(2), 215–233 (2001)MATHCrossRef Cheung, A., Bennamoun, M., Bergmann, N.M.: An Arabic optical character recognition system using recognition-based segmentation. Pattern Recognit. 34(2), 215–233 (2001)MATHCrossRef
11.
go back to reference Mehran, R., Pirsiavash, H., Razzazi, F.: A front-end OCR for omni-font Persian/Arabic cursive printed documents. In: Digital image computing on techniques and applications. DC, Washington (2005) Mehran, R., Pirsiavash, H., Razzazi, F.: A front-end OCR for omni-font Persian/Arabic cursive printed documents. In: Digital image computing on techniques and applications. DC, Washington (2005)
12.
go back to reference Safabakhsh, R., Abidi, P.: Nastaaligh handwritten word recognition using a continuous-density variable-duration HMM. Arab. J. Sci. Eng. 30, 95–118 (2005) Safabakhsh, R., Abidi, P.: Nastaaligh handwritten word recognition using a continuous-density variable-duration HMM. Arab. J. Sci. Eng. 30, 95–118 (2005)
13.
go back to reference Javed, S.T., Hussain, S.: ”Segmentation Based Urdu Nastalique OCR,” in 18th Iberoamerican Congress on Pattern Recognition, Havana CUBA, (2013) Javed, S.T., Hussain, S.: ”Segmentation Based Urdu Nastalique OCR,” in 18th Iberoamerican Congress on Pattern Recognition, Havana CUBA, (2013)
14.
go back to reference Muaz, A.: Urdu optical character recognition system. MS Thesis Report, National University of Computer and Emerging Sciences, Lahore (2010) Muaz, A.: Urdu optical character recognition system. MS Thesis Report, National University of Computer and Emerging Sciences, Lahore (2010)
15.
go back to reference Sankaran, N., Jawahar, C.V.: Recognition of printed Devanagari text using BLSTM Neural Network. In: 21st international conference on pattern recognition (2012) Sankaran, N., Jawahar, C.V.: Recognition of printed Devanagari text using BLSTM Neural Network. In: 21st international conference on pattern recognition (2012)
16.
go back to reference Al-Muhtaseb, H.A., Mahmoud, S.A., Qahwaji, R.S.: Recognition of off-line printed Arabic text using Hidden Markov models. Signal Process. 88(12), 2902–2912 (2008)MATHCrossRef Al-Muhtaseb, H.A., Mahmoud, S.A., Qahwaji, R.S.: Recognition of off-line printed Arabic text using Hidden Markov models. Signal Process. 88(12), 2902–2912 (2008)MATHCrossRef
17.
go back to reference AlKhateeb, J.H., Jiang, J., Ren, J., Khelifi, F., Ipson, S.S.: Multiclass classification of unconstrained handwritten arabic words using machine learning approaches. Open Signal Process. J. 2(1), 21–28 (2009)CrossRef AlKhateeb, J.H., Jiang, J., Ren, J., Khelifi, F., Ipson, S.S.: Multiclass classification of unconstrained handwritten arabic words using machine learning approaches. Open Signal Process. J. 2(1), 21–28 (2009)CrossRef
18.
go back to reference AlKhateeb, J.H., Ren, J., Jiang, J., Al-Muhtaseb, H.: Offline handwritten Arabic cursive text recognition using Hidden Markov models and re-ranking. Pattern Recognit. Lett. 32(8), 1081–1088 (2011)CrossRef AlKhateeb, J.H., Ren, J., Jiang, J., Al-Muhtaseb, H.: Offline handwritten Arabic cursive text recognition using Hidden Markov models and re-ranking. Pattern Recognit. Lett. 32(8), 1081–1088 (2011)CrossRef
19.
go back to reference Khorsheed, M.S.: Offline recognition of omnifont Arabic text usingthe HMM ToolKit (HTK). Pattern Recognit. Lett. 28(12), 1563–1571 (2007)CrossRef Khorsheed, M.S.: Offline recognition of omnifont Arabic text usingthe HMM ToolKit (HTK). Pattern Recognit. Lett. 28(12), 1563–1571 (2007)CrossRef
20.
go back to reference Ul-Hasan, A., Ahmed, S.B., Rashid, S.F., Shafait, F., Breuel, T.M.: Offline printed Urdu Nastaleeq script recognition with bidirectional LSTM networks. In: International conference on document analysis and recognition (2013) Ul-Hasan, A., Ahmed, S.B., Rashid, S.F., Shafait, F., Breuel, T.M.: Offline printed Urdu Nastaleeq script recognition with bidirectional LSTM networks. In: International conference on document analysis and recognition (2013)
21.
go back to reference Sabbour, N., Shafait, F.: A segmentation free approach to Arabic and Urdu OCR. In: SPIE 8658, (2013) Sabbour, N., Shafait, F.: A segmentation free approach to Arabic and Urdu OCR. In: SPIE 8658, (2013)
22.
go back to reference Javed, S.T., Hussain, S., Maqbool, A., Asloob, S., Jamil, S., Mohsin, H.: Segmentation Free Nastalique Urdu OCR, vol. 46, pp. 456–461. World Academy of Science, Engineering and Technology (2010) Javed, S.T., Hussain, S., Maqbool, A., Asloob, S., Jamil, S., Mohsin, H.: Segmentation Free Nastalique Urdu OCR, vol. 46, pp. 456–461. World Academy of Science, Engineering and Technology (2010)
23.
go back to reference Shah, Z., Saleem, F.: Ligature based optical character recognition of Urdu, Nastaleeq font. In: International multi topic conference. Karachi (2002) Shah, Z., Saleem, F.: Ligature based optical character recognition of Urdu, Nastaleeq font. In: International multi topic conference. Karachi (2002)
24.
go back to reference Lehal, G.S., Rana, A.: Recognition of Nastalique Urdu ligatures. In: 4th International workshop on multilingual OCR. NY, New York (2013) Lehal, G.S., Rana, A.: Recognition of Nastalique Urdu ligatures. In: 4th International workshop on multilingual OCR. NY, New York (2013)
25.
go back to reference Sattar, S.A.: A technique for the design and implementation of an OCR for printed Nastalique text. Degree of Doctor of Philosophy Thesis Report. N.E.D University of Engineering and Technology, Karachi (2009) Sattar, S.A.: A technique for the design and implementation of an OCR for printed Nastalique text. Degree of Doctor of Philosophy Thesis Report. N.E.D University of Engineering and Technology, Karachi (2009)
26.
go back to reference Satti, D.A.: Offline Urdu Nastaliq OCR for printed text using analytical approach. MS Thesis report, Quaid-i-Azam University, Islamabad (2013) Satti, D.A.: Offline Urdu Nastaliq OCR for printed text using analytical approach. MS Thesis report, Quaid-i-Azam University, Islamabad (2013)
27.
go back to reference Akram, Q., Hussain, S., Niazi, A., Anjum, U., Irfan, F.: Adapting Tesseract for complex scripts: an example for Urdu Nastalique. In: 11th IAPR Workshop on document analysis systems. Tours (2014) Akram, Q., Hussain, S., Niazi, A., Anjum, U., Irfan, F.: Adapting Tesseract for complex scripts: an example for Urdu Nastalique. In: 11th IAPR Workshop on document analysis systems. Tours (2014)
28.
go back to reference Rashwan, M.A., Fakhr, M.W., Attia, M., El-Mahallawy, M.: Arabic OCR system analogous to HMM-based ASR systems; implementation and evaluation. J. Eng. Appl. Sci. Cairo 54(6), 653–672 (2007) Rashwan, M.A., Fakhr, M.W., Attia, M., El-Mahallawy, M.: Arabic OCR system analogous to HMM-based ASR systems; implementation and evaluation. J. Eng. Appl. Sci. Cairo 54(6), 653–672 (2007)
29.
go back to reference Naz, M., Akram, Q., Hussain, S.: Binarization and its evaluation for Urdu Nastalique document images. In: Proceedings of the 16th international multi topic conference, Lahore (2013) Naz, M., Akram, Q., Hussain, S.: Binarization and its evaluation for Urdu Nastalique document images. In: Proceedings of the 16th international multi topic conference, Lahore (2013)
30.
go back to reference Huang, L., Wan, G., Liu, C.: An improved parallel thinning algorithm. In: Seventh international conference on document analysis and recognition (2003) Huang, L., Wan, G., Liu, C.: An improved parallel thinning algorithm. In: Seventh international conference on document analysis and recognition (2003)
31.
go back to reference Jiang, Y., Hongwei, G., Chao, L.: A filtering algorithm for removing salt and pepper noise and preserving details of images. In: 6th international conference on wireless communications networking and mobile computing (2010) Jiang, Y., Hongwei, G., Chao, L.: A filtering algorithm for removing salt and pepper noise and preserving details of images. In: 6th international conference on wireless communications networking and mobile computing (2010)
32.
go back to reference Li, F., Fan, J.: Salt and pepper noise removal by adaptive median filter and minimal surface inpainting. In: 2nd international congress on image and signal processing (2009) Li, F., Fan, J.: Salt and pepper noise removal by adaptive median filter and minimal surface inpainting. In: 2nd international congress on image and signal processing (2009)
33.
go back to reference Al-Khaffaf, H.S., Talib, A.Z., Abdul, R.: Salt and pepper noise removal from document images. In: Proceedings of the 1st international visual informatics conference on visual informatics (2009) Al-Khaffaf, H.S., Talib, A.Z., Abdul, R.: Salt and pepper noise removal from document images. In: Proceedings of the 1st international visual informatics conference on visual informatics (2009)
34.
35.
go back to reference Rabiner, L.R.: Mathematical foundations of hidden Markov models. In: Proceedings of the NATO advanced study institute on recent advances in speech understanding and dialog systems (1988) Rabiner, L.R.: Mathematical foundations of hidden Markov models. In: Proceedings of the NATO advanced study institute on recent advances in speech understanding and dialog systems (1988)
36.
go back to reference Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf, P., Woelfel, J.: Sphinx-4: a flexible open source framework for speech recognition. Sun Microsystems Inc, Mountain View, CA (2004) Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf, P., Woelfel, J.: Sphinx-4: a flexible open source framework for speech recognition. Sun Microsystems Inc, Mountain View, CA (2004)
38.
go back to reference Urooj, S., Hussain, S., Adeeba, F., Jabeen, F., Parveen, R.: CLE Urdu digest corpus. In: Conference on language and technology 2012 (CLT12), Lahore (2012) Urooj, S., Hussain, S., Adeeba, F., Jabeen, F., Parveen, R.: CLE Urdu digest corpus. In: Conference on language and technology 2012 (CLT12), Lahore (2012)
Metadata
Title
Nastalique segmentation-based approach for Urdu OCR
Authors
Sarmad Hussain
Salman Ali
Qurat ul Ain Akram
Publication date
01-12-2015
Publisher
Springer Berlin Heidelberg
Published in
International Journal on Document Analysis and Recognition (IJDAR) / Issue 4/2015
Print ISSN: 1433-2833
Electronic ISSN: 1433-2825
DOI
https://doi.org/10.1007/s10032-015-0250-2

Other articles of this Issue 4/2015

International Journal on Document Analysis and Recognition (IJDAR) 4/2015 Go to the issue

Premium Partner