Skip to main content
Erschienen in: Cluster Computing 1/2018

25.05.2017

Urdu ligature recognition using multi-level agglomerative hierarchical clustering

verfasst von: Naila Habib Khan, Awais Adnan, Sadia Basar

Erschienen in: Cluster Computing | Ausgabe 1/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Optical character recognition (OCR) system holds great significance in human-machine interaction. OCR has been the subject of intensive research especially for Latin, Chinese and Japanese script. Comparatively, little work has been done for Urdu OCR, due to the complexities and segmentation errors associated with its cursive script. This paper proposes an Urdu OCR system which aims at ligature-level recognition of Urdu text. This ligature based recognition approach overcomes the character-levelsegmentation problems associated with cursive scripts. A newly developed OCR algorithm is introduced that uses a semi-supervised multi-level clustering for categorization of the ligatures. Classification is performed using four machine learning techniques i.e. decision trees, linear discriminant analysis, naive Bayes and k-nearest neighbor (K-NN). The system was implemented and the results show 62, 61, 73 and 90% accuracy for decision tree, linear discriminant analysis, naive Bayes and K-NN respectively.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Habash, N.Y.: Introduction to Arabic natural language processing. Synthesis Lectures on Human Language Technologies 3(1), 1–187 (2010)CrossRef Habash, N.Y.: Introduction to Arabic natural language processing. Synthesis Lectures on Human Language Technologies 3(1), 1–187 (2010)CrossRef
2.
Zurück zum Zitat Olszewska, J.I.: Active contour based optical character recognition for automated scene understanding. Neurocomputing 161, 65–71 (2015)CrossRef Olszewska, J.I.: Active contour based optical character recognition for automated scene understanding. Neurocomputing 161, 65–71 (2015)CrossRef
3.
Zurück zum Zitat Kharma, N.N., Ward, R.K.: Character recognition systems for the non-expert. IEEE Can. Rev. 33, 5–8 (1999) Kharma, N.N., Ward, R.K.: Character recognition systems for the non-expert. IEEE Can. Rev. 33, 5–8 (1999)
4.
Zurück zum Zitat Ahmad, R., Naz, S., Afzal, M.Z., Amin, S.H., Breuel, T.: Robust optical recognition of cursive Pashto script using scale, rotation and location invariant approach. PLoS ONE 10(9), e0133648 (2015)CrossRef Ahmad, R., Naz, S., Afzal, M.Z., Amin, S.H., Breuel, T.: Robust optical recognition of cursive Pashto script using scale, rotation and location invariant approach. PLoS ONE 10(9), e0133648 (2015)CrossRef
5.
Zurück zum Zitat Choudhary, P., Nain, N.: A four-tier annotated urdu handwritten text image dataset for multidisciplinary research on Urdu Script. ACM Trans. Asian Low Resour. Lang. Inf. Process. 15(4), 26 (2016)CrossRef Choudhary, P., Nain, N.: A four-tier annotated urdu handwritten text image dataset for multidisciplinary research on Urdu Script. ACM Trans. Asian Low Resour. Lang. Inf. Process. 15(4), 26 (2016)CrossRef
6.
Zurück zum Zitat Naz, S., Umar, A.I., Ahmad, R., Ahmed, S.B., Shirazi, S.H., Siddiqi, I., Razzak, M.I.: Offline cursive Urdu-Nastaliq script recognition using multidimensional recurrent neural networks. Neurocomputing 177, 228–241 (2016)CrossRef Naz, S., Umar, A.I., Ahmad, R., Ahmed, S.B., Shirazi, S.H., Siddiqi, I., Razzak, M.I.: Offline cursive Urdu-Nastaliq script recognition using multidimensional recurrent neural networks. Neurocomputing 177, 228–241 (2016)CrossRef
7.
Zurück zum Zitat Hakro, D.N., Talib, A.Z.: Printed text image database for Sindhi OCR. ACM Trans. Asian Low Resour. Lang. Inf. Process. 15(4), 21 (2016)CrossRef Hakro, D.N., Talib, A.Z.: Printed text image database for Sindhi OCR. ACM Trans. Asian Low Resour. Lang. Inf. Process. 15(4), 21 (2016)CrossRef
8.
Zurück zum Zitat Ahmad, Z., Orakzai, J.K., Shamsher, I., Adnan, A.: Urdu Nastaleeq Optical Character Recognition. In: Proceedings of World Academy of Science, Engineering and Technology, pp. 249–252 (2007) Ahmad, Z., Orakzai, J.K., Shamsher, I., Adnan, A.: Urdu Nastaleeq Optical Character Recognition. In: Proceedings of World Academy of Science, Engineering and Technology, pp. 249–252 (2007)
9.
Zurück zum Zitat Husain, S.A.: A multi-tier holistic approach for Urdu Nastaliq recognition. In: Proceedings of the 8th International Multi Topic Conference, Abstracts 2002, pp. 79–84 (2002) Husain, S.A.: A multi-tier holistic approach for Urdu Nastaliq recognition. In: Proceedings of the 8th International Multi Topic Conference, Abstracts 2002, pp. 79–84 (2002)
10.
Zurück zum Zitat Shah, Z.A.: Ligature based optical character recognition of Urdu-Nastaleeq font. In: Proceedings of 6th International Multitopic IEEE Conference (INMIC) (2002) Shah, Z.A.: Ligature based optical character recognition of Urdu-Nastaleeq font. In: Proceedings of 6th International Multitopic IEEE Conference (INMIC) (2002)
11.
Zurück zum Zitat Husain, S.A., Sajjad, A., Anwar, F.: Online Urdu character recognition system. In: MVA2007 IAPR Conference on Machine Vision Applications (2007) Husain, S.A., Sajjad, A., Anwar, F.: Online Urdu character recognition system. In: MVA2007 IAPR Conference on Machine Vision Applications (2007)
12.
Zurück zum Zitat Khan, K., Siddique, M., Aamir, M., Khan, R.: An efficient method for Urdu language text search in image based Urdu text. IJCSI Int. J. Comput. Sci. Issues 9(2), 523–527 (2012) Khan, K., Siddique, M., Aamir, M., Khan, R.: An efficient method for Urdu language text search in image based Urdu text. IJCSI Int. J. Comput. Sci. Issues 9(2), 523–527 (2012)
13.
Zurück zum Zitat Razzak, M.I., Husain, S.A., Mirza, A.A., Belaid, A.: Fuzzy based preprocessing using fusion of online and offline trait for online Urdu script based languages character recognition. Int. J. Innov. Comput. Inf. Control 8, 3149–3161 (2012) Razzak, M.I., Husain, S.A., Mirza, A.A., Belaid, A.: Fuzzy based preprocessing using fusion of online and offline trait for online Urdu script based languages character recognition. Int. J. Innov. Comput. Inf. Control 8, 3149–3161 (2012)
14.
15.
Zurück zum Zitat Akram, Q.u.A., Hussain, S., Habib, Z.: Font size independent OCR for Noori Nastaleeq. In: Proceedings of Graduate Colloquium on Computer Sciences (GCCS), NUCES, Lahore (2010) Akram, Q.u.A., Hussain, S., Habib, Z.: Font size independent OCR for Noori Nastaleeq. In: Proceedings of Graduate Colloquium on Computer Sciences (GCCS), NUCES, Lahore (2010)
16.
Zurück zum Zitat Javed, S.T., Hussain, S., Maqbool, A., Asloob, S., Jamil, S., Moin, H.: Segmentation Free Nastalique Urdu OCR. In: Proceedings of World Academy Of Science, Engineering and Technology, vol. 70 (2010) Javed, S.T., Hussain, S., Maqbool, A., Asloob, S., Jamil, S., Moin, H.: Segmentation Free Nastalique Urdu OCR. In: Proceedings of World Academy Of Science, Engineering and Technology, vol. 70 (2010)
17.
Zurück zum Zitat Sattar, S.A., Haque, S., Pathan, M.K.: A finite state model for Urdu Nastalique optical character recognition. Int. J. Comput. Sci. Netw. Security 9(9), 116 (2009) Sattar, S.A., Haque, S., Pathan, M.K.: A finite state model for Urdu Nastalique optical character recognition. Int. J. Comput. Sci. Netw. Security 9(9), 116 (2009)
18.
Zurück zum Zitat Pal, U., Sarkar, A.: Recognition of Printed Urdu Script. Paper presented at the Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol. 2 (2003) Pal, U., Sarkar, A.: Recognition of Printed Urdu Script. Paper presented at the Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol. 2 (2003)
19.
Zurück zum Zitat Malik, S., Khan, S.A.: Urdu online handwriting recognition. In: Proceedings of the IEEE Symposium on Emerging Technologies, vol. 17(18), Islamabad (2005) Malik, S., Khan, S.A.: Urdu online handwriting recognition. In: Proceedings of the IEEE Symposium on Emerging Technologies, vol. 17(18), Islamabad (2005)
20.
Zurück zum Zitat Chanda, S., Pal, U.: English, Devnagari and Urdu text identification. In: Proceedings of the International Conference on Cognition and Recognition, pp. 538–546 (2005) Chanda, S., Pal, U.: English, Devnagari and Urdu text identification. In: Proceedings of the International Conference on Cognition and Recognition, pp. 538–546 (2005)
21.
Zurück zum Zitat Pathan, R.R.J.I.K., Ali, A.A.: Recognition of offline handwritten isolated Urdu character. Adv. Comput. Res. 4(1), 117–121 (2012) Pathan, R.R.J.I.K., Ali, A.A.: Recognition of offline handwritten isolated Urdu character. Adv. Comput. Res. 4(1), 117–121 (2012)
22.
Zurück zum Zitat Zaman, S., Slany, W., Sahito, F.: Recognition of segmented Arabic/Urdu characters using pixel values as their features. In: ICCIT (2012) Zaman, S., Slany, W., Sahito, F.: Recognition of segmented Arabic/Urdu characters using pixel values as their features. In: ICCIT (2012)
23.
Zurück zum Zitat Shahzad, N., Paulson, B., Hammond, T.: Urdu Qaeda: Recognition system for isolated Urdu characters. In: IUI 2009 Workshop on Sketch Recognition, Sanibel Island, Florida (2009) Shahzad, N., Paulson, B., Hammond, T.: Urdu Qaeda: Recognition system for isolated Urdu characters. In: IUI 2009 Workshop on Sketch Recognition, Sanibel Island, Florida (2009)
24.
Zurück zum Zitat Nawaz, T., Naqvi, S.A.H.S., ur Rehman, H.: Optical character recognition system for Urdu (Naskh Font) using pattern matching technique. Int. J. Image Process. 3, 92–104 (2009) Nawaz, T., Naqvi, S.A.H.S., ur Rehman, H.: Optical character recognition system for Urdu (Naskh Font) using pattern matching technique. Int. J. Image Process. 3, 92–104 (2009)
25.
Zurück zum Zitat Ahmad, Z., Orakzai, J.K., Shamsher, I.: Urdu compound character recognition using feed forward neural networks. In: ICCSIT 2009, pp. 457–462 (2009) Ahmad, Z., Orakzai, J.K., Shamsher, I.: Urdu compound character recognition using feed forward neural networks. In: ICCSIT 2009, pp. 457–462 (2009)
26.
Zurück zum Zitat Shamsher, I., Ahmad, Z., Orakzai, J.K., Adnan, A.: OCR for printed Urdu Script using feed forward neural network. In: Proceedings of World Academy of Science, Engineering and Technology (2007) Shamsher, I., Ahmad, Z., Orakzai, J.K., Adnan, A.: OCR for printed Urdu Script using feed forward neural network. In: Proceedings of World Academy of Science, Engineering and Technology (2007)
27.
Zurück zum Zitat Javed, S.T., Hussain, S., Maqbool, A., Asloob, S., Jamil, S., Moin, H.: Segmentation free nastalique urdu OCR. In: Proceedings of World Academy of Science, Engineering and Technology, vol. 46, pp. 456–461 (2010) Javed, S.T., Hussain, S., Maqbool, A., Asloob, S., Jamil, S., Moin, H.: Segmentation free nastalique urdu OCR. In: Proceedings of World Academy of Science, Engineering and Technology, vol. 46, pp. 456–461 (2010)
28.
Zurück zum Zitat Ahmed, S.B., Naz, S., Razzak, M.I., Rashid, S.F., Afzal, M.Z., Breuel, T.M.: Evaluation of cursive and non-cursive scripts using recurrent neural networks. Neural Comput. Appl. 27(3), 603–613 (2016)CrossRef Ahmed, S.B., Naz, S., Razzak, M.I., Rashid, S.F., Afzal, M.Z., Breuel, T.M.: Evaluation of cursive and non-cursive scripts using recurrent neural networks. Neural Comput. Appl. 27(3), 603–613 (2016)CrossRef
29.
Zurück zum Zitat Javed, S.T., Hussain, S.: Segmentation based Urdu Nastalique OCR. In: Iberoamerican Congress on Pattern Recognition 2013, pp. 41–49. Springer, Heidelberg (2013) Javed, S.T., Hussain, S.: Segmentation based Urdu Nastalique OCR. In: Iberoamerican Congress on Pattern Recognition 2013, pp. 41–49. Springer, Heidelberg (2013)
30.
Zurück zum Zitat Razzak, M.I., Husain, S.A., Mirza, A.A., Belaid, A.: Fuzzy based preprocessing using fusion of online and offline trait for online urdu script based languages character recognition. Int. J. Innov. Comput. Inf. Control 8(5), 21 (2012) Razzak, M.I., Husain, S.A., Mirza, A.A., Belaid, A.: Fuzzy based preprocessing using fusion of online and offline trait for online urdu script based languages character recognition. Int. J. Innov. Comput. Inf. Control 8(5), 21 (2012)
31.
Zurück zum Zitat Wali, A., Hussain, S.: Context sensitive shape-substitution in nastaliq writing system: analysis and formulation. In: Innovations and Advanced Techniques in Computer and Information Sciences and Engineering. pp. 53–58. Springer, Heidelberg (2007) Wali, A., Hussain, S.: Context sensitive shape-substitution in nastaliq writing system: analysis and formulation. In: Innovations and Advanced Techniques in Computer and Information Sciences and Engineering. pp. 53–58. Springer, Heidelberg (2007)
32.
Zurück zum Zitat Hussain, S.: Complexity of Asian writing systems: a case study of Nafees Nasta’leeq for urdu. In: Proceedings of the 12th AMIC Annual Conference on e-Worlds: Governments, Business and Civil Society, Asian Media Information Center, Singapore 2003. Citeseer Hussain, S.: Complexity of Asian writing systems: a case study of Nafees Nasta’leeq for urdu. In: Proceedings of the 12th AMIC Annual Conference on e-Worlds: Governments, Business and Civil Society, Asian Media Information Center, Singapore 2003. Citeseer
33.
Zurück zum Zitat Naz, S., Hayat, K., Razzak, M.I., Anwar, M.W., Madani, S.A., Khan, S.U.: The optical character recognition of Urdu-like cursive scripts. Pattern Recognit. 47(3), 12291248 (2014)CrossRef Naz, S., Hayat, K., Razzak, M.I., Anwar, M.W., Madani, S.A., Khan, S.U.: The optical character recognition of Urdu-like cursive scripts. Pattern Recognit. 47(3), 12291248 (2014)CrossRef
34.
Zurück zum Zitat Naz, S., Hayat, K., Razzak, M.I., Anwar, M.W., Akbar, H.: Arabic script based character segmentation: a review. In: 2013 IEEE World Congress on Computer and Information Technology (WCCIT), pp. 1–6 (2013) Naz, S., Hayat, K., Razzak, M.I., Anwar, M.W., Akbar, H.: Arabic script based character segmentation: a review. In: 2013 IEEE World Congress on Computer and Information Technology (WCCIT), pp. 1–6 (2013)
35.
Zurück zum Zitat Satti, D.A., Saleem, K.: Complexities and implementation challenges in offline Urdu Nastaliq OCR. In: Proceedings of the Conference on Language & Technology 2012, pp. 85–91 (2012) Satti, D.A., Saleem, K.: Complexities and implementation challenges in offline Urdu Nastaliq OCR. In: Proceedings of the Conference on Language & Technology 2012, pp. 85–91 (2012)
36.
Zurück zum Zitat Sabbour, N., Shafait, F.: A segmentation-free approach to Arabic and Urdu OCR. In: IS&T/SPIE Electronic Imaging 2013. International Society for Optics and Photonics, pp. 86580N-86580N-86512 (2013) Sabbour, N., Shafait, F.: A segmentation-free approach to Arabic and Urdu OCR. In: IS&T/SPIE Electronic Imaging 2013. International Society for Optics and Photonics, pp. 86580N-86580N-86512 (2013)
37.
Zurück zum Zitat Akram, M., Hussain, S.: Word segmentation for Urdu OCR system. In: Proceedings of the 8th Workshop on Asian Language Resources, Beijing, China, pp. 88–94 (2010) Akram, M., Hussain, S.: Word segmentation for Urdu OCR system. In: Proceedings of the 8th Workshop on Asian Language Resources, Beijing, China, pp. 88–94 (2010)
Metadaten
Titel
Urdu ligature recognition using multi-level agglomerative hierarchical clustering
verfasst von
Naila Habib Khan
Awais Adnan
Sadia Basar
Publikationsdatum
25.05.2017
Verlag
Springer US
Erschienen in
Cluster Computing / Ausgabe 1/2018
Print ISSN: 1386-7857
Elektronische ISSN: 1573-7543
DOI
https://doi.org/10.1007/s10586-017-0916-2

Weitere Artikel der Ausgabe 1/2018

Cluster Computing 1/2018 Zur Ausgabe