Skip to main content
Log in

A novel approach for ranking spelling error corrections for Urdu

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This paper presents a scheme for ranking of spelling error corrections for Urdu. Conventionally spell-checking techniques do not provide any explicit ranking mechanism. Ranking is either implicit in the correction algorithm or corrections are not ranked at all. The research presented in this paper shows that for Urdu, phonetic similarity between the corrections and the erroneous word can serve as a useful parameter for ranking the corrections. This combined with a new technique Shapex that uses visual similarity of characters for ranking gives an improvement of 23% in the accuracy of the one-best match compared to the result obtained when the ranking is done on the basis of word frequencies only.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. The lexicon used for spell checking was a wordlist of 112,481 words prepared at Center of Research in Urdu Langue Processing, FAST-NU.

  2. The number of letters in the Urdu alphabet is 41. In the Unicode, four additional characters are introduced which are basically combinations of the letter hamza with other letters. This makes a total of 45 isolated characters in Urdu.

  3. Feroz Sons books: (i) Asma-e-Husna (ii) Dhatain aur un ke istamalaat (iii) Dil Batkay Ghay (iv) Kufer Iqbal Academy books: (i) 100 years Iqbal (ii) Hayat-i-Iqbal (iii) Iqbal droon-i-Khana (iv) Khutbat-i-Iqbal (v) Telmihat-o-Isharat-i-Iqbal (vi) Tejdeed Fikhariyat-i-Islam (vii) Bang-i-Draa (viii) Baal-i-Jibreel (ix)Zerb-i-Kaleem.

  4. Urdu is generally written in Nastaleeq style and the codes are assigned on the basis of shapes of letters in Nastaleeq font; for those Arabic-script-based languages that are written in other styles, the code assignment might be somewhat different.

References

  • Aliprand, J., et al. (2003). The unicode standard (Version 4.0). Addison-Wesley Publishing Company.

  • Brill, E., & Moore, R. C. (2000). An improved error model for noisy channel spelling correction. In Proceedings of 38th Annual Meeting of Association for Computational Linguistics (pp. 286–293).

  • Damerau, F. J. (1964). A technique for computer detection and correction of spelling errors. Communications of ACM, 7(3), 171–177.

    Article  Google Scholar 

  • Erikson, K. (1997). Approximate Swedish name matching—survey and test of different algorithms. NADA report TRITA-NA-E9721. http://www.csc.kth.se/tcs/projects/swedish.html

  • Hodge, V. J., & Austin, J. (2003). A comparison of standard spell checking algorithms and a novel binary neural approach. IEEE Transactions on Knowledge and Data Engineering, 15(5), 1073–1081.

    Google Scholar 

  • Holmes, D., & McCabe, M. (2002). Improving precision and recall for Soundex retrieval. In Proceedings of the 2002 IEEE International Conference on Information Technology—Coding and Computing (ITCC), Las Vegas, April 2002.

  • Hussain, S. (2004). Letter to sound rules for Urdu text to speech system. In Proceedings of Workshop on “Computational Approaches to Arabic Script-based Languages,” COLING, Geneva, Switzerland.

  • Hussain, S., & Karamat, N. (2003). Urdu collation sequence. In Proceedings of the IEEE International Multi-Topic Conference, Islamabad.

  • Kann, V., et al. (1998). Implementation aspects and applications of a spelling correction algorithm. NADA report TRITA-NA-9813, May 1998. http://www.nada.kth.se/∼viggo/papers.html

  • Kernighan, M., et al. (1990). A spelling correction program based on noisy channel model. In Proceedings of COLING-90, The 13th International Conference On Computational Linguistics, Vol. 2.

  • Khan, R. H. (1998). “Urdu Imla”, Qaumi Council bra-e-Taraki-e-Urdu Zabaan.

  • Kukich, K. (1992). Techniques for automatically correcting words in text. ACM Computing Survey, 14(4), 377–439.

    Article  Google Scholar 

  • Odell and Russell Soundex. U.S. Patent 1261167 and U.S. Patent 1435663, 1918 and 1922.

  • Peterson, L. J. (1986). A note on undetected typing errors. Communications of ACM, 29(7), 633–637.

    Article  Google Scholar 

  • Stanier, A. (1990). How accurate is Soundex matching. Computers in Genealogy, 3(7), 286–288.

    Google Scholar 

  • Toutanova, K., & Moore, R. C. (2002). Pronunciation modeling for improved spelling correction. In Proceedings of 40th Annual meeting of Association for Computational Linguistics (pp. 144–151). July 2002.

  • Zobel, J., & Dart, P. W. (1995). Finding approximate matches in large lexicons. Software—Practice and Experience, 25(3), 331–345.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tahira Naseem.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Naseem, T., Hussain, S. A novel approach for ranking spelling error corrections for Urdu. Lang Resources & Evaluation 41, 117–128 (2007). https://doi.org/10.1007/s10579-007-9028-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-007-9028-6

Keywords

Navigation