Abstract
Stemming is a mechanism of word form normalization that transforms the variant word forms to their common root. In an Information Retrieval system, it is used to increase the system’s performance, specifically the recall and desirably the precision. Although its usefulness is shown to be mixed in languages such as English, because morphologically complex languages stemming produces a significant performance improvement. A number of linguistic rule-based stemmers are available for most European languages which employ a set of rules to get back the root word from its variants. But for Indian languages which are highly inflectional in nature, devising a linguistic rule-based stemmer needs some additional resources which are not available. We present an approach which is purely corpus based and finds the equivalence classes of variant words in an unsupervised manner. A set of experiments on four languages using FIRE, CLEF, and TREC test collections shows that our approach provides comparable results with linguistic rule-based stemmers for some languages and gives significant performance improvement for resource constrained languages such as Bengali and Marathi.
- Almeida, A. and Bhattacharyya, P. 2008. Using morphology to improve Marathi monolingual information retrieval. In FIRE Working Note.Google Scholar
- Amati, G. and Van Rijsbergen, C. J. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20, 4, 357--389. Google ScholarDigital Library
- de Roeck, A. N. and Al-Fares, W. 2000. A morphologically sensitive clustering algorithm for identifying arabic roots. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (ACL’00). 199--206. Google ScholarDigital Library
- Dolamic, L. and Jacques, S. 2008. Unine at fire 2008: Hindi, Bengali, and Marathi IR. In FIRE Working Note.Google Scholar
- Goldsmith, J. 2001. Unsupervised learning of the morphology of a natural language. Comput. Linguist. 27, 2, 153--198. Google ScholarDigital Library
- Harman, D. 1991. How effective is suffixing. J. Amer. Soc. Inf. Sci. 42, 7--15.Google ScholarCross Ref
- Hull, D. A. 1996. Stemming algorithms - A case study for detailed evaluation. J. Amer. Soc. Inf. Sci. 47, 70--84. Google ScholarDigital Library
- Kraaij, W. and Pohlmann, R. 1996. Viewing stemming as recall enhancement. In Proceedings of the 19th Annual International Conference on Research and Development in Information Retrieval (SIGIR’96). 40--48. Google ScholarDigital Library
- Krovetz, R. 1993. Viewing morphology as an inference process. In Proceedings of the 16th Annual International Conference on Research and Development in Information Retrieval (SIGIR’93). 191--202. Google ScholarDigital Library
- Larkey, L. S., Connell, M. E., and Abduljaleel, N. 2003. Hindi clir in thirty days. ACM Trans. Asian Lang. Inform. Process. 2, 2, 130--142. Google ScholarDigital Library
- Majumder, P., Mitra, M., Parui, S. K., Kole, G., Mitra, P., and Datta, K. 2007. Yass: Yet another suffix stripper. ACM Trans. Inf. Syst. 25, 4, 18. Google ScholarDigital Library
- Majumder, P., Mitra, M., and Pal, D. 2008. Bulgarian, Hungarian and Czech stemming using Yass. In Proceedings of the 8th Workshop of the Cross-Language Evaluation Forum (CLEF’07). 49--56.Google Scholar
- Manning, C. D., Raghavan, P., and Schtze, H. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
- Mcnamee, P. and Mayfield, J. 2004. Character n-gram tokenization for European language text retrieval. Inf. Retr. 7, 1-2, 73--97. Google ScholarDigital Library
- Oard, D. W., Levow, G.-A., and Cabezas, C. I. 2000. Clef experiments at Maryland: Statistical stemming and backoff translation. In Proceedings of the Workshop on Cross-Language Evaluation Forum (CLEF’00). 176--187. Google ScholarDigital Library
- Peters, C., Gey, F., Gonzalo, J., Mueller, H., Jones, G., Kluck, M., Magnini, B., and Rijke, M. D. 2006. Accessing multilingual information repositories. In Proceedings of the 6th Workshop of the Cross-Language Evaluation Forum (CLEF’05).Google Scholar
- Popovic, M. and Willett, P. 1999. The effectiveness of stemming for natural-language access to Slovene textual data. J. Amer. Soc. Inf. Sci. 43, 5, 384--390.Google ScholarCross Ref
- Ramanathan, A. and Rao, D. 2003. A lightweight stemmer for Hindi. In Proceedings of the 10th Conference of the European Chapter of the Computational Linguistics for South Asian Languages (EACL’03).Google Scholar
- Savoy, J. 2008. Searching strategies for the Hungarian language. Inf. Process. Manage. 44, 1, 310--324. Google ScholarDigital Library
- Xu, J. and Croft, W. B. 1998. Corpus-based stemming using cooccurrence of word variants. ACM Trans. Inf. Syst. 16, 1, 61--81. Google ScholarDigital Library
Index Terms
- A Fast Corpus-Based Stemmer
Recommendations
YASS: Yet another suffix stripper
Stemmers attempt to reduce a word to its stem or root form and are used widely in information retrieval tasks to increase the recall rate. Most popular stemmers encode a large number of language-specific rules built over a length of time. Such stemmers ...
GRAS: An effective and efficient stemming algorithm for information retrieval
A novel graph-based language-independent stemming algorithm suitable for information retrieval is proposed in this article. The main features of the algorithm are retrieval effectiveness, generality, and computational efficiency. We test our approach on ...
A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics
AbstractWord Stemming is a widely used mechanism in the fields of Natural Language Processing, Information Retrieval, and Language Modeling. Language-independent stemmers discover classes of morphologically related words from the ambient ...
Comments