Abstract
A novel graph-based language-independent stemming algorithm suitable for information retrieval is proposed in this article. The main features of the algorithm are retrieval effectiveness, generality, and computational efficiency. We test our approach on seven languages (using collections from the TREC, CLEF, and FIRE evaluation platforms) of varying morphological complexity. Significant performance improvement over plain word-based retrieval, three other language-independent morphological normalizers, as well as rule-based stemmers is demonstrated.
- Amati, G. and Van Rijsbergen, C. J. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20, 4, 357--389. Google ScholarDigital Library
- Bacchin, M., Ferro, N., and Melucci, M. 2005. A probabilistic model for stemmer generation. Inf. Process. Manage. 41, 1, 121--137. Google ScholarDigital Library
- Dolamic, L. and Savoy, J. 2009. Indexing and stemming approaches for the Czech language. Inf. Process. Manage. 45, 6, 714--720. Google ScholarDigital Library
- Dolamic, L. and Savoy, J. 2010. Comparative study of indexing and search strategies for the Hindi, Marathi, and Bengali languages. ACM Trans. Asian Lang. Inf. Process. 9, 3. Google ScholarDigital Library
- Frakes, W. B. and Fox, C. J. 2003. Strength and similarity of affix removal stemming algorithms. SIGIR Forum 37, 26--30. Google ScholarDigital Library
- Goldsmith, J. 2001. Unsupervised learning of the morphology of a natural language. Comput. Linguist. 27, 2, 153--198. Google ScholarDigital Library
- Harman, D. 1991. How effective is suffixing. J. Amer. Soc. Infor. Sci. 42, 7--15.Google ScholarCross Ref
- Hull, D. A. 1996. Stemming algorithms—a case study for detailed evaluation. J. Amer. Soc. Infor. Sci. 47, 70--84. Google ScholarDigital Library
- Kettunen, K. 2009. Reductive and generative approaches to management of morphological variation of keywords in monolingual information retrieval: an overview. J. Document. 65, 2, 267--290.Google ScholarCross Ref
- Krovetz, R. 1993. Viewing morphology as an inference process. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 191--202. Google ScholarDigital Library
- Majumder, P., Mitra, M., and Pal, D. 2008. Bulgarian, Hungarian and Czech stemming using YASS. In Proceedings of Advances in Multilingual and Multimodal Information Retrieval: 8th Workshop of the Cross-Language Evaluation Forum. Springer-Verlag, Berlin, 49--56. Google ScholarDigital Library
- Majumder, P., Mitra, M., Parui, S. K., Kole, G., Mitra, P., and Datta, K. 2007. YASS: Yet another suffix stripper. ACM Trans. Inf. Syst. 25, 4. Google ScholarDigital Library
- Mcnamee, P. and Mayfield, J. 2004. Character n-gram tokenization for european language text retrieval. Inf. Retr. 7, 1-2, 73--97. Google ScholarDigital Library
- Oard, D. W., Levow, G.-A., and Cabezas, C. I. 2001. Clef experiments at maryland: Statistical stemming and backoff translation. In Proceedings of the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation (Revised Papers). Springer-Verlag, Berlin, U.K., 176--187. Google ScholarDigital Library
- Porter, M. F. 1997. An Algorithm for Suffix Stripping. Morgan Kaufmann, San Francisco, CA, 313--316. Google ScholarDigital Library
- Savoy, J. 2008. Searching strategies for the Hungarian language. Inf. Process. Manage. 44, 1, 310--324. Google ScholarDigital Library
- Xu, J. and Croft, W. B. 1998. Corpus-based stemming using cooccurrence of word variants. ACM Trans. Inf. Syst. 16, 1, 61--81. Google ScholarDigital Library
Index Terms
- GRAS: An effective and efficient stemming algorithm for information retrieval
Recommendations
YASS: Yet another suffix stripper
Stemmers attempt to reduce a word to its stem or root form and are used widely in information retrieval tasks to increase the recall rate. Most popular stemmers encode a large number of language-specific rules built over a length of time. Such stemmers ...
A Fast Corpus-Based Stemmer
Stemming is a mechanism of word form normalization that transforms the variant word forms to their common root. In an Information Retrieval system, it is used to increase the system’s performance, specifically the recall and desirably the precision. ...
Effective and Robust Query-Based Stemming
Stemming is a widely used technique in information retrieval systems to address the vocabulary mismatch problem arising out of morphological phenomena. The major shortcoming of the commonly used stemmers is that they accept the morphological variants of ...
Comments