skip to main content
research-article

A Fast Corpus-Based Stemmer

Published:01 June 2011Publication History
Skip Abstract Section

Abstract

Stemming is a mechanism of word form normalization that transforms the variant word forms to their common root. In an Information Retrieval system, it is used to increase the system’s performance, specifically the recall and desirably the precision. Although its usefulness is shown to be mixed in languages such as English, because morphologically complex languages stemming produces a significant performance improvement. A number of linguistic rule-based stemmers are available for most European languages which employ a set of rules to get back the root word from its variants. But for Indian languages which are highly inflectional in nature, devising a linguistic rule-based stemmer needs some additional resources which are not available. We present an approach which is purely corpus based and finds the equivalence classes of variant words in an unsupervised manner. A set of experiments on four languages using FIRE, CLEF, and TREC test collections shows that our approach provides comparable results with linguistic rule-based stemmers for some languages and gives significant performance improvement for resource constrained languages such as Bengali and Marathi.

References

  1. Almeida, A. and Bhattacharyya, P. 2008. Using morphology to improve Marathi monolingual information retrieval. In FIRE Working Note.Google ScholarGoogle Scholar
  2. Amati, G. and Van Rijsbergen, C. J. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20, 4, 357--389. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. de Roeck, A. N. and Al-Fares, W. 2000. A morphologically sensitive clustering algorithm for identifying arabic roots. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (ACL’00). 199--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Dolamic, L. and Jacques, S. 2008. Unine at fire 2008: Hindi, Bengali, and Marathi IR. In FIRE Working Note.Google ScholarGoogle Scholar
  5. Goldsmith, J. 2001. Unsupervised learning of the morphology of a natural language. Comput. Linguist. 27, 2, 153--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Harman, D. 1991. How effective is suffixing. J. Amer. Soc. Inf. Sci. 42, 7--15.Google ScholarGoogle ScholarCross RefCross Ref
  7. Hull, D. A. 1996. Stemming algorithms - A case study for detailed evaluation. J. Amer. Soc. Inf. Sci. 47, 70--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Kraaij, W. and Pohlmann, R. 1996. Viewing stemming as recall enhancement. In Proceedings of the 19th Annual International Conference on Research and Development in Information Retrieval (SIGIR’96). 40--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Krovetz, R. 1993. Viewing morphology as an inference process. In Proceedings of the 16th Annual International Conference on Research and Development in Information Retrieval (SIGIR’93). 191--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Larkey, L. S., Connell, M. E., and Abduljaleel, N. 2003. Hindi clir in thirty days. ACM Trans. Asian Lang. Inform. Process. 2, 2, 130--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Majumder, P., Mitra, M., Parui, S. K., Kole, G., Mitra, P., and Datta, K. 2007. Yass: Yet another suffix stripper. ACM Trans. Inf. Syst. 25, 4, 18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Majumder, P., Mitra, M., and Pal, D. 2008. Bulgarian, Hungarian and Czech stemming using Yass. In Proceedings of the 8th Workshop of the Cross-Language Evaluation Forum (CLEF’07). 49--56.Google ScholarGoogle Scholar
  13. Manning, C. D., Raghavan, P., and Schtze, H. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Mcnamee, P. and Mayfield, J. 2004. Character n-gram tokenization for European language text retrieval. Inf. Retr. 7, 1-2, 73--97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Oard, D. W., Levow, G.-A., and Cabezas, C. I. 2000. Clef experiments at Maryland: Statistical stemming and backoff translation. In Proceedings of the Workshop on Cross-Language Evaluation Forum (CLEF’00). 176--187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Peters, C., Gey, F., Gonzalo, J., Mueller, H., Jones, G., Kluck, M., Magnini, B., and Rijke, M. D. 2006. Accessing multilingual information repositories. In Proceedings of the 6th Workshop of the Cross-Language Evaluation Forum (CLEF’05).Google ScholarGoogle Scholar
  17. Popovic, M. and Willett, P. 1999. The effectiveness of stemming for natural-language access to Slovene textual data. J. Amer. Soc. Inf. Sci. 43, 5, 384--390.Google ScholarGoogle ScholarCross RefCross Ref
  18. Ramanathan, A. and Rao, D. 2003. A lightweight stemmer for Hindi. In Proceedings of the 10th Conference of the European Chapter of the Computational Linguistics for South Asian Languages (EACL’03).Google ScholarGoogle Scholar
  19. Savoy, J. 2008. Searching strategies for the Hungarian language. Inf. Process. Manage. 44, 1, 310--324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Xu, J. and Croft, W. B. 1998. Corpus-based stemming using cooccurrence of word variants. ACM Trans. Inf. Syst. 16, 1, 61--81. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Fast Corpus-Based Stemmer

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Asian Language Information Processing
          ACM Transactions on Asian Language Information Processing  Volume 10, Issue 2
          June 2011
          111 pages
          ISSN:1530-0226
          EISSN:1558-3430
          DOI:10.1145/1967293
          Issue’s Table of Contents

          Copyright © 2011 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 June 2011
          • Revised: 1 November 2010
          • Accepted: 1 November 2010
          • Received: 1 May 2010
          Published in talip Volume 10, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader