skip to main content
article

YASS: Yet another suffix stripper

Published:01 October 2007Publication History
Skip Abstract Section

Abstract

Stemmers attempt to reduce a word to its stem or root form and are used widely in information retrieval tasks to increase the recall rate. Most popular stemmers encode a large number of language-specific rules built over a length of time. Such stemmers with comprehensive rules are available only for a few languages. In the absence of extensive linguistic resources for certain languages, statistical language processing tools have been successfully used to improve the performance of IR systems. In this article, we describe a clustering-based approach to discover equivalence classes of root words and their morphological variants. A set of string distance measures are defined, and the lexicon for a given text collection is clustered using the distance measures to identify these equivalence classes. The proposed approach is compared with Porter's and Lovin's stemmers on the AP and WSJ subcollections of the Tipster dataset using 200 queries. Its performance is comparable to that of Porter's and Lovin's stemmers, both in terms of average precision and the total number of relevant documents retrieved. The proposed stemming algorithm also provides consistent improvements in retrieval performance for French and Bengali, which are currently resource-poor.

References

  1. Adamson, G. and Boreham, J. 1974. The use of an association measure based on character structure to identify semantically related pairs of words and document titles. Inf. Stor. Retrieval 10, 253--260.Google ScholarGoogle ScholarCross RefCross Ref
  2. Bacchin, M., Ferro, N., and Melucci, M. 2005. A probabilistic model for stemmer generation. Inf. Process. Manage. 41, 1, 121--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Buckley, C., Singhal, A., and Mitra, M. 1996. Using query zoning and correlation within SMART: TREC 5. In the 5th Text Retrieval Conference.Google ScholarGoogle Scholar
  4. Buckley, C., Singhal, A., and Mitra, M. 1995. New retrieval approaches using SMART: TREC 4. In the 4th Text Retrieval Conference.Google ScholarGoogle Scholar
  5. Gey, F., Kando, N., and Peters, C. 2002. Cross language information retrieval: A research roadmap. SIGIR Forum 37, 1, 76--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Goldsmith, J. 2001. Unsupervised learning of the morphology of a natural language. Comput. Linguist. 27, 2, 153--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Goldsmith, J. A., Higgins, D., and Soglasnova, S. 2000. Automatic language-specific stemming in information retrieval. In Proceedings of the Workshop on Cross-Language Evaluation Forum (CLEF), 273--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Hsu, J. 1986. Multiple Comparisons: Theory and Methods. Chapman and Hall.Google ScholarGoogle Scholar
  9. Jain, A. K., Murty, M. N., and Flynn, P. J. 1999. Data clustering: A review. ACM Comput. Surv. 31, 3, 264--323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Krovetz, R. 2000. Viewing morphology as an inference process. Artif. Intell. 118, 277--294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Larkey, L. S., Ballesteros, L., and Connell, M. E. 2002. Improving stemming for Arabic information retrieval: Light stemming and Co-occurrence analysis. In (SIGIR) '02: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, 275--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Levenstein, V. I. 1966. Binary codes capable of correcting deletions, insertions and reversals. Commun. ACM 27, 4, 358--368Google ScholarGoogle Scholar
  13. Lovins, J. 1968. Development of a stemming algorithm. Mech. Trans. Comput. Linguis. 11, 22--31.Google ScholarGoogle Scholar
  14. Majumder, P., Mitra, M., and Chaudhuri, B. 2004. Construction and statistical analysis of an Indic language corpus for applied language research. Computing Science Tech. Rep. TR/ISI/CVPR/01/2004, CVPR Unit, Indian Statistical Institute, Kolkata.Google ScholarGoogle Scholar
  15. Oard, D. W., Levow, G.-A., and Cabezas, C. I. 2001. CLEF experiments at Maryland: Statistical stemming and backoff translation. In Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation (CLEF), Springer, London, 176--187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.Google ScholarGoogle ScholarCross RefCross Ref
  17. Ramanathan, A. and Rao, D. 2003. A lightweight stemmer for Hindi. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL), on Computatinal Linguistics for South Asian Languages (Budapest, Apr.) Workshop.Google ScholarGoogle Scholar
  18. Roeck, A. and Al-Fares, W. 2000. A morphologically sensitive clustering algorithm for identifying Arabic roots. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Rogati, M., McCarley, S., and Yang, Y. 2003. Unsupervised learning of Arabic stemming using a parallel corpus. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, E. Hinrichs and D. Roth, eds, 391--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Salton, G., Ed. 1971. The SMART Retrieval System---Experiments in Automatic Document Retrieval. Prentice Hall, Englewood Cliffs, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Xu, J. and Croft, W. B. 1998. Corpus-Based stemming using cooccurrence of word variants. ACM Trans. Inf. Syst. 16, 1, 61--81. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. YASS: Yet another suffix stripper

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader