article

YASS: Yet another suffix stripper

Authors:
Prasenjit Majumder

Indian Statistical Institute, Kolkata, India

Indian Statistical Institute, Kolkata, India
View Profile

,
Mandar Mitra

Indian Statistical Institute, Kolkata, India

Indian Statistical Institute, Kolkata, India
View Profile

,
Swapan K. Parui

Indian Statistical Institute, Kolkata, India

Indian Statistical Institute, Kolkata, India
View Profile

,
Gobinda Kole

Indian Statistical Institute, Kolkata, India

Indian Statistical Institute, Kolkata, India
View Profile

,
Pabitra Mitra

Indian Institute of Technology, Kharagpur, India

Indian Institute of Technology, Kharagpur, India
View Profile

,
Kalyankumar Datta

Jadavpur University, Calcutta, India

Jadavpur University, Calcutta, India
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 25 Issue 4pp 18–eshttps://doi.org/10.1145/1281485.1281489

Published:01 October 2007Publication History

ACM Transactions on Information Systems

Abstract

Stemmers attempt to reduce a word to its stem or root form and are used widely in information retrieval tasks to increase the recall rate. Most popular stemmers encode a large number of language-specific rules built over a length of time. Such stemmers with comprehensive rules are available only for a few languages. In the absence of extensive linguistic resources for certain languages, statistical language processing tools have been successfully used to improve the performance of IR systems. In this article, we describe a clustering-based approach to discover equivalence classes of root words and their morphological variants. A set of string distance measures are defined, and the lexicon for a given text collection is clustered using the distance measures to identify these equivalence classes. The proposed approach is compared with Porter's and Lovin's stemmers on the AP and WSJ subcollections of the Tipster dataset using 200 queries. Its performance is comparable to that of Porter's and Lovin's stemmers, both in terms of average precision and the total number of relevant documents retrieved. The proposed stemming algorithm also provides consistent improvements in retrieval performance for French and Bengali, which are currently resource-poor.

References

Adamson, G. and Boreham, J. 1974. The use of an association measure based on character structure to identify semantically related pairs of words and document titles. Inf. Stor. Retrieval 10, 253--260.Google ScholarCross Ref
Bacchin, M., Ferro, N., and Melucci, M. 2005. A probabilistic model for stemmer generation. Inf. Process. Manage. 41, 1, 121--137. Google ScholarDigital Library
Buckley, C., Singhal, A., and Mitra, M. 1996. Using query zoning and correlation within SMART: TREC 5. In the 5th Text Retrieval Conference.Google Scholar
Buckley, C., Singhal, A., and Mitra, M. 1995. New retrieval approaches using SMART: TREC 4. In the 4th Text Retrieval Conference.Google Scholar
Gey, F., Kando, N., and Peters, C. 2002. Cross language information retrieval: A research roadmap. SIGIR Forum 37, 1, 76--84. Google ScholarDigital Library
Goldsmith, J. 2001. Unsupervised learning of the morphology of a natural language. Comput. Linguist. 27, 2, 153--198. Google ScholarDigital Library
Goldsmith, J. A., Higgins, D., and Soglasnova, S. 2000. Automatic language-specific stemming in information retrieval. In Proceedings of the Workshop on Cross-Language Evaluation Forum (CLEF), 273--284. Google ScholarDigital Library
Hsu, J. 1986. Multiple Comparisons: Theory and Methods. Chapman and Hall.Google Scholar
Jain, A. K., Murty, M. N., and Flynn, P. J. 1999. Data clustering: A review. ACM Comput. Surv. 31, 3, 264--323. Google ScholarDigital Library
Krovetz, R. 2000. Viewing morphology as an inference process. Artif. Intell. 118, 277--294. Google ScholarDigital Library
Larkey, L. S., Ballesteros, L., and Connell, M. E. 2002. Improving stemming for Arabic information retrieval: Light stemming and Co-occurrence analysis. In (SIGIR) '02: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, 275--282. Google ScholarDigital Library
Levenstein, V. I. 1966. Binary codes capable of correcting deletions, insertions and reversals. Commun. ACM 27, 4, 358--368Google Scholar
Lovins, J. 1968. Development of a stemming algorithm. Mech. Trans. Comput. Linguis. 11, 22--31.Google Scholar
Majumder, P., Mitra, M., and Chaudhuri, B. 2004. Construction and statistical analysis of an Indic language corpus for applied language research. Computing Science Tech. Rep. TR/ISI/CVPR/01/2004, CVPR Unit, Indian Statistical Institute, Kolkata.Google Scholar
Oard, D. W., Levow, G.-A., and Cabezas, C. I. 2001. CLEF experiments at Maryland: Statistical stemming and backoff translation. In Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation (CLEF), Springer, London, 176--187. Google ScholarDigital Library
Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.Google ScholarCross Ref
Ramanathan, A. and Rao, D. 2003. A lightweight stemmer for Hindi. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL), on Computatinal Linguistics for South Asian Languages (Budapest, Apr.) Workshop.Google Scholar
Roeck, A. and Al-Fares, W. 2000. A morphologically sensitive clustering algorithm for identifying Arabic roots. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics. Google ScholarDigital Library
Rogati, M., McCarley, S., and Yang, Y. 2003. Unsupervised learning of Arabic stemming using a parallel corpus. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, E. Hinrichs and D. Roth, eds, 391--398. Google ScholarDigital Library
Salton, G., Ed. 1971. The SMART Retrieval System---Experiments in Automatic Document Retrieval. Prentice Hall, Englewood Cliffs, NJ. Google ScholarDigital Library
Xu, J. and Croft, W. B. 1998. Corpus-Based stemming using cooccurrence of word variants. ACM Trans. Inf. Syst. 16, 1, 61--81. Google ScholarDigital Library

Index Terms

YASS: Yet another suffix stripper
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection
    2. Search engine architectures and scalability
      1. Search engine indexing

Recommendations

GRAS: An effective and efficient stemming algorithm for information retrieval

A novel graph-based language-independent stemming algorithm suitable for information retrieval is proposed in this article. The main features of the algorithm are retrieval effectiveness, generality, and computational efficiency. We test our approach on ...
Read More
A Fast Corpus-Based Stemmer

Stemming is a mechanism of word form normalization that transforms the variant word forms to their common root. In an Information Retrieval system, it is used to increase the system’s performance, specifically the recall and desirably the precision. ...
Read More
Stemming resource-poor Indian languages

Stemming is a basic method for morphological normalization of natural language texts. In this study, we focus on the problem of stemming several resource-poor languages from Eastern India, viz., Assamese, Bengali, Bishnupriya Manipuri and Bodo. While ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Information Systems Volume 25, Issue 4
October 2007
159 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/1281485
Issue’s Table of Contents

Copyright © 2007 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 October 2007
Published in tois Volume 25, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Bengali
French
Indian languages
clustering
corpus
stemming
string similarity
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 101
  Total Citations
  View Citations
- 1,190
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

YASS: Yet another suffix stripper

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

GRAS: An effective and efficient stemming algorithm for information retrieval

A Fast Corpus-Based Stemmer

Stemming resource-poor Indian languages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

YASS: Yet another suffix stripper

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

GRAS: An effective and efficient stemming algorithm for information retrieval

A Fast Corpus-Based Stemmer

Stemming resource-poor Indian languages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media