article

Free Access

Corpus-based stemming using cooccurrence of word variants

Authors:
Jinxi Xu

Univ. of Massachusetts, Amherst

Univ. of Massachusetts, Amherst
View Profile

,
W. Bruce Croft

Univ. of Massachusetts, Amherst

Univ. of Massachusetts, Amherst
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 16 Issue 1pp 61–81https://doi.org/10.1145/267954.267957

Published:01 January 1998Publication History

ACM Transactions on Information Systems

Abstract

Stemming is used in many information retrieval (IR) systems to reduce variant word forms to common roots. It is one of the simplest applications of natural-language processing to IR and is one of the most effective in terms of user acceptance and consistency, though small retrieval improvements. Current stemming techniques do not, however, reflect the language use in specific corpora, and this can lead to occasional serious retrieval failures. We propose a technique for using corpus-based word variant cooccurrence statistics to modify or create a stemmer. The experimental results generated using English newspaper and legal text and Spanish text demonstrate the viability of this technique and its advantages relative to conventional approaches that only employ morphological rules.

References

BROGLIO, J., CALLAN, J. P., AND CROFT, W. 1994. An overview of the INQUERY system as used for the TIPSTER project. In Proceedings of the TIPSTER Workshop. Morgan-Kaufmann, San Mateo, Calif., 47-67.Google Scholar
BROGLIO, J., CALLAN, J. P., CROFT, W. B., AND NACHBAR, D.W. 1995. Document retrieval and routing using the INQUERY system. In Proceedings of the 3rd Text REtrieval Conference (TREC-3), D. Harman, Ed. NIST Special Publication 500-225, 22-29.Google Scholar
CHURCH, K. AND HANKS, P. 1989. Word association norms, mutual information, and lexicography. In Proceedings of the 27th ACL Meeting. 76-83. Google Scholar
CROFT, W. B. AND XU, J. 1995. Corpus-specific stemming using word form co-occurrence. In the 4th Annual Symposium on Document Analysis and Information Retrieval. 147-159.Google Scholar
HARMAN, D. 1991. How effective is suffixing? J. Am. Soc. Inf. Sci. 42, 1, 7-15.Google Scholar
HARMAN, D. 1995. Overview of the third text REtrieval conference (TREC-3). In Proceedings of the 3rd Text REtrieval Conference (TREC-3), D. Harman, Ed. NIST Special Publication 500-225, 1-20.Google Scholar
HULL, D. 1993. Using statistical testing in the evaluation of retrieval experiments. In Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 329-338. Google Scholar
HULL, D.A. 1996. Stemming algorithms: A case study for detailed evaluation. J. Am. Soc. Inf. Sci. 47, 1, 70-84. Google Scholar
JING, Y. AND CROFT, W. 1994. An association thesaurus for information retrieval. In Proceedings of RIAO 94. 146-160.Google Scholar
KRAAIJ, W. 1996. Viewing stemming as recall enhancement. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 40-48. Google Scholar
K_ROVETZ, R. 1993. Viewing morphology as an inference process. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 191-202. Google Scholar
PoPovIc, M. AND WILLETT, P. 1992. The effectiveness of stemming for natural-language access to Slovene textual data. J. Am. Soc. Inf. Sci. 43, 5, 384-390.Google Scholar
PORTER, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130-137.Google Scholar
SALTON, G. 1989. Automatic Text Processing. Addison-Wesley, Reading, Mass. Google Scholar
SPARCK JONES, K. 1971. Automatic Keyword Classification for Information Retrieval. Archon Books, Hamden, Conn.Google Scholar
TURTLE, H. 1994. Natural language vs. Boolean query evaluation: A comparison of retrieval performance. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 212-220. Google Scholar
VAN RIJSBERGEN, C. 1979. Information Retrieval. 2nd ed. Butterworths, London, U.K. Google Scholar
VOORHEES, E. 1994. Query expansion using lexical-semantic relations. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 61-69. Google Scholar

Index Terms

Corpus-based stemming using cooccurrence of word variants
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval

Recommendations

A Fast Corpus-Based Stemmer

Stemming is a mechanism of word form normalization that transforms the variant word forms to their common root. In an Information Retrieval system, it is used to increase the system’s performance, specifically the recall and desirably the precision. ...
Read More
A novel corpus-based stemming algorithm using co-occurrence statistics
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

We present a stemming algorithm for text retrieval. The algorithm uses the statistics collected on the basis of certain corpus analysis based on the co-occurrence between two word variants. We use a very simple co-occurrence measure that reflects how ...
Read More
A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics
Abstract
Word Stemming is a widely used mechanism in the fields of Natural Language Processing, Information Retrieval, and Language Modeling. Language-independent stemmers discover classes of morphologically related words from the ambient ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Information Systems Volume 16, Issue 1
Jan. 1998
100 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/267954
Issue’s Table of Contents

Copyright © 1998 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 January 1998
Published in tois Volume 16, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
class refinement
cooccurence
corpus analysis
information retrieval
n-gram
stemming
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 207
  Total Citations
  View Citations
- 2,026
  Total Downloads
- Downloads (Last 12 months)111
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Corpus-based stemming using cooccurrence of word variants

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

A Fast Corpus-Based Stemmer

A novel corpus-based stemming algorithm using co-occurrence statistics

A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Corpus-based stemming using cooccurrence of word variants

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

A Fast Corpus-Based Stemmer

A novel corpus-based stemming algorithm using co-occurrence statistics

A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media