Abstract
Stemming is used in many information retrieval (IR) systems to reduce variant word forms to common roots. It is one of the simplest applications of natural-language processing to IR and is one of the most effective in terms of user acceptance and consistency, though small retrieval improvements. Current stemming techniques do not, however, reflect the language use in specific corpora, and this can lead to occasional serious retrieval failures. We propose a technique for using corpus-based word variant cooccurrence statistics to modify or create a stemmer. The experimental results generated using English newspaper and legal text and Spanish text demonstrate the viability of this technique and its advantages relative to conventional approaches that only employ morphological rules.
- BROGLIO, J., CALLAN, J. P., AND CROFT, W. 1994. An overview of the INQUERY system as used for the TIPSTER project. In Proceedings of the TIPSTER Workshop. Morgan-Kaufmann, San Mateo, Calif., 47-67.Google Scholar
- BROGLIO, J., CALLAN, J. P., CROFT, W. B., AND NACHBAR, D.W. 1995. Document retrieval and routing using the INQUERY system. In Proceedings of the 3rd Text REtrieval Conference (TREC-3), D. Harman, Ed. NIST Special Publication 500-225, 22-29.Google Scholar
- CHURCH, K. AND HANKS, P. 1989. Word association norms, mutual information, and lexicography. In Proceedings of the 27th ACL Meeting. 76-83. Google Scholar
- CROFT, W. B. AND XU, J. 1995. Corpus-specific stemming using word form co-occurrence. In the 4th Annual Symposium on Document Analysis and Information Retrieval. 147-159.Google Scholar
- HARMAN, D. 1991. How effective is suffixing? J. Am. Soc. Inf. Sci. 42, 1, 7-15.Google Scholar
- HARMAN, D. 1995. Overview of the third text REtrieval conference (TREC-3). In Proceedings of the 3rd Text REtrieval Conference (TREC-3), D. Harman, Ed. NIST Special Publication 500-225, 1-20.Google Scholar
- HULL, D. 1993. Using statistical testing in the evaluation of retrieval experiments. In Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 329-338. Google Scholar
- HULL, D.A. 1996. Stemming algorithms: A case study for detailed evaluation. J. Am. Soc. Inf. Sci. 47, 1, 70-84. Google Scholar
- JING, Y. AND CROFT, W. 1994. An association thesaurus for information retrieval. In Proceedings of RIAO 94. 146-160.Google Scholar
- KRAAIJ, W. 1996. Viewing stemming as recall enhancement. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 40-48. Google Scholar
- K_ROVETZ, R. 1993. Viewing morphology as an inference process. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 191-202. Google Scholar
- PoPovIc, M. AND WILLETT, P. 1992. The effectiveness of stemming for natural-language access to Slovene textual data. J. Am. Soc. Inf. Sci. 43, 5, 384-390.Google Scholar
- PORTER, M. 1980. An algorithm for suffix stripping. Program 14, 3, 130-137.Google Scholar
- SALTON, G. 1989. Automatic Text Processing. Addison-Wesley, Reading, Mass. Google Scholar
- SPARCK JONES, K. 1971. Automatic Keyword Classification for Information Retrieval. Archon Books, Hamden, Conn.Google Scholar
- TURTLE, H. 1994. Natural language vs. Boolean query evaluation: A comparison of retrieval performance. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 212-220. Google Scholar
- VAN RIJSBERGEN, C. 1979. Information Retrieval. 2nd ed. Butterworths, London, U.K. Google Scholar
- VOORHEES, E. 1994. Query expansion using lexical-semantic relations. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 61-69. Google Scholar
Index Terms
- Corpus-based stemming using cooccurrence of word variants
Recommendations
A Fast Corpus-Based Stemmer
Stemming is a mechanism of word form normalization that transforms the variant word forms to their common root. In an Information Retrieval system, it is used to increase the system’s performance, specifically the recall and desirably the precision. ...
A novel corpus-based stemming algorithm using co-occurrence statistics
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information RetrievalWe present a stemming algorithm for text retrieval. The algorithm uses the statistics collected on the basis of certain corpus analysis based on the co-occurrence between two word variants. We use a very simple co-occurrence measure that reflects how ...
A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics
AbstractWord Stemming is a widely used mechanism in the fields of Natural Language Processing, Information Retrieval, and Language Modeling. Language-independent stemmers discover classes of morphologically related words from the ambient ...
Comments