Skip to main content

2002 | OriginalPaper | Buchkapitel

Uncertainty-Based Noise Reduction and Term Selection in Text Categorization

verfasst von : C. Peters, C. H. A. Koster

Erschienen in: Advances in Information Retrieval

Verlag: Springer Berlin Heidelberg

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

This paper introduces a new criterium for term selection, which is based on the notion of Uncertainty. Term selection according to this criterium is performed by the elimination of noisy terms on a class-by-class basis, rather than by selecting the most significant ones. Uncertainty-based term selection (UC) is compared to a number of other criteria like Information Gain (IG), simplified χ2 (SX), Term Frequency (TF) and Document Frequency (DF) in a Text Categorization setting. Experiments on data sets with different properties (Reuters- 21578, patent abstracts and patent applications) and with two different algorithms (Winnow and Rocchio) show that UC-based term selection is not the most aggressive term selection criterium, but that its effect is quite stable across data sets and algorithms. This makes it a good candidate for a general “install-and-forget” term selection mechanism. We also describe and evaluate a hybrid Term Selection technique, first applying UC to eliminate noisy terms and then using another criterium to select the best terms.

Metadaten
Titel
Uncertainty-Based Noise Reduction and Term Selection in Text Categorization
verfasst von
C. Peters
C. H. A. Koster
Copyright-Jahr
2002
Verlag
Springer Berlin Heidelberg
DOI
https://doi.org/10.1007/3-540-45886-7_17