Erschienen in:

2003 | OriginalPaper | Buchkapitel

Using Compression-Based Language Models for Text Categorization

verfasst von : William J. Teahan, David J. Harper

Erschienen in: Language Modeling for Information Retrieval

Verlag: Springer Netherlands

Enthalten in: Professional Book Archive

Zugang erhalten

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Text compression models are firmly grounded in information theory, and we exploit this theoretical underpinning in applying text compression to text categorization. Category models are constructed using the Prediction by Partial Matching (PPM) text compression scheme, specifically using character-based rather than word-based contexts. Two approaches to compression-based categorization are presented, one based on ranking by document cross entropy (average bits per coded symbol) with respect to a category model, and the other based on document cross entropy difference between category and complement of category models. Formally, we show the equivalence of the latter approach to two-class Bayes classification, and propose a method for performing feature selection within our compression-based categorization framework.An extensive set of experiments on a range of classification tasks is reported. These tasks in increasing order of difficulty are language and dialect identification, authorship ascription, genre classification and topic classification. The results show that text categorization based on PPM is extremely effective for language and dialect identification, and for authorship ascription. PPM-based categorization is very competitive for genre and topic categorization compared with other reported approaches.

Springer Professional