2003 | OriginalPaper | Buchkapitel
Using Compression-Based Language Models for Text Categorization
verfasst von : William J. Teahan, David J. Harper
Erschienen in: Language Modeling for Information Retrieval
Verlag: Springer Netherlands
Enthalten in: Professional Book Archive
Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.
Wählen Sie Textabschnitte aus um mit Künstlicher Intelligenz passenden Patente zu finden. powered by
Markieren Sie Textabschnitte, um KI-gestützt weitere passende Inhalte zu finden. powered by
Text compression models are firmly grounded in information theory, and we exploit this theoretical underpinning in applying text compression to text categorization. Category models are constructed using the Prediction by Partial Matching (PPM) text compression scheme, specifically using character-based rather than word-based contexts. Two approaches to compression-based categorization are presented, one based on ranking by document cross entropy (average bits per coded symbol) with respect to a category model, and the other based on document cross entropy difference between category and complement of category models. Formally, we show the equivalence of the latter approach to two-class Bayes classification, and propose a method for performing feature selection within our compression-based categorization framework.An extensive set of experiments on a range of classification tasks is reported. These tasks in increasing order of difficulty are language and dialect identification, authorship ascription, genre classification and topic classification. The results show that text categorization based on PPM is extremely effective for language and dialect identification, and for authorship ascription. PPM-based categorization is very competitive for genre and topic categorization compared with other reported approaches.