Information gain and divergence-based feature selection for machine learning-based text categorization

doi:10.1016/j.ipm.2004.08.006

Information Processing & Management

Volume 42, Issue 1, January 2006, Pages 155-165

https://doi.org/10.1016/j.ipm.2004.08.006 Get rights and content

Abstract

Most previous works of feature selection emphasized only the reduction of high dimensionality of the feature space. But in cases where many features are highly redundant with each other, we must utilize other means, for example, more complex dependence models such as Bayesian network classifiers. In this paper, we introduce a new information gain and divergence-based feature selection method for statistical machine learning-based text categorization without relying on more complex dependence models. Our feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. Empirical results are given on a number of dataset, showing that our feature selection method is more effective than Koller and Sahami’s method [Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In Proceedings of ICML-96, 13th international conference on machine learning], which is one of greedy feature selection methods, and conventional information gain which is commonly used in feature selection for text categorization. Moreover, our feature selection method sometimes produces more improvements of conventional machine learning algorithms over support vector machines which are known to give the best classification accuracy.

Introduction

Text categorization is the problem of automatically assigning predefined categories to free text documents. A growing number of statistical classification methods and machine learning techniques have been applied to text categorization in recent years (Yang & Liu, 1999).

A major characteristic, or difficulty, of text categorization problems is the high dimensionality of the feature space (Joachims, 1998). The native feature space consists of the unique terms (words or phrases) that occur in documents, which can be tens or hundreds of thousands of terms for even a moderate-sized text collection. This is prohibitively high for many machine learning algorithms. If we reduce the set of features considered by the algorithm, we can serve two purposes. We can considerably decrease the running time of the learning algorithm, and we can increase the accuracy of the resulting model. In this line, a number of researches have recently addressed the issue of feature subset selection (Lewis and Ringuette, 1994, Schutze et al., 1995, Yang and Pedersen, 1997). Yang and Pederson found information gain (IG) and chi-square test (CHI) most effective in aggressive term removal without losing categorization accuracy in their experiments (Yang & Pedersen, 1997). They also discovered that IG and CHI scores of a term are strongly correlated.

Another major characteristic of text categorization problems is the high level of feature redundancy (Joachims, 2001). While there are generally many different features relevant to a classification task, often several such cues occur in one document, and these cues are partly redundant. Naive Bayes, which is a popular learning algorithm, is commonly justified using assumptions of conditional independence or linked dependence (Cooper, 1991). However, theses assumptions are generally accepted to be false for text. To remove these violations, more complex dependence models such as Bayesian network classifiers have been developed (Sahami, 1998), but they require complex models by trading efficiency.

Most previous works of feature selection emphasized only the reduction of high dimensionality of the feature space (Lewis and Ringuette, 1994, Schutze et al., 1995, Yang and Pedersen, 1997). The most popular feature selection method is IG. IG works well with texts and has often been used. IG looks at each feature in isolation and measures how important it is for the prediction of the correct class label. In cases where all features are not redundant with each other, IG is very appropriate. But in cases where many features are highly redundant with each other, we must utilize other means, for example, more complex dependence models.

In this paper, for the high dimensionality of the feature space and the high level of feature redundancy, we propose a new feature selection method which selects each feature according to a combined criterion of information gain and novelty of information. The latter measures the degree of dissimilarity between the feature being considered and the previously selected features. MMR provides precisely such functionality (Carbonell & Goldstein, 1998). So we propose MMR-based feature selection method which strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization.

In machine learning field, some greedy methods that add or subtract a single feature at a time have been developed for feature selection (Koller and Sahami, 1996, Pietra et al., 1997). Pietra et al. proposed a method for incrementally constructing random field (Pietra et al., 1997). Their method builds increasingly complex fields to approximate the empirical distribution of a set of training examples by allowing potential functions, or features, which are supported by increasingly large sub-graphs. Each feature is assigned a weight, and the weights are trained to minimize the Kullback–Leibler divergence between the field and the empirical distribution of the training data. Features are incrementally added to the field using a top–down greedy algorithm, with the intent of capturing the salient properties of the empirical samples while allowing generalization to new configurations. However the method is not simple, and this is problematic both computationally and statistically in large-scale problems.

Koller and Sahami proposed another greedy feature selection method which provides a mechanism for eliminating features whose predictive information with respect to the class is subsumed by the other features (Koller & Sahami, 1996). This method is also based on the Kullback–Leibler divergence to minimize the amount of predictive information lost during feature elimination.

In order to compare the performances of our method and greedy feature selection methods, we implemented Koller and Sahami’s method, and empirically tested it in Section 4.

We also compared the performance of conventional machine learning algorithms using our feature selection method with that of support vector machine (SVM) in Section 4. SVM is a new learning method introduced by Vapnik et al. (Vapnik, 1998). Previous works show that SVM consistently achieves good performance on text categorization tasks, outperforming existing methods substantially and significantly (Joachims, 1998, Joachims, 2001). With its ability to generalize well in high dimensional feature spaces and high level of feature redundancy, SVM is known that it does not help much with sophisticated feature selection (Joachims, 2001).

The remainder of this paper is organized as follows. In Section 2, we describe the information gain and divergence-based feature selection. Section 3 presents in-depth experiments, discussions and the results. Section 4 concludes the research.

Section snippets

Information gain and divergence-based feature selection

In this section, we describe the maximal marginal relevance (MMR) and the MMR-based feature selection.

Experiments

In order to compare the performance of MMR_FS method with conventional IG and greedy feature selection method (Koller & Sahami’s method, labeled ‘Greedy’), we evaluated the three feature selection methods with four different learning algorithms: Naive Bayes, TFIDF/Rocchio. Probabilistic Indexing (PrTFIDF (Joachims, 1997)) and Maximum Entropy using Rainbow (McCallum & Kachites, 1996). Rainbow is an executable program that does document classification. It provides Naive Bayes, TFIDF/Rocchio,

Conclusion

In this paper, we proposed an information gain and divergence-based feature selection method which strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization.

We carried out extensive experiments to verify the proposed method. The experiments were performed using three different machine learning algorithms on both Reuters and WebKB dataset. Based on the experiment results, we can verify that our MMR-based feature

References (15)

Carbonell, J., & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing...
Cooper, W. S. (1991). Some inconsistencies and misnomers in probabilistic information retrieval. In Proceedings of the...
Joachims, T. (1997). A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In...
Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. In...
Joachims, T. (2001). A statistical learning model of text classification for support vector machines. In Proceedings of...
Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In Proceedings of ICML-96, 13th international...
Lewis, D. D., & Ringuette, M. (1994). A comparison of two learning algorithms for text categorization. In Proceedings...

There are more references available in the full text version of this article.

Cited by (0)

^☆: This work was supported by the Brain Korea 21 Project (Ministry of Education) in 2003. The short and preliminary version of this paper was published in HLT/NAACL2004.

View full text