A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization

https://doi.org/10.1016/j.ipm.2011.12.005Get rights and content

Abstract

The feature selection, which can reduce the dimensionality of vector space without sacrificing the performance of the classifier, is widely used in text categorization. In this paper, we proposed a new feature selection algorithm, named CMFS, which comprehensively measures the significance of a term both in inter-category and intra-category. We evaluated CMFS on three benchmark document collections, 20-Newsgroups, Reuters-21578 and WebKB, using two classification algorithms, Naïve Bayes (NB) and Support Vector Machines (SVMs). The experimental results, comparing CMFS with six well-known feature selection algorithms, show that the proposed method CMFS is significantly superior to Information Gain (IG), Chi statistic (CHI), Document Frequency (DF), Orthogonal Centroid Feature Selection (OCFS) and DIA association factor (DIA) when Naïve Bayes classifier is used and significantly outperforms IG, DF, OCFS and DIA when Support Vector Machines are used.

Highlights

► The term is comprehensively measured both in inter-category and intra-category. ► We compared the proposed method with six well-known feature selection algorithms. ► The proposed algorithm can significantly improve the performance of classifiers.

Introduction

As the number of digital documents available on the Internet has been growing significantly in recent years, it is impossible to manipulate manually such enormous quantities of information. More and more methods based on statistical theory and machine learning have been applied to information automatic processing (Shang, Huang, & Zhu, 2007). A very efficient method for managing the vast amount of data is text categorization which assigns one or more predefined categories to one new document based on the contents of the document (Fragoudis et al., 2005, Sebastiani, 2002; Yang & Pedersen, 1997). There exist numerous sophisticated algorithms that have been applied to text categorization; for example, Naïve Bayes classifier (NB) (Chen, Huang, Tian, & Qu, 2009), Support Vector Machines (SVMs) (Joachims, 1998), K-Nearest Neighbors (KNN) (Cover & Hart, 1967), Decision tree, Rocchio (Sebastiani, 2002), etc.

The major characteristic of text categorization is that the number of the features in the feature space (vector space, bag of words) can easily reach the orders of tens of thousands even for moderate size data set (Fragoudis et al., 2005, Yang and Pedersen, 1997). So there exist two problems in the context of the high dimensionality. The one is that some sophisticated algorithms can not be optimally used in the text categorization. The other is that the overfitting is inevitable in the text categorization when most algorithms are trained in the training set (Fragoudis et al., 2005, Sebastiani, 2002). Therefore, dimensionality reduction has been a major research area.

The goal of the dimensionality reduction is to reduce vector space and avoid the overfitting without sacrificing the performance of the categorization, and it is tackled by feature extraction and feature selection. The feature extraction generates a new term set that is not of the same type of the original feature space by combinations or transformations of the original one; however, the feature selection, which is the most commonly used method in the field of text classification, selects a subset from the original feature space according to one evaluation criteria (Sebastiani, 2002). There are three distinct ways of viewing feature selection (Blum & Langley, 1997). the first one is embedded approach that the feature selection process is clearly embedded in the basic induction algorithm; the second one is wrapper approach that selects term subset using the evaluation function that exists as a wrapper around the classifier algorithm, and these features will be used on the same classifier algorithm (John et al., 1994, Mladenic and Grobelnik, 2003); the last one is the filter approach that selects the feature subset from the original feature space using one evaluation function which is independent to the classifier algorithm (Mladenic & Grobelnik, 2003). As the filter feature selection approach is simple and efficient, it has been widely applied in the text categorization. The proposed method in this study is also a filter approach. There are numerous efficient and effective feature selection algorithms, such as Document Frequency (DF) (Yang & Pedersen, 1997), DIA association factor (DIA) (Fragoudis et al., 2005, Fuhr et al., 1991, Sebastiani, 2002), Odds Ratio (OR) (Mengle & Goharian, 2009), Mutual Information (MI) (Peng et al., 2005, Yang and Pedersen, 1997), Information Gain (IG) (Ogura et al., 2009, Yang and Pedersen, 1997), Chi-square statistics (CHI) (Ogura et al., 2009, Yang and Pedersen, 1997), Ambiguity Measure feature selection (AM) (Mengle & Goharian, 2009), Orthogonal Centroid Feature Selection (OCFS) (Yan et al., 2005), Improved Gini index (GINI) (Mengle & Goharian, 2009), Expected Crossed Entropy (Koller & Sahami, 1997), Best Terms (BT) (Fragoudis et al., 2005), measure using Poisson distribution (Ogura et al., 2009), Preprocess algorithm for filtering irrelevant information Based on the Minimum Calss Difference (PBMCD) (Chen & Lü, 2006), Class Dependent Feature Weighting (CDFW) (Youn & Jeong, 2009), binomial hypothesis testing (Bi-Test) (Yang, Liu, Liu, Zhu, & Zhang, 2011), and so on.

Among the above-mentioned feature selection algorithms, Document Frequency (DF) only measures the significance of a term in the intra-category; however, Ambiguity Measure (AM) and DIA association factor (DIA) only calculate the score of a term in the inter-category. In this paper, we proposed a new feature selection algorithm, called Comprehensively Measure Feature Selection (CMFS), which comprehensively measures the significance of a term both in the intra-category and inter-category. To evaluate the CMFS method, we used two classification algorithms, Naïve Bayes (NB) and Support Vector Machines (SVMs) on three benchmark corpora (20-Newgroups, Reuters-21578 and WebKB), and compared it with six feature selection algorithms (Information Gain, Chi-square statistic, Improved Gini index, Document Frequency and Orthogonal Centroid Feature Selection, DIA association factor). The experiment results show that the proposed method CMFS outperforms significantly DIA, IG, CHI, DF, OCFS, and is comparable with GINI when Naïve Bayes is used; the CMFS is superior to DIA, IG, DF and OCFS, and is comparable with GINI and CHI when Support Vector Machines is used.

The rest of this paper is organized as follows: Section 2 presents the state of the art for the feature selection algorithms. Section 3 describes the basic idea and implementation of the CMFS method. The experimental details are given in Section 4 and the experimental results are presented in Section 5. Section 6 shows the statistical analysis and discussion. Our conclusion and the future work direction are provided in the last section.

Section snippets

Related work

Numerous feature selection methods have been widely used in text categorization in recent years. The Information Gain (IG) and Chi-square statistic (CHI) are two of the most efficient feature selections, and Document Frequency (DF) is comparable to the performance of IG and CHI (Y. Yang & Pedersen, 1997). Improved Gini index is an improved feature selection based on Gini index. It has a better performance and simpler computation than IG and CHI (Shang et al., 2007). The Orthogonal Centroid

Motivation

Before the process of text classification, the feature vector space model (bag of words) (Sebastiani, 2002), which consists of all unique terms extracted from the training set, must be created. Then, each raw document in the corpus must be transformed into a big vector according to mapping of the terms occur in the raw document to the feature vector space. This vector space model was regarded as a word-to-document feature-appearance matrix where rows are the features and columns are document

Classifiers

In this section, we briefly describe the Naïve Bayes (NB) and Support Vector Machines (SVMs) used in our study.

Experimental results on the 20-newsgroups dataset

Table 3 and Table 4 show the micro F1 measure result when Naïve Bayes and Support Vector Machines are used on 20-newsgroups data set, respectively. It can be seen from Table 3 that the micro F1 performance of NB used CMFS on 20-newsgroups outperforms that based on the other algorithms when the number of the selected features is 200, 1400, 1600, 1800 or 2000. Moreover, the CMFS is only inferior to GINI when the number of the selected features is 400, 600, 800, 1000 or 1200. Table 4 indicates

Statistical analysis

In order to compare the performance of the proposed method with the previous approaches, Friedman and Iman and Davenport (1980) tests are used in the statistical analysis. Friedman and Iman & Davenport tests (Demšar, 2006) are non-parametric tests. The null hypothesis of them is that all the algorithms are equivalent and so the ranks of all algorithms should be equal. If the null hypothesis of Friedman and Iman & Davenport tests is rejected, the post test (Bonferroni-Dunn test or Holm test) (

Conclusion

In this paper, we have proposed a novel feature selection algorithm, named CMFS, by which the significance of a term both in inter-category and intra-category are comprehensively measured. The efficiency of the proposed measure CMFS was examined through the experiments of text categorization with NB and SVM classifier. The results, comparing with six well-known feature selection algorithms (Information Gain (IG), Improved Gini index (GINI), Chi square statistic (CHI), Document Frequency (DF),

Acknowledgments

This research is supported by National Natural Science Foundation of China under Grant no. 60971089 and National Electronic Development Foundation of China under Grant no. 2009537.

References (29)

  • J. Demšar

    Statistical comparisons of classifiers over multiple data sets

    Journal of Machine Learning Research

    (2006)
  • H. Drucker et al.

    Support vector machines for spam categorization

    IEEE Transactions on Neural Networks

    (1999)
  • D. Fragoudis et al.

    Best terms: An efficient feature-selection algorithm for text categorization

    Knowledge and Information Systems

    (2005)
  • Fuhr, N., Hartmann, S., Lustig, G., Schwantner, M., Tzeras, K., Darmstadt, T. H., et al. (1991). AIR/X – A rule-based...
  • Cited by (122)

    • An optimal approach for text feature selection

      2022, Computer Speech and Language
    • Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods

      2020, Applied Soft Computing Journal
      Citation Excerpt :

      Regardless of the search strategy, the number of possible results will increase geometrically as the number of features increases. The process of embedded methods are embedded in the training of classifiers [13]. Wrappers and embedded methods require much more computation time than filters and may work only with a specific classifier [3].

    • Ensemble feature selection using weighted concatenated voting for text classification

      2024, Journal of the Nigerian Society of Physical Sciences
    View all citing articles on Scopus
    View full text