A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization

doi:10.1016/j.ipm.2011.12.005

Information Processing & Management

Volume 48, Issue 4, July 2012, Pages 741-754

https://doi.org/10.1016/j.ipm.2011.12.005 Get rights and content

Abstract

The feature selection, which can reduce the dimensionality of vector space without sacrificing the performance of the classifier, is widely used in text categorization. In this paper, we proposed a new feature selection algorithm, named CMFS, which comprehensively measures the significance of a term both in inter-category and intra-category. We evaluated CMFS on three benchmark document collections, 20-Newsgroups, Reuters-21578 and WebKB, using two classification algorithms, Naïve Bayes (NB) and Support Vector Machines (SVMs). The experimental results, comparing CMFS with six well-known feature selection algorithms, show that the proposed method CMFS is significantly superior to Information Gain (IG), Chi statistic (CHI), Document Frequency (DF), Orthogonal Centroid Feature Selection (OCFS) and DIA association factor (DIA) when Naïve Bayes classifier is used and significantly outperforms IG, DF, OCFS and DIA when Support Vector Machines are used.

Highlights

► The term is comprehensively measured both in inter-category and intra-category. ► We compared the proposed method with six well-known feature selection algorithms. ► The proposed algorithm can significantly improve the performance of classifiers.

Introduction

As the number of digital documents available on the Internet has been growing significantly in recent years, it is impossible to manipulate manually such enormous quantities of information. More and more methods based on statistical theory and machine learning have been applied to information automatic processing (Shang, Huang, & Zhu, 2007). A very efficient method for managing the vast amount of data is text categorization which assigns one or more predefined categories to one new document based on the contents of the document (Fragoudis et al., 2005, Sebastiani, 2002; Yang & Pedersen, 1997). There exist numerous sophisticated algorithms that have been applied to text categorization; for example, Naïve Bayes classifier (NB) (Chen, Huang, Tian, & Qu, 2009), Support Vector Machines (SVMs) (Joachims, 1998), K-Nearest Neighbors (KNN) (Cover & Hart, 1967), Decision tree, Rocchio (Sebastiani, 2002), etc.

The major characteristic of text categorization is that the number of the features in the feature space (vector space, bag of words) can easily reach the orders of tens of thousands even for moderate size data set (Fragoudis et al., 2005, Yang and Pedersen, 1997). So there exist two problems in the context of the high dimensionality. The one is that some sophisticated algorithms can not be optimally used in the text categorization. The other is that the overfitting is inevitable in the text categorization when most algorithms are trained in the training set (Fragoudis et al., 2005, Sebastiani, 2002). Therefore, dimensionality reduction has been a major research area.

The goal of the dimensionality reduction is to reduce vector space and avoid the overfitting without sacrificing the performance of the categorization, and it is tackled by feature extraction and feature selection. The feature extraction generates a new term set that is not of the same type of the original feature space by combinations or transformations of the original one; however, the feature selection, which is the most commonly used method in the field of text classification, selects a subset from the original feature space according to one evaluation criteria (Sebastiani, 2002). There are three distinct ways of viewing feature selection (Blum & Langley, 1997). the first one is embedded approach that the feature selection process is clearly embedded in the basic induction algorithm; the second one is wrapper approach that selects term subset using the evaluation function that exists as a wrapper around the classifier algorithm, and these features will be used on the same classifier algorithm (John et al., 1994, Mladenic and Grobelnik, 2003); the last one is the filter approach that selects the feature subset from the original feature space using one evaluation function which is independent to the classifier algorithm (Mladenic & Grobelnik, 2003). As the filter feature selection approach is simple and efficient, it has been widely applied in the text categorization. The proposed method in this study is also a filter approach. There are numerous efficient and effective feature selection algorithms, such as Document Frequency (DF) (Yang & Pedersen, 1997), DIA association factor (DIA) (Fragoudis et al., 2005, Fuhr et al., 1991, Sebastiani, 2002), Odds Ratio (OR) (Mengle & Goharian, 2009), Mutual Information (MI) (Peng et al., 2005, Yang and Pedersen, 1997), Information Gain (IG) (Ogura et al., 2009, Yang and Pedersen, 1997), Chi-square statistics (CHI) (Ogura et al., 2009, Yang and Pedersen, 1997), Ambiguity Measure feature selection (AM) (Mengle & Goharian, 2009), Orthogonal Centroid Feature Selection (OCFS) (Yan et al., 2005), Improved Gini index (GINI) (Mengle & Goharian, 2009), Expected Crossed Entropy (Koller & Sahami, 1997), Best Terms (BT) (Fragoudis et al., 2005), measure using Poisson distribution (Ogura et al., 2009), Preprocess algorithm for filtering irrelevant information Based on the Minimum Calss Difference (PBMCD) (Chen & Lü, 2006), Class Dependent Feature Weighting (CDFW) (Youn & Jeong, 2009), binomial hypothesis testing (Bi-Test) (Yang, Liu, Liu, Zhu, & Zhang, 2011), and so on.

Among the above-mentioned feature selection algorithms, Document Frequency (DF) only measures the significance of a term in the intra-category; however, Ambiguity Measure (AM) and DIA association factor (DIA) only calculate the score of a term in the inter-category. In this paper, we proposed a new feature selection algorithm, called Comprehensively Measure Feature Selection (CMFS), which comprehensively measures the significance of a term both in the intra-category and inter-category. To evaluate the CMFS method, we used two classification algorithms, Naïve Bayes (NB) and Support Vector Machines (SVMs) on three benchmark corpora (20-Newgroups, Reuters-21578 and WebKB), and compared it with six feature selection algorithms (Information Gain, Chi-square statistic, Improved Gini index, Document Frequency and Orthogonal Centroid Feature Selection, DIA association factor). The experiment results show that the proposed method CMFS outperforms significantly DIA, IG, CHI, DF, OCFS, and is comparable with GINI when Naïve Bayes is used; the CMFS is superior to DIA, IG, DF and OCFS, and is comparable with GINI and CHI when Support Vector Machines is used.

The rest of this paper is organized as follows: Section 2 presents the state of the art for the feature selection algorithms. Section 3 describes the basic idea and implementation of the CMFS method. The experimental details are given in Section 4 and the experimental results are presented in Section 5. Section 6 shows the statistical analysis and discussion. Our conclusion and the future work direction are provided in the last section.

Section snippets

Related work

Numerous feature selection methods have been widely used in text categorization in recent years. The Information Gain (IG) and Chi-square statistic (CHI) are two of the most efficient feature selections, and Document Frequency (DF) is comparable to the performance of IG and CHI (Y. Yang & Pedersen, 1997). Improved Gini index is an improved feature selection based on Gini index. It has a better performance and simpler computation than IG and CHI (Shang et al., 2007). The Orthogonal Centroid

Motivation

Before the process of text classification, the feature vector space model (bag of words) (Sebastiani, 2002), which consists of all unique terms extracted from the training set, must be created. Then, each raw document in the corpus must be transformed into a big vector according to mapping of the terms occur in the raw document to the feature vector space. This vector space model was regarded as a word-to-document feature-appearance matrix where rows are the features and columns are document

Classifiers

In this section, we briefly describe the Naïve Bayes (NB) and Support Vector Machines (SVMs) used in our study.

Experimental results on the 20-newsgroups dataset

Table 3 and Table 4 show the micro F1 measure result when Naïve Bayes and Support Vector Machines are used on 20-newsgroups data set, respectively. It can be seen from Table 3 that the micro F1 performance of NB used CMFS on 20-newsgroups outperforms that based on the other algorithms when the number of the selected features is 200, 1400, 1600, 1800 or 2000. Moreover, the CMFS is only inferior to GINI when the number of the selected features is 400, 600, 800, 1000 or 1200. Table 4 indicates

Statistical analysis

In order to compare the performance of the proposed method with the previous approaches, Friedman and Iman and Davenport (1980) tests are used in the statistical analysis. Friedman and Iman & Davenport tests (Demšar, 2006) are non-parametric tests. The null hypothesis of them is that all the algorithms are equivalent and so the ranks of all algorithms should be equal. If the null hypothesis of Friedman and Iman & Davenport tests is rejected, the post test (Bonferroni-Dunn test or Holm test) (

Conclusion

In this paper, we have proposed a novel feature selection algorithm, named CMFS, by which the significance of a term both in inter-category and intra-category are comprehensively measured. The efficiency of the proposed measure CMFS was examined through the experiments of text categorization with NB and SVM classifier. The results, comparing with six well-known feature selection algorithms (Information Gain (IG), Improved Gini index (GINI), Chi square statistic (CHI), Document Frequency (DF),

Acknowledgments

This research is supported by National Natural Science Foundation of China under Grant no. 60971089 and National Electronic Development Foundation of China under Grant no. 2009537.

References (29)

A.L. Blum et al.
Selection of relevant features and examples in machine learning
Artificial Intelligence
(1997)
J. Chen et al.
Feature selection for text classification with Naïve Bayes
Expert Systems with Applications
(2009)
Z. Chen et al.
A preprocess algorithm of filtering irrelevant information based on the minimum class difference
Knowledge-Based Systems
(2006)
D. Mladenic et al.
Feature selection on hierarchy of web documents
Decision Support Systems
(2003)
H. Ogura et al.
Feature selection with a measure of deviations from Poisson in text categorization
Expert Systems with Applications
(2009)
W. Shang et al.
A novel feature selection algorithm for text categorization
Expert Systems with Applications
(2007)
J. Yang et al.
A new feature selection algorithm based on binomial hypothesis testing for spam filtering
Knowledge-Based Systems
(2011)
E. Youn et al.
Class dependent feature scaling method using Naive Bayes classifier for text datamining
Pattern Recognition Letters
(2009)
Chang, C. -C., & Lin, C. -J. (2001). LIBSVM: A library for support vector machines....
T. Cover et al.
Nearest neighbor pattern classification
Information Theory, IEEE Transactions on
(1967)

J. Demšar

Statistical comparisons of classifiers over multiple data sets

Journal of Machine Learning Research

(2006)

H. Drucker et al.

Support vector machines for spam categorization

IEEE Transactions on Neural Networks

(1999)

D. Fragoudis et al.

Best terms: An efficient feature-selection algorithm for text categorization

Knowledge and Information Systems

(2005)

Fuhr, N., Hartmann, S., Lustig, G., Schwantner, M., Tzeras, K., Darmstadt, T. H., et al. (1991). AIR/X – A rule-based...

Cited by (122)

An optimal approach for text feature selection
2022, Computer Speech and Language
Traditionally, feature selection is conducted by first deriving a candidate list of features, then ranking and selecting the top features based on predefined threshold. These methods are highly dependent on the choice of the threshold, and therefore lead to sub-optimal text categorization results. In this paper, we address the selection problem by suggesting a one-step method designed to optimally select the subset of features. The selection is formulated mathematically as an optimization problem with the objective of maximizing classification accuracy while simultaneously deriving and choosing the most discriminative features. Our method, MFX, is applicable to many of the conventional methods, with two distinguishing aspects. First, it is based on considering all documents from the same category as one extended document, instead of analyzing individual documents. Second, it considers choosing the most discriminative terms that are frequent and common across all documents of the same category, and minimally present in other categories. Moreover, MFX is language-independent. It was tested on the well-known benchmark Reuters RCV1 dataset. To showcase its language independence, MFX was also tested on Arabic datasets extracted from Arabic news sources. The results indicated that MFX always performed similar to or better than other well-known feature selection methods. MFX with a Support Vector Machine (SVM) classifier was also shown to outperform recent text classification algorithms based on neural networks and word embeddings.
Incorporating textual reviews in the learning of latent factors for recommender systems
2022, Electronic Commerce Research and Applications
In the field of recommender systems, the latent factor model is one of the state-of-the-art ones thanks to its strengths in accuracy and scalability. Its core is to learn latent factors for the representation of users and items using rating data collected through surveys after the users have experienced the items. However, on e-commerce applications, besides ratings, users can write reviews for items. A review generally indicates a user’s experience with an item while a rating indicates his/her level of satisfaction with such an experience. Latent factors can be learned more accurately if supported by such reviews. This study is distinctive in interpreting a review as both a description of the user/item and a description of the surrounding elements affecting the user's experience with the item. It has proven to be more effective than those that only consider a review as a description of the user/item. Especially, the analysis of the experimental results shows that our model provides supportive recommendations for users with detailed reviews in spite of their few collected ratings.
A wrapper based binary bat algorithm with greedy crossover for attribute selection
2022, Expert Systems with Applications
Attribute selection plays a vital role in optimization and machine learning that involves huge datasets. Classification accuracy of any learning model depends on the dimensionality of data and attributes selected. This leads to a multi-objective problem of obtaining high classification accuracy with fewer attributes. In this research work, a multi-objective optimization algorithm with greedy crossover for attribute selection and classification is proposed. A wrapper based Binary Bat Algorithm (BBA) with Support Vector Machine (SVM) as evaluator is implemented for attribute selection. In general, the optimization algorithms have the tendency to prematurely converge with sub-optimal solutions. This reduces the quality of the attribute selected and efficiency of the algorithm. Here, a multi-objective binary bat algorithm with greedy crossover is proposed to reset the sub-optimal solutions that are obtained due to the premature convergence. The evaluation of the attributes selected is done using the Support Vector Machine with 10-fold cross-validation. The proposed algorithm is implemented and evaluated with the benchmark datasets available in the UC Irvine (UCI) repository. Classification accuracy of 89.25%, 96.45%, 96.57% and 88.50% using the Australian, Ionosphere, Wisconsin Breast Cancer (Original dataset) and Musk is obtained. Further analysis is made with parameter metrics like sensitivity, specificity, precision, recall, fmeasure, Matthews Correlation coefficient (MCC), confusion matrix and Area under the ROC Curve (AUC). The proposed multi-objective binary bat algorithm with greedy crossover yields better performance over the existing bat based algorithms and other nature-inspired algorithms. The solution for the multi-objective problem of obtaining high classification accuracy with minimal number of attributes is attained. Also, the problem of premature convergence occurring in the optimization algorithms with sub-optimal solutions is overcome using the proposed algorithm.
Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods
2020, Applied Soft Computing Journal
Citation Excerpt :
Regardless of the search strategy, the number of possible results will increase geometrically as the number of features increases. The process of embedded methods are embedded in the training of classifiers [13]. Wrappers and embedded methods require much more computation time than filters and may work only with a specific classifier [3].
The evaluation of feature selection methods for text classification with small sample datasets must consider classification performance, stability, and efficiency. It is, thus, a multiple criteria decision-making (MCDM) problem. Yet there has been few research in feature selection evaluation using MCDM methods which considering multiple criteria. Therefore, we use MCDM-based methods for evaluating feature selection methods for text classification with small sample datasets. An experimental study is designed to compare five MCDM methods to validate the proposed approach with 10 feature selection methods, nine evaluation measures for binary classification, seven evaluation measures for multi-class classification, and three classifiers with 10 small datasets. Based on the ranked results of the five MCDM methods, we make recommendations concerning feature selection methods. The results demonstrate the effectiveness of the used MCDM-based method in evaluating feature selection methods.
Term evaluation metrics in imbalanced text categorization
2020, Natural Language Engineering
Ensemble feature selection using weighted concatenated voting for text classification
2024, Journal of the Nigerian Society of Physical Sciences

View all citing articles on Scopus

View full text

A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization

Abstract

Highlights

Introduction

Section snippets

Related work

Motivation

Classifiers

Experimental results on the 20-newsgroups dataset

Statistical analysis

Conclusion

Acknowledgments

Artificial Intelligence

Expert Systems with Applications

Knowledge-Based Systems

Decision Support Systems

Expert Systems with Applications

Expert Systems with Applications

Knowledge-Based Systems

Pattern Recognition Letters

Nearest neighbor pattern classification

Information Theory, IEEE Transactions on

Statistical comparisons of classifiers over multiple data sets

Journal of Machine Learning Research

Support vector machines for spam categorization

IEEE Transactions on Neural Networks

Best terms: An efficient feature-selection algorithm for text categorization

Knowledge and Information Systems