Feature selection using data envelopment analysis
Introduction
In many data mining applications, identifying the most characterizing features (or attributes, variables, hereafter they will be used interchangeably) of the observed data is critical to optimize the classification result. Tremendous new computer and internet applications, e.g. the prevalent use of social media, generate large amounts of data at an exponential rate in the world. Massive irrelevant and redundant features existing in the feature space deteriorate the performance of machine learning algorithms, and thus present challenges to feature selection.
Feature selection is desirable and essential for a number of reasons, such as reducing the complexity of training a classifier and the cost of collecting features, improving the quality of the data, and even resulting in an improvement in classification accuracy [1], [2]. Roughly, there are three types of feature selection methods [3], [4]: Embedded methods, filters and wrappers. As for embedded methods in C4.5 [5] or SVM-RFE [6], the process of selecting features is integrated into the learning algorithm [5]. Wrappers rely on performance estimated by a specific learning method to evaluate and select features. The drawbacks of embedded methods and wrappers are their expensive computational complexity in learning and poor generalization to other learning methods as they are tightly coupled with specified ones. In contrast, filters assess features based on some classifier-agnostic criteria (e.g. Fisher score [7], -test [3], [8], mutual information [9], [10], [11], symmetrical uncertainty [1], Hilbert–Schmidt operator [12], etc.) and select features by focusing only on the intrinsic properties of the data. Developing efficient and effective filter methods has attracted great attention during past years [1], [9], [13], [14].
Feature ranking and feature subset selection are two typical categories of feature selection regarding the output style. The former outputs ranked features weighted by their predictive power [15], [16], [17] while the latter evaluates feature subsets and searches for the best one [18], [1], [19], [20], [21]. Since finding an optimal subset is usually intractable and many problems related to feature selection have been shown to be NP-hard [22], [23], a trade-off between result optimality and computational efficiency has been taken under consideration in literature. Heuristic methods with various feature evaluation criteria have thus been proposed [17], [24], [19], [1]. These criteria mainly focus on the measurement of feature relevance, redundancy, conditional independence, inter-dependence, etc., and the combination of such criteria (e.g. relevance analysis with mutual information + redundancy analysis with conditional mutual information) brings about diversity of feature selection methods. Nevertheless, most of the combinations evaluate features with either prior arguments or constant coefficients, and the relative importance (weight) of each feature property such as relevance or redundancy usually cannot be identified. For example, MIFS [17] applies two information-theoretic metrics to respectively measure feature dependence (D) and redundancy (R), and uses to evaluate the quality of selected features. Parameter plays a role of mediating the weight of measured redundancy and thus any changes to it may influence the quality of the finally-selected features. Owing to more than one feature property or criterion to be considered and utilized, feature selection can thus be categorized as a multi-index evaluation process.
Data Envelopment Analysis (DEA) is an effective nonparametric method for efficiency evaluation and has been widely applied in many industries. It employs linear programming to evaluate and rank the Decision Making Units (DMUs) when the production process presents a structure of multiple inputs and outputs. Inspired by this, we regard feature selection as evaluation process with multiple inputs and outputs, and introduce in this paper a novel DEA-based feature selection framework. An effective feature selection method based on this framework is then proposed and evaluated. The remainder of the paper is organized as follows: Section 2 briefly reviews related works. Section 3 introduces some related information-theoretic metrics and Section 4 introduces a novel feature selection framework based on DEA. Then Section 5 proposes a feature selection method based on this framework. In Section 6, experimental results are given to evaluate the effectiveness of proposed method comparing with the representative feature selection methods on twelve well-known datasets, and some discussions are presented. Section 7 finally summarizes the concluding remarks and points out the future work.
Section snippets
Related work
Various aspects of feature selection have been studied for years. One of the key aspects is to measure the quality of selected features. John, Kohavi, and Pfleger classified features into three categories, i.e. strongly relevant, weakly relevant, and irrelevant ones [25], and feature selection research at that time mainly focused on searching for relevant features [26]. However, since the existence and effect of feature redundancy were pointed out [27], [28], [9], [10], how to effectively
Information-theoretic metrics
In this section, we introduce some essential information-theoretic metrics that will be used in our method. In information theory, Mutual Information (MI) is a basic metric of information and has been widely used for quantifying the mutual dependence of random variables. Mutual information between two random variables X and Y, denoted as , is formed aswhere and are the possible value assignments of X and Y, respectively, and the logarithm to
Basic and super-efficiency DEA models
As introduced previously, DEA is a powerful and widely used linear programming technique for the assessment of the relative efficiency of a set of DMUs. The basic DEA model originated by Charnes, Cooper and Rhodes is called CCR model [41]. Consider n DMUs that use m inputs to produce s outputs. Let represents the ith input and the rth output of , the relative efficiency of under CCR model can be achieved by solving the following linear
Ranking all features according to their efficiency scores
In this section, we propose a super-efficiency-DEA-based feature selection method. Unlike most of the existing feature selection methods, our method focuses on the entire redundancy distribution for each candidate feature, i.e. redundancy between the candidate feature and every other feature in the feature space. Since salient features should have strong discriminative power and the power would not be impaired given other features, we measure the conditional independence between the
Experimental results and discussion
In this section, we empirically evaluate the performance of proposed method by comparing it with the most representative and well-performed feature ranking methods (CMIM [19], mRMR [9], and ReliefF [45]) and feature subset selection methods (CFS [18] and FCBF [1]). ReliefF is a well-known distance-based feature ranking method that searches nearest neighbors of samples for each class label and then weights features in terms of how well they differentiate samples for different class labels. For
Conclusions and future work
In this paper, we regard feature selection as an evaluation system with multiple inputs and outputs and propose a DEA-based feature selection framework: Properties of features such as relevance and redundancy are taken as the inputs and the outputs of the evaluation system and a DEA method is then applied to evaluate the relative efficiency of the features and then to achieve feature rankings according to the efficiency values. To illustrate the effectiveness of this framework, we propose a
Acknowledgements
The authors are grateful to the editor and three anonymous referees for their valuable and constructive comments. The first author would like to thank Dr. Xinyue Tong for her insightful comments on this paper. This work is partially supported by the National Natural Science Foundation of China (71320107001), the Fundamental Research Funds for the Central Universities, HUST (CXY12Q044 and CXY13Q035), and the Graduates’ Innovation Fund of Huazhong University of Science & Technology (HF-11-20-2013
References (55)
- et al.
Mutual information based input feature selection for classification problems
Decis. Support Syst.
(2012) - et al.
Effective feature selection scheme using mutual information
Neurocomputing
(2005) - et al.
A parameterless feature ranking algorithm based on mi
Neurocomputing
(2008) - et al.
A practical approach to feature selection
- et al.
On the approximation of minimizing non zero variables or unsatisfied relations in linear systems
Theor. Comput. Sci.
(1998) Stochastic local search for the feature set problem, with applications to microarray data
Appl. Math. Comput.
(2006)- et al.
Irrelevant features and the subset selection problem
- et al.
Wrappers for feature subset selection
Artif. Intell.
(1997) - et al.
Divergence-based feature selection for separate classes
Neurocomputing
(2013) - et al.
Feature subset selection with cumulate conditional mutual information minimization
Expert Syst. Appl.
(2012)
Measuring the efficiency of decision making units
Eur. J. Oper. Res.
Data envelopment analysis (DEA) – thirty years on
Eur. J. Oper. Res.
An enhanced DEA Russell graph efficiency measure
Eur. J. Oper. Res.
Efficient feature selection via analysis of relevance and redundancy
J. Mach. Learn. Res.
An introduction to variable and features election
J. Mach. Learn. Res.
Toward integrating features election algorithms for classification and clustering
IEEE Trans. Knowl. Data Eng.
C4.5: Programs for Machine Learning
Gene selection for cancer classification using support vector machines
Mach. Learn.
Support vector machine classification and validation of cancer tissue samples using microarray expression data
Bioinformatics
A new dependency and correlation analysis for features
IEEE Trans. Knowl. Data Eng.
Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy
IEEE Trans. Pattern Anal. Mach. Intell.
Feature selection via dependence maximization
J. Mach. Learn. Res.
Spectral feature selection for supervised and unsupervised learning
Supervised feature selection via dependence estimation
Estimating attributes: analysis and extensions of relief
Using mi for selecting features in supervised neural net learning
IEEE Trans. Neural Netw.
Correlation-based feature selection for discrete and numeric class machine learning
Cited by (45)
Evaluating and selecting features via information theoretic lower bounds of feature inner correlations for high-dimensional data
2021, European Journal of Operational ResearchCitation Excerpt :The conditional mutual information maximization (CMIM) criterion (Fleuret, 2004) in essence considers the complementarity implicitly using the conditional mutual information (CMI). Other typical criteria, such as those proposed by Zhang, Yang, Xiong, and Zhang (2014) and Wang, Wei, Yang, and Wang (2017), are also built based on CMI. Complementarity can also be implicitly identified using joint mutual information (JMI) (Bennasar et al., 2015; Guo & Nixon, 2009; Yang & Moody, 1999), which is actually the MI between the feature group (the existing works focus primarily on pairs of features) and the class.
Feature assessment and ranking for classification with nonlinear sparse representation and approximate dependence analysis
2019, Decision Support SystemsCitation Excerpt :The main objective of the present feature evaluation criteria considering feature dependencies is to identify a set of class-relevant and complementary features, wherein the redundancy among them is minimal. In general, feature dependencies comprise feature redundancy and feature complementarity [20, 28], wherein redundancy has attracted significantly more attention than complementarity due to its detriment to classification performance. In order to identify class-relevance and redundancy, a number of novel feature-evaluation criteria have been developed [8, 17, 19, 29].
Hybrid fast unsupervised feature selection for high-dimensional data
2019, Expert Systems with ApplicationsFeature subset selection algorithm based on symmetric uncertainty and interaction factor
2024, Multimedia Tools and Applications