Elsevier

Knowledge-Based Systems

Volume 64, July 2014, Pages 70-80
Knowledge-Based Systems

Feature selection using data envelopment analysis

https://doi.org/10.1016/j.knosys.2014.03.022Get rights and content

Highlights

  • Feature selection is regarded as a multi-index evaluation process.

  • A novel feature selection framework based on data envelopment analysis is proposed.

  • The framework evaluates features from a perspective of “efficient frontier” without parameter setting.

  • A simple feature selection method is proposed based on super-efficiency DEA.

Abstract

Feature selection has been attracting increasing attention in recent years for its advantages in improving the predictive efficiency and reducing the cost of feature acquisition. In this paper, we regard feature selection as an efficiency evaluation process with multiple evaluation indices, and propose a novel feature selection framework based on Data Envelopment Analysis (DEA). The most significant advantages of this framework are that it can make a trade-off among several feature properties or evaluation criteria and evaluate features from a perspective of “efficient frontier” without parameter setting. We then propose a simple feature selection method based on the framework to effectively search “efficient” features with high class-relevance and low conditional independence. Super-efficiency DEA is employed in our method to fully rank features according to their efficiency scores. Experimental results for twelve well-known datasets indicate that proposed method is effective and outperforms several representative feature selection methods in most cases. The results also show the feasibility of proposed DEA-based feature selection framework.

Introduction

In many data mining applications, identifying the most characterizing features (or attributes, variables, hereafter they will be used interchangeably) of the observed data is critical to optimize the classification result. Tremendous new computer and internet applications, e.g. the prevalent use of social media, generate large amounts of data at an exponential rate in the world. Massive irrelevant and redundant features existing in the feature space deteriorate the performance of machine learning algorithms, and thus present challenges to feature selection.

Feature selection is desirable and essential for a number of reasons, such as reducing the complexity of training a classifier and the cost of collecting features, improving the quality of the data, and even resulting in an improvement in classification accuracy [1], [2]. Roughly, there are three types of feature selection methods [3], [4]: Embedded methods, filters and wrappers. As for embedded methods in C4.5 [5] or SVM-RFE [6], the process of selecting features is integrated into the learning algorithm [5]. Wrappers rely on performance estimated by a specific learning method to evaluate and select features. The drawbacks of embedded methods and wrappers are their expensive computational complexity in learning and poor generalization to other learning methods as they are tightly coupled with specified ones. In contrast, filters assess features based on some classifier-agnostic criteria (e.g. Fisher score [7], χ2-test [3], [8], mutual information [9], [10], [11], symmetrical uncertainty [1], Hilbert–Schmidt operator [12], etc.) and select features by focusing only on the intrinsic properties of the data. Developing efficient and effective filter methods has attracted great attention during past years [1], [9], [13], [14].

Feature ranking and feature subset selection are two typical categories of feature selection regarding the output style. The former outputs ranked features weighted by their predictive power [15], [16], [17] while the latter evaluates feature subsets and searches for the best one [18], [1], [19], [20], [21]. Since finding an optimal subset is usually intractable and many problems related to feature selection have been shown to be NP-hard [22], [23], a trade-off between result optimality and computational efficiency has been taken under consideration in literature. Heuristic methods with various feature evaluation criteria have thus been proposed [17], [24], [19], [1]. These criteria mainly focus on the measurement of feature relevance, redundancy, conditional independence, inter-dependence, etc., and the combination of such criteria (e.g. relevance analysis with mutual information + redundancy analysis with conditional mutual information) brings about diversity of feature selection methods. Nevertheless, most of the combinations evaluate features with either prior arguments or constant coefficients, and the relative importance (weight) of each feature property such as relevance or redundancy usually cannot be identified. For example, MIFS [17] applies two information-theoretic metrics to respectively measure feature dependence (D) and redundancy (R), and uses max(D-βR) to evaluate the quality of selected features. Parameter β plays a role of mediating the weight of measured redundancy and thus any changes to it may influence the quality of the finally-selected features. Owing to more than one feature property or criterion to be considered and utilized, feature selection can thus be categorized as a multi-index evaluation process.

Data Envelopment Analysis (DEA) is an effective nonparametric method for efficiency evaluation and has been widely applied in many industries. It employs linear programming to evaluate and rank the Decision Making Units (DMUs) when the production process presents a structure of multiple inputs and outputs. Inspired by this, we regard feature selection as evaluation process with multiple inputs and outputs, and introduce in this paper a novel DEA-based feature selection framework. An effective feature selection method based on this framework is then proposed and evaluated. The remainder of the paper is organized as follows: Section 2 briefly reviews related works. Section 3 introduces some related information-theoretic metrics and Section 4 introduces a novel feature selection framework based on DEA. Then Section 5 proposes a feature selection method based on this framework. In Section 6, experimental results are given to evaluate the effectiveness of proposed method comparing with the representative feature selection methods on twelve well-known datasets, and some discussions are presented. Section 7 finally summarizes the concluding remarks and points out the future work.

Section snippets

Related work

Various aspects of feature selection have been studied for years. One of the key aspects is to measure the quality of selected features. John, Kohavi, and Pfleger classified features into three categories, i.e. strongly relevant, weakly relevant, and irrelevant ones [25], and feature selection research at that time mainly focused on searching for relevant features [26]. However, since the existence and effect of feature redundancy were pointed out [27], [28], [9], [10], how to effectively

Information-theoretic metrics

In this section, we introduce some essential information-theoretic metrics that will be used in our method. In information theory, Mutual Information (MI) is a basic metric of information and has been widely used for quantifying the mutual dependence of random variables. Mutual information between two random variables X and Y, denoted as I(X;Y), is formed asI(X;Y)=xXyYp(xy)logp(xy)p(x)p(y),where xX and yY are the possible value assignments of X and Y, respectively, and the logarithm to

Basic and super-efficiency DEA models

As introduced previously, DEA is a powerful and widely used linear programming technique for the assessment of the relative efficiency of a set of DMUs. The basic DEA model originated by Charnes, Cooper and Rhodes is called CCR model [41]. Consider n DMUs that use m inputs to produce s outputs. Let xij (i=1,2,,m) represents the ith input and yrj (r=1,2,,s) the rth output of DMUj(j=1,2,,n), the relative efficiency of DMUp under CCR model can be achieved by solving the following linear

Ranking all features according to their efficiency scores

In this section, we propose a super-efficiency-DEA-based feature selection method. Unlike most of the existing feature selection methods, our method focuses on the entire redundancy distribution for each candidate feature, i.e. redundancy between the candidate feature and every other feature in the feature space. Since salient features should have strong discriminative power and the power would not be impaired given other features, we measure the conditional independence between the

Experimental results and discussion

In this section, we empirically evaluate the performance of proposed method by comparing it with the most representative and well-performed feature ranking methods (CMIM [19], mRMR [9], and ReliefF [45]) and feature subset selection methods (CFS [18] and FCBF [1]). ReliefF is a well-known distance-based feature ranking method that searches nearest neighbors of samples for each class label and then weights features in terms of how well they differentiate samples for different class labels. For

Conclusions and future work

In this paper, we regard feature selection as an evaluation system with multiple inputs and outputs and propose a DEA-based feature selection framework: Properties of features such as relevance and redundancy are taken as the inputs and the outputs of the evaluation system and a DEA method is then applied to evaluate the relative efficiency of the features and then to achieve feature rankings according to the efficiency values. To illustrate the effectiveness of this framework, we propose a

Acknowledgements

The authors are grateful to the editor and three anonymous referees for their valuable and constructive comments. The first author would like to thank Dr. Xinyue Tong for her insightful comments on this paper. This work is partially supported by the National Natural Science Foundation of China (71320107001), the Fundamental Research Funds for the Central Universities, HUST (CXY12Q044 and CXY13Q035), and the Graduates’ Innovation Fund of Huazhong University of Science & Technology (HF-11-20-2013

References (55)

  • A. Charnes et al.

    Measuring the efficiency of decision making units

    Eur. J. Oper. Res.

    (1978)
  • W.D. Cook et al.

    Data envelopment analysis (DEA) – thirty years on

    Eur. J. Oper. Res.

    (2009)
  • J.T. Pastor et al.

    An enhanced DEA Russell graph efficiency measure

    Eur. J. Oper. Res.

    (1999)
  • L. Yu et al.

    Efficient feature selection via analysis of relevance and redundancy

    J. Mach. Learn. Res.

    (2004)
  • I. Guyon et al.

    An introduction to variable and features election

    J. Mach. Learn. Res.

    (2003)
  • H. Liu et al.

    Toward integrating features election algorithms for classification and clustering

    IEEE Trans. Knowl. Data Eng.

    (2005)
  • R. Quinlan

    C4.5: Programs for Machine Learning

    (1993)
  • I. Guyon et al.

    Gene selection for cancer classification using support vector machines

    Mach. Learn.

    (2002)
  • T.S. Furey et al.

    Support vector machine classification and validation of cancer tissue samples using microarray expression data

    Bioinformatics

    (2000)
  • G. Qu et al.

    A new dependency and correlation analysis for features

    IEEE Trans. Knowl. Data Eng.

    (2005)
  • H. Peng et al.

    Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2005)
  • L. Song et al.

    Feature selection via dependence maximization

    J. Mach. Learn. Res.

    (2012)
  • Z. Zhao et al.

    Spectral feature selection for supervised and unsupervised learning

  • L. Song et al.

    Supervised feature selection via dependence estimation

  • I. Kononenko

    Estimating attributes: analysis and extensions of relief

  • R. Battiti

    Using mi for selecting features in supervised neural net learning

    IEEE Trans. Neural Netw.

    (1994)
  • M.A. Hall

    Correlation-based feature selection for discrete and numeric class machine learning

  • Cited by (45)

    • Evaluating and selecting features via information theoretic lower bounds of feature inner correlations for high-dimensional data

      2021, European Journal of Operational Research
      Citation Excerpt :

      The conditional mutual information maximization (CMIM) criterion (Fleuret, 2004) in essence considers the complementarity implicitly using the conditional mutual information (CMI). Other typical criteria, such as those proposed by Zhang, Yang, Xiong, and Zhang (2014) and Wang, Wei, Yang, and Wang (2017), are also built based on CMI. Complementarity can also be implicitly identified using joint mutual information (JMI) (Bennasar et al., 2015; Guo & Nixon, 2009; Yang & Moody, 1999), which is actually the MI between the feature group (the existing works focus primarily on pairs of features) and the class.

    • Feature assessment and ranking for classification with nonlinear sparse representation and approximate dependence analysis

      2019, Decision Support Systems
      Citation Excerpt :

      The main objective of the present feature evaluation criteria considering feature dependencies is to identify a set of class-relevant and complementary features, wherein the redundancy among them is minimal. In general, feature dependencies comprise feature redundancy and feature complementarity [20, 28], wherein redundancy has attracted significantly more attention than complementarity due to its detriment to classification performance. In order to identify class-relevance and redundancy, a number of novel feature-evaluation criteria have been developed [8, 17, 19, 29].

    View all citing articles on Scopus
    View full text