Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

CBFS: High Performance Feature Selection Algorithm Based on Feature Clearness

  • Minseok Seo,

    Affiliation Department of Nanobiomedical Science and WCU Research Center of Nanobiomedical Science, Dankook University, Cheonan, South Korea

  • Sejong Oh

    sejongoh@dankook.ac.kr

    Affiliation Department of Nanobiomedical Science and WCU Research Center of Nanobiomedical Science, Dankook University, Cheonan, South Korea

Abstract

Background

The goal of feature selection is to select useful features and simultaneously exclude garbage features from a given dataset for classification purposes. This is expected to bring reduction of processing time and improvement of classification accuracy.

Methodology

In this study, we devised a new feature selection algorithm (CBFS) based on clearness of features. Feature clearness expresses separability among classes in a feature. Highly clear features contribute towards obtaining high classification accuracy. CScore is a measure to score clearness of each feature and is based on clustered samples to centroid of classes in a feature. We also suggest combining CBFS and other algorithms to improve classification accuracy.

Conclusions/Significance

From the experiment we confirm that CBFS is more excellent than up-to-date feature selection algorithms including FeaLect. CBFS can be applied to microarray gene selection, text categorization, and image classification.

Introduction

The fundamental goal of feature selection is to select useful features and eliminate useless ones in a high-dimensional dataset to improve the performance of learning models by alleviating the effects of dimensionality, enhancing generalization capability, speeding up the learning process and improving model interpretability. Typical application areas of feature selection are gene selection from microarray data and text categorization.

In machine learning literature there are three general approaches to feature selection: filters, wrappers, and embedded methods [1], [2]. Filter methods select the optimal feature subset based solely on the dataset by evaluating each future based on specific statistics, but completely independently from the classification algorithm. In contrast, wrapper methods make use of the algorithm that will be used to build the final classifier to select a feature subset. When compared to filters, they tend to be more computationally expensive, but provide superior performance [3] since they are injected inside the learning algorithm and well suited to the interest of the classifier. In the embedded technique, the search for an optimal subset of features is built into the classifier construction, and can be seen as a search in the combined space of feature subsets and hypotheses. Just like wrapper approaches, embedded approaches are thus specific to a given learning algorithm.

Filter Methods are mostly a popular approach because they are simple and fast to extract target features. FSDD [4], Relief [5], and MRMR [6] are up-to-date feature selection algorithms that belong to the filter methods. FSDD is a distance discriminant method. This algorithm calculates the grade of each feature using a distance matrix. The criterion used for selecting good features is db – ßdw, where db is the distance between different classes, dw is distance within classes, and ß is a user defined value that is usually set to 2 and used to control the impact of dw. The higher the value of ß, the more the focus should be on the distance within classes. The Relief algorithm recursively and randomly selects an instance and identifies its nearest neighbors, one from its own class and others from different classes. The quality estimator in this algorithm is then updated for all attributes to assess how well the feature distinguishes the instance from its closest neighbors. In each iteration, an instance x is selected randomly, and its nearest instance is found from the same class (NH), as well as different classes (NM). Finally, the weight value is updated by the equation. Because of this recursive characteristic, runtime is slow compared with the other feature selection methods. Also, results are different every time because the method randomly selects features. MRMR [6] has been proposed to solve the problem by (using mutual information) maximizing the mutual Euclidean distance and minimizing the pair-wise correlations of the features. The minimum redundancy condition is, where S denotes the feature subset, and I(i, j) is the mutual information of two variables i and j. To maximize the total relevance , where I(h, i) represents the mutual information between targeted classes h and gene expressions i.

FeaLect [7] is a very high quality wrapper method. FeaLect proposes an alternative algorithm for feature selection based on the Lasso [8] for building a robust predictor. Lasso is an L1-regularization technique for linear regression which has attracted much attention in machine learning and statistics. Although efficient algorithms exist for recovering the whole regularization path for the Lasso, finding a subset of highly relevant features that lead to a robust predictor is an important aspect to investigate. The hypothesis of FeaLect is that defining a scoring scheme that measures the quality of each feature can provide a more robust selection of features. The FeaLect approach is to generate several samples from the training data, determine the best relevance-ordering of the features for each sample, and finally combine these relevance-orderings to select highly relevant features.

In this paper, we propose a clearness-based feature selection (CBFS) algorithm which can be classified as a filter method. In our context, clearness means the separability between classes in a feature. If (clearness of feature f2) > (clearness of feature f1), then f2 is more advantageous to classification than f1. In Fig. 1, feature f2 is clearer than f1. Ο and ⧫ are data samples in f1 and f2, and mixed area of f1 is larger than f2. Therefore, the classification accuracy using f1 may be lower than f2. In the CBFS method, we measure clearness of each feature in a dataset, and select top ranked features. CBFS calculates the distance between the target sample and centroid of each class, and then compares the class of the nearest centroid with the class of the target sample. The matching ratio of all samples in a feature becomes a clearness value for the feature. We describe the detailed process to obtain clearness values in materials and methods section.

thumbnail
Figure 1. Clearness of feature f1and f2. Mixed area of f1 is larger than f2, and it means feature f2 is clearer than f1.

https://doi.org/10.1371/journal.pone.0040419.g001

The proposed method can be used to combine other feature ranking algorithms. We combine proposed methods with R-value [9] and validate the improvement of classification accuracy. R-value is one of the feature ranking algorithms and it also measures the clearness of each feature by different way of CBFS. It considers nearest neighbor samples of target sample to decide whether it is located in congestion area or not. In some cases, R-value based feature selection produces better accuracy than CBFS, and we can expect that combining them improves classification accuracy.

Materials and Methods

Clearness-based feature scoring scheme

As mentioned, the proposed method can be classified as a filter method. Every filter method has a scoring scheme for each feature in a dataset. CBFS adopts CScore. CScore(fi) is a scoring function for feature fi which measures clearness of the feature. The intuitive meaning of CScore for feature fi is the degree of correctly clustered samples to the centroid of their class in fi. In the context of CBFS, each sample is clustered to the nearest centroid of the class. If a sample of class A is clustered to the centroid of class B, it is a mis-clustered sample. In Fig. 2(b), two samples are mis-clustered whereas all samples are correctly clustered in Fig. 2(a). It is clear that well clustered features bring high classification accuracy.

thumbnail
Figure 2. CScores of two features.

Fig. 2(b) has two mis-clustered samples whereas Fig. 2(a) has clearly clustered samples.

https://doi.org/10.1371/journal.pone.0040419.g002

Let's suppose a dataset DS has n samples, m features, and p classes. DS can be denoted by a set of sample xi.

DS  =  {x1, x2, .., xi ,.. , xn}

Each sample is a vector value which has m elements (features).

xi  =  (xi1, xi2, .., xij , .. , xim)

A set CS contains class labels corresponding to samples in DS.

CS  =  {c1, c2, .., ci ,.. , cn}

A class label is a sequential numerical value and the range is [0, p-1]. Now we introduce the procedure to obtain CScore(fi).

Step 1.

Calculate centroid of each class. It is the same as the median point of a class and calculated by the average operation. Med(fi, j) denotes the median point of class j in the feature fi, which is calculated by:(1)

Step 2.

Calculate the predicted class label for each xij in sample xi. After calculating the distance between xij and Med(fj,ci) for all classes, we take the nearest centroid Med(fj, s) and s is a predicted class label for xij. The distance between xij and Med(fj, t) is calculated by:(2)

As a result of step2, we have n × m matrix M1 and element value M1(i,j) is predicted class label for xij.

Step 3.

Calculate n × m matrix M2 which contains a matching result of predicted class label and correct class label in CS. M2(i,j) is calculated by:(3)

Step 4.

Calculate CScore(fi). Finally we calculate CScore(fi) by:(4)

Fig. 3 presents step 1 to step 4. The range of CScore(fi) is [0, 1]. If CScore(fi) is close to 1, this shows that classes in feature fi are clustered well and elements in fi can be clearly classified. Therefore, we can use CScore(fi) as a criteria to select features for classification work. CBFS chooses highly scored features using the CScore() function. Implementable algorithm to get CScore() is available in the Supplementary Material link (http://biosw.dankook.ac.kr/cbfs). CBFS can be combined to other feature scoring schemes. To distinguish combined algorithms as shown in the next section, we denote a pure CBFS algorithm as CBFSorg.

Improvement of CBFS with R-value

Though CBFS itself shows high performance for feature selection, we can improve its quality by combining other scoring schemes. In this section, we describe a combining example between CBFS and R-value. We can apply this approach to combine other scoring schemes. Scoring function of CBFS is based on distance between each data point and centroid of classes. In some cases, this produces the wrong scores, as shown in Fig. 4. In Fig. 4(a), class A and class B are clearly separated but two points of class B in the dotted circle are classified as class A and this decreases the value of the CScore(). If two classes are widely overlapped as shown in Fig. 4(b), many points in the overlapping area will be mis-classified. In the cases shown in Fig. 4, the R-value is a better scoring function because the R-value does not consider the distance to the centroid of classes but instead, to the number of nearest neighbors.

thumbnail
Figure 4. Two cases that CScore() produces wrong scores.

If data range of a class is so smaller than neighbor class', CScore may produce wrong score (Case 1). If two classes' have wide overlapping area, CScore may produce wrong score (Case 2).

https://doi.org/10.1371/journal.pone.0040419.g004

Traditional approaches to combine different feature selection methods usually just use intersection. Next box presents the simple steps required to combine CBFS and R-value. We denote this approach as CBFSintersection.

  • Step 1. Choose n features using which have higher scores by CScore().
  • Step 2. Choose n features using which have lower scores by R-value.
  • Step 3. Extract m features from intersection of step 1 and step 2.

A difficulty in CBFSintersection is how to determine n if we fix m. For example, even if we want to get 20 features using CBFSintersection, we do not know the correct number of n because we cannot estimate the number of intersections between step 1 and step 2. Therefore, we modify CBFSintersection to extract the exact number of m features. We denote it as CBFSexact. Basic steps for CBFSexact are as follows:

  • Step 1. Initialize n as m and ExtractList as empty.
  • Step 2. Choose n features using which have higher scores by CScore().
  • Step 3. Choose n features using which have lower scores by R-value.
  • Step 4. Extract features from intersection of step 2 and step 3, and if they are not in ExtractList, store them to ExtractList.
  • Step 5. If number of element in ExtractList < m, then

nn+1

go to step 2

A confusing point in feature selection is to select better features from multiple features that have same ranking scores. Intersection with other feature selections offers a solution for the problem. Pseudo codes for CBFSintersection, CBFSexact, and R-value are available in the Supplementary Material link (http://biosw.dankook.ac.kr/cbfs). In the next section, we present benchmarking results for CBFSorg, CBFSintersection, and CBFSexact.

Datasets, feature selection algorithms, and classifiers

To compare feature selection algorithms we choose various kinds of datasets, which contain varying numbers of features and samples. Duke, Leukemia, DLBCL, and Carcinoma are well known microarray datasets. Other datasets come from the UCI repository [10] and several websites. Table 1 summarizes the benchmark datasets. FeaLect, FSDD, and Relief feature selection algorithms are compared with proposed CBFSorg, CBFSintersection, and CBFSexact. FeaLect is widely considered as a state-of–the-art algorithm and details are described in Section 1. For simplicity we denote FeaLect as Lect from here on.

The basis of the FSDD algorithm is to identify the features that result in good class separability among classes and to make the samples in the same classes as close as possible. A criterion used for selection of good features is db – β dw and the criteria function can be expressed as follows:(5)where m is the number of selected features, c is the number of classes, and is the prior probability of the ith class.

Relief is regarded as one of the successful features of selection algorithms. The basic idea of Relief is to iteratively estimate feature weights according to their ability to discriminate between neighboring instances. In each iteration, an instance x is selected randomly, and its nearest instance is found from the same class (NH), as well as different classes (NM). Finally, the weight value is updated by the equation:(6)

If w1 < w2, feature2 is better than feature1. The ReliefF (Relief-F) algorithm [11], which is an updated version of Relief, is more robust and can deal with incomplete and noisy data.

To compare classification accuracy between the current feature selection algorithms and proposed CBFS, we used the k-nearest neighbor (KNN) and support vector machine (SVM). In the KNN classification analysis, we used k = 5 for K because this value was found to produce the best accuracy in most of cases. For the SVM test, we use LIBSVM tool [12] with linear kernel. Whole user defined value is set as default such as degree, gamma, and coef0. We use the Lect algorithm that is imported in the R-package. User defined values of FSDD are Beta  = 3, and K = 3. In case of ReliefF, we use K = 7. Proposed CBFSintersection chooses a threshold value n = 100. We use well-known validation methods, k-fold cross validation [13] where k = 5 to avoid the problem of over-fitting the classification.

Results

Relevance, sparsity, and optimality are measures to evaluate feature selection algorithms. Relevance and sparsity are generally used for microarray area and requires domain knowledge. Optimality evaluates classification accuracy using the same number of features from different feature selection algorithms. Fig. 5 and 6 present the optimality evaluation. We test KNN and SVM to compare classification accuracy based on 5-fold validation. Fig. 5 and 6 show that the proposed CBFSorg is far better than the current filter methods such as FSDD and Relief. In addition, it also outperforms Lect, which is a superior quality feature selection method. Lect can be classified as a wrapper method. In general, wrapper methods produce better classification accuracy but require long execution time. Though the proposed method is a filter method, it exceeds or remains the performance of wrapper method. CBFSorg shows good classification accuracy both in KNN and SVM. CBFSorg has a generality for well-known classification algorithms.

thumbnail
Figure 5. Comparison of classification accuracy using KNN where K = 5.

In every case, CBFSorg shows best classification accuracy.

https://doi.org/10.1371/journal.pone.0040419.g005

thumbnail
Figure 6. Comparison of classification accuracy using SVM with linear kernel.

LECT and CBFSorg show similar performance, but CBFSorg is a little bit better than LECT.

https://doi.org/10.1371/journal.pone.0040419.g006

Fig. 7 shows PCA analysis for feature selection results for the Arcene dataset. Red and black points represent samples of two different classes. Congestion areas of red and black points are narrow in CBFS graphs compared with the others. In general, the more narrow congestion area we get, the better classification accuracy we can expect. This is why the CBFS algorithm produces higher accuracy than other algorithms.

thumbnail
Figure 7. PCA analysis for feature selection results for the Arcene dataset.

Congestion areas of red and black points are narrow in CBFS graphs comparing with the others.

https://doi.org/10.1371/journal.pone.0040419.g007

Table 2 summarizes the best classification accuracies by prepared feature selection algorithms on benchmark datasets. We test various parameter values and a number of features for each feature selection algorithm and classifiers, and choose the best accuracies. The proposed CBFSorg and CBFSexact occupy top accuracies for each datasets except Prostate. In particular, CBFSexact produces 22.7% and 23.8% higher than Lect on the Duke and Madelon dataset, respectively. It is clear that the number of features to produce best accuracy of CBFSorg and CBFSexact are generally smaller than other algorithms. For example, Lect, CBFSorg and CBFSexact produce best accuracy on the Leukemia dataset, and Lect uses 25 features whereas CBFSexact uses only 5 features.

thumbnail
Table 2. Best classification accuracies and number of features to produce accuracies.

https://doi.org/10.1371/journal.pone.0040419.t002

Table 2 also shows that CBFSexact produces better accuracy than CBFSorg in some cases. CBFSintersection has lower accuracy than CBFSorg and CBFSexact. We can consider combining multiple feature selection algorithms to improve classification accuracy.

Some microarray datasets have a small number of samples. For example, Carcinoma has only 72 samples. In that case, classification accuracy is not a reliable measure to evaluate feature selection algorithms, and instead we need to analyze the risk of mis-classification or prediction using the ‘loss function’ [14] or Receiver Operating Characteristic (ROC) curve [15]. ROC curves show a two-dimensional graph using sensitivity and 1- specificity. They are widely used in biology and medical science for evaluating prediction methods or markers. We used ROC curves to compare the stability of CBFS with that of Lect. Currently, Lect is a top ranked feature selection algorithm, and we only use it for comparison purposes. Fig. 8 and 9 show ROC analysis for CBFS and Lect on the Duke and Prostate dataset, respectively. We extract five features using CBFS and Lect, and list the relationship values between samples in the features and their class labels. We also draw AUC curves according to [15] which use the average values obtained from the ROC curves. Fig. 8(c) and Fig. 9(c) shows AUC value of CBFS is greater than Lect, which means that CBFS is a more stable and superior method than Lect.

thumbnail
Figure 8. ROC curve and AUC values for CBFS and Lect feature selection algorithms on the Duke dataset.

https://doi.org/10.1371/journal.pone.0040419.g008

thumbnail
Figure 9. ROC curve and AUC values for CBFS and Lect feature selection algorithms on Prostate dataset.

https://doi.org/10.1371/journal.pone.0040419.g009

Discussion

Time complexity of calculating CScore

CBFS is a fast and efficient algorithm. From the steps to calculate CScore() in section 2.1, we analyze time complexity. Let n, c, and f equal the number of samples, classes, and features of given dataset, respectively. The time complexity of each step for CBFS method is as follows:

  1. Calculate centroid of each class in each feature: O(mn)
  2. Produce n × m matrix M1 and M2: O(mnc)
  3. Calculate CScore() for each features: O(mn)

Therefore, the total time complexity is O((2+c)•mn). Table 3 shows computation times of feature selection for selected algorithms. CBFS is the fastest feature selection algorithm.

thumbnail
Table 3. Computation time of feature selection for the Madelon dataset.

https://doi.org/10.1371/journal.pone.0040419.t003

Overfitting problem of proposed algorithm

Overfitting is a general problem of machine learning algorithms such as classification. To avoid overfitting, K-fold validation and LOOCV skims are used in classification tests. Validation errors can be used to evaluate feature selection algorithms. Table 4 shows the classification accuracy and validation errors of Lect and CBFS on benchmark datasets. We calculate validation errors from five to twenty features derived by Lect and CBFS. Lect uses the L1-regularization technique to avoid overfitting problem, so we can indirectly evaluate the validation error of CBFS by comparing Lect and CBFS. CBFS gives lower validation errors than Lect for every dataset except Sonar. The average validation error of Lect and CBFS are ±8.33% and ±6.85%, respectively. If a feature selection algorithm has a lower validation error, it means that the algorithm is less sensitive for distribution of samples and may produce less overfitting problems. Most distance-based filters assume that if a feature has short intra-classes and long inter-classes distances, it can produce high classification accuracy, but this assumption carries the risk of higher overfitting. Proposed CScore() evaluates each feature based on the degree of condensation of samples to the centroid of classes, and reduces validation error.

thumbnail
Table 4. Classification accuracies and validation errors for Lect and CBFS.

https://doi.org/10.1371/journal.pone.0040419.t004

Application of CBFS

CBFS can be applied to any areas of data analysis that require feature selection scheme such as microarray gene selection, text categorization, and image classification. Microarray data is used to screen thousands of genes and determine whether genes have relationship with specific disease such as cancer. A gene corresponds to a feature and CBFS may suggest candidate genes according to feature evaluation values. Medical expert will analyze the biological functions of the candidate genes and find target genes that are related with diseases. Feature selection is an essential part of text classification. Document collections have 10,000 to 100,000 or more unique words. Many words are not useful for classification. Restricting the set of words that are used for classification makes classification more efficient and can improve generalization error [16]. Image retrieval is one of application area of CBFS. In image retrieval, each image data may have so many features to characterize the data. In feature extraction step, we don't know which features are efficient to characterize each image. After applying CBFS, we can evaluate quality of each feature and select best features for image retrieval system.

CBFS java program is available at http://biosw.dankook.ac.kr/cbfs

Author Contributions

Conceived and designed the experiments: SO MS. Performed the experiments: MS. Analyzed the data: SO MS. Contributed reagents/materials/analysis tools: MS. Wrote the paper: SO MS.

References

  1. 1. Guyon I, Elisseeff A (2003) An Introduction to Variable and Feature Selection. J Mach Learn Res 3: 1157–1182.
  2. 2. Saeys Y, Inza I, Larranaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics l23 (19): 2507–2517.
  3. 3. Berrar DP, Dubitzky W, Granzow M (2009) A Practical Approach to Microarray Data Analysis, Springer Publishing Company, Incorporated.
  4. 4. Liang J, Yang S (2008) A. Winstanley. Invariant optimal feature selection: A distance discriminant and feature ranking based solution. Pattern Recogn 41: 1429–1439.
  5. 5. Robnik-Sikonja M, Kononenko I (2003) Theoretical and Empirical Analysis of ReliefF and RReliefF. Mach Learn 53: 23–69.
  6. 6. Ding C, Peng H (2003) Minimum Redundancy Feature Selection from Microarray Gene Expression Data. Proc IEEE Computer Society Conference on Bioinformatics 523.
  7. 7. Zare H (2010) FeaLect: Feature seLection by computing statistical scores. http://cran.rakanu.com, Accessed 2011 Dec 01.
  8. 8. Wainwright MJ (2009) Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using L1 -Constrained Quadratic Programming (Lasso). IEEE Trans Inform Theory 55(5): 2183–2202.
  9. 9. Oh S (2011) A New Feature Evaluation Method Based on Category Overlap. Comput Biol Med 41(2): 115–122.
  10. 10. Blake C, Merz CJ (1998) UCI repository of machine learning databases. UCI Machine Learning Repository. Accessed 2011 Nov 11.
  11. 11. Kononenko I, Simec E (1995) Induction of decision trees using ReliefF. In: G. Della Riccia, R. Kruse, and R. Viertl (eds.): Mathematical and Statistical Methods in Artificial Intelligence, CISM Courses and Lectures No. 363. Springer Verlag.
  12. 12. Chang C, Lin C (2005) LIBSVM – A Library for Support Vector Machines. Accessed 2011 Nov 11.
  13. 13. Bengio Y, Grandvalet Y (2004) No Unbiased Estimator of the Variance of K-Fold Cross-Validation. J Mach Learn Res 5: 1089–1105.
  14. 14. Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inform Theory 13(1): 21–27.
  15. 15. Fawcett T (2004) ROC Graphs: Notes and Practical Considerations for Researchers. HP Laboratories, MS 1143, 1501 Page Mill Road, Palo Alto, CA 94304, March 16.
  16. 16. 13 January. 2012 Accessed. Winarko E, Data Mining, Available: http://ewinarko.staff.ugm.ac.id/blog/? p = 17.
  17. 17. Guyon I, Li J, Mader T, Pletscher PA, Schneider G, et al. (2007) Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark. Pattern Recogn Lett 28: 1438–1444.
  18. 18. Singh D, Febbo PG, Ross K, Jackson DG, Manola J, et al. (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2): 203–209.
  19. 19. West M, Blanchette C, Dressman H, Huang E, Ishida S, et al. (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc National Academy of Sciences 98: 11462–11467.
  20. 20. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531.
  21. 21. Gorman RP, Sejnowski TJ (1988) Analysis of Hidden Units in a Layered Network Trained to Classify Sonar Targets. Neural Networks 1: 75–89.
  22. 22. Hoshida Y, Brunet JP, Tamayo P, Golub TR, Mesirov JP (2007) Subclass Mapping: Identifying Common Subtypes in Independent Disease Data Sets. PLoS ONE 2(11): e1195.
  23. 23. Notterman DA, Alon U, Sierk AJ, Levine AJ (2001) Transcriptional Gene Expression Profiles of Colorectal Adenoma, Adenocarcinoma, and Normal Tissue Examined by Oligonucleotide Arrays. Cancer Res 61: 3124.