l0-norm based structural sparse least square regression for feature selection

doi:10.1016/j.patcog.2015.06.003

Pattern Recognition

Volume 48, Issue 12, December 2015, Pages 3927-3940

https://doi.org/10.1016/j.patcog.2015.06.003 Get rights and content

Highlights

•
We impose l₀-norm inequality constraint to build the structural sparse LSR problem.
•
We develop an adaptive algorithm to ensure the structural sparsity accurately.
•
Theoretical results on the efficiency and effectiveness of our method are provided.
•
Experimental results prove the superiority of our method over the state-of-the-arts.

Abstract

This paper presents a novel approach for feature selection with regard to the problem of structural sparse least square regression (SSLSR). Rather than employing the l₁-norm regularization to control the sparsity, we directly work with sparse solutions via l₀-norm regularization. In particular, we develop an effective greedy algorithm, where the forward and backward steps are combined adaptively, to resolve the SSLSR problem with the intractable $l_{r, 0}$ -norm. On the one hand, features with the strongest correlation to classes are selected in the forward steps. On the other hand, redundant features which contribute little to the improvement of the objective function are removed in the backward steps. Furthermore, we provide solid theoretical analysis to prove the effectiveness of the proposed method. Experimental results on synthetic and real world data sets from different domains also demonstrate the superiority of the proposed method over the state-of-the-arts.

Introduction

Feature selection, which has been widely identified as a key component in the area of machine learning (ML) and pattern recognition (PR), primarily addresses the problem of finding the most relevant and informative subset of features according to certain evaluation criterion [1]. It plays an important role in many aspects including compressing and understanding huge data, boosting the model generalization capability, improving the prediction accuracy, as well as accelerating the learning process.

From the perspective of evaluation strategy, feature selection mainly fall into two categories: wrappers [2] and filters [3]. Wrapper approaches require a build-in classifier to evaluate the candidate feature subset, while filter approaches evaluate the correlation between features based on a criterion indicative of the ability to separate the classes. In this paper, we focus on the latter.

Least square regression (LSR) and several variants of LSR [4], [5] have been widely applied in solving a number of ML and PR problems [6], [7]. As an important branch of LSR, structural sparse LSR (SSLSR) which adds the structural sparse regularization term or constraint to the classical LSR, has received increasing attention recently. Formally, SSLSR can be represented as the following constrained optimization problem: $\min_{W} L (W) : \min_{W} \frac{1}{n} ∥ XW - Y ∥_{2}^{2}, s . t . ∥ W ∥_{r, 0} \leq p$ where $X \in R^{n \times m}$ is the data matrix with n samples and m features, $Y \in {0, 1}^{n \times c}$ is the class label indicator matrix with n samples and c classes ( $Y_{ij} = 1$ if the i-th sample belongs to the j-th class, otherwise, $Y_{ij} = 0$ ). Then $W \in R^{m \times c}$ indicates the correlation between features and classes. The i-th feature is selected if $∥ W_{(i, :)} ∥ \neq 0$ , and $W_{(i, :)} = 0$ means that the i-th feature is removed. The constraint $∥ W ∥_{r, 0} \leq p$ guarantees that the number of nonzero rows of $W$ , i.e., the number of selected features, is no more than p. In other words, structural sparse constraint based on $l_{r, 0}$ -norm controls the sparsity in an explicitly way, serving the purpose of feature selection.

Recent developments to handle the non-convex l₀-norm [8] regularization lie in the following two aspects. The methods in the first major aspect create a smoothed convex approximation such as l₁-norm regularization [6], [9], [10]. Since the conditions [8], that ensures l₁-norm is an equivalent relaxation of l₀-norm, do not always hold, the term $∥ W ∥_{r, 1} \leq p$ may fail to live up to the desired feature selection property, leading to prediction accuracy loss by shrinking both relevant and irrelevant features to zero. The second aspect includes several methods which attempt to directly cope with the l₀-norm via introducing auxiliary augmented functions [11], [12]. Nevertheless, additional terms mean excess variables and parameters, which may negatively affect the generalization capability of these algorithms. Besides, they often force reset of some non-zero rows of the auxiliary matrices, which may lead to error accumulation.

To remedy the drawbacks mentioned above, we develop an effective greedy algorithm to directly handle the challenging $l_{r, 0}$ -norm in Eq. (1). Our proposed algorithm starts with an empty feature set and alternates between forward and backward greedy steps during optimization. In the forward steps, we pick informative features which reduce the temporal square loss most among the remaining features. Meanwhile, redundant features with little contribution to minimizing the objective function are removed in the backward steps. For the sake of preventing the backward steps to erase the gains made in forward steps, we combine these two steps adaptively. Therefore, it is natural to constrain the number of selected features strictly less than p. Furthermore, we provide solid theoretical analysis to show the effectiveness of our proposed method. Experimental results on synthetic and different real world data sets also demonstrate its superiority over the state-of-the-arts.

Our main contributions are highlighted as follows:

•
We develop a novel greedy algorithm to solve the SSLSR problem with $l_{r, 0}$ -norm by combining forward and backward steps adaptively.
•
We offer theoretical guarantees, as well as the experimental results from diverse domains to support our method.

Throughout this paper, scalars, matrices, vectors, sets and functions are denoted as small, boldface capital, boldface lowercase, fraktur capital and script capital letters respectively. $a_{i}$ , $A_{ij}$ , $A_{(i, :)}$ and $A_{(:, j)}$ indicate the i-th element of $a$ , the element of $A$ occurs in the i-th row and j-th column, the i-th row of $A$ and the j-th column of $A$ respectively. Moreover, $\forall r \geq 0$ , we define the norm of any $a \in R^{s \times 1}$ and $A \in R^{s \times t}$ as $∥ a ∥_{r} = {\begin{matrix} {(\sum_{i = 1}^{s} | a_{i} |^{r})}^{1 / r} & if r > 0 \\ \sum_{i = 1}^{s} ∥ a_{i} ∥_{0} & if r = 0 \end{matrix}, ∥ A ∥_{r} = {(\sum_{i = 1}^{s} \sum_{j = 1}^{t} | A_{ij} |^{r})}^{1 / r} if r > 0 ∥ A ∥_{r, 0} = \sum_{i = 1}^{s} ∥ ∥ A_{(i, :)} ∥_{r} ∥_{0}, ∥ A ∥_{r, 1} = \sum_{i = 1}^{s} {(\sum_{j = 1}^{t} | A_{ij} |^{r})}^{1 / r} if r > 0$ where $∥ a ∥_{0} = 1$ if the scalar a is nonzero.

Section snippets

Related works

Great efforts have been made to address the issue of SSLSR in the past decades, which mainly fall into two strategies.

Due to the difficulty of solving $l_{r, 0}$ -norm directly, a number of researches have tried to tackle the SSLSR problem approximately with the choice of $l_{r, 1}$ -norm. Discriminative LSR (DLSR) [6] achieved the structural sparsity with the help of $l_{2, 1}$ -norm regularization. Besides, $l_{2, 1}$ -norm regularization was also employed in Joint Embedding Learning and Sparse Regression (JELSR) [13],

Main results

In this section, we first present a simple yet effective greedy algorithm to solve Eq. (1). Then, we analyze the correctness and the time complexity of our algorithm.

Experimental setup

We refer to our method as SSLSR and compare it with other feature selection methods in terms of classification accuracy. The experimental setup is given in this section.

Experimental results on synthetic data

We first apply our method to the synthetic data set and the numbers of selected features are from 1 to 20 with step 1. As shown in Fig. 1(a)–(d), it is easy to distinguish the five classes if data points are represented by $5 - 10$ selected features, which is consistent with the properties of features in the synthetic data set. From Fig. 1(e), we could observe that several redundant features are removed in the selection process of SSLSR. The accuracy curve in Fig. 1(f) explicitly demonstrates that

Conclusion

We develop a novel greedy algorithm to solve the structural sparse least square regression problem with the challenging $l_{r, 0}$ -norm and further apply it to the informative feature selection in this paper. By combining the forward and backward step adaptively, we could pick the relevant feature and remove the redundant one during each iteration. In order to demonstrate the effectiveness of the proposed method, we provide the solid theoretical analysis and conduct the classification experiments.

Conflict of interest

None declared.

Acknowledgements

The authors thank the anonymous reviewers for their valuable comments and thank the editors for their fruitful work. This work is supported by the National Natural Science Foundation of China (No. 61303179) and the Hundred Talents Program (Chinese Academy of Sciences, No. Y3S4011D31).

Jiuqi Han received his B.E. degree from Beijing University of Chemical Technology in 2010 and his M.E. degree from University of Chinese Academy of Sciences in 2013. He is currently pursuing the Ph.D. degree at the Institute of Automation, Chinese Academy of Sciences, China. His research interests include machine learning, pattern recognition, etc.

References (36)

D. Singh et al.
Gene expression correlates of clinical prostate cancer behavior
Cancer cell
(2002)
H. Yan et al.
Joint Laplacian feature weights learning
Pattern Recognit.
(2014)
Z. Zhao et al.
On similarity preserving feature selection
IEEE Trans. Knowl. Data Eng.
(2013)
X. Cai, F. Nie, H. Huang, C. Ding, Multi-class l2, 1-norm support vector machine, in: 2011 IEEE 11th International...
X. He, D. Cai, P. Niyogi, Laplacian score for feature selection, in: NIPS, pp....
T. Strutz, Data Fitting and Uncertainty: A Practical Introduction to Weighted Least Squares and Beyond,...
S. Wold et al.
The collinearity problem in linear regression, the partial least squares (PLS) approach to generalized inverses
SIAM J. Stat. Comput.
(1984)
S. Xiang et al.
Discriminative least squares regression for multiclass classification and feature selection
IEEE Trans. Neural Netw. Learn. Syst.
(2012)
Z. Li et al.
Clustering-guided sparse structural learning for unsupervised feature selection
IEEE Trans. Knowl. Data Eng.
(2013)
E.J. Candes et al.
Decoding by linear programming
IEEE Trans. Inf. Theory
(2005)

A. Beck et al.

A fast iterative shrinkage-thresholding algorithm for linear inverse problems

SIAM J. Imaging Sci.

(2009)

M. Yuan et al.

Model selection and estimation in regression with grouped variables

J. R. Stat. Soc., Ser. B

(2006)

X. Cai, F. Nie, H. Huang, Exact top-k feature selection via l2, 0-norm constraint, in: Proceedings of the 23rd...

D. Luo, C.H. Q. Ding, H. Huang, Towards structural sparsity: an explicit l2/l0 approach, in: The 10th IEEE...

C. Hou, F. Nie, D. Yi, Y. Wu, Feature selection via joint embedding learning and sparse regression, in: Proceedings of...

Z. Li, Y. Yang, J. Liu, X. Zhou, H. Lu, Unsupervised feature selection using nonnegative spectral analysis, in: AAAI,...

M. Qian, C. Zhai, Robust unsupervised feature selection, in: Proceedings of the Twenty-Third international Joint...

H. Mohimani et al.

A fast approach for overcomplete sparse decomposition based on smoothed l₀ norm

IEEE Trans. Signal Process.

(2009)

Cited by (24)

Adaptive reweighted quaternion sparse learning for data recovery and classification
2023, Pattern Recognition
Sparse representation (SR) methods in quaternion space have been attracting increasing interests recently. However, most existing quaternion SR methods adopt the quaternion $ℓ_{1}$ norm, which penalizes all the entries of the quaternion sparse vector equally and ignores the differences and significance of different entries. Ideally, the entries with large magnitude should be less penalized while those with small magnitude (such as zero entries) should be more penalized. Therefore, we propose an Adaptive Weighted Quaternion Sparse Representation (AWQSR) method in this paper, which can learn weights for distinct entries of the quaternion sparse entries in an adaptive manner. Due to the noncommutativity of quaternion multiplication, it is difficult to tackle the resulting optimization problem of AWQSR. For this reason, we devise an effective iteratively reweighted optimization algorithm based on quaternion operators. To further improve the classification performance, we also develop a Supervised AWQSR based Classification (SAWQSRC) method by leveraging the label information of training samples to learn discriminative weights. Theoretical analysis of SAWQSRC has also been established to show that SAWQSRC succeeds in classification under appropriate conditions. The experiments on simulated data and real data prove the validity of the proposed methods for quaternion signal recovery and classification.
Feature selection with multi-class logistic regression
2023, Neurocomputing
Feature selection can help to reduce data redundancy and improve algorithm performance in actual tasks. Most of the embedded feature selection models are constructed based on square loss and hinge loss. However, these models based on the square loss cannot directly evaluate the discriminability of the samples in the feature subspace, and these methods based on the hinge loss are difficult to solve due to their complex objective functions. To deal with these problems, a Feature Selection method with Multi-class Logistic Regression (FSMLR) is proposed in this paper. Firstly, we construct a linear function to measure the difference between the distance from samples to their regression hyperplane and the distance from these samples to regression hyperplanes of other classes, which could be used to strengthen the discriminant property of the embedded model. Then, we design a re-weighting matrix with a $ℓ_{2, 0}$ -norm sparse condition as well as a discrete condition, which is used to select features in the subspace. Considering that it is difficult to solve the re-weighting matrix with the discrete and sparse conditions in an optimization problem, we relax these two conditions and present a feature selection model via a re-weighted multi-class logistic regression with the two relaxed constraints. Finally, we add the F-norm regularization in our model to avoid overfitting, and its unconstrained equivalent transformation with $ℓ_{2, p}$ -norm regularization is derived to explore the function of the re-weighting matrix. The gradient descent algorithm could be used to solve the FSMLR. Especially, when the regularization term in the equivalence problem is set to $ℓ_{2, 1}$ -norm, the global optimal solution can be obtained. Extensive experiments on multiple public data sets prove that FSMLR outperforms other competitors.
Low-rank inter-class sparsity based semi-flexible target least squares regression for feature representation
2022, Pattern Recognition
Citation Excerpt :
Since these methods aim to use the training data to represent the test data by achieving the minimum linear representation coefficient. Furthermore, sparsity-based techniques can help the projection matrix to learn more discriminative features from the data for improvement of classification accuracy [1] [12–14] [40]. Though some LSR-based methods in literature have already achieved significant performances, if the data is influenced by noise or outliers, for some LSR-based methods the classification accuracy will be degraded.
Least squares regression (LSR) is an important machine learning method for feature extraction, feature selection, and image classification. For the training samples, there are correlations among samples from the same class. Therefore, many LSR-based methods utilize this property to pursue discriminative representation. However, if the training samples contain noise or outliers, it will be hard to obtain the exact inter-class correlation. To address this problem, in this paper, a novel LSR-based method is proposed, named low-rank inter-class sparsity based semi-flexible target least squares regression (LIS_StLSR). Firstly, the low-rank representation method is utilized to achieve the intrinsic characteristics of the training samples. Afterwards, the low-rank inter-class sparsity constraint is used to force the projected data to have an exact common sparsity structure in each class, which will be robust to noise and outliers in the training samples. This step can also reduce margins of samples from the same class and enlarge margins of samples from different classes to make the projection matrix discriminative. The low-rank representation and the discriminative projection matrix are jointly learned such that they can be boosted mutually. Moreover, a semi-flexible regression target matrix is introduced to measure the regression error more accurately, thus the regression performance can be enhanced to improve the classification accuracy. Experiments are implemented on the different databases of Yale B, AR, LFW, CASIA NIR-VIS, 15-Scene SPF, COIL-20, and Caltech 101, illustrating that the proposed LIS_StLSR outperforms many state-of-the-art methods.
Multinomial logistic regression classifier via l<inf>q,0</inf>-proximal Newton algorithm
2022, Neurocomputing
Multinomial logistic regression (MLR) is a useful tool for solving multi-classification problems. The $l_{q, 0}$ $(q ⩾ 1)$ norm is an ideal regularization term for characterizing group sparsity in multinomial logistic regression and selecting important features in the high dimensional data. However, $l_{q, 0}$ regularized multinomial logistic regression ( $l_{q, 0}$ -MLR) is nonconvex, discontinuous, and NP-hard. Thus, most prior studies adopted a continuous approximation of the $l_{q, 0}$ norm. In this paper, we present a novel $l_{q, 0}$ -proximal Newton algorithm ( $l_{q, 0}$ -PNA) to solve the $l_{q, 0}$ -MLR. We first define a strong $α$ -stationary point and prove that this point is a local minimizer of $l_{q, 0}$ -MLR. We then convert such a point into a stationary equation and solve it by $l_{q, 0}$ -PNA, which is a Newton-type method running on a group sparse subspace with a low computational cost. Furthermore, we establish a locally quadratic convergence of $l_{q, 0}$ -PNA. Finally, numerical experiments on simulated and real data show the superiority of $l_{q, 0}$ -PNA in terms of computational time and accuracy, when compared with six state-of-the-art solvers, especially for high dimensional data.
New feature selection paradigm based on hyper-heuristic technique
2021, Applied Mathematical Modelling
Feature selection (FS) is a crucial step for effective data mining since it has largest effect on improving the performance of classifiers. This is achieved by removing the irrelevant features and using only the relevant features. Many metaheuristic approaches exist in the literature in attempt to address this problem. The performance of these approaches differ based on the settings of a number of factors including the use of chaotic maps, opposition-based learning (OBL) and the percentage of the population that OBL will be applied to, the metaheuristic (MH) algorithm adopted, the classifier utilized, and the threshold value used to convert real solutions to binary ones. However, it is not an easy task to identify the best settings for these different components in order to determine the relevant features for a specific dataset. Moreover, running extensive experiments to fine tune these settings for each and every dataset will consume considerable time. In order to mitigate this important issue, a hyper-heuristic based FS paradigm is proposed. In the proposed model, a two-stage approach is adopted to identify the best combination of these components. In the first stage, referred to as the training stage, the Differential Evolution (DE) algorithm is used as a controller for selecting the best combination of components to be used by the second stage. In the second stage, referred to as the testing stage, the received combination will be evaluated using a testing set. Empirical evaluation of the proposed framework is based on numerous experiments performed on the most popular 18 datasets from the UCI machine learning repository. Experimental results illustrates that the generated generic configuration provides a better performance than eight other metaheuristic algorithms over all performance measures when applied to the UCI dataset. Moreover, The overall paradigm ranks at number one when compared against state-of-the-art algorithms. Finally, the generic configuration provides a very competitive performance for high dimensional datasets.
Supervised feature selection through Deep Neural Networks with pairwise connected structure
2020, Knowledge-Based Systems
Feature selection is an important data preprocessing strategy, has been proven empirically that it contributes to reducing the dimensionality of feature and enhancing the performance of learning algorithms in practice. Typical sparse learning-based models select the features by removing ones that the feature scores are zero. However, linear models puzzle to build the non-linear relations between features and responses. The Deep Neural Network (DNN) has a strong capability to mode the non-linear relations and has been employed to select features. In this paper, we introduce a novel deep Neural network-based Feature Selection (NeuralFS) method to identify features. The new model comprises of a fully-connection network, a decision network, and connect them through a pairwise connected structure. In NeuralFS, the fully-connected network is the crucial structure in NeuralFS that transforms the features into their corresponding scores, and the decision network is the final structure that performs classification or regression. The pairwise connected can be regarded as a “bridge” to connect the two networks, and its weights are fixed as the normalized input as well as it is un-trainable during model training. After optimizing, the feature scores can be obtained by calculating the output of the fully-connected network. NeuralFS takes advantage of the deep network to model the non-linearity, and also make features scores sparse without the sparse regularization technology. We apply the proposed method to both synthetic datasets and benchmark datasets to prove its effectiveness.

View all citing articles on Scopus

Zhengya Sun received her B.E. degree in 2005 and M.E. degree in 2007 from Tianjin University, and her Ph.D degree from University of Chinese Academy of Sciences in 2011. She is currently an associate professor at Institute of Automation, Chinese Academy of Sciences. Her research interests include machine learning, pattern recognition, etc.

Hongwei Hao received his B.E. degree in 1987, M.E. degree in 1992 and his Ph.D. degree from University of Chinese Academy of Sciences, Beijing, China in 1996. He is currently a professor at Institute of Automation, Chinese Academy of Sciences. His research interest covers image processing, natural language processing and pattern recognition, etc.

¹: Tel.: +86 10 82544483.

View full text

l0-norm based structural sparse least square regression for feature selection

Highlights

Abstract

Introduction

Section snippets

Related works

Main results

Experimental setup

Experimental results on synthetic data

Conclusion

Conflict of interest

Acknowledgements

Cancer cell

Pattern Recognit.

On similarity preserving feature selection

IEEE Trans. Knowl. Data Eng.

The collinearity problem in linear regression, the partial least squares (PLS) approach to generalized inverses

SIAM J. Stat. Comput.

Discriminative least squares regression for multiclass classification and feature selection

IEEE Trans. Neural Netw. Learn. Syst.

Clustering-guided sparse structural learning for unsupervised feature selection

IEEE Trans. Knowl. Data Eng.

Decoding by linear programming

IEEE Trans. Inf. Theory