l0-norm based structural sparse least square regression for feature selection
Introduction
Feature selection, which has been widely identified as a key component in the area of machine learning (ML) and pattern recognition (PR), primarily addresses the problem of finding the most relevant and informative subset of features according to certain evaluation criterion [1]. It plays an important role in many aspects including compressing and understanding huge data, boosting the model generalization capability, improving the prediction accuracy, as well as accelerating the learning process.
From the perspective of evaluation strategy, feature selection mainly fall into two categories: wrappers [2] and filters [3]. Wrapper approaches require a build-in classifier to evaluate the candidate feature subset, while filter approaches evaluate the correlation between features based on a criterion indicative of the ability to separate the classes. In this paper, we focus on the latter.
Least square regression (LSR) and several variants of LSR [4], [5] have been widely applied in solving a number of ML and PR problems [6], [7]. As an important branch of LSR, structural sparse LSR (SSLSR) which adds the structural sparse regularization term or constraint to the classical LSR, has received increasing attention recently. Formally, SSLSR can be represented as the following constrained optimization problem:where is the data matrix with n samples and m features, is the class label indicator matrix with n samples and c classes ( if the i-th sample belongs to the j-th class, otherwise, ). Then indicates the correlation between features and classes. The i-th feature is selected if , and means that the i-th feature is removed. The constraint guarantees that the number of nonzero rows of , i.e., the number of selected features, is no more than p. In other words, structural sparse constraint based on -norm controls the sparsity in an explicitly way, serving the purpose of feature selection.
Recent developments to handle the non-convex l0-norm [8] regularization lie in the following two aspects. The methods in the first major aspect create a smoothed convex approximation such as l1-norm regularization [6], [9], [10]. Since the conditions [8], that ensures l1-norm is an equivalent relaxation of l0-norm, do not always hold, the term may fail to live up to the desired feature selection property, leading to prediction accuracy loss by shrinking both relevant and irrelevant features to zero. The second aspect includes several methods which attempt to directly cope with the l0-norm via introducing auxiliary augmented functions [11], [12]. Nevertheless, additional terms mean excess variables and parameters, which may negatively affect the generalization capability of these algorithms. Besides, they often force reset of some non-zero rows of the auxiliary matrices, which may lead to error accumulation.
To remedy the drawbacks mentioned above, we develop an effective greedy algorithm to directly handle the challenging -norm in Eq. (1). Our proposed algorithm starts with an empty feature set and alternates between forward and backward greedy steps during optimization. In the forward steps, we pick informative features which reduce the temporal square loss most among the remaining features. Meanwhile, redundant features with little contribution to minimizing the objective function are removed in the backward steps. For the sake of preventing the backward steps to erase the gains made in forward steps, we combine these two steps adaptively. Therefore, it is natural to constrain the number of selected features strictly less than p. Furthermore, we provide solid theoretical analysis to show the effectiveness of our proposed method. Experimental results on synthetic and different real world data sets also demonstrate its superiority over the state-of-the-arts.
Our main contributions are highlighted as follows:
- •
We develop a novel greedy algorithm to solve the SSLSR problem with -norm by combining forward and backward steps adaptively.
- •
We offer theoretical guarantees, as well as the experimental results from diverse domains to support our method.
Throughout this paper, scalars, matrices, vectors, sets and functions are denoted as small, boldface capital, boldface lowercase, fraktur capital and script capital letters respectively. , , and indicate the i-th element of , the element of occurs in the i-th row and j-th column, the i-th row of and the j-th column of respectively. Moreover, , we define the norm of any and aswhere if the scalar a is nonzero.
Section snippets
Related works
Great efforts have been made to address the issue of SSLSR in the past decades, which mainly fall into two strategies.
Due to the difficulty of solving -norm directly, a number of researches have tried to tackle the SSLSR problem approximately with the choice of -norm. Discriminative LSR (DLSR) [6] achieved the structural sparsity with the help of -norm regularization. Besides, -norm regularization was also employed in Joint Embedding Learning and Sparse Regression (JELSR) [13],
Main results
In this section, we first present a simple yet effective greedy algorithm to solve Eq. (1). Then, we analyze the correctness and the time complexity of our algorithm.
Experimental setup
We refer to our method as SSLSR and compare it with other feature selection methods in terms of classification accuracy. The experimental setup is given in this section.
Experimental results on synthetic data
We first apply our method to the synthetic data set and the numbers of selected features are from 1 to 20 with step 1. As shown in Fig. 1(a)–(d), it is easy to distinguish the five classes if data points are represented by selected features, which is consistent with the properties of features in the synthetic data set. From Fig. 1(e), we could observe that several redundant features are removed in the selection process of SSLSR. The accuracy curve in Fig. 1(f) explicitly demonstrates that
Conclusion
We develop a novel greedy algorithm to solve the structural sparse least square regression problem with the challenging -norm and further apply it to the informative feature selection in this paper. By combining the forward and backward step adaptively, we could pick the relevant feature and remove the redundant one during each iteration. In order to demonstrate the effectiveness of the proposed method, we provide the solid theoretical analysis and conduct the classification experiments.
Conflict of interest
None declared.
Acknowledgements
The authors thank the anonymous reviewers for their valuable comments and thank the editors for their fruitful work. This work is supported by the National Natural Science Foundation of China (No. 61303179) and the Hundred Talents Program (Chinese Academy of Sciences, No. Y3S4011D31).
Jiuqi Han received his B.E. degree from Beijing University of Chemical Technology in 2010 and his M.E. degree from University of Chinese Academy of Sciences in 2013. He is currently pursuing the Ph.D. degree at the Institute of Automation, Chinese Academy of Sciences, China. His research interests include machine learning, pattern recognition, etc.
References (36)
- et al.
Gene expression correlates of clinical prostate cancer behavior
Cancer cell
(2002) - et al.
Joint Laplacian feature weights learning
Pattern Recognit.
(2014) - et al.
On similarity preserving feature selection
IEEE Trans. Knowl. Data Eng.
(2013) - X. Cai, F. Nie, H. Huang, C. Ding, Multi-class l2, 1-norm support vector machine, in: 2011 IEEE 11th International...
- X. He, D. Cai, P. Niyogi, Laplacian score for feature selection, in: NIPS, pp....
- T. Strutz, Data Fitting and Uncertainty: A Practical Introduction to Weighted Least Squares and Beyond,...
- et al.
The collinearity problem in linear regression, the partial least squares (PLS) approach to generalized inverses
SIAM J. Stat. Comput.
(1984) - et al.
Discriminative least squares regression for multiclass classification and feature selection
IEEE Trans. Neural Netw. Learn. Syst.
(2012) - et al.
Clustering-guided sparse structural learning for unsupervised feature selection
IEEE Trans. Knowl. Data Eng.
(2013) - et al.
Decoding by linear programming
IEEE Trans. Inf. Theory
(2005)
A fast iterative shrinkage-thresholding algorithm for linear inverse problems
SIAM J. Imaging Sci.
Model selection and estimation in regression with grouped variables
J. R. Stat. Soc., Ser. B
A fast approach for overcomplete sparse decomposition based on smoothed l0 norm
IEEE Trans. Signal Process.
Cited by (24)
Adaptive reweighted quaternion sparse learning for data recovery and classification
2023, Pattern RecognitionFeature selection with multi-class logistic regression
2023, NeurocomputingLow-rank inter-class sparsity based semi-flexible target least squares regression for feature representation
2022, Pattern RecognitionCitation Excerpt :Since these methods aim to use the training data to represent the test data by achieving the minimum linear representation coefficient. Furthermore, sparsity-based techniques can help the projection matrix to learn more discriminative features from the data for improvement of classification accuracy [1] [12–14] [40]. Though some LSR-based methods in literature have already achieved significant performances, if the data is influenced by noise or outliers, for some LSR-based methods the classification accuracy will be degraded.
New feature selection paradigm based on hyper-heuristic technique
2021, Applied Mathematical ModellingSupervised feature selection through Deep Neural Networks with pairwise connected structure
2020, Knowledge-Based Systems
Jiuqi Han received his B.E. degree from Beijing University of Chemical Technology in 2010 and his M.E. degree from University of Chinese Academy of Sciences in 2013. He is currently pursuing the Ph.D. degree at the Institute of Automation, Chinese Academy of Sciences, China. His research interests include machine learning, pattern recognition, etc.
Zhengya Sun received her B.E. degree in 2005 and M.E. degree in 2007 from Tianjin University, and her Ph.D degree from University of Chinese Academy of Sciences in 2011. She is currently an associate professor at Institute of Automation, Chinese Academy of Sciences. Her research interests include machine learning, pattern recognition, etc.
Hongwei Hao received his B.E. degree in 1987, M.E. degree in 1992 and his Ph.D. degree from University of Chinese Academy of Sciences, Beijing, China in 1996. He is currently a professor at Institute of Automation, Chinese Academy of Sciences. His research interest covers image processing, natural language processing and pattern recognition, etc.
- 1
Tel.: +86 10 82544483.