Elsevier

Pattern Recognition

Volume 48, Issue 12, December 2015, Pages 3927-3940
Pattern Recognition

l0-norm based structural sparse least square regression for feature selection

https://doi.org/10.1016/j.patcog.2015.06.003Get rights and content

Highlights

  • We impose l0-norm inequality constraint to build the structural sparse LSR problem.

  • We develop an adaptive algorithm to ensure the structural sparsity accurately.

  • Theoretical results on the efficiency and effectiveness of our method are provided.

  • Experimental results prove the superiority of our method over the state-of-the-arts.

Abstract

This paper presents a novel approach for feature selection with regard to the problem of structural sparse least square regression (SSLSR). Rather than employing the l1-norm regularization to control the sparsity, we directly work with sparse solutions via l0-norm regularization. In particular, we develop an effective greedy algorithm, where the forward and backward steps are combined adaptively, to resolve the SSLSR problem with the intractable lr,0-norm. On the one hand, features with the strongest correlation to classes are selected in the forward steps. On the other hand, redundant features which contribute little to the improvement of the objective function are removed in the backward steps. Furthermore, we provide solid theoretical analysis to prove the effectiveness of the proposed method. Experimental results on synthetic and real world data sets from different domains also demonstrate the superiority of the proposed method over the state-of-the-arts.

Introduction

Feature selection, which has been widely identified as a key component in the area of machine learning (ML) and pattern recognition (PR), primarily addresses the problem of finding the most relevant and informative subset of features according to certain evaluation criterion [1]. It plays an important role in many aspects including compressing and understanding huge data, boosting the model generalization capability, improving the prediction accuracy, as well as accelerating the learning process.

From the perspective of evaluation strategy, feature selection mainly fall into two categories: wrappers [2] and filters [3]. Wrapper approaches require a build-in classifier to evaluate the candidate feature subset, while filter approaches evaluate the correlation between features based on a criterion indicative of the ability to separate the classes. In this paper, we focus on the latter.

Least square regression (LSR) and several variants of LSR [4], [5] have been widely applied in solving a number of ML and PR problems [6], [7]. As an important branch of LSR, structural sparse LSR (SSLSR) which adds the structural sparse regularization term or constraint to the classical LSR, has received increasing attention recently. Formally, SSLSR can be represented as the following constrained optimization problem:minWL(W):minW1nXWY22,s.t.Wr,0pwhere XRn×m is the data matrix with n samples and m features, Y{0,1}n×c is the class label indicator matrix with n samples and c classes (Yij=1 if the i-th sample belongs to the j-th class, otherwise, Yij=0). Then WRm×c indicates the correlation between features and classes. The i-th feature is selected if W(i,:)0, and W(i,:)=0 means that the i-th feature is removed. The constraint Wr,0p guarantees that the number of nonzero rows of W, i.e., the number of selected features, is no more than p. In other words, structural sparse constraint based on lr,0-norm controls the sparsity in an explicitly way, serving the purpose of feature selection.

Recent developments to handle the non-convex l0-norm [8] regularization lie in the following two aspects. The methods in the first major aspect create a smoothed convex approximation such as l1-norm regularization [6], [9], [10]. Since the conditions [8], that ensures l1-norm is an equivalent relaxation of l0-norm, do not always hold, the term Wr,1p may fail to live up to the desired feature selection property, leading to prediction accuracy loss by shrinking both relevant and irrelevant features to zero. The second aspect includes several methods which attempt to directly cope with the l0-norm via introducing auxiliary augmented functions [11], [12]. Nevertheless, additional terms mean excess variables and parameters, which may negatively affect the generalization capability of these algorithms. Besides, they often force reset of some non-zero rows of the auxiliary matrices, which may lead to error accumulation.

To remedy the drawbacks mentioned above, we develop an effective greedy algorithm to directly handle the challenging lr,0-norm in Eq. (1). Our proposed algorithm starts with an empty feature set and alternates between forward and backward greedy steps during optimization. In the forward steps, we pick informative features which reduce the temporal square loss most among the remaining features. Meanwhile, redundant features with little contribution to minimizing the objective function are removed in the backward steps. For the sake of preventing the backward steps to erase the gains made in forward steps, we combine these two steps adaptively. Therefore, it is natural to constrain the number of selected features strictly less than p. Furthermore, we provide solid theoretical analysis to show the effectiveness of our proposed method. Experimental results on synthetic and different real world data sets also demonstrate its superiority over the state-of-the-arts.

Our main contributions are highlighted as follows:

  • We develop a novel greedy algorithm to solve the SSLSR problem with lr,0-norm by combining forward and backward steps adaptively.

  • We offer theoretical guarantees, as well as the experimental results from diverse domains to support our method.

Throughout this paper, scalars, matrices, vectors, sets and functions are denoted as small, boldface capital, boldface lowercase, fraktur capital and script capital letters respectively. ai, Aij, A(i,:) and A(:,j) indicate the i-th element of a, the element of A occurs in the i-th row and j-th column, the i-th row of A and the j-th column of A respectively. Moreover, r0, we define the norm of any aRs×1 and ARs×t asar={(i=1s|ai|r)1/rifr>0i=1sai0ifr=0,Ar=(i=1sj=1t|Aij|r)1/rifr>0Ar,0=i=1sA(i,:)r0,Ar,1=i=1s(j=1t|Aij|r)1/rifr>0where a0=1 if the scalar a is nonzero.

Section snippets

Related works

Great efforts have been made to address the issue of SSLSR in the past decades, which mainly fall into two strategies.

Due to the difficulty of solving lr,0-norm directly, a number of researches have tried to tackle the SSLSR problem approximately with the choice of lr,1-norm. Discriminative LSR (DLSR) [6] achieved the structural sparsity with the help of l2,1-norm regularization. Besides, l2,1-norm regularization was also employed in Joint Embedding Learning and Sparse Regression (JELSR) [13],

Main results

In this section, we first present a simple yet effective greedy algorithm to solve Eq. (1). Then, we analyze the correctness and the time complexity of our algorithm.

Experimental setup

We refer to our method as SSLSR and compare it with other feature selection methods in terms of classification accuracy. The experimental setup is given in this section.

Experimental results on synthetic data

We first apply our method to the synthetic data set and the numbers of selected features are from 1 to 20 with step 1. As shown in Fig. 1(a)–(d), it is easy to distinguish the five classes if data points are represented by 510 selected features, which is consistent with the properties of features in the synthetic data set. From Fig. 1(e), we could observe that several redundant features are removed in the selection process of SSLSR. The accuracy curve in Fig. 1(f) explicitly demonstrates that

Conclusion

We develop a novel greedy algorithm to solve the structural sparse least square regression problem with the challenging lr,0-norm and further apply it to the informative feature selection in this paper. By combining the forward and backward step adaptively, we could pick the relevant feature and remove the redundant one during each iteration. In order to demonstrate the effectiveness of the proposed method, we provide the solid theoretical analysis and conduct the classification experiments.

Conflict of interest

None declared.

Acknowledgements

The authors thank the anonymous reviewers for their valuable comments and thank the editors for their fruitful work. This work is supported by the National Natural Science Foundation of China (No. 61303179) and the Hundred Talents Program (Chinese Academy of Sciences, No. Y3S4011D31).

Jiuqi Han received his B.E. degree from Beijing University of Chemical Technology in 2010 and his M.E. degree from University of Chinese Academy of Sciences in 2013. He is currently pursuing the Ph.D. degree at the Institute of Automation, Chinese Academy of Sciences, China. His research interests include machine learning, pattern recognition, etc.

References (36)

  • D. Singh et al.

    Gene expression correlates of clinical prostate cancer behavior

    Cancer cell

    (2002)
  • H. Yan et al.

    Joint Laplacian feature weights learning

    Pattern Recognit.

    (2014)
  • Z. Zhao et al.

    On similarity preserving feature selection

    IEEE Trans. Knowl. Data Eng.

    (2013)
  • X. Cai, F. Nie, H. Huang, C. Ding, Multi-class l2, 1-norm support vector machine, in: 2011 IEEE 11th International...
  • X. He, D. Cai, P. Niyogi, Laplacian score for feature selection, in: NIPS, pp....
  • T. Strutz, Data Fitting and Uncertainty: A Practical Introduction to Weighted Least Squares and Beyond,...
  • S. Wold et al.

    The collinearity problem in linear regression, the partial least squares (PLS) approach to generalized inverses

    SIAM J. Stat. Comput.

    (1984)
  • S. Xiang et al.

    Discriminative least squares regression for multiclass classification and feature selection

    IEEE Trans. Neural Netw. Learn. Syst.

    (2012)
  • Z. Li et al.

    Clustering-guided sparse structural learning for unsupervised feature selection

    IEEE Trans. Knowl. Data Eng.

    (2013)
  • E.J. Candes et al.

    Decoding by linear programming

    IEEE Trans. Inf. Theory

    (2005)
  • A. Beck et al.

    A fast iterative shrinkage-thresholding algorithm for linear inverse problems

    SIAM J. Imaging Sci.

    (2009)
  • M. Yuan et al.

    Model selection and estimation in regression with grouped variables

    J. R. Stat. Soc., Ser. B

    (2006)
  • X. Cai, F. Nie, H. Huang, Exact top-k feature selection via l2, 0-norm constraint, in: Proceedings of the 23rd...
  • D. Luo, C.H. Q. Ding, H. Huang, Towards structural sparsity: an explicit l2/l0 approach, in: The 10th IEEE...
  • C. Hou, F. Nie, D. Yi, Y. Wu, Feature selection via joint embedding learning and sparse regression, in: Proceedings of...
  • Z. Li, Y. Yang, J. Liu, X. Zhou, H. Lu, Unsupervised feature selection using nonnegative spectral analysis, in: AAAI,...
  • M. Qian, C. Zhai, Robust unsupervised feature selection, in: Proceedings of the Twenty-Third international Joint...
  • H. Mohimani et al.

    A fast approach for overcomplete sparse decomposition based on smoothed l0 norm

    IEEE Trans. Signal Process.

    (2009)
  • Cited by (24)

    • Low-rank inter-class sparsity based semi-flexible target least squares regression for feature representation

      2022, Pattern Recognition
      Citation Excerpt :

      Since these methods aim to use the training data to represent the test data by achieving the minimum linear representation coefficient. Furthermore, sparsity-based techniques can help the projection matrix to learn more discriminative features from the data for improvement of classification accuracy [1] [12–14] [40]. Though some LSR-based methods in literature have already achieved significant performances, if the data is influenced by noise or outliers, for some LSR-based methods the classification accuracy will be degraded.

    View all citing articles on Scopus

    Jiuqi Han received his B.E. degree from Beijing University of Chemical Technology in 2010 and his M.E. degree from University of Chinese Academy of Sciences in 2013. He is currently pursuing the Ph.D. degree at the Institute of Automation, Chinese Academy of Sciences, China. His research interests include machine learning, pattern recognition, etc.

    Zhengya Sun received her B.E. degree in 2005 and M.E. degree in 2007 from Tianjin University, and her Ph.D degree from University of Chinese Academy of Sciences in 2011. She is currently an associate professor at Institute of Automation, Chinese Academy of Sciences. Her research interests include machine learning, pattern recognition, etc.

    Hongwei Hao received his B.E. degree in 1987, M.E. degree in 1992 and his Ph.D. degree from University of Chinese Academy of Sciences, Beijing, China in 1996. He is currently a professor at Institute of Automation, Chinese Academy of Sciences. His research interest covers image processing, natural language processing and pattern recognition, etc.

    1

    Tel.: +86 10 82544483.

    View full text