Improved Group Sparse Classifier

doi:10.1016/j.patrec.2010.06.014

Pattern Recognition Letters

Volume 31, Issue 13, 1 October 2010, Pages 1959-1964

https://doi.org/10.1016/j.patrec.2010.06.014 Get rights and content

Abstract

This work proposes a new classifier based on the assumption that the training samples of a particular class approximately form a linear basis for any new test sample belonging to that class. This is not a new assumption; two previous works namely Sparse Classifier (Yang et al., 2009) and Group Sparse Classifier (Majumdar and Ward, 2009a) has been built upon it. However both the previous works are fraught with certain shortcomings. The two optimization algorithms proposed in the two previous works do not capture all the implications of the said assumption perfectly. This work, accounts for all the intricacies of the said assumption and consequently requires solving a new non-convex optimization problem. We have developed an elegant approach to solve the optimization problem based on Iterative Reweighted Least Squares method.

Our classifier is very flexible; the previous classifiers are actually two special cases of the said classifier. The proposed classifier has been rigorously compared against the previous ones stemming from the said assumption on benchmark classification databases. The results indicate that in most cases the proposed classifier is significantly better than the previous ones and in the worst case the proposed classifier is as good as the previous ones.

Introduction

Recently there has been an interest in a classification model where the training samples of a particular class are assumed to form a linear basis for approximately representing any new test sample belonging to that class. This assumption has led to the development of some new classifiers like the Sparse Classifier (SC) (Yang et al., 2009), Group Sparse Classifier (GSC) (Majumdar and Ward, 2009a) and its fast version called the Fast Group Sparse Classifier (FGSC) (Majumdar and Ward, 2009b). Formally, this assumption can be expressed as, $v_{k, test} = α_{k, 1} v_{k, 1} + α_{k, 2} v_{k, 2} + \dots + α_{k, n_{k}} v_{k, n_{k}} + ε_{k},$ where v_k_,_test is the test sample of the kth class, v_k_,_i is the ith training sample of the kth class, α_k_,_i is the weight corresponding weight and ε_k is the approximation error.

Geometrically this assumption implies that each class comprises of a few subspaces, and each subspace is spanned by some training vectors. The new test sample can be approximately represented as a union of these subspaces. To take a concrete example, consider the problem of face recognition. Faces appear in different poses – frontal, left profile, right profile, etc. Each of the poses can be assumed to lie on a subspace. If there are enough training samples, a few samples in each pose will be sufficient to capture the variability of that pose. Now if there is a new test sample, with a pose somewhere between frontal and left profile, then it may be expressed as a linear combination of the training samples in frontal and left profile poses. The relative contribution of the training samples in each pose will be decided by their weights (α).

All the aforementioned classifiers (SC, GSC and FGSC) determine the unknown weights (α_k_,_i’s) by solving different optimization problems and use these weights to determine the class of the test sample. The first step for all of them is to express the assumption in terms of all the training samples instead of just one class as in (1). In compact matrix vector notation it can be expressed as, $v_{k, test} = V α + ε$ where $V = [v_{1, 1} | \dots | v_{n, 1} | \dots | v_{k, 1} | \dots | v_{k, n_{k}} | \dots v_{C, 1} | \dots | v_{C, n_{C}}]$ and $α = [α_{1, 1} \dots α_{1, n_{1}} \dots α_{k, 1} \dots α_{k, n_{k}} \dots α_{C, 1} \dots α_{C, n_{C}}]^{'}$ .

Given v_k_,_test and V, the problem is to determine the class of the test sample. To achieve this from the said assumption, it is required to solve the unknown weight vector α. The way to do this is to solve an optimization problem of the following form, $\begin{matrix} \hat{α} & = \min_{α} φ (α) \\ s . t . ‖ v_{k, test} - V α ‖_{2} ⩽ σ \end{matrix}$ The objective function φ(α) contains prior information about the nature of α and the constraint arises from the assumption that the misfit is Normally distributed.

The different classification methods vary in their choice of the objective function. In SC (Yang et al., 2009) it is assumed that the weight vector α should be sparse and therefore employed the l₁-norm on the weight vector $(‖ α ‖_{1})$ as the objective function. The choice of the l₁-norm as a sparsifying function can be traced back to studies in statistics (Tibshirani, 1996) and signal processing (Chen et al., 2001). However the l₁-norm is not an appropriate choice in this case. This is because as pointed out in (Zou and Hastie, 2005) l₁-norm only selects a single sample from a group of correlated training samples. For the example of face recognition, it implies that, if the test sample is a pose between frontal and left profile, the SC (employing l₁-norm) will try to express the test sample either as a frontal or as a left profile, but not a combination of both. Clearly this is not desired.

To address this issue, the GSC was proposed by Majumdar and Ward (2009a). It employed a mixed norm for the objective function, given by $‖ α ‖_{2, 1} = \sum_{k} ‖ α_{k} ‖_{2}$ . This norm ensured that for a particular class, all the training samples were selected, but the number of classes selected is sparse (Huang and Zhang, 2008, Eldar et al., 2009). The GSC (also the FGSC which is a fast implementation of GSC) showed good results, but it is not perfect either. This is because the inner l₂-norm of the mixed norm selects all the training samples from a particular class. We again refer to the problem of face recognition. In this case, if a test image is somewhere between the left profile and frontal, this classifier will try to represent it as a combination of all the view – left, frontal and right. This is erroneous as well.

The SC and the GSC are two extremes – the SC only selects one sample from the entire class (in the extreme case) and the GSC selects all. Ideally we would like the classifier to represent the test sample only as a combination of the relevant subspaces and not all. For the face recognition problem, this means that if a test sample is somewhere between the left and frontal view, it should be only represented by training samples corresponding to these and not training samples from right profile view. Present classifiers like SC and GSC can not address this problem. This paper is dedicated to find methods to solve this shortcoming so that the classifier can operate between the two extremes.

The other issue we would tackle in this paper is the relaxation of the l₂-norm misfit (3). There is no reason (physical or mathematical) to assume that the misfit is normally distributed. In this work we would like to experiment with some other norms for the misfit.

This paper has two major contributions – (i) changing the objective function to address aforementioned issues; and (ii) changing the norm for the misfit. Both of these lead to new optimization problems which will solved in an efficient manner in this paper. The rest of the paper is divided into several sections. In the next section, the background of this work is discussed. The proposed classification model is discussed in Section 3. In Section 4, the optimization problem behind the proposed classifier is detailed out. The experimental results are described in Section 5. The conclusions of the work are discussed in Section 6.

Section snippets

Background

The assumption that a new test sample can be approximated as a linear combination of the training samples of that class was proposed by Yang et al. (2009) for the Sparse Classifier (SC). Following the same assumption the Group Sparse Classifier (GSC) was later developed by Majumdar and Ward (2009a). The Fast version of GSC (FGSC) is not a different classifier, but employs fast algorithms to solve the same optimization problem. Therefore we will discuss the SC and the GSC in detail to understand

Proposed classifier

We have discussed the shortcomings of the SC and the GSC in terms of the objective function in the optimization problem (3) which encodes prior knowledge about the solution. The other issue common to both the SC and the GSC is the l₂-norm mismatch. There is no reason which dictates that the error term in (2) is Gaussian; therefore the l₂-norm mismatch may not be optimal. It is worth exploring other l_q-norm for the mismatch.

We propose to solve (2) via the following optimization problem, $\hat{α} = \min_{α} ‖ α ‖$

Optimization algorithm

The optimization problem (8) is the same as the following problem because the minimizer for $‖ \cdot ‖_{t}$ is the same as the minimize for $‖ \cdot ‖_{t}^{t}$ $\hat{α} = \min_{α} ‖ α ‖_{m, p}^{p} such that ‖ v_{test} - V α ‖_{q}^{q} < σ$ In standard optimization terminology, the term $‖ α ‖_{m, p}^{p}$ is the objective function and $‖ v_{test} - V α ‖_{q}^{q} < σ$ is the constraint. This constrained form has the following equivalent unconstrained formulation for a proper choice of the Lagrangian (λ), $\hat{α} = \min_{α} ‖ α ‖_{m, p}^{p} + λ ‖ v_{test} - V α ‖_{q}^{q}$ The constant λ is related to σ. But the relationship between λ

Experimental evaluation

The experimental evaluation was carried out in two parts. In the first part, the experiments were carried out on benchmark datasets from the University of California, Irvine Machine Learning (UCI ML) repository. The second part of the experiments were carried out on two widely used face and character recognition datasets, namely the YaleB and USPS, respectively.

For the UCI ML datasets, we only chose those datasets which do not have any missing values or unlabeled samples. The proposed

Conclusion

This paper proposes a new classification algorithm based on a simple assumption that the training samples of a particular class approximately form a linear basis for any new test sample belonging to the class. The assumption appears to be simple, but incorporating the implications of this assumption into the actual optimization problem involved in the classification is far from trivial. Two previous works have already been done based on the said assumption – they are the Sparse Classifier (SC) (

References (21)

E. Amaldi et al.
On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems
Theor. Comput. Sci.
(1998)
R. Chartrand et al.
Restricted isometry properties and nonconvex compressive sensing
Inverse Prob.
(2008)
Chartrand, R., Yin, W., 2008. Iteratively reweighted algorithms for compressive sensing. In: IEEE Internat. Conf. on...
S. Chen et al.
Atomic decomposition by basis pursuit
SIAM Rev.
(2001)
Daubechies, I., DeVore, R., Fornasier, M., Sinan Güntürk, C., preprint. Iteratively re-weighted least squares...
D.L. Donoho
For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution
Comm. Pure Appl. Math.
(2006)
Eldar, Y.C., Kuppinger, P., Bölcskei, H., 2009. Compressed Sensing of Block-Sparse Signals: Uncertainty Relations and...
Haber, E., 1997. Numerical Strategies for the Solution of Inverse Problems. Ph.D. Thesis, Univ. of British...
Huang, J., Zhang, T., 2008. The Benefit of Group Sparsity. Available from:...
l1-magic....

There are more references available in the full text version of this article.

Cited by (14)

Group sparse representation based classification for multi-feature multimodal biometrics
2016, Information Fusion
Multimodal biometrics technology consolidates information obtained from multiple sources at sensor level, feature level, match score level, and decision level. It is used to increase robustness and provide broader population coverage for inclusion. Due to the inherent challenges involved with feature-level fusion, combining multiple evidences is attempted at score, rank, or decision level where only a minimal amount of information is preserved. In this paper, we propose the Group Sparse Representation based Classifier (GSRC) which removes the requirement for a separate feature-level fusion mechanism and integrates multi-feature representation seamlessly into classification. The performance of the proposed algorithm is evaluated on two multimodal biometric datasets. Experimental results indicate that the proposed classifier succeeds in efficiently utilizing a multi-feature representation of input data to perform accurate biometric recognition.
Sparse neighbor representation for classification
2012, Pattern Recognition Letters
Citation Excerpt :
SRC cast the recognition problem as one of classifying among multiple linear reconstruction models and argued that new theory from sparse signal representation offers the key to address this problem. The work in (Majumdar and Ward, 2010) is a generalization of SRC, with the original assumption extended to include cases that the test samples can be represented by a nonlinear combination of the training samples. Yang et al. (2011) proposed a robust sparse coding method, by modeling the sparse coding as a sparsity constrained robust regression problem, and it is robust to outliers.
Recent research of sparse signal representation has aimed at learning discriminative sparse models instead of purely reconstructive ones for classification tasks, such as sparse representation based classification (SRC) which obtains state-of-the-art results in face recognition. In this paper, a new method is proposed in that direction. With the assumption of locally linear embedding, the proposed method achieves the classification goal via sparse neighbor representation, combining the reconstruction property, sparsity and discrimination power. The experiments on several data sets are performed and results show that the proposed method is acceptable for nonlinear data sets. Further, it is argued that the proposed method is well suited for the classification of low dimensional data dimensionally reduced by dimensionality reduction methods, especially the methods obtaining the low dimensional and neighborhood preserving embeddings, and it costs less time.
Iterative Re-weighted Least Squares Algorithms for Non-negative Sparse and Group-sparse Recovery
2022, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Joint Sparse Regularization for Dictionary Learning
2019, Cognitive Computation
Enhanced Signal Recovery via Sparsity Inducing Image Priors
2018, arXiv
Class-wise deep dictionary learning
2017, Proceedings of the International Joint Conference on Neural Networks

View all citing articles on Scopus

View full text

Improved Group Sparse Classifier

Abstract

Introduction

Section snippets

Background

Proposed classifier

Optimization algorithm

Experimental evaluation

Conclusion

Theor. Comput. Sci.

Restricted isometry properties and nonconvex compressive sensing

Inverse Prob.

Atomic decomposition by basis pursuit

SIAM Rev.

For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution

Comm. Pure Appl. Math.