Elsevier

Pattern Recognition

Volume 42, Issue 7, July 2009, Pages 1237-1247
Pattern Recognition

Gaussian kernel optimization for pattern classification

https://doi.org/10.1016/j.patcog.2008.11.024Get rights and content

Abstract

This paper presents a novel algorithm to optimize the Gaussian kernel for pattern classification tasks, where it is desirable to have well-separated samples in the kernel feature space. We propose to optimize the Gaussian kernel parameters by maximizing a classical class separability criterion, and the problem is solved through a quasi-Newton algorithm by making use of a recently proposed decomposition of the objective criterion. The proposed method is evaluated on five data sets with two kernel-based learning algorithms. The experimental results indicate that it achieves the best overall classification performance, compared with three competing solutions. In particular, the proposed method provides a valuable kernel optimization solution in the severe small sample size scenario.

Introduction

The kernel machine technique has been widely used to tackle complicated classification problems [1], [2], [3], [4], [5] by a nonlinear mapping from the original input space to a kernel feature space. Although in general the dimensionality of the kernel feature space could be arbitrarily large or even infinite, which makes direct analysis in this space very difficult, the nonlinear mapping can be specified implicitly by replacing the dot products in the kernel feature space with a kernel function defined in the original input space. Therefore, the key task of a kernel-based solution is to generalize the linear representation in the form of dot products. Existing kernel-based algorithms include the kernel principal component analysis (KPCA) [6], generalized discriminant analysis (GDA) [7] and kernel direct discriminant analysis (KDDA) [8]. In kernel-based learning algorithms, choosing an appropriate kernel, which is a model selection problem [4], is crucial to ensure good performance since the geometrical structure of the mapped samples is determined by the selected kernel and its parameters. Thus, this paper focuses on kernel optimization for pattern classification in a supervised setting, where labeled data are available for training.

For classification purposes, it is often anticipated that the linear separability of the mapped samples is enhanced in the kernel feature space so that applying traditional linear algorithms in this space could result in better performance compared to those obtained in the original input space. However, this is not always the case [9]. If an inappropriate kernel is selected, the classification performance of kernel-based methods can be even worse than that of their linear counterparts. Therefore, selecting a proper kernel with good class separability plays a significant role in kernel-based classification algorithms.

In the literature, there are three approaches to kernel optimization. First, cross validation is a commonly used method but it can only select kernel parameters from a set of discrete values defined empirically, which may or may not contain the optimal or even suboptimal parameter values. Furthermore, cross validation is a costly procedure and it can only be performed when sufficient training samples are available. Thus, it may fail to apply when there are only very limited number of training samples available, which is often encountered in realistic applications such as face recognition. The second approach is to directly optimize the kernel matrix. In [10], the kernel matrix is optimized by maximizing the margin in the kernel feature space via a semi-definite programming technique. Weinberger et al. [11] proposed to learn the kernel matrix by maximizing the variances in the kernel feature space, and their method performed well for manifold learning but not for classification. In [12], a measure named as “alignment” is proposed to adapt the kernel matrix to sample labels, and a series of algorithms are derived for two-class classification problems. Their extensions to multi-class classification problems are usually performed by decomposing the problem into a series of two-class problems through the so-called “one-vs-all” or “one-vs-one” schemes [13], [14]. These methods usually result in different classifiers/feature extractors for different class pairs. This complicates the implementation and increases the computational demand and memory requirement in both training and testing, especially for a large number of classes. In general, direct optimization of the kernel matrix operates in the so-called transductive setting and requires the availability of both training and test data in the learning process. This is unrealistic in most, if not all, realtime applications, in addition to its high computational demand in testing. The third approach is to optimize the kernel function rather than the kernel matrix. In [15], [16], [17], [18], a radius-margin quotient is used as a criterion to tune kernel parameters for the support vector machine (SVM) classifier, and it is applicable to two-class classification problems only. Xiong et al. [9] proposed to optimize a kernel function in the so-called empirical feature space by maximizing a class separability measure defined as the ratio between the trace of the between-class scatter matrix and the trace of the within-class scatter matrix, which corresponds to the class separability criterion J4 in [19]. Promising results have been reported on a set of two-class classification problems. However, its performance on more general multi-class classification problems is not studied. Furthermore, as pointed out in [19], the J4 criterion is dependent on the coordinate system. In addition, it needs a set of samples to form the so-called empirical core set. This limits its use in the severe small sample size (SSS) scenario, where there are only a very small number of (e.g., two) samples per class available for training.

Among the three approaches, the third one is the most principled one since it directly optimizes the kernel function, which defines the mapping from the input space to the kernel feature space. Therefore, in this paper, we take this approach and propose a kernel optimization algorithm by maximizing the J1 class separability criterion in [19], defined as the trace of the ratio between the between-class scatter matrix and the within-class scatter matrix. This criterion is equivalent to the criterion used in the classical Fisher's discriminant analysis [19], [20], and it is invariant under any non-singular linear transformation. We focus on optimizing the Gaussian kernel since it is a widely used kernel with success in various applications. The optimization is solved using a Newton-based algorithm, based on the decomposition introduced recently in [21]. The proposed solution works for multi-class problems and it is applicable in the severe SSS scenario as well. Furthermore, although an isotropic Gaussian kernel (with a single parameter) is used to present the algorithm, the proposed method can be readily extended to multi-parameter optimization problems as well as other kernels with differentiable kernel functions.

The rest of the paper is organized as follows. Section 2 presents the optimization algorithm in detail, where the optimization criterion is formulated and a Newton-based searching algorithm is then adopted to optimize the criterion. In Section 3, experimental results on five different data sets, including both two-class and multi-class problems, are reported to show that the proposed kernel optimization method improves the classification performance of two kernel-based classification algorithms, especially in the severe SSS scenario. Comparisons with three methods are included in this section. Finally, a conclusion is drawn in Section 4.

Section snippets

Gaussian kernel optimization for pattern classification

Kernel-based learning algorithms are essentially the implementations of the corresponding linear algorithms in the kernel feature space. Let Z={Zi}i=1C be a training set containing C classes and each class Zi={zij}j=1Ni consists of Ni samples, where zijRJ and RJ denotes the J-dimensional real space. Let φ(·) be a nonlinear mapping from the input space RJ to a kernel feature space F [5], i.e., φ:zRJφ(z)F, where F denotes the F-dimensional kernel feature space [22]. Then, Φ=[φ(z11),,φ(zCNC)]

Experimental evaluation

In order to evaluate the effectiveness of the proposed method, five sets of experiments are conducted, including both simple two-class problems and more complicated multi-class problems. The first experiment evaluates the sensitivity of the proposed method in terms of the influence of the system parameters on the classification performance. The second experiment compares the proposed method with three competing solutions. In the third experiment, two optimization criteria J and J4 are compared.

Conclusions

In this paper, a novel kernel optimization method has been proposed for pattern classification tasks. We propose to use the classical class separability criterion defined as the trace of the scatter ratio for kernel optimization. Based on the recent decomposition proposed in [21], a differentiable version of the criterion is developed for the isotropic Gaussian kernel. Experimental results on five different data sets, including both two-class and multi-class problems, demonstrate that the

Acknowledgements

The authors would like to thank the anonymous reviewers for their insightful and constructive comments. The authors would also like to thank the FERET Technical Agent, the U.S. National Institute of Standards and Technology for providing the FERET database. This work is partially supported by a Bell University Lab research grant and the CITO Student Internship Program.

About the Author—JIE WANG received both the B.Eng. and M.Sci. degree in electronic engineering from Zhejiang University, PR China, in 1998 and 2001, respectively, and the Ph.D. degree from the Edward S.Rogers, Sr. Department of Electrical and Computer Engineering, University of Toronto, Canada, in 2007. Her research interests include face detection and recognition, multi-modal biometrics and multi-classifier system.

References (34)

  • G. Baudat et al.

    Generalized discriminant analysis using a kernel approach

    Neural Computation

    (2000)
  • J. Lu et al.

    Face recognition using kernel direct discriminant analysis algorithms

    IEEE Transactions on Neural Networks

    (2003)
  • H. Xiong et al.

    Optimizing the kernel in the empirical feature space

    IEEE Transactions on Neural Networks

    (2005)
  • G.R.G. Lanckriet et al.

    Learning the kernel matrix with semidefinite programming

    Journal of Machine Learning Research

    (2004)
  • K.Q. Weinberger et al.

    Learning a kernel matrix for nonlinear dimensionality reduction

  • N. Cristianini et al.

    On kernel target alignment

  • L.W.L. Bo et al.

    Feature scaling for kernel fisher discriminant analysis using leave-one-out cross validation

    Neural Computation

    (2006)
  • Cited by (51)

    • Some sets of orthogonal polynomial kernel functions

      2017, Applied Soft Computing Journal
    • An efficient Gaussian kernel optimization based on centered kernel polarization criterion

      2015, Information Sciences
      Citation Excerpt :

      In order to obtain a better computation efficiency, many universal data-dependent kernel evaluation measures have been derived by optimizing the measure of data separation in the feature space. Based on Fisher discrimination criteria, Refs. [28,31,33] proposed different approaches to optimize the kernel parameters. However, the use of Fisher criteria tends to give undesired results if samples in some class form several separative clusters, especially for the case of multimodally distributed data [25].

    View all citing articles on Scopus

    About the Author—JIE WANG received both the B.Eng. and M.Sci. degree in electronic engineering from Zhejiang University, PR China, in 1998 and 2001, respectively, and the Ph.D. degree from the Edward S.Rogers, Sr. Department of Electrical and Computer Engineering, University of Toronto, Canada, in 2007. Her research interests include face detection and recognition, multi-modal biometrics and multi-classifier system.

    About the Author—HAIPING LU received the B.Eng. and M.Eng. degrees in electrical and electronic engineering from Nanyang Technological University, Singapore, in 2001 and 2004, respectively, and the Ph.D. degree in electrical and computer engineering from University of Toronto, Toronto, ON, Canada, in 2008. His current research interests include statistical pattern recognition and tensor object processing, with applications in human detection, tracking and recognition.

    About the Author—K.N. PLATANIOTIS received the B.Eng. degree in computer engineering and informatics from University of Patras, Greece, in 1988 and the M.S. and Ph.D. degrees in electrical engineering from Florida Institute of Technology (Florida Tech), Melbourne, in 1992 and 1994, respectively. He is currently a Professor with the Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, Canada. His research interests are in the areas of multimedia systems, biometrics, image and signal processing, communications systems and pattern recognition. Dr. Plataniotis is a registered professional engineer in the province of Ontario.

    About the Author—JUWEI LU received the B.Eng. degree in electrical engineering from Nanjing University of Aeronautics and Astronautics, China, in 1994 and the M.Eng. degree from the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, in 1999 and the Ph.D. degree from the Edward S.Rogers, Sr. Department of Electrical and Computer Engineering, University of Toronto, Canada, in 2004. His research interests include multimedia signal processing, face detection and recognition, kernel methods, support vector machines, neural networks, and boosting technologies.

    View full text