Gaussian kernel optimization for pattern classification

doi:10.1016/j.patcog.2008.11.024

Pattern Recognition

Volume 42, Issue 7, July 2009, Pages 1237-1247

https://doi.org/10.1016/j.patcog.2008.11.024 Get rights and content

Abstract

This paper presents a novel algorithm to optimize the Gaussian kernel for pattern classification tasks, where it is desirable to have well-separated samples in the kernel feature space. We propose to optimize the Gaussian kernel parameters by maximizing a classical class separability criterion, and the problem is solved through a quasi-Newton algorithm by making use of a recently proposed decomposition of the objective criterion. The proposed method is evaluated on five data sets with two kernel-based learning algorithms. The experimental results indicate that it achieves the best overall classification performance, compared with three competing solutions. In particular, the proposed method provides a valuable kernel optimization solution in the severe small sample size scenario.

Introduction

The kernel machine technique has been widely used to tackle complicated classification problems [1], [2], [3], [4], [5] by a nonlinear mapping from the original input space to a kernel feature space. Although in general the dimensionality of the kernel feature space could be arbitrarily large or even infinite, which makes direct analysis in this space very difficult, the nonlinear mapping can be specified implicitly by replacing the dot products in the kernel feature space with a kernel function defined in the original input space. Therefore, the key task of a kernel-based solution is to generalize the linear representation in the form of dot products. Existing kernel-based algorithms include the kernel principal component analysis (KPCA) [6], generalized discriminant analysis (GDA) [7] and kernel direct discriminant analysis (KDDA) [8]. In kernel-based learning algorithms, choosing an appropriate kernel, which is a model selection problem [4], is crucial to ensure good performance since the geometrical structure of the mapped samples is determined by the selected kernel and its parameters. Thus, this paper focuses on kernel optimization for pattern classification in a supervised setting, where labeled data are available for training.

For classification purposes, it is often anticipated that the linear separability of the mapped samples is enhanced in the kernel feature space so that applying traditional linear algorithms in this space could result in better performance compared to those obtained in the original input space. However, this is not always the case [9]. If an inappropriate kernel is selected, the classification performance of kernel-based methods can be even worse than that of their linear counterparts. Therefore, selecting a proper kernel with good class separability plays a significant role in kernel-based classification algorithms.

In the literature, there are three approaches to kernel optimization. First, cross validation is a commonly used method but it can only select kernel parameters from a set of discrete values defined empirically, which may or may not contain the optimal or even suboptimal parameter values. Furthermore, cross validation is a costly procedure and it can only be performed when sufficient training samples are available. Thus, it may fail to apply when there are only very limited number of training samples available, which is often encountered in realistic applications such as face recognition. The second approach is to directly optimize the kernel matrix. In [10], the kernel matrix is optimized by maximizing the margin in the kernel feature space via a semi-definite programming technique. Weinberger et al. [11] proposed to learn the kernel matrix by maximizing the variances in the kernel feature space, and their method performed well for manifold learning but not for classification. In [12], a measure named as “alignment” is proposed to adapt the kernel matrix to sample labels, and a series of algorithms are derived for two-class classification problems. Their extensions to multi-class classification problems are usually performed by decomposing the problem into a series of two-class problems through the so-called “one-vs-all” or “one-vs-one” schemes [13], [14]. These methods usually result in different classifiers/feature extractors for different class pairs. This complicates the implementation and increases the computational demand and memory requirement in both training and testing, especially for a large number of classes. In general, direct optimization of the kernel matrix operates in the so-called transductive setting and requires the availability of both training and test data in the learning process. This is unrealistic in most, if not all, realtime applications, in addition to its high computational demand in testing. The third approach is to optimize the kernel function rather than the kernel matrix. In [15], [16], [17], [18], a radius-margin quotient is used as a criterion to tune kernel parameters for the support vector machine (SVM) classifier, and it is applicable to two-class classification problems only. Xiong et al. [9] proposed to optimize a kernel function in the so-called empirical feature space by maximizing a class separability measure defined as the ratio between the trace of the between-class scatter matrix and the trace of the within-class scatter matrix, which corresponds to the class separability criterion $J_{4}$ in [19]. Promising results have been reported on a set of two-class classification problems. However, its performance on more general multi-class classification problems is not studied. Furthermore, as pointed out in [19], the $J_{4}$ criterion is dependent on the coordinate system. In addition, it needs a set of samples to form the so-called empirical core set. This limits its use in the severe small sample size (SSS) scenario, where there are only a very small number of (e.g., two) samples per class available for training.

Among the three approaches, the third one is the most principled one since it directly optimizes the kernel function, which defines the mapping from the input space to the kernel feature space. Therefore, in this paper, we take this approach and propose a kernel optimization algorithm by maximizing the $J_{1}$ class separability criterion in [19], defined as the trace of the ratio between the between-class scatter matrix and the within-class scatter matrix. This criterion is equivalent to the criterion used in the classical Fisher's discriminant analysis [19], [20], and it is invariant under any non-singular linear transformation. We focus on optimizing the Gaussian kernel since it is a widely used kernel with success in various applications. The optimization is solved using a Newton-based algorithm, based on the decomposition introduced recently in [21]. The proposed solution works for multi-class problems and it is applicable in the severe SSS scenario as well. Furthermore, although an isotropic Gaussian kernel (with a single parameter) is used to present the algorithm, the proposed method can be readily extended to multi-parameter optimization problems as well as other kernels with differentiable kernel functions.

The rest of the paper is organized as follows. Section 2 presents the optimization algorithm in detail, where the optimization criterion is formulated and a Newton-based searching algorithm is then adopted to optimize the criterion. In Section 3, experimental results on five different data sets, including both two-class and multi-class problems, are reported to show that the proposed kernel optimization method improves the classification performance of two kernel-based classification algorithms, especially in the severe SSS scenario. Comparisons with three methods are included in this section. Finally, a conclusion is drawn in Section 4.

Section snippets

Gaussian kernel optimization for pattern classification

Kernel-based learning algorithms are essentially the implementations of the corresponding linear algorithms in the kernel feature space. Let $Z = {Z_{i}}_{i = 1}^{C}$ be a training set containing $C$ classes and each class $Z_{i} = {z_{ij}}_{j = 1}^{N_{i}}$ consists of $N_{i}$ samples, where $z_{ij} \in R^{J}$ and $R^{J}$ denotes the $J$ -dimensional real space. Let $φ (\cdot)$ be a nonlinear mapping from the input space $R^{J}$ to a kernel feature space $F$ [5], i.e., $φ : z \in R^{J} \to φ (z) \in F$ , where $F$ denotes the $F$ -dimensional kernel feature space [22]. Then, $Φ = [φ (z_{11}), \dots, φ (z_{{CN}_{C}})]$

Experimental evaluation

In order to evaluate the effectiveness of the proposed method, five sets of experiments are conducted, including both simple two-class problems and more complicated multi-class problems. The first experiment evaluates the sensitivity of the proposed method in terms of the influence of the system parameters on the classification performance. The second experiment compares the proposed method with three competing solutions. In the third experiment, two optimization criteria $J$ and $J_{4}$ are compared.

Conclusions

In this paper, a novel kernel optimization method has been proposed for pattern classification tasks. We propose to use the classical class separability criterion defined as the trace of the scatter ratio for kernel optimization. Based on the recent decomposition proposed in [21], a differentiable version of the criterion is developed for the isotropic Gaussian kernel. Experimental results on five different data sets, including both two-class and multi-class problems, demonstrate that the

Acknowledgements

The authors would like to thank the anonymous reviewers for their insightful and constructive comments. The authors would also like to thank the FERET Technical Agent, the U.S. National Institute of Standards and Technology for providing the FERET database. This work is partially supported by a Bell University Lab research grant and the CITO Student Internship Program.

About the Author—JIE WANG received both the B.Eng. and M.Sci. degree in electronic engineering from Zhejiang University, PR China, in 1998 and 2001, respectively, and the Ph.D. degree from the Edward S.Rogers, Sr. Department of Electrical and Computer Engineering, University of Toronto, Canada, in 2007. Her research interests include face detection and recognition, multi-modal biometrics and multi-classifier system.

References (34)

C. Gold et al.
Model selection for support vector machine classification
Neurocomputing
(2003)
P.J. Phillips et al.
The FERET database and evaluation procedure for face recognition algorithms
Image and Vision Computing Journal
(1998)
J. Wang et al.
On solving the face recognition problem with one training sample per subject
Pattern Recognition
(2006)
J. Wang et al.
Selecting discriminant eigenfaces for face recognition
Pattern Recognition Letters
(2005)
V.N. Vapnik
The Nature of Statistical Learning Theory
(1995)
B. Schölkopf et al.
Advances in Kernel Methods—Support Vector Learning
(1999)
A. Ruiz et al.
Nonlinear kernel-based statistical pattern analysis
IEEE Transactions on Neural Networks
(2001)
K.-R. Müller et al.
An introduction to kernel-based learning algorithms
IEEE Transactions on Neural Networks
(2001)
J. Shawe-Taylor et al.
Kernel Methods for Pattern Analysis
(2004)
B. Schölkopf et al.
Nonlinear component analysis as a kernel eigenvalue problem
Neural Computation
(1999)

G. Baudat et al.

Generalized discriminant analysis using a kernel approach

Neural Computation

(2000)

J. Lu et al.

Face recognition using kernel direct discriminant analysis algorithms

IEEE Transactions on Neural Networks

(2003)

H. Xiong et al.

Optimizing the kernel in the empirical feature space

IEEE Transactions on Neural Networks

(2005)

G.R.G. Lanckriet et al.

Learning the kernel matrix with semidefinite programming

Journal of Machine Learning Research

(2004)

K.Q. Weinberger et al.

Learning a kernel matrix for nonlinear dimensionality reduction

N. Cristianini et al.

On kernel target alignment

L.W.L. Bo et al.

Feature scaling for kernel fisher discriminant analysis using leave-one-out cross validation

Neural Computation

(2006)

Cited by (51)

Solving large-scale support vector ordinal regression with asynchronous parallel coordinate descent algorithms
2021, Pattern Recognition
Ordinal regression is one of the most influential tasks of supervised learning. Support vector ordinal regression (SVOR) is an appealing method to tackle ordinal regression problems. However, due to the complexity in the formulation of SVOR and the high cost of kernel computation, traditional SVOR solvers are inefficient for large-scale training. To address this problem, in this paper, we first highlight a special SVOR formulation whose thresholds are described implicitly, so that the dual formulation is concise to apply the state-of-the-art asynchronous parallel coordinate descent algorithm, such as AsyGCD. To further accelerate the training for SVOR, we propose two novel asynchronous parallel coordinate descent algorithms, called AsyACGD and AsyORGCD respectively. AsyACGD is an accelerated extension of AsyGCD using active set strategy. AsyORGCD is specifically designed for SVOR that it can keep the ordered thresholds when it is training so that it can obtain good performance with lower time. Experimental results on several large-scale ordinal regression datasets demonstrate the superiority of our proposed algorithms.
Some sets of orthogonal polynomial kernel functions
2017, Applied Soft Computing Journal
Kernel methods provide high performance in a variety of machine learning tasks. However, the success of kernel methods is heavily dependent on the selection of the right kernel function and proper setting of its parameters. Several sets of kernel functions based on orthogonal polynomials have been proposed recently. Besides their good performance in the error rate, these kernel functions have only one parameter chosen from a small set of integers, and it facilitates kernel selection greatly. Two sets of orthogonal polynomial kernel functions, namely the triangularly modified Chebyshev kernels and the triangularly modified Legendre kernels, are proposed in this study. Furthermore, we compare the construction methods of some orthogonal polynomial kernels and highlight the similarities and differences among them. Experiments on 32 data sets are performed for better illustration and comparison of these kernel functions in classification and regression scenarios. In general, there is difference among these orthogonal polynomial kernels in terms of accuracy, and most orthogonal polynomial kernels can match the commonly used kernels, such as the polynomial kernel, the Gaussian kernel and the wavelet kernel. Compared with these universal kernels, the orthogonal polynomial kernels each have a unique easily optimized parameter, and they store statistically significantly less support vectors in support vector classification. New presented kernels can obtain better generalization performance both for classification tasks and regression tasks.
Tuning of the hyperparameters for L2-loss SVMs with the RBF kernel by the maximum-margin principle and the jackknife technique
2015, Pattern Recognition
The hyperparameters for support vector machines (SVMs) with L2 soft margins and the radial basis function (RBF) kernel include the parameters for the RBF kernel and the L2-soft-margin parameter C. In this paper, the parameters for the RBF kernel are determined through maximization of a margin-based criterion. This criterion is approximately optimized through solving two easier subproblems: one is related to margin maximization in the input space and the other is related to the determination of the extent of sample spread in the feature space. After that, the L2-soft-margin parameter C is obtained by an analytic formula in terms of a jackknife estimate of the perturbation in the eigenvalues of the kernel matrix. In comparison with SVM model selection based on differentiable bounds, such as radius/margin bounds, experimental results on a number of open data sets show that the proposed approach is efficient and accurate.
An efficient Gaussian kernel optimization based on centered kernel polarization criterion
2015, Information Sciences
Citation Excerpt :
In order to obtain a better computation efficiency, many universal data-dependent kernel evaluation measures have been derived by optimizing the measure of data separation in the feature space. Based on Fisher discrimination criteria, Refs. [28,31,33] proposed different approaches to optimize the kernel parameters. However, the use of Fisher criteria tends to give undesired results if samples in some class form several separative clusters, especially for the case of multimodally distributed data [25].
The success of kernel-based learning methods is heavily dependent on the choice of a kernel function and proper setting of its parameters. In this paper, we optimize the Gaussian kernel for binary-class problems by using centered kernel polarization criterion. This criterion is an extension of kernel polarization and a simplified style of centered kernel alignment. Compared with formulated kernel polarization criterion, the proposed criterion has a defined geometrical significance, and it can locate the global optimal point with less influence of threshold selection. Furthermore, the approximate criterion function can be proved to have a determined global minimum point by adopting the Euler–Maclaurin formula under weaker conditions. In addition, taking the preservation of within-class local structure into account, we present an evaluation criterion named local multiclass centered kernel polarization in multiclass classification scenario. Comparative experiments are conducted on some benchmark examples with three Gaussian kernel based learning methods and the results well demonstrate the effectiveness and efficiency of the proposed quality measures.
A new robust model of one-class classification by interval-valued training data using the triangular kernel
2015, Neural Networks
A robust one-class classification model as an extension of Campbell and Bennett’s (C–B) novelty detection model on the case of interval-valued training data is proposed in the paper. It is shown that the dual optimization problem to a linear program in the C–B model has a nice property allowing to represent it as a set of simple linear programs. It is proposed also to replace the Gaussian kernel in the obtained linear support vector machines by the well-known triangular kernel which can be regarded as an approximation of the Gaussian kernel. This replacement allows us to get a finite set of simple linear optimization problems for dealing with interval-valued data. Numerical experiments with synthetic and real data illustrate performance of the proposed model.
Optimizing kernel methods to reduce dimensionality in fault diagnosis of industrial systems
2015, Computers and Industrial Engineering
Currently, industry needs more robust fault diagnosis systems. One way to achieve this is to complement these systems with preprocessing modules. This makes possible to reduce the dimension of the workspace by removing irrelevant information that hides faults in development or overloads the system’s management. In this paper, a comparison between five performance measures in the adjustment of a Gaussian kernel used with the preprocessing techniques: Kernel Fisher Discriminant Analysis (KFDA) and Kernel Principal Component Analysis (KPCA) is made. The measures of performance used were: Target alignment, Alpha, Beta, Gamma and Fisher. The best results were obtained using the KFDA with the Alpha metric achieving a significant reduction in the dimension of the workspace and a high accuracy in the fault diagnosis. As fault classifier in the Tennessee Eastman Process benchmark an Artificial Neural Network was used.

View all citing articles on Scopus

About the Author—HAIPING LU received the B.Eng. and M.Eng. degrees in electrical and electronic engineering from Nanyang Technological University, Singapore, in 2001 and 2004, respectively, and the Ph.D. degree in electrical and computer engineering from University of Toronto, Toronto, ON, Canada, in 2008. His current research interests include statistical pattern recognition and tensor object processing, with applications in human detection, tracking and recognition.

About the Author—K.N. PLATANIOTIS received the B.Eng. degree in computer engineering and informatics from University of Patras, Greece, in 1988 and the M.S. and Ph.D. degrees in electrical engineering from Florida Institute of Technology (Florida Tech), Melbourne, in 1992 and 1994, respectively. He is currently a Professor with the Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, Canada. His research interests are in the areas of multimedia systems, biometrics, image and signal processing, communications systems and pattern recognition. Dr. Plataniotis is a registered professional engineer in the province of Ontario.

About the Author—JUWEI LU received the B.Eng. degree in electrical engineering from Nanjing University of Aeronautics and Astronautics, China, in 1994 and the M.Eng. degree from the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, in 1999 and the Ph.D. degree from the Edward S.Rogers, Sr. Department of Electrical and Computer Engineering, University of Toronto, Canada, in 2004. His research interests include multimedia signal processing, face detection and recognition, kernel methods, support vector machines, neural networks, and boosting technologies.

View full text

Gaussian kernel optimization for pattern classification

Abstract

Introduction

Section snippets

Gaussian kernel optimization for pattern classification

Experimental evaluation

Conclusions

Acknowledgements

Neurocomputing

Image and Vision Computing Journal

Pattern Recognition

Pattern Recognition Letters

The Nature of Statistical Learning Theory

Advances in Kernel Methods—Support Vector Learning

Nonlinear kernel-based statistical pattern analysis

IEEE Transactions on Neural Networks

An introduction to kernel-based learning algorithms

IEEE Transactions on Neural Networks

Kernel Methods for Pattern Analysis

Nonlinear component analysis as a kernel eigenvalue problem

Neural Computation

Generalized discriminant analysis using a kernel approach

Neural Computation

Face recognition using kernel direct discriminant analysis algorithms

IEEE Transactions on Neural Networks

Optimizing the kernel in the empirical feature space

IEEE Transactions on Neural Networks

Learning the kernel matrix with semidefinite programming

Journal of Machine Learning Research

Learning a kernel matrix for nonlinear dimensionality reduction

On kernel target alignment

Feature scaling for kernel fisher discriminant analysis using leave-one-out cross validation

Neural Computation