Elsevier

Pattern Recognition Letters

Volume 88, 1 March 2017, Pages 6-11
Pattern Recognition Letters

Empirical comparison of cross-validation and internal metrics for tuning SVM hyperparameters

https://doi.org/10.1016/j.patrec.2017.01.007Get rights and content

Highlights

  • Compared cross validation with 5 internal metric to select SVM hyperparameters.

  • Cross validation results in hyperparameters with better accuracy on new data.

  • Distance between two classes (DBTC) is the second best algorithm.

  • DBTC has the lowest execution time and is a very competitive alternative to 5-fold CV.

Abstract

Hyperparameter tuning is a mandatory step for building a support vector machine classifier. In this work, we study some methods based on metrics of the training set itself, and not the performance of the classifier on a different test set - the usual cross-validation approach. We compare cross-validation (5-fold) with Xi-alpha, radius-margin bound, generalized approximate cross validation, maximum discrepancy and distance between two classes on 110 public binary data sets. Cross validation is the method that resulted in the best selection of the hyper-parameters, but it is also the method with one of the highest execution time. Distance between two classes (DBTC) is the fastest and the second best ranked method. We discuss that DBTC is a reasonable alternative to cross validation when training/hyperparameter-selection times are an issue and that the loss in accuracy when using DBTC is reasonably small.

Introduction

Support Vector Machines (SVMs) are commonly used in classification problems with two or more classes. In its general formulation, SVM works by mapping input data (x) into a high-dimensional feature space (ϕ(x)), and building a hyperplane (f(x)=w·ϕ(x)+b) to separate examples from two classes. For a L1 soft-margin SVM, this hyperplane is defined by solving the primal problem: min12w2+Ci=1nξisubjecttoyi(w·ϕ(xi)+b)1ξiwithξi0and1inwhere xi is a data example, and yi{1,1} its label/class. Computationally this problem is solved on its dual form: min12αTKαeTαsubjecttoyTα=0with0αiCand1inwhere e is a vector of ones and Kij=k(xi,xj)=ϕ(xi)T·ϕ(xj) is a kernel function that performs the ϕ mapping. The Gaussian Radial Basis Function (RBF) kernel is a common choice for the kernel function k(xi,xj)=exp(γxixj2).

The hyper-parameters C and γ must be defined before solving the minimization in Eq. (2) and must be carefully chosen to obtain a good accuracy. Choosing C and γ, known as hyper-parameter tuning or model selection, is usually done by performing a grid search over pairs of C and γ and testing each pair using some cross-validation procedure. Examples of cross validation procedures are: k-fold, repeated k-fold, bootstrap, leave one out, hold-out, among others.

Formally, if D is the data set, cross validation procedures will define a set of pairs of sets TRi and TEi, called train and test, such that: TRiTEi=TRiTEID

Let us use the notation acc(B|A,C,γ)to indicate the accuracy on a data set B of a SVM trained on data set A with hyperparameters C and γ.

Then the cross-validation of pair C, γ (for a data set D under some cross validation procedure) is: acccv(C,γ)=meaniacc(TEi|TRi,C,γ)

The best set of hyper-parameters, from a set of candidates S is: argmaxC,γSacccv(C,γ)

The expression C, γS indicates that we are selecting the best C an γ from a pre-defined set of candidates (see Section 2.2).

This usual cross-validation procedure may be too costly, and there has been many proposals to consider measures of the training set itself as the ones that are maximized or minimized in order to select one or both hyper-parameters. We will call them internal metrics[9]; call them performance measures, [3] call then in-sample methods, [1] call them methods based on the Statistical Learning Theory.

If ψ(D,C,γ) is one of these internal metrics, when applied to the set D with hyper-parameters C and γ, then Eq. (3) would be: argmaxC,γSψ(D,C,γ)Again we are selecting the best hyperparameters from a pre-defined set of pairs S.

One of the ideas behind using internal metrics to select hyper-parameters is that the cost of selecting of hyper-parameters will probably be lower. For each pair of hyper-parameters, the cross-validation method has to learn a SVM for each of the TRi sets, while the internal metrics method requires only one learning step, for the whole D. Furthermore, some of the internal metrics are convex functions on the hyper-parameters, and thus a gradient descent method can be used to select the hyper-parameters, instead of a grid search. This may also reduce the execution time of the whole learning process.

The aim of this paper is to replicate and update the works of [9] and [1] (discussed below), which tested some internal metric procedures against cross-validation procedures on a few data sets (5 and 13 respectively). We will perform a similar comparison of 6 internal metrics, one of which has not been used by the previous research, on 110 data sets. We will compare not only the quality of the hyper-parameter selection but also the execution time for such procedure.

Duan et al. [9] compare the following internal metrics:

  • Xi-Alpha bound [12]

  • Generalized approximate cross-validation [19]

  • Approximate span bound [18]

  • VC bound [17]

  • Radius-margin bound [5]

  • Modified radius-margin bound [7]

with a standard cross validation procedure to select hyper-parameters on 5 different data sets. Each data set was split into a training subset, with 400 to 1300 data points. The quality of the choice of hyper-parameters was the accuracy of the classifiers on the remaining test set.

Duan et al. [9] concluded that the cross validation procedures result in better classifiers. They also concluded that Xi-Alpha results in reasonable choices of the hyper-parameters in the sense that the resulting classifier had an accuracy close to that of the cross-validation classifier, but the choices of hyper-parameters were not close to the ones chosen by the cross-validation. They also concluded that approximate span and VC bound do not result in high accuracy classifiers, and that (unmodified) radius bound also do not result in good hyper-parameter selections. The two remaining methods, modified radius-margin and generalized approximate cross-validation result in choices worse than that of Xi-Alpha.

Anguita et al. [1] analyzed 5 cross-validation procedures (100 repetitions of bootstrap, 10 repetitions of bootstrap, 10-fold CV, leave-one-out, and 70%/30% split for training/test), and two internal metrics (compression bound [11] and maximal discrepancy [4]). They tested the procedures on 13 data sets, and discovered that all cross-validation procedures and maximal discrepancy metric were able to select appropriate hyper-parameters.

This research replicates those researches, but we use 110 binary data sets from the UCI; we do not test the VC bound, approximate span, and unmodified radius margin bounds, given the negative results of [9], and we include maximal discrepancy [4] and distance between two classes [16] as new internal metrics to be compared. We also compare the execution time of the procedures, since one of the intended goals of internal metrics methods is a faster selection procedure for hyper-parameters.

We must point out that this research does not cover all proposed internal metrics to select hyperparameters. As discussed above, we removed from consideration some of the older proposal that were shown by other experiments to perform worse than the others such as VC bounds, approximate span, and unmodified radius margin bounds. We also do not cover a recent internal metric proposed in [14] based on stability of the models[14]. show the efficacy of the metric in selecting hyperparameters in the cases where d ≫ n, that is, high dimensional data sets with few data points, which are not the cases tested in this paper. Another proposal not tested herein was the heuristics to select C and γ proposed in [13]: γ is selected based on the Euclidian distance (in the data space) between random samples of both classes, and C is selected from the distribution of values 1K(xi,xj) for a small sample of data.

In the remaining of this paper we show a brief review of the methods in Section 2; the results are presented in Section 3 and conclusions in Section 4.

Section snippets

Methods

Next, we briefly review the internal metrics and the experimental setup.

Results

Table 1 displays the mean rank regarding accuracy for each of the 6 selection procedures and the mean rank regarding the execution time. The table is ordered by the mean accuracy rank.

The Friedman test of the 6 procedures results in a p-value<2.2e16 (Friedman chi-squared = 296.05, df = 6). Tables 2 and 3 display the p-values of the Nemenyi pairwise comparisons between the selection methods, for accuracies and for the execution times.

In agreement with the previous literature [1], [4], [9], CV

Conclusion

In accordance to the previous literature [1], [4], [9], cross validation is the selection procedure to tune SVM hyper-parameters that will have lower expected error on future data.

When the selection time is a concern, we would recommend the use the distance between two classes (DBTC) method. As described, the method first selects the γ (using a 1D grid) that maximizes Eq. (5) (which represent the distance between the center of each class in the feature space) and then with that γ selects the C

References (20)

  • D. Anguita et al.

    Hyperparameter design criteria for support vector classifiers

    Neurocomputing

    (2003)
  • K. Duan et al.

    Evaluation of simple performance measures for tuning svm hyperparameters

    Neurocomputing

    (2003)
  • D. Anguita et al.

    Theoretical and practical model selection methods for support vector classifiers

    Support vector machines: theory and applications

    (2005)
  • D. Anguita et al.

    Model selection for support vector machines: advantages and disadvantages of the machine learning theory

    IEEE International Joint Conference on Neural Networks

    (2010)
  • D. Anguita et al.

    In-sample and out-of-sample model selection and error estimation for support vector machines

    IEEE Trans. Neural Netw. Learn. Syst.

    (2012)
  • C. Campbell et al.

    Dynamically adapting kernels in support vector machines

    Adv. Neural Inf. Process. Syst.

    (1999)
  • C.C. Chang et al.

    Libsvm: a library for support vector machines

    ACM Trans. Intell. Syst. Technol. (TIST)

    (2011)
  • K. Chung et al.

    Radius margin bounds for support vector machines with the RBF kernel

    Neural Comput.

    (2003)
  • J. Demšar

    Statistical comparisons of classifiers over multiple data sets

    J. Mach. Learn. Res.

    (2006)
  • M. Fernández-Delgado et al.

    Do we need hundreds of classifiers to solve real world classification problems?

    J. Mach. Learn. Res.

    (2014)
There are more references available in the full text version of this article.

Cited by (59)

  • Advancement of management information system for discovering fraud in master card based intelligent supervised machine learning and deep learning during SARS-CoV2

    2023, Information Processing and Management
    Citation Excerpt :

    However, in order to get the optimum performance from the model, the value must first be fine-tuned. The method of cross validation was utilized to tune the hyper parameter in this case (Duarte & Wainer, 2017). We employed cross validation of K fold, with K set to 10.

  • Assessment of groundwater potential modeling using support vector machine optimization based on Bayesian multi-objective hyperparameter algorithm

    2023, Applied Soft Computing
    Citation Excerpt :

    Hyperparameter tuning refers as an automatic optimization of the hyperparameters of a machine learning model. It is needed to be tuned to accomplish optimal performance of a model [88]. In this study, the following hyperparameter algorithms are used.

View all citing articles on Scopus
View full text