Empirical comparison of cross-validation and internal metrics for tuning SVM hyperparameters
Introduction
Support Vector Machines (SVMs) are commonly used in classification problems with two or more classes. In its general formulation, SVM works by mapping input data (x) into a high-dimensional feature space (ϕ(x)), and building a hyperplane () to separate examples from two classes. For a L1 soft-margin SVM, this hyperplane is defined by solving the primal problem: where xi is a data example, and its label/class. Computationally this problem is solved on its dual form: where e is a vector of ones and is a kernel function that performs the ϕ mapping. The Gaussian Radial Basis Function (RBF) kernel is a common choice for the kernel function
The hyper-parameters C and γ must be defined before solving the minimization in Eq. (2) and must be carefully chosen to obtain a good accuracy. Choosing C and γ, known as hyper-parameter tuning or model selection, is usually done by performing a grid search over pairs of C and γ and testing each pair using some cross-validation procedure. Examples of cross validation procedures are: k-fold, repeated k-fold, bootstrap, leave one out, hold-out, among others.
Formally, if is the data set, cross validation procedures will define a set of pairs of sets TRi and TEi, called train and test, such that:
Let us use the notation to indicate the accuracy on a data set of a SVM trained on data set with hyperparameters C and γ.
Then the cross-validation of pair C, γ (for a data set under some cross validation procedure) is:
The best set of hyper-parameters, from a set of candidates S is:
The expression C, γ ∈ S indicates that we are selecting the best C an γ from a pre-defined set of candidates (see Section 2.2).
This usual cross-validation procedure may be too costly, and there has been many proposals to consider measures of the training set itself as the ones that are maximized or minimized in order to select one or both hyper-parameters. We will call them internal metrics[9]; call them performance measures, [3] call then in-sample methods, [1] call them methods based on the Statistical Learning Theory.
If is one of these internal metrics, when applied to the set with hyper-parameters C and γ, then Eq. (3) would be: Again we are selecting the best hyperparameters from a pre-defined set of pairs S.
One of the ideas behind using internal metrics to select hyper-parameters is that the cost of selecting of hyper-parameters will probably be lower. For each pair of hyper-parameters, the cross-validation method has to learn a SVM for each of the TRi sets, while the internal metrics method requires only one learning step, for the whole D. Furthermore, some of the internal metrics are convex functions on the hyper-parameters, and thus a gradient descent method can be used to select the hyper-parameters, instead of a grid search. This may also reduce the execution time of the whole learning process.
The aim of this paper is to replicate and update the works of [9] and [1] (discussed below), which tested some internal metric procedures against cross-validation procedures on a few data sets (5 and 13 respectively). We will perform a similar comparison of 6 internal metrics, one of which has not been used by the previous research, on 110 data sets. We will compare not only the quality of the hyper-parameter selection but also the execution time for such procedure.
Duan et al. [9] compare the following internal metrics:
- •
Xi-Alpha bound [12]
- •
Generalized approximate cross-validation [19]
- •
Approximate span bound [18]
- •
VC bound [17]
- •
Radius-margin bound [5]
- •
Modified radius-margin bound [7]
Duan et al. [9] concluded that the cross validation procedures result in better classifiers. They also concluded that Xi-Alpha results in reasonable choices of the hyper-parameters in the sense that the resulting classifier had an accuracy close to that of the cross-validation classifier, but the choices of hyper-parameters were not close to the ones chosen by the cross-validation. They also concluded that approximate span and VC bound do not result in high accuracy classifiers, and that (unmodified) radius bound also do not result in good hyper-parameter selections. The two remaining methods, modified radius-margin and generalized approximate cross-validation result in choices worse than that of Xi-Alpha.
Anguita et al. [1] analyzed 5 cross-validation procedures (100 repetitions of bootstrap, 10 repetitions of bootstrap, 10-fold CV, leave-one-out, and 70%/30% split for training/test), and two internal metrics (compression bound [11] and maximal discrepancy [4]). They tested the procedures on 13 data sets, and discovered that all cross-validation procedures and maximal discrepancy metric were able to select appropriate hyper-parameters.
This research replicates those researches, but we use 110 binary data sets from the UCI; we do not test the VC bound, approximate span, and unmodified radius margin bounds, given the negative results of [9], and we include maximal discrepancy [4] and distance between two classes [16] as new internal metrics to be compared. We also compare the execution time of the procedures, since one of the intended goals of internal metrics methods is a faster selection procedure for hyper-parameters.
We must point out that this research does not cover all proposed internal metrics to select hyperparameters. As discussed above, we removed from consideration some of the older proposal that were shown by other experiments to perform worse than the others such as VC bounds, approximate span, and unmodified radius margin bounds. We also do not cover a recent internal metric proposed in [14] based on stability of the models[14]. show the efficacy of the metric in selecting hyperparameters in the cases where d ≫ n, that is, high dimensional data sets with few data points, which are not the cases tested in this paper. Another proposal not tested herein was the heuristics to select C and γ proposed in [13]: γ is selected based on the Euclidian distance (in the data space) between random samples of both classes, and C is selected from the distribution of values for a small sample of data.
In the remaining of this paper we show a brief review of the methods in Section 2; the results are presented in Section 3 and conclusions in Section 4.
Section snippets
Methods
Next, we briefly review the internal metrics and the experimental setup.
Results
Table 1 displays the mean rank regarding accuracy for each of the 6 selection procedures and the mean rank regarding the execution time. The table is ordered by the mean accuracy rank.
The Friedman test of the 6 procedures results in a p-value (Friedman chi-squared = 296.05, df = 6). Tables 2 and 3 display the p-values of the Nemenyi pairwise comparisons between the selection methods, for accuracies and for the execution times.
In agreement with the previous literature [1], [4], [9], CV
Conclusion
In accordance to the previous literature [1], [4], [9], cross validation is the selection procedure to tune SVM hyper-parameters that will have lower expected error on future data.
When the selection time is a concern, we would recommend the use the distance between two classes (DBTC) method. As described, the method first selects the γ (using a 1D grid) that maximizes Eq. (5) (which represent the distance between the center of each class in the feature space) and then with that γ selects the C
References (20)
- et al.
Hyperparameter design criteria for support vector classifiers
Neurocomputing
(2003) - et al.
Evaluation of simple performance measures for tuning svm hyperparameters
Neurocomputing
(2003) - et al.
Theoretical and practical model selection methods for support vector classifiers
Support vector machines: theory and applications
(2005) - et al.
Model selection for support vector machines: advantages and disadvantages of the machine learning theory
IEEE International Joint Conference on Neural Networks
(2010) - et al.
In-sample and out-of-sample model selection and error estimation for support vector machines
IEEE Trans. Neural Netw. Learn. Syst.
(2012) - et al.
Dynamically adapting kernels in support vector machines
Adv. Neural Inf. Process. Syst.
(1999) - et al.
Libsvm: a library for support vector machines
ACM Trans. Intell. Syst. Technol. (TIST)
(2011) - et al.
Radius margin bounds for support vector machines with the RBF kernel
Neural Comput.
(2003) Statistical comparisons of classifiers over multiple data sets
J. Mach. Learn. Res.
(2006)- et al.
Do we need hundreds of classifiers to solve real world classification problems?
J. Mach. Learn. Res.
(2014)
Cited by (59)
Global patterns in urban green space are strongly linked to human development and population density
2023, Urban Forestry and Urban GreeningAdvancement of management information system for discovering fraud in master card based intelligent supervised machine learning and deep learning during SARS-CoV2
2023, Information Processing and ManagementCitation Excerpt :However, in order to get the optimum performance from the model, the value must first be fine-tuned. The method of cross validation was utilized to tune the hyper parameter in this case (Duarte & Wainer, 2017). We employed cross validation of K fold, with K set to 10.
Assessment of groundwater potential modeling using support vector machine optimization based on Bayesian multi-objective hyperparameter algorithm
2023, Applied Soft ComputingCitation Excerpt :Hyperparameter tuning refers as an automatic optimization of the hyperparameters of a machine learning model. It is needed to be tuned to accomplish optimal performance of a model [88]. In this study, the following hyperparameter algorithms are used.
An efficient model selection for linear discriminant function-based recursive feature elimination
2022, Journal of Biomedical Informatics