Empirical comparison of cross-validation and internal metrics for tuning SVM hyperparameters

doi:10.1016/j.patrec.2017.01.007

Pattern Recognition Letters

Volume 88, 1 March 2017, Pages 6-11

https://doi.org/10.1016/j.patrec.2017.01.007 Get rights and content

Highlights

•
Compared cross validation with 5 internal metric to select SVM hyperparameters.
•
Cross validation results in hyperparameters with better accuracy on new data.
•
Distance between two classes (DBTC) is the second best algorithm.
•
DBTC has the lowest execution time and is a very competitive alternative to 5-fold CV.

Abstract

Hyperparameter tuning is a mandatory step for building a support vector machine classifier. In this work, we study some methods based on metrics of the training set itself, and not the performance of the classifier on a different test set - the usual cross-validation approach. We compare cross-validation (5-fold) with Xi-alpha, radius-margin bound, generalized approximate cross validation, maximum discrepancy and distance between two classes on 110 public binary data sets. Cross validation is the method that resulted in the best selection of the hyper-parameters, but it is also the method with one of the highest execution time. Distance between two classes (DBTC) is the fastest and the second best ranked method. We discuss that DBTC is a reasonable alternative to cross validation when training/hyperparameter-selection times are an issue and that the loss in accuracy when using DBTC is reasonably small.

Introduction

Support Vector Machines (SVMs) are commonly used in classification problems with two or more classes. In its general formulation, SVM works by mapping input data (x) into a high-dimensional feature space (ϕ(x)), and building a hyperplane ( $f (x) = w \cdot ϕ (x) + b$ ) to separate examples from two classes. For a L1 soft-margin SVM, this hyperplane is defined by solving the primal problem: $\begin{matrix} min \frac{1}{2} {∥ w ∥}^{2} + C \sum_{i = 1}^{n} ξ_{i} \\ subject to y_{i} (w \cdot ϕ (x_{i}) + b) \geq 1 - ξ_{i} \\ with ξ_{i} \geq 0 and 1 \leq i \leq n \end{matrix}$ where x_i is a data example, and $y_{i} \in {- 1, 1}$ its label/class. Computationally this problem is solved on its dual form: $\begin{matrix} min \frac{1}{2} α^{T} K α - e^{T} α \\ subject to y^{T} α = 0 \\ with 0 \leq α_{i} \leq C and 1 \leq i \leq n \end{matrix}$ where e is a vector of ones and $K_{i j} = k (x_{i}, x_{j}) = ϕ {(x_{i})}^{T} \cdot ϕ (x_{j})$ is a kernel function that performs the ϕ mapping. The Gaussian Radial Basis Function (RBF) kernel is a common choice for the kernel function $k (x_{i}, x_{j}) = exp (- γ {∥ x_{i} - x_{j} ∥}^{2}) .$

The hyper-parameters C and γ must be defined before solving the minimization in Eq. (2) and must be carefully chosen to obtain a good accuracy. Choosing C and γ, known as hyper-parameter tuning or model selection, is usually done by performing a grid search over pairs of C and γ and testing each pair using some cross-validation procedure. Examples of cross validation procedures are: k-fold, repeated k-fold, bootstrap, leave one out, hold-out, among others.

Formally, if $D$ is the data set, cross validation procedures will define a set of pairs of sets TR_i and TE_i, called train and test, such that: $\begin{matrix} T R_{i} \cap T E_{i} = & \emptyset \\ T R_{i} \cup T E_{I} \subseteq & D \end{matrix}$

Let us use the notation $acc (B | A, C, γ)$ to indicate the accuracy on a data set $B$ of a SVM trained on data set $A$ with hyperparameters C and γ.

Then the cross-validation of pair C, γ (for a data set $D$ under some cross validation procedure) is: $a c c_{c v} (C, γ) = {mean}_{i} acc (T E_{i} | T R_{i}, C, γ)$

The best set of hyper-parameters, from a set of candidates S is: ${argmax}_{C, γ \in S} {acc}_{c v} (C, γ)$

The expression C, γ ∈ S indicates that we are selecting the best C an γ from a pre-defined set of candidates (see Section 2.2).

This usual cross-validation procedure may be too costly, and there has been many proposals to consider measures of the training set itself as the ones that are maximized or minimized in order to select one or both hyper-parameters. We will call them internal metrics[9]; call them performance measures, [3] call then in-sample methods, [1] call them methods based on the Statistical Learning Theory.

If $ψ (D, C, γ)$ is one of these internal metrics, when applied to the set $D$ with hyper-parameters C and γ, then Eq. (3) would be: ${argmax}_{C, γ \in S} ψ (D, C, γ)$ Again we are selecting the best hyperparameters from a pre-defined set of pairs S.

One of the ideas behind using internal metrics to select hyper-parameters is that the cost of selecting of hyper-parameters will probably be lower. For each pair of hyper-parameters, the cross-validation method has to learn a SVM for each of the TR_i sets, while the internal metrics method requires only one learning step, for the whole D. Furthermore, some of the internal metrics are convex functions on the hyper-parameters, and thus a gradient descent method can be used to select the hyper-parameters, instead of a grid search. This may also reduce the execution time of the whole learning process.

The aim of this paper is to replicate and update the works of [9] and [1] (discussed below), which tested some internal metric procedures against cross-validation procedures on a few data sets (5 and 13 respectively). We will perform a similar comparison of 6 internal metrics, one of which has not been used by the previous research, on 110 data sets. We will compare not only the quality of the hyper-parameter selection but also the execution time for such procedure.

Duan et al. [9] compare the following internal metrics:

•
Xi-Alpha bound [12]
•
Generalized approximate cross-validation [19]
•
Approximate span bound [18]
•
VC bound [17]
•
Radius-margin bound [5]
•
Modified radius-margin bound [7]

with a standard cross validation procedure to select hyper-parameters on 5 different data sets. Each data set was split into a training subset, with 400 to 1300 data points. The quality of the choice of hyper-parameters was the accuracy of the classifiers on the remaining test set.

Duan et al. [9] concluded that the cross validation procedures result in better classifiers. They also concluded that Xi-Alpha results in reasonable choices of the hyper-parameters in the sense that the resulting classifier had an accuracy close to that of the cross-validation classifier, but the choices of hyper-parameters were not close to the ones chosen by the cross-validation. They also concluded that approximate span and VC bound do not result in high accuracy classifiers, and that (unmodified) radius bound also do not result in good hyper-parameter selections. The two remaining methods, modified radius-margin and generalized approximate cross-validation result in choices worse than that of Xi-Alpha.

Anguita et al. [1] analyzed 5 cross-validation procedures (100 repetitions of bootstrap, 10 repetitions of bootstrap, 10-fold CV, leave-one-out, and 70%/30% split for training/test), and two internal metrics (compression bound [11] and maximal discrepancy [4]). They tested the procedures on 13 data sets, and discovered that all cross-validation procedures and maximal discrepancy metric were able to select appropriate hyper-parameters.

This research replicates those researches, but we use 110 binary data sets from the UCI; we do not test the VC bound, approximate span, and unmodified radius margin bounds, given the negative results of [9], and we include maximal discrepancy [4] and distance between two classes [16] as new internal metrics to be compared. We also compare the execution time of the procedures, since one of the intended goals of internal metrics methods is a faster selection procedure for hyper-parameters.

We must point out that this research does not cover all proposed internal metrics to select hyperparameters. As discussed above, we removed from consideration some of the older proposal that were shown by other experiments to perform worse than the others such as VC bounds, approximate span, and unmodified radius margin bounds. We also do not cover a recent internal metric proposed in [14] based on stability of the models[14]. show the efficacy of the metric in selecting hyperparameters in the cases where d ≫ n, that is, high dimensional data sets with few data points, which are not the cases tested in this paper. Another proposal not tested herein was the heuristics to select C and γ proposed in [13]: γ is selected based on the Euclidian distance (in the data space) between random samples of both classes, and C is selected from the distribution of values $\frac{1}{K (x_{i}, x_{j})}$ for a small sample of data.

In the remaining of this paper we show a brief review of the methods in Section 2; the results are presented in Section 3 and conclusions in Section 4.

Section snippets

Methods

Next, we briefly review the internal metrics and the experimental setup.

Results

Table 1 displays the mean rank regarding accuracy for each of the 6 selection procedures and the mean rank regarding the execution time. The table is ordered by the mean accuracy rank.

The Friedman test of the 6 procedures results in a p-value $< 2.2 e - 16$ (Friedman chi-squared = 296.05, df = 6). Tables 2 and 3 display the p-values of the Nemenyi pairwise comparisons between the selection methods, for accuracies and for the execution times.

In agreement with the previous literature [1], [4], [9], CV

Conclusion

In accordance to the previous literature [1], [4], [9], cross validation is the selection procedure to tune SVM hyper-parameters that will have lower expected error on future data.

When the selection time is a concern, we would recommend the use the distance between two classes (DBTC) method. As described, the method first selects the γ (using a 1D grid) that maximizes Eq. (5) (which represent the distance between the center of each class in the feature space) and then with that γ selects the C

References (20)

D. Anguita et al.
Hyperparameter design criteria for support vector classifiers
Neurocomputing
(2003)
K. Duan et al.
Evaluation of simple performance measures for tuning svm hyperparameters
Neurocomputing
(2003)
D. Anguita et al.
Theoretical and practical model selection methods for support vector classifiers
Support vector machines: theory and applications
(2005)
D. Anguita et al.
Model selection for support vector machines: advantages and disadvantages of the machine learning theory
IEEE International Joint Conference on Neural Networks
(2010)
D. Anguita et al.
In-sample and out-of-sample model selection and error estimation for support vector machines
IEEE Trans. Neural Netw. Learn. Syst.
(2012)
C. Campbell et al.
Dynamically adapting kernels in support vector machines
Adv. Neural Inf. Process. Syst.
(1999)
C.C. Chang et al.
Libsvm: a library for support vector machines
ACM Trans. Intell. Syst. Technol. (TIST)
(2011)
K. Chung et al.
Radius margin bounds for support vector machines with the RBF kernel
Neural Comput.
(2003)
J. Demšar
Statistical comparisons of classifiers over multiple data sets
J. Mach. Learn. Res.
(2006)
M. Fernández-Delgado et al.
Do we need hundreds of classifiers to solve real world classification problems?
J. Mach. Learn. Res.
(2014)

There are more references available in the full text version of this article.

Cited by (59)

Seismic safety evaluation of slope with spatially variable soils based on collaborative analysis via optimized subset simulation
2024, Computers and Geotechnics
The results of slope seismic safety evaluation are affected by the uncertainty of geotechnical material parameters, seismic dynamic characteristics, safety evaluation indexes and other factors, and the computational efficiency is restricted by small failure probability events. First, the random variable model (RVM) and random field model (RFM) of material parameters are introduced to generate random finite element samples considering the uncertainty of shear strength parameters. Then, to overcome computational inefficiencies associated with low probability failure events, a specialized approach called seismic collaborative safety evaluation (SCSE), which is based on optimized subset simulation (SS) techniques, is introduced. Finally, the SCSE is used to analyze the working conditions with the ground peak acceleration (PGA) of 0.10 g, 0.15 g, 0.20 g and 0.25 g, and the safety evaluation method based on the time history minimum safety factor and maximum slip displacement is applied to evaluate the seismic failure probability respectively. The results suggest that the application of RVM and the use of time-history minimum safety factor as the evaluation index of slope safety under seismic excitation will lead to conservative analysis results. In contrast, applying RFM and using displacement-based evaluation method can offer a more precise depiction of slope seismic reliability.
Crack classification and quantitative evaluation based on dimensionality reduction optimization model of multifeature weak magnetic signal
2023, Results in Engineering
In this paper, we propose a classification and quantitative method based on the optimization model of dimensionality reduction for detecting crack defects in superalloy turbine disks using multifeature weak magnetic signals. First, the experiment on magnetic weak nondestructive testing was conducted on the defects, resulting in the acquisition of 112 sets of preset magnetic anomaly eigenvalue data. The dimensionality of the multidimensional weak magnetic signal features was then reduced using the principal component analysis method to construct the weak magnetic feature sample library. On this basis, a support vector machine (SVM) model was established to classify and quantitatively evaluate the defects. Genetic algorithm and grid search method were used to optimize the model. Finally, the validity of the optimized SVM model was verified by adding artificial and natural crack defect specimens outside of the sample library. Results show that compared with the original model, the feature dimension reduction and optimization of the two methods enhance the classification accuracy of superalloy surface defects and the quantitative accuracy of defect length, width, and depth. The genetic algorithm significantly improved the classification accuracy, increasing it by 39.33 %. Specifically, the genetic algorithm improved the quantitative accuracy of length and width by 17.9 % and 28.56 % respectively. On the other hand, the grid search method improved the quantitative accuracy of depth the most, with an improvement of 49.83 %. For specimens with natural cracks, the optimized SVM model still demonstrated good classification and quantitative effectiveness in detecting defects, with an accuracy rate of over 85 %.
Global patterns in urban green space are strongly linked to human development and population density
2023, Urban Forestry and Urban Greening
Urban green space is important for alleviating high temperatures, pollution, and flooding in cities. Furthermore, it is becoming increasingly clear that urban green space is important for the mental and physical health of humans residing in cities and that urban green space may harbor unique biodiversity. Understanding the extent and drivers of urban green space is thus important. While urban green space has been mapped and studied at local to national scales, the global patterns and drivers of urban green space remain unknown, potentially hampering effective planning and allocation of resources toward reaching sustainable development goals. Here, we quantified the effect of environmental and socio-economic drivers (temperature, precipitation, human development, and population density) on urban green space globally by focusing on national capital cities. We used satellite imagery to map urban green space using two measures: the Normalized Difference Vegetation Index (NDVI), and the fractional cover of “green” land cover classes. NDVI is useful as it includes all vegetated surfaces, also small ones like gardens. However, land cover classes allow the exclusion of certain classes such as sports fields or cropland. We used boosted regression trees to show that climatic variables accounted for 75 % of the relative influence in urban green space, with a positive effect of precipitation and a negative effect of temperature. Importantly, socioeconomic variables accounted for 25 % of the influence on global urban green space, with a positive effect of human development index (HDI) and a negative effect of population density. HDI in relation to urban green space has not previously been tested globally, and our study shows that significantly affects urban greenspace. The results demonstrate that cities where development status is low and population densities are high, typically in the Global South, have less urban green space than the climate would predict. The results therefore suggest that human wellbeing does not only benefit directly from increasing human development and decreasing population densities in urban areas, but that these effects may be compounded by also improving nature’s contribution to people.
Advancement of management information system for discovering fraud in master card based intelligent supervised machine learning and deep learning during SARS-CoV2
2023, Information Processing and Management
Citation Excerpt :
However, in order to get the optimum performance from the model, the value must first be fine-tuned. The method of cross validation was utilized to tune the hyper parameter in this case (Duarte & Wainer, 2017). We employed cross validation of K fold, with K set to 10.
During coronavirus (SARS-CoV2) the number of fraudulent transactions is expanding at a rate of alarming (7,352,421 online transaction records). Additionally, the Master Card (MC) usage is increasing. To avoid massive losses, companies of finance must constantly improve their management information systems for discovering fraud in MC. In this paper, an approach of advancement management information system for discovering of MC fraud was developed using sequential modeling of data depend on intelligent forecasting methods such as deep Learning and intelligent supervised machine learning (ISML). The Long Short-Term Memory Network (LSTM), Logistic Regression (LR), and Random Forest (RF) were used. The dataset is separated into two parts: the training and testing data, with a ratio of 8:2. Also, the advancement of management information system has been evaluated using 10-fold cross validation depend on recall, f1-score, precision, Mean Absolute Error (MAE), Receiver Operating Curve (ROC), and Root Mean Square Error (RMSE). Finally various techniques of resampling used to forecast if a transaction of MC is genuine/fraudulent. Performance for without re-sampling, with under-sampling, and with over-sampling is measured for each Algorithm. Highest performance of without re-sampling was 0.829 for RF algorithm-F score. While for under-sampling, it was 0.871 for LSTM algorithm-RMSE. Further, for over-sampling, it was 0.921 for both RF algorithm-Precision and LSTM algorithm-F score. The results from running advancement of management information system revealed that using resampling technique with deep learning LSTM generated the best results than intelligent supervised machine learning.
Assessment of groundwater potential modeling using support vector machine optimization based on Bayesian multi-objective hyperparameter algorithm
2023, Applied Soft Computing
Citation Excerpt :
Hyperparameter tuning refers as an automatic optimization of the hyperparameters of a machine learning model. It is needed to be tuned to accomplish optimal performance of a model [88]. In this study, the following hyperparameter algorithms are used.
Today, water supply in order to achieve sustainable development goals is one of the most important concerns and challenges in most countries. For this reason, accurate identification of areas with groundwater potential is one of the important tools in the protection, management and exploitation of water resources. Accordingly, the present study was conducted with the aim of modeling and predicting groundwater potential in Markazi province, Iran using Multivariate adaptive regression spline (MARS) and Support vector machine (SVM) machine learning models and using two random search (RS) and Bayesian optimization hyperparameter algorithms to optimize the parameters of the SVM model. For this purpose, 18 variables affecting the groundwater potential and 3482 spring locations were used to model the groundwater potential. Data for modeling were divided into two categories of training (70%) and validation (30%). The receiver operating characteristics (ROC) were used to evaluate the performance of the models. The results of evaluation models showed that using hyperparameters random search and Bayesian optimization were improved SVM accuracy in training and validation stages. Bayesian optimization methods are very efficient because they are consciously choosing the parameters of the model that this strategy improves the performance of the model. Evaluating accuracy in the validation stage showed that the AUC value is for MARS, SVM, RS-SVM and B-SVM models 87.40%, 88.25%, 90.73% and 91.73%, respectively. The results of assessment variables importance showed elevation, precipitation in the coldest month, soil and slope variables have the most importance in modeling groundwater potential, while aspect, profile curvature and TWI variables, have the least importance in predicting groundwater potential in Markazi province.
An efficient model selection for linear discriminant function-based recursive feature elimination
2022, Journal of Biomedical Informatics
Model selection is an important issue in support vector machine-based recursive feature elimination (SVM-RFE). However, performing model selection on a linear SVM-RFE is difficult because the generalization error of SVM-RFE is hard to estimate. This paper proposes an approximation method to evaluate the generalization error of a linear SVM-RFE, and designs a new criterion to tune the penalty parameter C. As the computational cost of the proposed algorithm is expensive, several alpha seeding approaches are proposed to reduce the computational complexity. We show that the performance of the proposed algorithm exceeds that of the compared algorithms on bioinformatics datasets, and empirically demonstrate the computational time saving achieved by alpha seeding approaches.

View all citing articles on Scopus

View full text

Empirical comparison of cross-validation and internal metrics for tuning SVM hyperparameters

Highlights

Abstract

Introduction

Section snippets

Methods

Results

Conclusion

Neurocomputing

Neurocomputing

Theoretical and practical model selection methods for support vector classifiers

Support vector machines: theory and applications

Model selection for support vector machines: advantages and disadvantages of the machine learning theory

IEEE International Joint Conference on Neural Networks

In-sample and out-of-sample model selection and error estimation for support vector machines

IEEE Trans. Neural Netw. Learn. Syst.

Dynamically adapting kernels in support vector machines

Adv. Neural Inf. Process. Syst.

Libsvm: a library for support vector machines

ACM Trans. Intell. Syst. Technol. (TIST)

Radius margin bounds for support vector machines with the RBF kernel

Neural Comput.

Statistical comparisons of classifiers over multiple data sets

J. Mach. Learn. Res.

Do we need hundreds of classifiers to solve real world classification problems?

J. Mach. Learn. Res.