Finding optimal model parameters by deterministic and annealed focused grid search
Introduction
Successful data mining applications usually rely on the appropriate choice of several parameters for the modelling paradigm that is to be used. A very well-known case is that of multilayer perceptrons (MLPs; [1]), where parameters to be chosen are the number of hidden layers and of units in each layer, the learning rate and momentum factor for gradient descent learning or the penalty factor for weight regularization. Other examples are support vector machines (SVMs; [2], [3]), where a margin slack penalty factor has to be chosen for nonlinearly separable models, to which one has to add the kernel width if Gaussian kernels are to be used or the value of the -insensitivity parameter in SV regression. However, there may be other reasons that force the selection of modelling parameters. For instance, some sort of input preprocessing is quite often needed, such as the selection of the percentage of variance to be retained if principal components are used for input dimensionality reduction or, in time series applications, the number of delays or the modelling window size to be used.
Some ad-hoc approaches can be applied in concrete settings. For instance, leave-one-out error estimation can be used to derive parameter-dependent error bounds for SVMs [4] which, in turn, make it possible to find optimal SVM parameters by gradient descent on the resulting error model. On the other hand, a relatively standard technique (see [5], Chapter 14) in the design of experiments field is to test the model performance in a few points of the parameter space and to fit a first or second order empirical error surface over the results obtained, which is then taken as an approximation to the real underlying error landscape. If it exists, this surface's minimum is used to provide optimal model parameters.
In any case, the previous situations are the exception rather than the rule. More often (and quite particularly for hybrid systems), several rather different and competing modelling techniques are to be used so that an adequately chosen combination offers less variance than that of its individual components. In such a situation one cannot count on a direct knowledge of the effectiveness of each individual model for the problem at hand and it is thus clear that no general parameter setting procedure exists other than some kind of search on the parameter space. There are then two extreme approaches: a more or less exhaustive, deterministic parameter search [6], [7] and, on the other hand, a stochastic metamodel search, typically using genetic algorithms [8], [9] or genetic programming [10].
Deterministic parameter search is usually done over a grid obtained by the discretization of each individual parameter's allowable range. This discretization appears naturally for integer valued parameters (such as, for instance, the number of hidden layers or units of an MLP) and, for a continuous parameter in a range , a resolution limit is fixed and one explores the parameter values, where . We may assume that and call K the depth of the search. An obvious way of exploring the resulting grid associated to an M parameter model would be to perform a linear search over all the grid's parameter values. However, the complexity of essentially any model selection procedure is determined by the number of the different models to be built (in our classification examples the fitness function will be the model errors on a validation subset) and here its cost would be , which rules out its application outside very low values of M and K.
The simplest way out of this is to introduce the possibility of a random evolution in the search procedure. A first example somewhat related to grid search is to fill the parameter space using uniform experimental design [11], where a number L of patterns is chosen so that the square norm discrepancy between their empirical cumulative distribution and that of the uniform distribution is small. Another widely used option is to stochastically explore the parameter space in an evolutionary setting. The well-known covariance matrix adaptation-evolution strategy (CMA-ES), proposed by Hansen and Ostermeier [12], [13], is one of the most effective black-box random optimizers. Briefly, CMA-ES produces from a population , of individuals a number of new individuals with genotypes where is the mass centre of the previous population and the perturbations are independent realizations of an M dimensional Gaussian distribution . The offspring with the best fitness are then selected to form the new population to be used at step . Moreover, at each step the covariance matrix is adaptively updated in such a way that the probability of applying again previous steps that lead to larger gains is maximized. The complexity of CMA-ES based model selection is clearly times the number of generations explored.
While being in general quite effective, these approaches are not problem-free. For instance, finding directly the appropriate space filling points in uniform design (UD) is a difficult endeavor. Tables exist that provide good preselected values, but the number of their points may be too small if a moderate to large number of parameters has to be set. On the other hand, CMA-ES is a rather complicated and difficult to parametrize and implement procedure. Moreover, it must start from a single point in parameter space and its exploration may therefore be less thorough.
As a simpler alternative, we shall propose in this paper two grid based procedures in which starting from the outer grid points, successively narrow the search to smaller half-size grids centred at the point giving the best fitness value in the previous iteration. Because of this we will term them focused grid searches (FGS). The first one is a deterministic FGS (DFGS) for which we will begin by discussing a simple general version that requires model constructions for a grid depth of K and an M parameter model; we will also show how to refine it to achieve an extra depth precision while requiring the same number of model trainings in the best case and in the worst case. Notice, however, that the exponential cost in M remains and to alleviate it we shall also introduce an evolution strategy on the previous focused grid search, in which we will only consider a fixed number of points instead of the points used in DFGS. These points will be selected through a simulated annealing-like procedure and the one giving the best fitness will be the centre of a new half-size grid; we shall call the resulting procedure annealed FGS (AFGS). While the complexity of DFGS is essentially fixed once we choose the grid depth K, we can control it in AFGS by fixing the number of outer grid points to be randomly examined. Thus, for a K depth search, AFGS will require at most model constructions and its cost will then be comparable to that of CMA-ES whenever the latter performs more than generations.
This paper is organized as follows. In Section 2 we will briefly review standard linear grid search, uniform design parameter value selection and, in more detail, the CMA-ES algorithm that we will use it for comparisons with our deterministic and annealed FGS procedures. We will present our DFGS and AFGS proposals in Section 3 and we will compare them against CMA-ES in Section 4 over several classification problems which will be solved using MLPs and SVMs. MLPs will have a standard 1-hidden layer structure and will be trained in a batch setting by conjugate gradient minimization. We shall train SVMs for classification using the MDM algorithm [14] with quadratic margin slack penalties and Gaussian kernels. We will consider two comparison settings. First we will deal with what we may call “small parameter set” problems. By this we mean selecting the optimal number of hidden units and the weight decay parameter for MLPs and the Gaussian kernel width and penalty factor C in SVMs; that is, we will select just two optimal parameters for both MLPs and SVMs. On the other hand, we shall also consider “large parameter set” problems for SVMs, where we will seek, besides the penalty factor C, the best parameters of an anisotropic Gaussian kernel. The number of parameters is now , with D the pattern dimension in the original feature space. This makes DFGS too costly and in this second situation we shall just compare AFGS and CMA-ES searches. (We observe that in [15] the anisotropic SVM parameter estimation problem is considered applying first a preliminary grid search to obtain an initial choice and tuning it then using CMA-ES.)
As our results in Section 4 will illustrate, there is no clear winner among the various approaches considered here in any of the parameter number scenarios, and the choice of one particular technique may rest on other facts. For instance, while practical only for models with few parameters, DFGS avoids the parametrization required by stochastic searches and AFGS, with few and very simple parameters, can be used over more complex models. On the other hand, while CMA-ES has undergone a very thorough theoretical and experimental analysis and good implementations in several programming languages are available in [16], it is still a fairly complex procedure that does not allow for an easy autonomous implementation and whose own parametrization is difficult. In contrast, DFGS and AFGS are procedurally much simpler and, thus, easier to implement. This paper will end with a brief discussion and conclusions section.
Section snippets
Previous work
In this section we will briefly describe standard linear grid search, establishing in part the notation to be used in the next section, and, then, review the application of uniform design to optimal parameter selection and, in more detail, the CMA-ES procedure that we shall use in our experimental comparisons.
Linear grid search is the simplest (but also costliest) way to find optimal model parameters. A grid structure lends itself naturally to search for discrete parameters while
Deterministic focused grid search
Our starting option will be a simple parameter grid search, where we will start at a point initial outer grid made up of vectors of the form , where we assume parameter ranges , , are the coordinates of the centre of the parameter hypercube to be explored and . In principle, the point giving a better fitness will be taken as the centre of a new smaller grid of points of the form with as
Numerical experiments
We will illustrate the preceding techniques over Rãtsch's classification datasets available in [21] and listed in Table 1 together with their number of patterns and attributes. These datasets are given as 100 train–test pairs; we shall use these splits in our experiments. Here we shall work with both Gaussian kernel SVMs and single hidden layer multilayer perceptrons. For SVMs we shall allow for margin slacks with quadratic penalties; that is, we use a criterion function of the form
Discussion and conclusions
Present day machine learning offers a very wide range of modelling methods and while for a given problem, the ability to select the best one would be highly desirable, this same high number of options makes that choice quite difficult. Besides, even when one particular model is chosen, it is not an easy task to come up with good values for that model's specific parameters. On the other hand, if the model is itself general enough to cover a wide range of problems, an appropriate choice of
Acknowledgements
With partial support of Spain's TIN 2004–07676 and TIN 2007–66862. The first author is kindly supported by the FPU–MEC grant reference AP2006–02285.
A. Barbero received the Computer Scientist degree from the Universidad Autónoma de Madrid (UAM) in 2006, and is currently a student in a Master/Ph.D. degree in Computer Science at the same university. At present, he is working at the UAM under a National Predoctoral Grant, in collaboration with the Instituto de Ingeniería del Conocimiento. His research interests are in pattern recognition, kernel methods and wind power forecasting.
References (22)
- et al.
Model selection for support vector machines via uniform design
Computational Statistics and Data Analysis
(2007) - et al.
Evolutionary tuning of multiple SVM parameters
Neurocomputing
(2005) - et al.
Uniform experimental designs and their applications in industry
Handbook of Statistics
(2003) - et al.
Pattern Classification
(2000) The Nature of Statistical Learning Theory
(1995)- et al.
Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond, Machine Learning
(2002) Efficient tuning of SVM hyperparameters using radius/margin bound and iterative algorithms
IEEE Transactions on Neural Networks
(2002)Design and Analysis of Experiments
(1976)- C.-W. Hsu, Ch.-Ch. Chang, Ch.-J. Lin, A practical guide to support vector classification...
- C. Staelin, Parameter selection for support vector machines, Technical Report HPL-2002-354, HP Laboratories, Israel,...
A review of genetic algorithms applied to training radial basis function networks
Neural Computing & Applications
Cited by (40)
Integrating Models and Fusing Data in a Deep Ensemble Learning Method for Predicting Epidemic Diseases Outbreak
2022, Big Data ResearchCitation Excerpt :Successful ANN applications usually depend on the appropriate choice of the best modeling hyperparameters. These modeling hyperparameters can concern the network architecture (hidden layers number and units or neurons number in each layer, etc.) or data preparation such as delays number or the window size that will be used in time series applications [20]. That's why, we will use in this study the grid searching technique whose role is to select the optimal and suitable hyperparameters modeling by performing optimization decisions based on several combinations and solid statistical criteria [21].
Sample and feature selecting based ensemble learning for imbalanced problems
2021, Applied Soft ComputingHierarchical parameter optimization based support vector regression for power load forecasting
2021, Sustainable Cities and SocietyQuantifying proportions of different material sources to loess based on a grid search and Monte Carlo model: A case study of the Ili Valley, Central Asia
2021, Palaeogeography, Palaeoclimatology, PalaeoecologyCitation Excerpt :The grid search is an important technique for performing an approximately exhaustive search in a domain space based on a given step size, which has been verified and applied extensively (e.g., Hesterman et al., 2010). For example, the grid search is often used to find the optimal parameters for the model in multi-dimensional solution space (e.g., Jimenez et al., 2009) or discretize the solution space (e.g., Dufour and Neves, 2019; He and Hong, 2015). The Monte Carlo model, a reliable approach to perform and obtain the optimal solution, is widely applied for continuous random sampling and statistical calculation in the solution spaces (Seila, 1981).
Weight-based multiple empirical kernel learning with neighbor discriminant constraint for heart failure mortality prediction
2020, Journal of Biomedical InformaticsA modified support vector regression: Integrated selection of training subset and model
2017, Applied Soft Computing Journal
A. Barbero received the Computer Scientist degree from the Universidad Autónoma de Madrid (UAM) in 2006, and is currently a student in a Master/Ph.D. degree in Computer Science at the same university. At present, he is working at the UAM under a National Predoctoral Grant, in collaboration with the Instituto de Ingeniería del Conocimiento. His research interests are in pattern recognition, kernel methods and wind power forecasting.
J. López received his Computer Engineering degree from the Universidad Autónoma de Madrid in 2006, where he got an Honorific Mention as the best student. Currently he is attending the Postgraduate Programme organized by the Computer Engineering Department of the same university. His research interests concentrate on support vector machines, but also cover additional machine learning and pattern recognition paradigms.
J.R. Dorronsoro received the Licenciado en Matemáticas degree from the Universidad Complutense de Madrid in 1977 and the Ph.D. degree in Mathematics from Washington University in St. Louis in 1981. Currently he is Full Professor in the Computer Engineering Department of the Universidad Autónoma de Madrid, of which he was the head from 1993 to 1996. His research interest is in neural networks, image processing and pattern recognition.