Elsevier

Neurocomputing

Volume 72, Issues 13–15, August 2009, Pages 2824-2832
Neurocomputing

Finding optimal model parameters by deterministic and annealed focused grid search

https://doi.org/10.1016/j.neucom.2008.09.024Get rights and content

Abstract

Optimal parameter model finding is usually a crucial task in engineering applications of classification and modelling. The exponential cost of linear search on a parameter grid of a given precision rules it out in all but the simplest problems and random algorithms such as uniform design or the covariance matrix adaptation-evolution strategy (CMA-ES) are usually applied. In this work we shall present two focused grid search (FGS) alternatives in which one repeatedly zooms into more concentrated sets of discrete grid points in the parameter search space. The first one, deterministic FGS (DFGS), is much faster than standard search although still too costly in problems with a large number of parameters. The second one, annealed FGS (AFGS), is a random version of DFGS where a fixed fraction of grid points is randomly selected and examined. As we shall numerically see over several classification problems for multilayer perceptrons and support vector machines, DFGS and AFGS are competitive with respect to CMA-ES, one of the most successful evolutive black-box optimizers. The choice of a concrete technique may thus rest in other facts, and the simplicity and basically parameter-free nature of both DFGS and AFGS may make them worthwile alternatives to the thorough theoretical and experimental background of CMA-ES.

Introduction

Successful data mining applications usually rely on the appropriate choice of several parameters for the modelling paradigm that is to be used. A very well-known case is that of multilayer perceptrons (MLPs; [1]), where parameters to be chosen are the number of hidden layers and of units in each layer, the learning rate and momentum factor for gradient descent learning or the penalty factor for weight regularization. Other examples are support vector machines (SVMs; [2], [3]), where a margin slack penalty factor has to be chosen for nonlinearly separable models, to which one has to add the kernel width if Gaussian kernels are to be used or the value of the ε-insensitivity parameter in SV regression. However, there may be other reasons that force the selection of modelling parameters. For instance, some sort of input preprocessing is quite often needed, such as the selection of the percentage of variance to be retained if principal components are used for input dimensionality reduction or, in time series applications, the number of delays or the modelling window size to be used.

Some ad-hoc approaches can be applied in concrete settings. For instance, leave-one-out error estimation can be used to derive parameter-dependent error bounds for SVMs [4] which, in turn, make it possible to find optimal SVM parameters by gradient descent on the resulting error model. On the other hand, a relatively standard technique (see [5], Chapter 14) in the design of experiments field is to test the model performance in a few points of the parameter space and to fit a first or second order empirical error surface over the results obtained, which is then taken as an approximation to the real underlying error landscape. If it exists, this surface's minimum is used to provide optimal model parameters.

In any case, the previous situations are the exception rather than the rule. More often (and quite particularly for hybrid systems), several rather different and competing modelling techniques are to be used so that an adequately chosen combination offers less variance than that of its individual components. In such a situation one cannot count on a direct knowledge of the effectiveness of each individual model for the problem at hand and it is thus clear that no general parameter setting procedure exists other than some kind of search on the parameter space. There are then two extreme approaches: a more or less exhaustive, deterministic parameter search [6], [7] and, on the other hand, a stochastic metamodel search, typically using genetic algorithms [8], [9] or genetic programming [10].

Deterministic parameter search is usually done over a grid obtained by the discretization of each individual parameter's allowable range. This discretization appears naturally for integer valued parameters (such as, for instance, the number of hidden layers or units of an MLP) and, for a continuous parameter α in a range [a,b], a resolution limit δ is fixed and one explores the {a,a+δa+Hδ} parameter values, where H=(b-a)/δ. We may assume that H=2K and call K the depth of the search. An obvious way of exploring the resulting grid G associated to an M parameter model would be to perform a linear search over all the grid's parameter values. However, the complexity of essentially any model selection procedure is determined by the number of the different models to be built (in our classification examples the fitness function will be the model errors on a validation subset) and here its cost would be (1+2K)M2KM, which rules out its application outside very low values of M and K.

The simplest way out of this is to introduce the possibility of a random evolution in the search procedure. A first example somewhat related to grid search is to fill the parameter space using uniform experimental design [11], where a number L of patterns is chosen so that the square norm discrepancy between their empirical cumulative distribution and that of the uniform distribution is small. Another widely used option is to stochastically explore the parameter space in an evolutionary setting. The well-known (μ,λ) covariance matrix adaptation-evolution strategy (CMA-ES), proposed by Hansen and Ostermeier [12], [13], is one of the most effective black-box random optimizers. Briefly, (μ,λ) CMA-ES produces from a population Xlt,1lμ, of μ individuals a number λ of new individuals with genotypes Xlt+1=mt+ξlt where mt is the mass centre of the previous population and the perturbations ξlt are independent realizations of an M dimensional Gaussian distribution N(0,Ct). The μ offspring with the best fitness are then selected to form the new population to be used at step t+1. Moreover, at each step the covariance matrix Ct is adaptively updated in such a way that the probability of applying again previous steps that lead to larger gains is maximized. The complexity of CMA-ES based model selection is clearly λ times the number of generations explored.

While being in general quite effective, these approaches are not problem-free. For instance, finding directly the appropriate space filling points in uniform design (UD) is a difficult endeavor. Tables exist that provide good preselected values, but the number of their points may be too small if a moderate to large number of parameters has to be set. On the other hand, CMA-ES is a rather complicated and difficult to parametrize and implement procedure. Moreover, it must start from a single point in parameter space and its exploration may therefore be less thorough.

As a simpler alternative, we shall propose in this paper two grid based procedures in which starting from the outer grid points, successively narrow the search to smaller half-size grids centred at the point giving the best fitness value in the previous iteration. Because of this we will term them focused grid searches (FGS). The first one is a deterministic FGS (DFGS) for which we will begin by discussing a simple general version that requires K3M model constructions for a grid depth of K and an M parameter model; we will also show how to refine it to achieve an extra K+1 depth precision while requiring the same number K3M of model trainings in the best case and K(3M+2M) in the worst case. Notice, however, that the exponential cost in M remains and to alleviate it we shall also introduce an evolution strategy on the previous focused grid search, in which we will only consider a fixed number Λ of points instead of the 3M points used in DFGS. These points will be selected through a simulated annealing-like procedure and the one giving the best fitness will be the centre of a new half-size grid; we shall call the resulting procedure annealed FGS (AFGS). While the complexity of DFGS is essentially fixed once we choose the grid depth K, we can control it in AFGS by fixing the number Λ of outer grid points to be randomly examined. Thus, for a K depth search, AFGS will require at most KΛ model constructions and its cost will then be comparable to that of CMA-ES whenever the latter performs more than KΛ/λ generations.

This paper is organized as follows. In Section 2 we will briefly review standard linear grid search, uniform design parameter value selection and, in more detail, the (μ,λ) CMA-ES algorithm that we will use it for comparisons with our deterministic and annealed FGS procedures. We will present our DFGS and AFGS proposals in Section 3 and we will compare them against CMA-ES in Section 4 over several classification problems which will be solved using MLPs and SVMs. MLPs will have a standard 1-hidden layer structure and will be trained in a batch setting by conjugate gradient minimization. We shall train SVMs for classification using the MDM algorithm [14] with quadratic margin slack penalties and Gaussian kernels. We will consider two comparison settings. First we will deal with what we may call “small parameter set” problems. By this we mean selecting the optimal number of hidden units and the weight decay parameter for MLPs and the Gaussian kernel width 2σ2 and penalty factor C in SVMs; that is, we will select just two optimal parameters for both MLPs and SVMs. On the other hand, we shall also consider “large parameter set” problems for SVMs, where we will seek, besides the penalty factor C, the best parameters of an anisotropic Gaussian kernel. The number of parameters is now 1+D, with D the pattern dimension in the original feature space. This makes DFGS too costly and in this second situation we shall just compare AFGS and CMA-ES searches. (We observe that in [15] the anisotropic SVM parameter estimation problem is considered applying first a preliminary grid search to obtain an initial 2σ2 choice and tuning it then using CMA-ES.)

As our results in Section 4 will illustrate, there is no clear winner among the various approaches considered here in any of the parameter number scenarios, and the choice of one particular technique may rest on other facts. For instance, while practical only for models with few parameters, DFGS avoids the parametrization required by stochastic searches and AFGS, with few and very simple parameters, can be used over more complex models. On the other hand, while CMA-ES has undergone a very thorough theoretical and experimental analysis and good implementations in several programming languages are available in [16], it is still a fairly complex procedure that does not allow for an easy autonomous implementation and whose own parametrization is difficult. In contrast, DFGS and AFGS are procedurally much simpler and, thus, easier to implement. This paper will end with a brief discussion and conclusions section.

Section snippets

Previous work

In this section we will briefly describe standard linear grid search, establishing in part the notation to be used in the next section, and, then, review the application of uniform design to optimal parameter selection and, in more detail, the (μ,λ) CMA-ES procedure that we shall use in our experimental comparisons.

Linear grid search is the simplest (but also costliest) way to find optimal model parameters. A grid structure lends itself naturally to search for discrete parameters while

Deterministic focused grid search

Our starting option will be a simple parameter grid search, where we will start at a 3M point initial outer grid O0 made up of vectors of the form αi0=ci0+γΔi, where we assume parameter ranges [ai,bi], Δi=bi-ai, ci0=ai+Δi/2 are the coordinates of the centre c0 of the parameter hypercube to be explored and γ{-12,0,12}. In principle, the point τ0=((α10)*,,(αM0)*) giving a better fitness will be taken as the centre c1 of a new smaller grid O1 of 3M points of the form αi1=ci1+γΔi/2 with γ as

Numerical experiments

We will illustrate the preceding techniques over Rãtsch's classification datasets available in [21] and listed in Table 1 together with their number of patterns and attributes. These datasets are given as 100 train–test pairs; we shall use these splits in our experiments. Here we shall work with both Gaussian kernel SVMs and single hidden layer multilayer perceptrons. For SVMs we shall allow for margin slacks with quadratic penalties; that is, we use a criterion function of the form 12W2+C2ξi

Discussion and conclusions

Present day machine learning offers a very wide range of modelling methods and while for a given problem, the ability to select the best one would be highly desirable, this same high number of options makes that choice quite difficult. Besides, even when one particular model is chosen, it is not an easy task to come up with good values for that model's specific parameters. On the other hand, if the model is itself general enough to cover a wide range of problems, an appropriate choice of

Acknowledgements

With partial support of Spain's TIN 2004–07676 and TIN 2007–66862. The first author is kindly supported by the FPU–MEC grant reference AP2006–02285.

A. Barbero received the Computer Scientist degree from the Universidad Autónoma de Madrid (UAM) in 2006, and is currently a student in a Master/Ph.D. degree in Computer Science at the same university. At present, he is working at the UAM under a National Predoctoral Grant, in collaboration with the Instituto de Ingeniería del Conocimiento. His research interests are in pattern recognition, kernel methods and wind power forecasting.

References (22)

  • C.M. Huang et al.

    Model selection for support vector machines via uniform design

    Computational Statistics and Data Analysis

    (2007)
  • F. Friedrichs et al.

    Evolutionary tuning of multiple SVM parameters

    Neurocomputing

    (2005)
  • K.T. Fang et al.

    Uniform experimental designs and their applications in industry

    Handbook of Statistics

    (2003)
  • R. Duda et al.

    Pattern Classification

    (2000)
  • V. Vapnik

    The Nature of Statistical Learning Theory

    (1995)
  • B. Schölkopf et al.

    Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond, Machine Learning

    (2002)
  • S.S. Keerthi

    Efficient tuning of SVM hyperparameters using radius/margin bound and iterative algorithms

    IEEE Transactions on Neural Networks

    (2002)
  • D.C. Montgomery

    Design and Analysis of Experiments

    (1976)
  • C.-W. Hsu, Ch.-Ch. Chang, Ch.-J. Lin, A practical guide to support vector classification...
  • C. Staelin, Parameter selection for support vector machines, Technical Report HPL-2002-354, HP Laboratories, Israel,...
  • C. Harpham et al.

    A review of genetic algorithms applied to training radial basis function networks

    Neural Computing & Applications

    (2004)
  • Cited by (40)

    • Integrating Models and Fusing Data in a Deep Ensemble Learning Method for Predicting Epidemic Diseases Outbreak

      2022, Big Data Research
      Citation Excerpt :

      Successful ANN applications usually depend on the appropriate choice of the best modeling hyperparameters. These modeling hyperparameters can concern the network architecture (hidden layers number and units or neurons number in each layer, etc.) or data preparation such as delays number or the window size that will be used in time series applications [20]. That's why, we will use in this study the grid searching technique whose role is to select the optimal and suitable hyperparameters modeling by performing optimization decisions based on several combinations and solid statistical criteria [21].

    • Quantifying proportions of different material sources to loess based on a grid search and Monte Carlo model: A case study of the Ili Valley, Central Asia

      2021, Palaeogeography, Palaeoclimatology, Palaeoecology
      Citation Excerpt :

      The grid search is an important technique for performing an approximately exhaustive search in a domain space based on a given step size, which has been verified and applied extensively (e.g., Hesterman et al., 2010). For example, the grid search is often used to find the optimal parameters for the model in multi-dimensional solution space (e.g., Jimenez et al., 2009) or discretize the solution space (e.g., Dufour and Neves, 2019; He and Hong, 2015). The Monte Carlo model, a reliable approach to perform and obtain the optimal solution, is widely applied for continuous random sampling and statistical calculation in the solution spaces (Seila, 1981).

    View all citing articles on Scopus

    A. Barbero received the Computer Scientist degree from the Universidad Autónoma de Madrid (UAM) in 2006, and is currently a student in a Master/Ph.D. degree in Computer Science at the same university. At present, he is working at the UAM under a National Predoctoral Grant, in collaboration with the Instituto de Ingeniería del Conocimiento. His research interests are in pattern recognition, kernel methods and wind power forecasting.

    J. López received his Computer Engineering degree from the Universidad Autónoma de Madrid in 2006, where he got an Honorific Mention as the best student. Currently he is attending the Postgraduate Programme organized by the Computer Engineering Department of the same university. His research interests concentrate on support vector machines, but also cover additional machine learning and pattern recognition paradigms.

    J.R. Dorronsoro received the Licenciado en Matemáticas degree from the Universidad Complutense de Madrid in 1977 and the Ph.D. degree in Mathematics from Washington University in St. Louis in 1981. Currently he is Full Professor in the Computer Engineering Department of the Universidad Autónoma de Madrid, of which he was the head from 1993 to 1996. His research interest is in neural networks, image processing and pattern recognition.

    View full text