Finding optimal model parameters by deterministic and annealed focused grid search

doi:10.1016/j.neucom.2008.09.024

Neurocomputing

Volume 72, Issues 13–15, August 2009, Pages 2824-2832

https://doi.org/10.1016/j.neucom.2008.09.024 Get rights and content

Abstract

Optimal parameter model finding is usually a crucial task in engineering applications of classification and modelling. The exponential cost of linear search on a parameter grid of a given precision rules it out in all but the simplest problems and random algorithms such as uniform design or the covariance matrix adaptation-evolution strategy (CMA-ES) are usually applied. In this work we shall present two focused grid search (FGS) alternatives in which one repeatedly zooms into more concentrated sets of discrete grid points in the parameter search space. The first one, deterministic FGS (DFGS), is much faster than standard search although still too costly in problems with a large number of parameters. The second one, annealed FGS (AFGS), is a random version of DFGS where a fixed fraction of grid points is randomly selected and examined. As we shall numerically see over several classification problems for multilayer perceptrons and support vector machines, DFGS and AFGS are competitive with respect to CMA-ES, one of the most successful evolutive black-box optimizers. The choice of a concrete technique may thus rest in other facts, and the simplicity and basically parameter-free nature of both DFGS and AFGS may make them worthwile alternatives to the thorough theoretical and experimental background of CMA-ES.

Introduction

Successful data mining applications usually rely on the appropriate choice of several parameters for the modelling paradigm that is to be used. A very well-known case is that of multilayer perceptrons (MLPs; [1]), where parameters to be chosen are the number of hidden layers and of units in each layer, the learning rate and momentum factor for gradient descent learning or the penalty factor for weight regularization. Other examples are support vector machines (SVMs; [2], [3]), where a margin slack penalty factor has to be chosen for nonlinearly separable models, to which one has to add the kernel width if Gaussian kernels are to be used or the value of the $ε$ -insensitivity parameter in SV regression. However, there may be other reasons that force the selection of modelling parameters. For instance, some sort of input preprocessing is quite often needed, such as the selection of the percentage of variance to be retained if principal components are used for input dimensionality reduction or, in time series applications, the number of delays or the modelling window size to be used.

Some ad-hoc approaches can be applied in concrete settings. For instance, leave-one-out error estimation can be used to derive parameter-dependent error bounds for SVMs [4] which, in turn, make it possible to find optimal SVM parameters by gradient descent on the resulting error model. On the other hand, a relatively standard technique (see [5], Chapter 14) in the design of experiments field is to test the model performance in a few points of the parameter space and to fit a first or second order empirical error surface over the results obtained, which is then taken as an approximation to the real underlying error landscape. If it exists, this surface's minimum is used to provide optimal model parameters.

In any case, the previous situations are the exception rather than the rule. More often (and quite particularly for hybrid systems), several rather different and competing modelling techniques are to be used so that an adequately chosen combination offers less variance than that of its individual components. In such a situation one cannot count on a direct knowledge of the effectiveness of each individual model for the problem at hand and it is thus clear that no general parameter setting procedure exists other than some kind of search on the parameter space. There are then two extreme approaches: a more or less exhaustive, deterministic parameter search [6], [7] and, on the other hand, a stochastic metamodel search, typically using genetic algorithms [8], [9] or genetic programming [10].

Deterministic parameter search is usually done over a grid obtained by the discretization of each individual parameter's allowable range. This discretization appears naturally for integer valued parameters (such as, for instance, the number of hidden layers or units of an MLP) and, for a continuous parameter $α$ in a range $[a, b]$ , a resolution limit $δ$ is fixed and one explores the ${a, a + δ \dots a + H δ}$ parameter values, where $H = (b - a) / δ$ . We may assume that $H = 2^{K}$ and call K the depth of the search. An obvious way of exploring the resulting grid $G$ associated to an M parameter model would be to perform a linear search over all the grid's parameter values. However, the complexity of essentially any model selection procedure is determined by the number of the different models to be built (in our classification examples the fitness function will be the model errors on a validation subset) and here its cost would be $(1 + 2^{K})^{M} ≃ 2^{KM}$ , which rules out its application outside very low values of M and K.

The simplest way out of this is to introduce the possibility of a random evolution in the search procedure. A first example somewhat related to grid search is to fill the parameter space using uniform experimental design [11], where a number L of patterns is chosen so that the square norm discrepancy between their empirical cumulative distribution and that of the uniform distribution is small. Another widely used option is to stochastically explore the parameter space in an evolutionary setting. The well-known $(μ, λ)$ covariance matrix adaptation-evolution strategy (CMA-ES), proposed by Hansen and Ostermeier [12], [13], is one of the most effective black-box random optimizers. Briefly, $(μ, λ)$ CMA-ES produces from a population $X_{l}^{t}, 1 ⩽ l ⩽ μ$ , of $μ$ individuals a number $λ$ of new individuals with genotypes $X_{l}^{t + 1} = m^{t} + ξ_{l}^{t}$ where $m^{t}$ is the mass centre of the previous population and the perturbations $ξ_{l}^{t}$ are independent realizations of an M dimensional Gaussian distribution $N (0, C^{t})$ . The $μ$ offspring with the best fitness are then selected to form the new population to be used at step $t + 1$ . Moreover, at each step the covariance matrix $C^{t}$ is adaptively updated in such a way that the probability of applying again previous steps that lead to larger gains is maximized. The complexity of CMA-ES based model selection is clearly $λ$ times the number of generations explored.

While being in general quite effective, these approaches are not problem-free. For instance, finding directly the appropriate space filling points in uniform design (UD) is a difficult endeavor. Tables exist that provide good preselected values, but the number of their points may be too small if a moderate to large number of parameters has to be set. On the other hand, CMA-ES is a rather complicated and difficult to parametrize and implement procedure. Moreover, it must start from a single point in parameter space and its exploration may therefore be less thorough.

As a simpler alternative, we shall propose in this paper two grid based procedures in which starting from the outer grid points, successively narrow the search to smaller half-size grids centred at the point giving the best fitness value in the previous iteration. Because of this we will term them focused grid searches (FGS). The first one is a deterministic FGS (DFGS) for which we will begin by discussing a simple general version that requires $K 3^{M}$ model constructions for a grid depth of K and an M parameter model; we will also show how to refine it to achieve an extra $K + 1$ depth precision while requiring the same number $K 3^{M}$ of model trainings in the best case and $K (3^{M} + 2^{M})$ in the worst case. Notice, however, that the exponential cost in M remains and to alleviate it we shall also introduce an evolution strategy on the previous focused grid search, in which we will only consider a fixed number $Λ$ of points instead of the $3^{M}$ points used in DFGS. These points will be selected through a simulated annealing-like procedure and the one giving the best fitness will be the centre of a new half-size grid; we shall call the resulting procedure annealed FGS (AFGS). While the complexity of DFGS is essentially fixed once we choose the grid depth K, we can control it in AFGS by fixing the number $Λ$ of outer grid points to be randomly examined. Thus, for a K depth search, AFGS will require at most $K Λ$ model constructions and its cost will then be comparable to that of CMA-ES whenever the latter performs more than $K Λ / λ$ generations.

This paper is organized as follows. In Section 2 we will briefly review standard linear grid search, uniform design parameter value selection and, in more detail, the $(μ, λ)$ CMA-ES algorithm that we will use it for comparisons with our deterministic and annealed FGS procedures. We will present our DFGS and AFGS proposals in Section 3 and we will compare them against CMA-ES in Section 4 over several classification problems which will be solved using MLPs and SVMs. MLPs will have a standard 1-hidden layer structure and will be trained in a batch setting by conjugate gradient minimization. We shall train SVMs for classification using the MDM algorithm [14] with quadratic margin slack penalties and Gaussian kernels. We will consider two comparison settings. First we will deal with what we may call “small parameter set” problems. By this we mean selecting the optimal number of hidden units and the weight decay parameter for MLPs and the Gaussian kernel width $2 σ^{2}$ and penalty factor C in SVMs; that is, we will select just two optimal parameters for both MLPs and SVMs. On the other hand, we shall also consider “large parameter set” problems for SVMs, where we will seek, besides the penalty factor C, the best parameters of an anisotropic Gaussian kernel. The number of parameters is now $1 + D$ , with D the pattern dimension in the original feature space. This makes DFGS too costly and in this second situation we shall just compare AFGS and CMA-ES searches. (We observe that in [15] the anisotropic SVM parameter estimation problem is considered applying first a preliminary grid search to obtain an initial $2 σ^{2}$ choice and tuning it then using CMA-ES.)

As our results in Section 4 will illustrate, there is no clear winner among the various approaches considered here in any of the parameter number scenarios, and the choice of one particular technique may rest on other facts. For instance, while practical only for models with few parameters, DFGS avoids the parametrization required by stochastic searches and AFGS, with few and very simple parameters, can be used over more complex models. On the other hand, while CMA-ES has undergone a very thorough theoretical and experimental analysis and good implementations in several programming languages are available in [16], it is still a fairly complex procedure that does not allow for an easy autonomous implementation and whose own parametrization is difficult. In contrast, DFGS and AFGS are procedurally much simpler and, thus, easier to implement. This paper will end with a brief discussion and conclusions section.

Section snippets

Previous work

In this section we will briefly describe standard linear grid search, establishing in part the notation to be used in the next section, and, then, review the application of uniform design to optimal parameter selection and, in more detail, the $(μ, λ)$ CMA-ES procedure that we shall use in our experimental comparisons.

Linear grid search is the simplest (but also costliest) way to find optimal model parameters. A grid structure lends itself naturally to search for discrete parameters while

Deterministic focused grid search

Our starting option will be a simple parameter grid search, where we will start at a $3^{M}$ point initial outer grid $O_{0}$ made up of vectors of the form $α_{i}^{0} = c_{i}^{0} + γ Δ_{i}$ , where we assume parameter ranges $[a_{i}, b_{i}]$ , $Δ_{i} = b_{i} - a_{i}$ , $c_{i}^{0} = a_{i} + Δ_{i} / 2$ are the coordinates of the centre $c^{0}$ of the parameter hypercube to be explored and $γ \in {- \frac{1}{2}, 0, \frac{1}{2}}$ . In principle, the point $τ^{0} = ((α_{1}^{0})^{*}, \dots, (α_{M}^{0})^{*})$ giving a better fitness will be taken as the centre $c^{1}$ of a new smaller grid $O_{1}$ of $3^{M}$ points of the form $α_{i}^{1} = c_{i}^{1} + γ Δ_{i} / 2$ with $γ$ as

Numerical experiments

We will illustrate the preceding techniques over Rãtsch's classification datasets available in [21] and listed in Table 1 together with their number of patterns and attributes. These datasets are given as 100 train–test pairs; we shall use these splits in our experiments. Here we shall work with both Gaussian kernel SVMs and single hidden layer multilayer perceptrons. For SVMs we shall allow for margin slacks with quadratic penalties; that is, we use a criterion function of the form $\frac{1}{2} ∥ W ∥^{2} + \frac{C}{2} \sum ξ_{i}$

Discussion and conclusions

Present day machine learning offers a very wide range of modelling methods and while for a given problem, the ability to select the best one would be highly desirable, this same high number of options makes that choice quite difficult. Besides, even when one particular model is chosen, it is not an easy task to come up with good values for that model's specific parameters. On the other hand, if the model is itself general enough to cover a wide range of problems, an appropriate choice of

Acknowledgements

With partial support of Spain's TIN 2004–07676 and TIN 2007–66862. The first author is kindly supported by the FPU–MEC grant reference AP2006–02285.

A. Barbero received the Computer Scientist degree from the Universidad Autónoma de Madrid (UAM) in 2006, and is currently a student in a Master/Ph.D. degree in Computer Science at the same university. At present, he is working at the UAM under a National Predoctoral Grant, in collaboration with the Instituto de Ingeniería del Conocimiento. His research interests are in pattern recognition, kernel methods and wind power forecasting.

References (22)

C.M. Huang et al.
Model selection for support vector machines via uniform design
Computational Statistics and Data Analysis
(2007)
F. Friedrichs et al.
Evolutionary tuning of multiple SVM parameters
Neurocomputing
(2005)
K.T. Fang et al.
Uniform experimental designs and their applications in industry
Handbook of Statistics
(2003)
R. Duda et al.
Pattern Classification
(2000)
V. Vapnik
The Nature of Statistical Learning Theory
(1995)
B. Schölkopf et al.
Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond, Machine Learning
(2002)
S.S. Keerthi
Efficient tuning of SVM hyperparameters using radius/margin bound and iterative algorithms
IEEE Transactions on Neural Networks
(2002)
D.C. Montgomery
Design and Analysis of Experiments
(1976)
C.-W. Hsu, Ch.-Ch. Chang, Ch.-J. Lin, A practical guide to support vector classification...
C. Staelin, Parameter selection for support vector machines, Technical Report HPL-2002-354, HP Laboratories, Israel,...

C. Harpham et al.

A review of genetic algorithms applied to training radial basis function networks

Neural Computing & Applications

(2004)

Cited by (40)

Integrating Models and Fusing Data in a Deep Ensemble Learning Method for Predicting Epidemic Diseases Outbreak
2022, Big Data Research
Citation Excerpt :
Successful ANN applications usually depend on the appropriate choice of the best modeling hyperparameters. These modeling hyperparameters can concern the network architecture (hidden layers number and units or neurons number in each layer, etc.) or data preparation such as delays number or the window size that will be used in time series applications [20]. That's why, we will use in this study the grid searching technique whose role is to select the optimal and suitable hyperparameters modeling by performing optimization decisions based on several combinations and solid statistical criteria [21].
Due to the continuous and growing spread of the novel corona virus (COVID-19) worldwide, it is urgent, especially in the data science era, to develop accurate data driven decision-aided methods to predict and early detect the outbreak of this epidemic disease and then to support healthcare decision makers. In this context, the main goal of this paper is to build an accurate and generic data driven method that can predict daily COVID-19 positive cases and therefore helps stakeholders to make and review their epidemic response plans. This method is based on the integration of three deep learning models: Long Short Term Memory (LSTM), Deep Neural Networks (DNN) and Convolutional Neural Networks (CNN) and takes advantage of their complementarity. The proposed method is validated on two experimental scenarios where the first one aims to validate the method on China and Tunisia case studies and the second one is based on data fusion and transfer learning process where China data and models will be reused to predict Tunisia COVID-19 outbreak. Experiment results indicate that, compared with individual learners, the stacked-DNN meta-learner, whose inputs are results of LSTM, DNN and CNN learners, achieved the best results in terms of accuracy as well as RMSE and it required the lowest time for training as well as prediction for the two scenarios. The main outcomes of this paper are i) to adopt deep learning models combined to stacking ensemble learning to accurately forecast COVID-19 positive cases and ii) to merge data and to adopt transfer learning for the prediction of confirmed cases by reusing China data, learners and meat-learners to make prediction of the epidemic trend for other countries, with less facilities of collecting data, when preventive and control measures are similar.
Sample and feature selecting based ensemble learning for imbalanced problems
2021, Applied Soft Computing
Imbalanced problem is concerned with the performance of classifiers on the data set with severe class imbalance distribution. Traditional methods are misled by the majority samples to make the incorrect prediction and fail to make full use of minority samples. This paper is motivated to design a novel hybrid ensemble learning strategy named Sample and Feature Selection Hybrid Ensemble Learning (SFSHEL) and combine it with random forest to improve the classification performance of imbalanced data. Specifically, SFSHEL considers cluster-based stratification to undersample the majority samples and adopts sliding windows mechanism to generate a diversity of feature subsets, simultaneously. Then the weights trained through validation are assigned to different base learners and SFSHEL makes the prediction by weighted voting at last. In this manner, SFSHEL could not only guarantee the acceptable performance, but also save computational time. Furthermore, the weighting process makes SFSHEL interpret the importance of each selected feature set, which is important in the real-world scenarios. The contributions of the proposed strategy are: (1) reducing the impact of class imbalance distribution, (2) assigning based learner weights only once after the training process, and (3) generating weights of features to help interpret the importance of clinical features. In practice, the random forest is adopted as the base learner for SFSHEL, so as to build a classifier abbreviated as SFSHEL-RF. The experiments show the average performance of the proposed SFSHEL-RF on a part of KEEL dataset reaches 91.37%, which is comparable to our previous best ECUBoost-RF method and higher than the other eleven methods. On the clinical heart failure datasets, the performance of SFSHEL-RF can stably reach the level of the top three with three indicators. The experimental results on both the standard imbalanced and clinical heart failure datasets validate the effectiveness and stability of SFSHEL-RF.
Hierarchical parameter optimization based support vector regression for power load forecasting
2021, Sustainable Cities and Society
Power load forecasting is an important task of smart grid, which is of great significance to the sustainable development of society. In this paper, a hybrid support vector regression (HSVR) is raised for the medium and long term load forecasting. To further improve prediction accuracy, the coupling and interdependent relationship between hyperparameters and model parameters in the optimization process is focused. A hierarchical optimization method based on nested strategy and state transition algorithm (STA) is proposed to find optimal parameters. The effectiveness of the proposed hierarchical optimization method is confirmed on several benchmarks, and the resulting hierarchical optimization method based SVR is also successfully applied to a real industrial power load forecasting problem in China.
Quantifying proportions of different material sources to loess based on a grid search and Monte Carlo model: A case study of the Ili Valley, Central Asia
2021, Palaeogeography, Palaeoclimatology, Palaeoecology
Citation Excerpt :
The grid search is an important technique for performing an approximately exhaustive search in a domain space based on a given step size, which has been verified and applied extensively (e.g., Hesterman et al., 2010). For example, the grid search is often used to find the optimal parameters for the model in multi-dimensional solution space (e.g., Jimenez et al., 2009) or discretize the solution space (e.g., Dufour and Neves, 2019; He and Hong, 2015). The Monte Carlo model, a reliable approach to perform and obtain the optimal solution, is widely applied for continuous random sampling and statistical calculation in the solution spaces (Seila, 1981).
The Ili Valley is among the main distribution areas of loess deposits in Xinjiang Province, Central Asia, while the provenance of Ili loess remains under debate. In this study, samples from near-surface loess, two types of topsoil and modern riverbed sediment were analyzed for their concentrations of major and trace geochemical elements to determine the relative proportions of different provenances of loess deposits in different zones of the Ili Valley. The results obtained by the grid search technology and Monte Carlo model indicated that the proximal material is dominated in the Ili loess. Alluvial-diluvial sediments as the main local material source have significantly influenced loess in the western region of Ili Valley. Moreover, this influence gradually decreases in the eastern region and the Zhaosu Basin. The proportion of modern riverbed sediment in the eastern Ili Valley is significantly lower than that in the Zhaosu Basin and is lowest in the western Ili Valley. However, the proportion of dust and topsoil type-II with the mean value of 11.8% and 7.2%, respectively, is highest in the western Ili Valley and lowest in the Zhaosu Basin. The complex natural background of the Ili Valley can be used to interpret the quantitative results and the geochemical characteristics of Ili loess from different regions. The reliability of the proposed method can be assessed by environmental indicators such as grain size, and geomorphic-hydrological background and other published records.
Weight-based multiple empirical kernel learning with neighbor discriminant constraint for heart failure mortality prediction
2020, Journal of Biomedical Informatics
Heart Failure (HF) is one of the most common causes of hospitalization and is burdened by short-term (in-hospital) and long-term (6–12 month) mortality. Accurate prediction of HF mortality plays a critical role in evaluating early treatment effects. However, due to the lack of a simple and effective prediction model, mortality prediction of HF is difficult, resulting in a low rate of control. To handle this issue, we propose a Weight-based Multiple Empirical Kernel Learning with Neighbor Discriminant Constraint (WMEKL-NDC) method for HF mortality prediction. In our method, feature selection by calculating the F-value of each feature is first performed to identify the crucial clinical features. Then, different weights are assigned to each empirical kernel space according to the centered kernel alignment criterion. To make use of the discriminant information of samples, neighbor discriminant constraint is finally integrated into multiple empirical kernel learning framework. Extensive experiments were performed on a real clinical dataset containing 10, 198 in-patients records collected from Shanghai Shuguang Hospital in March 2009 and April 2016. Experimental results demonstrate that our proposed WMEKL-NDC method achieves a highly competitive performance for HF mortality prediction of in-hospital, 30-day and 1-year. Compared with the state-of-the-art multiple kernel learning and baseline algorithms, our proposed WMEKL-NDC is more accurate on mortality prediction Moreover, top 10 crucial clinical features are identified together with their meanings, which are very useful to assist clinicians in the treatment of HF disease.
A modified support vector regression: Integrated selection of training subset and model
2017, Applied Soft Computing Journal
In recent years, support vector regression (SVR) has become an emerging and popular forecasting technique in the field of machine learning. However, it is subjected to the model selection and learning complexity O(K * N³), especially for a massive data set (N is the size of training dataset, and K is the number of search). How to simultaneously reduce K and N can give us insight and inspiration on designing an effective and accurate selection algorithm. To this end, this paper tries to integrate the selection of training subset and model for SVR, and proposes a nested particle swarm optimization (NPSO) by inheriting the model selection of the existing training subset based SVR (TS-SVR). This nested algorithm is achieved by adaptively and periodically estimating the search region of the optimal parameter setting for TS-SVR. Complex SVR, involving large-scale training data, can be seen as extensions of TS-SVRs, yielding a nested sequence of TS-SVRs with increasing sample size. The uniform design idea is transplanted to the above modeling process, and the convergence for the proposed model is proofed. By using two artificial regression problems, Boston housing and electric load in New South Wales as empirical data, the proposed approach is compared with the standard ones, the APSO-OTS-SVR, and other existing approaches. Empirical results show that the proposed approach not only can select proper training subset and parameter, but also has better generalization performance and fewer processing time.

View all citing articles on Scopus

J. López received his Computer Engineering degree from the Universidad Autónoma de Madrid in 2006, where he got an Honorific Mention as the best student. Currently he is attending the Postgraduate Programme organized by the Computer Engineering Department of the same university. His research interests concentrate on support vector machines, but also cover additional machine learning and pattern recognition paradigms.

J.R. Dorronsoro received the Licenciado en Matemáticas degree from the Universidad Complutense de Madrid in 1977 and the Ph.D. degree in Mathematics from Washington University in St. Louis in 1981. Currently he is Full Professor in the Computer Engineering Department of the Universidad Autónoma de Madrid, of which he was the head from 1993 to 1996. His research interest is in neural networks, image processing and pattern recognition.

View full text

Finding optimal model parameters by deterministic and annealed focused grid search

Abstract

Introduction

Section snippets

Previous work

Deterministic focused grid search

Numerical experiments

Discussion and conclusions

Acknowledgements

Computational Statistics and Data Analysis

Neurocomputing

Handbook of Statistics

Pattern Classification

The Nature of Statistical Learning Theory

Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond, Machine Learning

Efficient tuning of SVM hyperparameters using radius/margin bound and iterative algorithms

IEEE Transactions on Neural Networks

Design and Analysis of Experiments

A review of genetic algorithms applied to training radial basis function networks

Neural Computing & Applications