2.1 Surrogate modelling and SBO in an MF context
Recent state of the art MF approaches suggest using the results from LF and HF tools,
\(y_{LF}\) and
\(y_{HF}\) respectively, to follow the relationship below Forrester and Keane (
2009),
$$\begin{aligned} y_{HF} ({\mathbf {x}}) = y_{LF}({\mathbf {x}}) + e({\mathbf {x}}) \end{aligned}$$
(1)
A typical application of this decomposition within an SBO framework Jones (
2001) is the trust region Demange et al. (
2016a,
b), Jarrett and Ghisu (
2015) approach involving the generation of a locally accurate error surrogate (typically generated by an RBF model) to correct the LF value according to eq.
1. However, such an optimization methodology is not appropriate for our needs for two main reasons. It does not offer the exploration characteristics required during a conceptual design stage study. Furthermore, it demands a high number of LF analyses as the LF tool is directly called by the optimizer during the suboptimization stage. Such an approach, even for low cost LF tools, becomes prohibitively expensive for an MO MDO study. To minimize the suboptimization costs we are using metamodels as their occasional training is cheaper—for a reasonable number of training data points. The optimizer then calls a surrogate model predictor which is of course cheaper than an LF tool. Therefore, in the global SBO methodology that we present, MF information is directly implemented in the surrogate model generated. A popular metamodelling technique for MF SBO is Co-Kriging Forrester et al. (
2007) which, unlike Kriging, uses both LF and HF point correlation to compose a unified MF covariance matrix. However, such an approach includes the tuning of additional hyperparameters, increasing the likelihood estimation cost quadratically to the order of the matrix. We therefore consider that Co-Kriging covariance matrix operations make it too expensive for real industrial MDO applications at which we aim.
Instead, to reduce computational expenses while aiming for global HF optimality, we propose a novel MF modified Kriging based model (MF modKriging). The computationally efficient RBF model cannot provide global exploration within an SBO framework Forrester et al. (
2008). In the next paragraphs, we present how we modify ordinary Kriging so it can accommodate MF information and be used instead of the more expensive Co-Kriging model.
Throughout this work—as typical in most MF research efforts, only HF results are considered to be accurate within the numerical framework. LF simulations are inaccurate, do not provide additional information Kennedy and O’Hagan (
2000) and despite being efficiently used to guide the optimization process, they are associated with a non constant error
\(e({\mathbf {x}})\) defined in Eq.
1. Therefore, to guide towards HF optima, our modified Kriging model should be an interpolation through HF points and a regression through LF ones; according to the associated error of the latter. This error is either estimated by a simple RBF Forrester and Keane (
2009) or a Kriging model using Eq.
1, given a sampling of both LF and HF points. For this, the space filling Latin Hypercube Sampling (LHS) method with a Morris-Mitchell maximin approach Morris and Mitchell (
1995); Johnson et al. (
1990) is used. The sampling requires
m points analyzed only by the LF tool at
\({\mathbf {x}}_{LF}\), and
n new points analyzed with both the LF and HF tool at
\({\mathbf {x}}_{HF}\). As such, we define the complete objective value vector
\({\mathbf {y}}\), and the Error vector
\({\mathbf {e}}\) consisting of
n data, derived by Eq.
1$$\begin{aligned} {\mathbf {y}} = \begin{pmatrix} y_{LF}^{(1)}({\mathbf {x}}_{LF}^{(1)})\\ \quad y_{LF}^{(2)}({\mathbf {x}}_{LF}^{(2)})\\ \vdots \\ \quad y_{LF}^{(m)}({\mathbf {x}}_{LF}^{(m)})\\\quad y_{LF}^{(m+1)}({\mathbf {x}}_{HF}^{(1)})\\ \quad y_{LF}^{(m+2)}({\mathbf {x}}_{HF}^{(2)})\\ \quad \vdots \\ \quad y_{LF}^{(m+n)}({\mathbf {x}}_{HF}^{(n)})\\\quad y_{HF}^{(m+n+1)}({\mathbf {x}}_{HF}^{(1)})\\ \quad y_{HF}^{(m+n+2)}({\mathbf {x}}_{HF}^{(2)})\\ \quad \vdots \\ y_{HF}^{(m+n+n)}({\mathbf {x}}_{HF}^{(n)}) \end{pmatrix}, {\mathbf {e}} = \begin{pmatrix} e^{(1)}({\mathbf {x}}_{HF}^{(1)})\\ \quad e^{(2)}({\mathbf {x}}_{HF}^{(2)})\\ \quad \vdots \\ \quad e^{(n)}({\mathbf {x}}_{HF}^{(n)}) \end{pmatrix} \end{aligned}$$
(2)
If RBF is used to model the Objective Function (OF) error, then in matrix form we have,
$$\begin{aligned} {\mathbf {R}} \varvec{\alpha } = {\mathbf {e}} \end{aligned}$$
(3)
When Gaussian kernel functions are used, the Gram matrix
\({\mathbf {R}}\) consists of elements of the form,
$$\begin{aligned} r_{ij}= \exp {- (\theta \Vert {\mathbf {x}}_{\mathbf{j}}-{\mathbf {x}}_{\mathbf{i}}\Vert )^p} \end{aligned}$$
(4)
where the smoothness parameter
p is set to
\(p=2\). Alternatively, to the author’s experience Mátern functions provide an accurate and robust choice [40]. The parameters vector
\(\varvec{\alpha }\) is found by a Cholesky decomposition and back-substitution. The error associated with the use of the LF tool can be now calculated in any design point. Below, we show how the error estimation is used in MF modKriging. Initially, consider the standard ordinary Kriging, that uses the correlation matrix,
$$\begin{aligned} {\varvec{\Psi }}= \begin{pmatrix} Corr[y({\mathbf {x}}_1), y({\mathbf {x}}_1)] &{} Corr[y({\mathbf {x}}_1), y({\mathbf {x}}_2)] &{} \ldots &{} Corr[y({\mathbf {x}}_1), y({\mathbf {x}}_{n})]\\ Corr[y({\mathbf {x}}_2), y({\mathbf {x}}_1)] &{} \ldots &{} \ldots &{} Corr[y({\mathbf {x}}_2), y({\mathbf {x}}_{n})]\\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ Corr[y({\mathbf {x}}_{n}), y({\mathbf {x}}_1)] &{} Corr[y({\mathbf {x}}_{n}), y({\mathbf {x}}_2)] &{} \ldots &{} Corr[y({\mathbf {x}}_{n}), y({\mathbf {x}}_{n})] \end{pmatrix} \end{aligned}$$
(5)
The elements of this matrix estimate the correlation of the data points by modelling the function as a Gaussian process:
$$\begin{aligned} Corr[y({\mathbf {x}}_{\mathbf{i}}), y({\mathbf {x}}_{\mathbf{j}})] = \exp {\left( - \sum \limits _{l=1}^{d} \theta _l \Vert x_{j,l}-x_{i,l}\Vert ^{p_l}\right) } \end{aligned}$$
(6)
Here,
d is the number of the design variables and
\(\theta _l\) and
\(p_l\) are the shape and smoothing parameters respectively, that need to be defined.
We generate the Kriging predictor by optimizing the set of
\(\mu\),
\(\sigma ^2\),
\(\theta _l\),
\(p_l\) parameters. The mean
\(\mu\) and variance
\(\sigma ^2\) are easily optimized in a deterministic manner Forrester et al. (
2008). Finding the optimum shape and smooth parameter (
\(\hat{\theta _l}\) and
\(\hat{p_l}\)) however, requires a stochastic optimization process aiming at maximizing the Likelihood Estimation function
\(\lambda\) given by,
$$\begin{aligned} \lambda = -\frac{n}{2} log ({\hat{\sigma }}^2) - \frac{1}{2} \log |{\varvec{\Psi }}| \end{aligned}$$
(7)
Therefore, the problem of optimizing the hyperparameters is defined by,
$$\begin{aligned} \hat{\varvec{\theta _l}}, \hat{\varvec{p_l}} = arg\min _{\varvec{\theta }, {\mathbf {p}}} \lambda \end{aligned}$$
Finally, the Kriging predictor takes the form,
$$\begin{aligned} {\hat{y}}({\mathbf {x}}) = {\hat{\mu }} + \varvec{\psi }^T {\varvec{\Psi }}^{{-\mathbf{1}}}({\mathbf {y}}-{\mathbf {1}}) \end{aligned}$$
(8)
where
\(\varvec{\psi }\) is the correlation vector associated with the point to be predicted.
The Kriging predictor is not used directly since we apply the Kriging model to provide the next sampling point under the EI Jones (
2001) plan. This approach can lead to global optimality without a prohibitively costly design space exploration and requires information regarding the metamodel’s Mean Squared Error (MSE). This is given by,
$$\begin{aligned} \hat{s^2}({\mathbf {x}}) = \sigma ^2 \left[ 1 - \varvec{\psi }^T {\varvec{\Psi }}^{{-\mathbf{1}}} \varvec{\psi } + \frac{1 - {\mathbf {1}}^T {\varvec{\Psi }}^{{-\mathbf{1}}} \varvec{\psi } }{{\mathbf {1}}^T {\varvec{\Psi }}^{{-\mathbf{1}}}{\mathbf {1}}}\right] \end{aligned}$$
(9)
MSE is zero in training data points and it increases between them due to value uncertainty. The above formulation is used to construct an estimator of the expected improvement of the objective function in any design space point, given the current minimum value
\(y_{\min }\). In any point where
\({\hat{s}}({\mathbf {x}}) \ne 0\), this is expressed as,
$$\begin{aligned} EI = ( y_{\min } - {\hat{y}}({\mathbf {x}}) ) \Phi \left( \frac{y_{\min } -{\hat{y}}({\mathbf {x}})}{\hat{s}({\mathbf {x}})} \right) + {\hat{s}}({\mathbf {x}}) \phi \left( \frac{y_{\min } - {\hat{y}}({\mathbf {x}})}{{\hat{s}}({\mathbf {x}})}\right) \end{aligned}$$
(10)
where
\(\Phi\) is the cumulative distribution function,
$$\begin{aligned} \Phi (x)= \frac{1}{2} + \frac{1}{2} erf\left( x / \sqrt{2}\right) \end{aligned}$$
(11)
and
\(\phi\) is the probability density function,
In Eq.
11,
erf is the error function expressed as,
$$\begin{aligned} erf(x) = \frac{1}{\sqrt{\pi }} \int _{-x}^{x} \exp {-t^2/2} \end{aligned}$$
(12)
The infill point
\(x^*\)\(\in D \subset R^d\) is then the solution of the suboptimization problem,
$$\begin{aligned} {\mathbf {x}}^{*} = \arg \max _{{\mathbf {x}} \in D} EI \end{aligned}$$
(13)
So far, we have presented the ordinary Kriging, used extensively in the literature. However, this cannot accommodate data resulting from MF analyses. Our MF Kriging modification uses a simple way to superimpose LF and HF information within the model. Specifically, we want to include this MF information in the Kriging predictor (Eq.
15) and MSE (Eq.
17) expression, therefore transforming the EI into an MF EI.
2
For MF modKriging, we define the correlation matrix to use only the LF training data of our MF vector
\({\mathbf {y}}\), so that:
$$\begin{aligned} {\varvec{\Psi }}= \begin{pmatrix} Corr[y_{LF}({\mathbf {x}}_1), y_{LF}({\mathbf {x}}_1)] &{} \ldots &{} Corr[y_{LF}({\mathbf {x}}_1), y_{LF}({\mathbf {x}}_{m+n})]\\ Corr[y_{LF}{\mathbf {x}}_2), y_{LF}({\mathbf {x}}_1)] &{} \ldots &{} Corr[y_{LF}({\mathbf {x}}_2), y_{LF}({\mathbf {x}}_{m+n})]\\ \vdots &{} \ddots &{} \vdots \\ Corr[y_{LF}({\mathbf {x}}_{m+n}), y_{LF}({\mathbf {x}}_1)] &{} \ldots &{} Corr[y_{LF}({\mathbf {x}}_{m+n}), y_{LF}({\mathbf {x}}_{m+n})] \end{pmatrix} \end{aligned}$$
(14)
The Kriging predictor
\({\hat{y}}\) now takes the following form:
$$\begin{aligned} {\hat{y}}({\mathbf {x}}) = {\hat{\mu }} + \varvec{\psi }^T {\varvec{\Psi }}^{{-\mathbf{1}}}({\mathbf {y}}-{\mathbf {1}}) + e({\mathbf {x}}) \end{aligned}$$
(15)
where
\({\hat{\mu }}\) is the Kriging optimized mean value calculated as,
$$\begin{aligned} {\hat{\mu }} = \frac{{\mathbf {1}}^T {\varvec{\Psi }}^{-1} {\mathbf {y}}}{{\mathbf {1}}^T {\varvec{\Psi }}^{-1} {\mathbf {1}}} \end{aligned}$$
(16)
The HF information is recovered by
\(e({{\mathbf {x}}})\), which is a surrogate prediction of the LF tool error, defined by Eq.
1. Therefore, Kriging predictor now interpolates HF points and fits LF ones depending on their predicted error. However, since the EI estimation is also dependent on the MSE, complete HF information recovery demands the alteration of the MSE equation as:
$$\begin{aligned} \hat{s^2}({\mathbf {x}}) = \sigma ^2 \left[ 1 - \varvec{\psi }^T {\varvec{\Psi }}^{{-\mathbf{1}}} \varvec{\psi } + \frac{1 - {\mathbf {1}}^T {\varvec{\Psi }}^{{-\mathbf{1}}} \varvec{\psi } }{{\mathbf {1}}^T {\varvec{\Psi }}^{{-\mathbf{1}}}{\mathbf {1}}}\right] + \hat{s^2_e} \end{aligned}$$
(17)
where the model variation
\(\sigma ^2\) is given by,
$$\begin{aligned} \sigma ^2 = \frac{({\mathbf {y}}-{\mathbf {1}}\mu )^T {\varvec{\Psi }}^{{-\mathbf{1}}}({\mathbf {y}}-{\mathbf {1}}\mu )}{n} \end{aligned}$$
(18)
with
n being the number of sampling data points. In Eq.
17, the
\(\hat{s^2_e}\) term is the MSE of the error metamodel, essentially expressing the uncertainty that arises from the LF tool correction model. The
MSE quantity is typically provided by Kriging models through Gaussian based correlation matrix operations, as above. Therefore,
\(\hat{s^2_e}\) is formulated in a straightforward way if a Kriging model is used for the LF error prediction. However, as it is frequently stated in this paper, the optimization framework should have the minimum computational costs. As such, we often use RBF for the LF error prediction. In this case, by using a Gaussian kernel we can relate the Gram matrix to the Correlation matrix used in Kriging models. By comparing Eqs.
4 and
6 we observe that Gram matrix is identical to the Correlation matrix in the special case that: (1) we define the training data points to be correlated with each other in the same way for all design coordinates (isotropic model), and (2) this correlation coincides with the correlation defined in the Gaussian RBF kernel through the constant shape parameters
\(\theta\) and
p. Namely, for a case where
\(\theta = \theta _1= \theta _2, \cdots \theta _d\) and
\(p=2\), the following holds,
$$\begin{aligned} {\varvec{\Psi }} = {\mathbf {R}} \end{aligned}$$
(19)
This essentially implies that the RBF model used to express the error of the LF tool is equivalent to a Kriging model which has the aforementioned correlation characteristics. Therefore, we can use it to predict a distribution of the
MSE, with exactly the same “assumptions” used in an RBF interpolation model (that is, isotropic shape parameter). As such,
\(\hat{s^2_e}\) can now be calculated as,
$$\begin{aligned} \hat{s^2_e} = \sigma ^2 \left[ 1 - \varvec{\psi }^T {\varvec{\Psi }}^{{-\mathbf{1}}} \varvec{\psi } + \frac{1 - {\mathbf {1}}^T {\varvec{\Psi }}^{{-\mathbf{1}}} \varvec{\psi } }{{\mathbf {1}}^T {\varvec{\Psi }}^{{-\mathbf{1}}}{\mathbf {1}}}\right] = \sigma _{e}^2 \left[ 1 - \varvec{r}^T {\mathbf {R}}^{-{\mathbf{1}}} \varvec{r} + \frac{1 - {\mathbf {1}}^T {\mathbf {R}}^{-{\mathbf{1}}} \varvec{r} }{{\mathbf {1}}^T {\mathbf {R}}^{-{\mathbf{1}}}{\mathbf {1}}}\right] \end{aligned}$$
(20)
where the variance of the error model
\(\sigma _{e}^2\) can be calculated using eqs.
17,
18, by setting
\({\varvec{\Psi }} = {\mathbf {R}}\) and
\(\varvec{\psi }=\varvec{r}\).
In the generic case where a Kriging model has \(\theta _1 \ne \theta _2, \ldots \theta _d\), its predicted MSE distribution will divert from the one similarly predicted by RBF. Nevertheless, such a prediction disagreement is of exactly the same nature as the disagreement between an RBF and a trained Kriging model value predictor.
The importance of using Eqs.
17 and
20 lies in the fact that they restore a zero MSE value to the HF data points and a non-zero finite value to the LF ones. Apart from performing regression on the LF data according to their predicted errors—and interpolation through HF data—the method can now distinguish between the uncertainty characteristics of the LF/HF points in terms of MSE, and eventually EI (which is our value of interest). The loss of LF/HF data correlation (as exploited in Co-Kriging) is compensated by the reduction of the overall model training costs.