An efficient methodology for modeling complex computer codes with Gaussian processes

https://doi.org/10.1016/j.csda.2008.03.026Get rights and content

Abstract

Complex computer codes are often too time expensive to be directly used to perform uncertainty propagation studies, global sensitivity analysis or to solve optimization problems. A well known and widely used method to circumvent this inconvenience consists in replacing the complex computer code by a reduced model, called a metamodel, or a response surface that represents the computer code and requires acceptable calculation time. One particular class of metamodels is studied: the Gaussian process model that is characterized by its mean and covariance functions. A specific estimation procedure is developed to adjust a Gaussian process model in complex cases (non-linear relations, highly dispersed or discontinuous output, high-dimensional input, inadequate sampling designs, etc.). The efficiency of this algorithm is compared to the efficiency of other existing algorithms on an analytical test case. The proposed methodology is also illustrated for the case of a complex hydrogeological computer code, simulating radionuclide transport in groundwater.

Introduction

With the advent of computing technology and numerical methods, investigation of computer code experiments remains an important challenge. Complex computer models calculate several output values (scalars or functions) which can depend on a high number of input parameters and physical variables. These computer models are used to make simulations as well as predictions or sensitivity studies. Importance measures of each uncertain input variable on the response variability provide guidance to a better understanding of the modeling in order to reduce the response uncertainties most effectively (Saltelli et al., 2000, Kleijnen, 1997, Helton et al., 2006).

However, complex computer codes are often too time expensive to be directly used to conduct uncertainty propagation studies or global sensitivity analysis based on Monte Carlo methods. To avoid the problem of huge calculation time, it can be useful to replace the complex computer code by a mathematical approximation, called a response surface or a surrogate model or also a metamodel. The response surface method (Box and Draper, 1987) consists in constructing a function that simulates the behavior of real phenomena in the variation range of the influential parameters, starting from a certain number of experiments. Similarly to this theory, some methods have been developed to build surrogates for long running computer codes (Sacks et al., 1989, Osio and Amon, 1996, Kleijnen and Sargent, 2000, Fang et al., 2006). Several metamodels are classically used: polynomials, splines, generalized linear models, or learning statistical models such as neural networks, support vector machines, etc. (Hastie et al., 2002, Fang et al., 2006).

For sensitivity analysis and uncertainty propagation, it would be useful to obtain an analytic predictor formula for a metamodel. Indeed, an analytical formula often allows the direct calculation of sensitivity indices or output uncertainties. Moreover, engineers and physicists prefer interpretable models that give some understanding of the simulated physical phenomena and parameter interactions. Some metamodels, such as polynomials (Jourdan and Zabalza-Mezghani, 2004, Kleijnen, 2005, Iooss et al., 2006), are easily interpretable but not always very efficient. Others, for instance neural networks (Alam et al., 2004, Fang et al., 2006), are more efficient but do not provide an analytic predictor formula.

The kriging method (Matheron, 1970, Cressie, 1993) has been developed for spatial interpolation problems; it takes into account the spatial statistical structure of the estimated variable. Sacks et al. (1989) have extended the kriging principles to computer experiments by considering the correlation between two responses of a computer code depending on the distance between input variables. The kriging model (also called the Gaussian process model), characterized by its mean and covariance functions, presents several advantages, especially the interpolation and interpretability properties. Moreover, numerous authors (for example, Currin et al. (1991), Santner et al. (2003) and Vazquez et al. (2005)) show that this model can provide a statistical framework to compute an efficient predictor of code response.

From a practical standpoint, constructing a Gaussian process model implies estimation of several hyperparameters included in the covariance function. This optimization problem is particularly difficult for a model with many inputs and inadequate sampling designs (Fang et al., 2006, O’Hagan, 2006). In this paper, a special estimation procedure is developed to fit a Gaussian process model in complex cases (non-linear relations, highly dispersed output, high-dimensional input, inadequate sampling designs). Our purpose includes developing a procedure for parameter estimation via an essential step of input parameter selection. Note that we do not deal with the design of experiments in computer code simulations (i.e. choosing values of input parameters). Indeed, we work on data obtained in a previous study (the hydrogeological model of Volkova et al. (2008) and try to adapt a Gaussian process model as well as possible to a non-optimal sampling design. In summary, this study presents two main objectives: developing a methodology to implement and adapt a Gaussian process model to complex data while studying its prediction capabilities.

The next section briefly explains the Gaussian process modeling from theoretical expression to predictor formulation and model parameterization. In Section 3, a parameter estimation procedure is introduced from the numerical standpoint and a global methodology of Gaussian process modeling implementation is presented. Section 4 is devoted to applications. First, the algorithm efficiency is compared to other algorithms for the example of an analytical test case. Secondly, the algorithm is applied to the data set (20 inputs and 20 outputs) coming from a hydrogeological transport model based on water flow and diffusion dispersion equations. The last section provides some possible extensions and concluding remarks.

Section snippets

Theoretical model

Let us consider n realizations of a computer code. Each realization y(x) of the computer code output corresponds to a d-dimensional input vector x=(x1,,xd). The n points corresponding to the code runs are called an experimental design and are denoted as Xs=(x(1),,x(n)). The outputs will be denoted as Ys=(y(1),,y(n)) with y(i)=y(x(i)),i=1,,n. Gaussian process (Gp) modeling treats the deterministic response y(x) as a realization of a random function Y(x), including a regression part and a

Modeling methodology

Let us first detail the procedure used to validate our model. Since the Gp predictor is an exact interpolator (except when a nugget effect is included), residuals of the learning data cannot be used directly. So, to estimate the mean squared error in a non-optimistic way, we use either a K-fold cross validation procedure (Hastie et al., 2002) or a test sample (consisting of new data, unused in the building process of the Gp model). In both cases, the predictivity coefficient Q2 is computed. Q2

Analytical test case

First, an analytical function called the g-function of Sobol is used to illustrate and justify our methodology. The g-function of Sobol is defined for d inputs uniformly distributed on [0,1]d: gSobol(X1,,Xd)=k=1dgk(Xk)where gk(Xk)=|4Xk2|+ak1+ak and ak0. Due to its complexity (strongly nonlinear and non-monotonic relationship) and the availability of analytical sensitivity indices, the g-function of Sobol is a well known test example in the studies of global sensitivity analysis algorithms (

Conclusion

The Gaussian process model presents some real advantages compared to other metamodels: exact interpolation property, simple analytical formulations of the predictor, availability of the mean squared error of the predictions and the proved efficiency of the model. The keen interest in this method is testified by the publication of the recent monographs of Santner et al. (2003), Fang et al. (2006) and Rasmussen and Williams (2006).

However, for its application to complex industrial problems,

Acknowledgments

This work was supported by the MRIMP project of the “Risk Control Domain” that is managed by CEA/Nuclear Energy Division/Nuclear Development and Innovation Division. We are grateful to the two referees for their comments which significantly improved the paper.

References (31)

  • J.-P. Chilès et al.

    Geostatistics: Modeling Spatial Uncertainty

    (1999)
  • N. Cressie
  • C. Currin et al.

    Bayesian prediction of deterministic functions with applications to the design and analysis of computer experiments

    Journal of the American Statistical Association

    (1991)
  • K.-T. Fang et al.

    Design and Modeling for Computer Experiments

    (2006)
  • T. Hastie et al.

    The Elements of Statistical Learning

    (2002)
  • Cited by (0)

    View full text