GaussianProcessTrainer

"Gaussian Processes for Machine Learning" Rasmussen and Williams (2005) provides a well written discussion of Gaussian Processes, and its reading is highly encouraged. Chapters 1-5 cover the topics presented here with far greater detail, depth, and rigor. Furthermore, for a detailed overview on Gaussian Processes that model multiple outputs together (multi-output Gaussian Process or MOGP) we refer the reader to Liu et al. (2018).

The documentation here is meant to give some practical insight for users to begin creating surrogate models with Gaussian Processes.

Given a set of inputs $X=\lbrace{\vec{x}_1, \cdots, \vec{x}_m \rbrace}$ for which we have made observations of the correspond outputs $Y=\lbrace{\vec{y}_1, \cdots, \vec{y}_m \rbrace}$ using the system ( $Y = f(X)$ ). Given another set of inputs $X_\star=\lbrace{\vec{x}_{\star 1}, \cdots, \vec{x}_{\star n} \rbrace}$ we wish to predict the associated outputs $Y_\star=f(X_\star)$ without evaluation of $f (X_\star)$ , which is presumed costly.

Parameter Covariance

In overly simplistic terms, Gaussian Process Modeling is driven by the idea that trials which are "close" in their input parameter space will be "close" in their output space. Closeness in the parameter space is driven by the covariance function $k(\vec{x},\vec{x'})$ (also called a kernel function, not to be confused with a MOOSE Framework kernel). This covariance function is used to generate a covariance matrix between the complete set of parameters $X \cup X_\star = \lbrace {\vec{x}_1, \cdots, \vec{x}_m, \vec{x}_{\star 1}, \cdots, \vec{x}_{\star n} \rbrace}$ , which can then be interpreted block-wise as various covariance matrices between $X$ and $X_\star$ .

\begin{aligned} \mathbf{K}(X \cup X_\star,X \cup X_\star) & = \left[ \begin{array}{ccc|ccc} k(\vec{x}_1,\vec{x}_1) & \cdots & k(\vec{x}_1,\vec{x}_m) & k(\vec{x}_{1},\vec{x}_{\star 1}) & \cdots & k(\vec{x}_{1},\vec{x}_{\star n}) \\ \vdots & & \vdots & \vdots & & \vdots \\ (\vec{x}_m,\vec{x}_1) & \cdots & k(\vec{x}_m,\vec{x}_m) & k(\vec{x}_{m},\vec{x}_{\star 1}) & \cdots & k(\vec{x}_{m},\vec{x}_{\star n}) \\ \hline k(\vec{x}_{\star 1},\vec{x}_{1}) & \cdots & k(\vec{x}_{\star 1},\vec{x}_{m}) & k(\vec{x}_{\star 1},\vec{x}_{\star 1}) & \cdots & k(\vec{x}_{\star 1},\vec{x}_{\star n}) \\ \vdots & & \vdots & \vdots & & \vdots \\ k(\vec{x}_{\star n},\vec{x}_{1}) & \cdots & k(\vec{x}_{\star n},\vec{x}_{m}) & k(\vec{x}_{\star n},\vec{x}_{\star 1}) & \cdots & k(\vec{x}_{\star n},\vec{x}_{\star n}) \end{array} \right] \\ & =\left[ \begin{array}{c|c} \mathbf{K}(X,X) & \mathbf{K}(X,X_\star) \\ \hline \mathbf{K}(X_\star,X) & \mathbf{K}(X_\star,X_\star) \end{array} \right] \\ & =\left[ \begin{array}{c|c} \mathbf{K} & \mathbf{K}_\star \\ \hline \mathbf{K}_\star^T & \mathbf{K}_{\star \star} \end{array} \right] \end{aligned}

The Gaussian Process Model consists of an infinite collection of functions, all of which agree with the training/observation data. Importantly the collection has closed forms for 2nd order statistics (mean and variance). When used as a surrogate, the nominal value is chosen to be the mean value. The method can be broken down into two step: definition of the prior distribution then conditioning on observed data.

Gaussian processes

A Gaussian Process is a (potentially infinite) collection of random variables, such that the joint distribution of every finite selection of random variables from the collection is a Gaussian distribution.

\mathcal{GP}(\mu(\vec{x}),k(\vec{x},\vec{x'}))

In an analogous way that a multivariate Gaussian is completely defined by its mean vector and its covariance matrix, a Gaussian Process is completely defined by its mean function and covariance function.

The (potentially) infinite number of random variables within the Gaussian Process correspond to the (potentially) infinite points in the parameter space our surrogate can be evaluated at.

Prior distribution:

We assume the observations (both training and testing) are pulled from an $m+n$ multivariate Gaussian distribution. The covariance matrix $\Sigma$ is the result of the choice of covariance function.

Y \cup Y_\star \sim \mathcal{N}(\mu,\Sigma)

Note that $\mu$ and $\Sigma$ are a vector and matrix respectively, and are a result of the mean and covariance functions applied to the sample points.

Zero Mean Assumption: Discussions of Gaussian Process are typically presented under assumption that $\mu=0$ . This occurs without loss of generality since any sample can be made $\mu=0$ by subtracting the sample mean (or a variety of other preprocessing options). Note that in a training\testing paradigm, the testing data $Y_\star$ is unknown, so determination of what to use as $\mu$ is based on the information from the training data $Y$ (or some other prior assumption).

Conditioning:

With the prior formed as above, conditioning on the available training data $Y$ is performed. This alters the mean and variance to new values $\mu_\star$ and $\Sigma_\star$ , restricting the set of possible functions which agree with the training data.

\begin{aligned} \mu_\star &= \mu + \mathbf{K}_\star \mathbf{K}^{-1}(Y-\mu) \\ \Sigma_\star &= \mathbf{K}_{\star \star} - \mathbf{K}_\star^T \mathbf{K}^{-1} \mathbf{K}_\star \end{aligned}

Y_\star \sim \mathcal{N}(\mu_\star ,\Sigma_\star)

When used as a surrogate, the nominal value is typically taken as the mean value, with $diag(\Sigma_\star)$ providing variances which can be used to generate confidence intervals.

Notes on Multi-Output Gaussian Processes (MOGPs)

MOGPs model and predict the outputs which are vectors, each of size $M$ . For any input vector $\vec{x}_i$ , the vector of outputs $\vec{y}_i$ and the matrix of $N$ vectors $Y$ is defined as:

$\begin{aligned} &\vec{y}_i = [f(\vec{x}_i)^1,~f(\vec{x}_i)^2,\dots,~f(\vec{x}_i)^M]^\intercal \\ &Y = [\vec{y}_1,~\vec{y}_2,\dots,~\vec{y}_N]^\intercal \\ \end{aligned}$

where $\vec{y}_i$ is of size $M\times 1$ and $Y$ is of size $N\times M$ . The matrix $Y$ is vectorized and represented as $\hat{\mathbf{y}}$ with size $NM \times 1$ . $\hat{\mathbf{y}}$ is modeled as a Gaussian distribution defined as described in Eq. (1).

In a multi-output Gaussian Process, $\mathbf{K}$ captures covariances across the input variables and the vector of outputs and hence has size $NM \times NM$ . $\mathbf{K}$ can be modeled in several ways as discussed in (Liu et al., 2018; Alvarez et al., 2012). We will follow the linear model of coregionalization (LMC) which introduces latent functions with restrictons on the associated covariances.

Common Hyperparameters

While the only apparent decision in the above formulation is the choice of covariance function, most covariance functions will contain hyperparameters of some form which need to be selected in some manner. While each covariance function will have its own set of hyperparameters, a few hyperparameters of specific forms are present in many common covariance functions.

Length Factor $\ell$ or $\vec{\ell}$

Frequently Kernels consider the distance between two input parameters $\vec{x}$ and $\vec{x}^\prime$ . For system of only a single parameter this distance often takes the form of

\frac{|x - x^\prime|}{\ell}.

In this form the $\ell$ factor set a relevant length scale for the distance measurements.

When multiple input parameters are to be considered, it may be advantageous to specify $n$ different length scales for each of the $n$ parameters, resulting in a vector $\vec{\ell}$ . For example distance may be calculated as

\sqrt{ \sum_{i=1}^n \left( \frac{x_i - x^\prime_i}{\ell_i} \right)^2}.

When used with standardized parameters, $\ell$ can be interpreted in units of standard deviation for the relevant parameter.

Signal Variance $\sigma_f^2$

This serves as an overall scaling parameter. Given a covariance function $\tilde{k}$ (which is not a function of $\sigma_f^2$ ), the multiplication of $\sigma_f^2$ yields a new valid covariance function.

k(x,x^\prime,\sigma_f) = \sigma_f^2 \, \tilde{k}(x,x^\prime)

This multiplication can also be pulled out of the covariance matrix formation, and simply multiply the matrix formed by $\tilde{k}$

\mathbf{K}(x,x^\prime,\sigma_f) = \sigma_f^2 \, \tilde{\mathbf{K}}(x,x^\prime)

Noise Variance $\sigma_n^2$

The $\sigma_n^2$ represents noise in the collected data, and is as a additional $\sigma_n^2$ factor on the variance terms (when $x=x^\prime$ ).

k(x,x^\prime,\sigma_f, \sigma_n) = \sigma_f^2 \, \tilde{k}(x,x^\prime) + \sigma_n^2 \, \delta_{x,x^\prime}

In the matrix representation this adds a factor of $\sigma_n^2$ to diagonal of the noiseless matrix $\tilde{\mathbf{K}}$

\mathbf{K}(x,x^\prime,\sigma_f, \sigma_n) = \sigma_f^2 \, \tilde{\mathbf{K}}(x,x^\prime) + \sigma_n^2 \mathbf{I}

Due to the addition of $\sigma_n^2$ along the diagonal of the $K$ matrix, this hyperparameter can aid in the inversion of the covariance matrix. For this reason adding a small amount of $\sigma_n^2$ may be preferable, even when you believe the data to be noise free.

Selected Covariance Functions

Table 1: Selected Covariance Functions

Covariance Function	Description
SquaredExponentialCovariance	Also referred to as a radial basis function (RBF) this is a widely used, general purpose covariance function. Serves as a common starting point for many. Used for single-output GPs
ExponentialCovariance	A simple exponential covariance function. Used for single-output GPs
MaternHalfIntCovariance	Implementation of the Matern class of covariance function, where the $\nu$ parameter takes on half-integer values. Used for single-output GPs
LMC	Covariance function built using the linear model of coregionalization. Used for multi-output GPs

Hyperparameter tuning options

The following options are available for tuning the hyperparameters:

Adaptive moment estimation (Adam)
Relies on the pseudocode provided in Kingma and Ba (2014). Adam permits stochastic optimization, wherein, a batch of the training data can be randomly chosen at each iteration.

The hyper-parameters of the GPs are inferred by optimizing the log-likelihood function. The MOGP log-likelihood function has a general form:

$\mathcal{L} = -\frac{1}{2}~\ln |\mathbf{K}| - \frac{1}{2}~\hat{\mathbf{y}}^T~\mathbf{K}^{-1}~\hat{\mathbf{y}}-\frac{1}{2}~N~\ln(2\pi)$

The optimization of GPs can be expensive. If there are $N$ training points each with $M$ outputs, each training iteration of Adam has a cost of $\mathcal{O} (M^3N^3)$ . Adam permits using $n < N$ random training points during each iteration (mini-batches) which has a cost of $\mathcal{O}(M^3n^3)<<\mathcal{O}(M^3N^3)$ .

Input Parameters

covariance_functionName of covariance function.
C++ Type:UserObjectName
Unit:(no unit assumed)
Controllable:No
Description:Name of covariance function.
responseReporter value of response results, can be vpp with / or sampler column with 'sampler/col_'.
C++ Type:ReporterName
Unit:(no unit assumed)
Controllable:No
Description:Reporter value of response results, can be vpp with / or sampler column with 'sampler/col_'.
samplerSampler used to create predictor and response data.
C++ Type:SamplerName
Unit:(no unit assumed)
Controllable:No
Description:Sampler used to create predictor and response data.

Required Parameters

batch_size0The batch size for Adam optimization
Default:0
C++ Type:unsigned int
Unit:(no unit assumed)
Controllable:No
Description:The batch size for Adam optimization
converged_reporterReporter value used to determine if a sample's multiapp solve converged.
C++ Type:ReporterName
Unit:(no unit assumed)
Controllable:No
Description:Reporter value used to determine if a sample's multiapp solve converged.
cv_n_trials1Number of repeated trials of cross-validation to perform.
Default:1
C++ Type:unsigned int
Unit:(no unit assumed)
Controllable:No
Description:Number of repeated trials of cross-validation to perform.
cv_seed4294967295Seed used to initialize random number generator for data splitting during cross validation.
Default:4294967295
C++ Type:unsigned int
Unit:(no unit assumed)
Controllable:No
Description:Seed used to initialize random number generator for data splitting during cross validation.
cv_splits10Number of splits (k) to use in k-fold cross-validation.
Default:10
C++ Type:unsigned int
Unit:(no unit assumed)
Controllable:No
Description:Number of splits (k) to use in k-fold cross-validation.
cv_surrogateName of Surrogate object used for model cross-validation.
C++ Type:UserObjectName
Unit:(no unit assumed)
Controllable:No
Description:Name of Surrogate object used for model cross-validation.
cv_typenoneCross-validation method to use for dataset. Options are 'none' or 'k_fold'.
Default:none
C++ Type:MooseEnum
Unit:(no unit assumed)
Options:none, k_fold
Controllable:No
Description:Cross-validation method to use for dataset. Options are 'none' or 'k_fold'.
execute_onTIMESTEP_ENDThe list of flag(s) indicating when this object should be executed. For a description of each flag, see https://mooseframework.inl.gov/source/interfaces/SetupInterface.html.
Default:TIMESTEP_END
C++ Type:ExecFlagEnum
Unit:(no unit assumed)
Options:NONE, INITIAL, LINEAR, NONLINEAR_CONVERGENCE, NONLINEAR, POSTCHECK, TIMESTEP_END, TIMESTEP_BEGIN, MULTIAPP_FIXED_POINT_END, MULTIAPP_FIXED_POINT_BEGIN, FINAL, CUSTOM
Controllable:No
Description:The list of flag(s) indicating when this object should be executed. For a description of each flag, see https://mooseframework.inl.gov/source/interfaces/SetupInterface.html.
filenameThe name of the file which will be associated with the saved/loaded data.
C++ Type:FileName
Unit:(no unit assumed)
Controllable:No
Description:The name of the file which will be associated with the saved/loaded data.
learning_rate0.001The learning rate for Adam optimization
Default:0.001
C++ Type:double
Unit:(no unit assumed)
Controllable:No
Description:The learning rate for Adam optimization
num_iters1000Tolerance value for Adam optimization
Default:1000
C++ Type:unsigned int
Unit:(no unit assumed)
Controllable:No
Description:Tolerance value for Adam optimization
predictor_colsSampler columns used as the independent random variables, If 'predictors' and 'predictor_cols' are both empty, all sampler columns are used.
C++ Type:std::vector<unsigned int>
Unit:(no unit assumed)
Controllable:No
Description:Sampler columns used as the independent random variables, If 'predictors' and 'predictor_cols' are both empty, all sampler columns are used.
predictorsReporter values used as the independent random variables, If 'predictors' and 'predictor_cols' are both empty, all sampler columns are used.
C++ Type:std::vector<ReporterName>
Unit:(no unit assumed)
Controllable:No
Description:Reporter values used as the independent random variables, If 'predictors' and 'predictor_cols' are both empty, all sampler columns are used.
prop_getter_suffixAn optional suffix parameter that can be appended to any attempt to retrieve/get material properties. The suffix will be prepended with a '_' character.
C++ Type:MaterialPropertyName
Unit:(no unit assumed)
Controllable:No
Description:An optional suffix parameter that can be appended to any attempt to retrieve/get material properties. The suffix will be prepended with a '_' character.
response_typerealResponse data type.
Default:real
C++ Type:MooseEnum
Unit:(no unit assumed)
Options:real, vector_real
Controllable:No
Description:Response data type.
show_every_nth_iteration0Switch to show Adam optimization loss values at every nth step. If 0, nothing is showed.
Default:0
C++ Type:unsigned int
Unit:(no unit assumed)
Controllable:No
Description:Switch to show Adam optimization loss values at every nth step. If 0, nothing is showed.
skip_unconverged_samplesFalseTrue to skip samples where the multiapp did not converge, 'stochastic_reporter' is required to do this.
Default:False
C++ Type:bool
Unit:(no unit assumed)
Controllable:No
Description:True to skip samples where the multiapp did not converge, 'stochastic_reporter' is required to do this.
standardize_dataTrueStandardize (center and scale) training data (y values)
Default:True
C++ Type:bool
Unit:(no unit assumed)
Controllable:No
Description:Standardize (center and scale) training data (y values)
standardize_paramsTrueStandardize (center and scale) training parameters (x values)
Default:True
C++ Type:bool
Unit:(no unit assumed)
Controllable:No
Description:Standardize (center and scale) training parameters (x values)
tune_parametersSelect hyperparameters to be tuned
C++ Type:std::vector<std::string>
Unit:(no unit assumed)
Controllable:No
Description:Select hyperparameters to be tuned
tuning_maxMaximum allowable tuning value
C++ Type:std::vector<double>
Unit:(no unit assumed)
Controllable:No
Description:Maximum allowable tuning value
tuning_minMinimum allowable tuning value
C++ Type:std::vector<double>
Unit:(no unit assumed)
Controllable:No
Description:Minimum allowable tuning value
use_interpolated_stateFalseFor the old and older state use projected material properties interpolated at the quadrature points. To set up projection use the ProjectedStatefulMaterialStorageAction.
Default:False
C++ Type:bool
Unit:(no unit assumed)
Controllable:No
Description:For the old and older state use projected material properties interpolated at the quadrature points. To set up projection use the ProjectedStatefulMaterialStorageAction.

Optional Parameters

allow_duplicate_execution_on_initialFalseIn the case where this UserObject is depended upon by an initial condition, allow it to be executed twice during the initial setup (once before the IC and again after mesh adaptivity (if applicable).
Default:False
C++ Type:bool
Unit:(no unit assumed)
Controllable:No
Description:In the case where this UserObject is depended upon by an initial condition, allow it to be executed twice during the initial setup (once before the IC and again after mesh adaptivity (if applicable).
control_tagsAdds user-defined labels for accessing object parameters via control logic.
C++ Type:std::vector<std::string>
Unit:(no unit assumed)
Controllable:No
Description:Adds user-defined labels for accessing object parameters via control logic.
enableTrueSet the enabled status of the MooseObject.
Default:True
C++ Type:bool
Unit:(no unit assumed)
Controllable:Yes
Description:Set the enabled status of the MooseObject.
execution_order_group0Execution order groups are executed in increasing order (e.g., the lowest number is executed first). Note that negative group numbers may be used to execute groups before the default (0) group. Please refer to the user object documentation for ordering of user object execution within a group.
Default:0
C++ Type:int
Unit:(no unit assumed)
Controllable:No
Description:Execution order groups are executed in increasing order (e.g., the lowest number is executed first). Note that negative group numbers may be used to execute groups before the default (0) group. Please refer to the user object documentation for ordering of user object execution within a group.
force_postauxFalseForces the UserObject to be executed in POSTAUX
Default:False
C++ Type:bool
Unit:(no unit assumed)
Controllable:No
Description:Forces the UserObject to be executed in POSTAUX
force_preauxFalseForces the UserObject to be executed in PREAUX
Default:False
C++ Type:bool
Unit:(no unit assumed)
Controllable:No
Description:Forces the UserObject to be executed in PREAUX
force_preicFalseForces the UserObject to be executed in PREIC during initial setup
Default:False
C++ Type:bool
Unit:(no unit assumed)
Controllable:No
Description:Forces the UserObject to be executed in PREIC during initial setup
use_displaced_meshFalseWhether or not this object should use the displaced mesh for computation. Note that in the case this is true but no displacements are provided in the Mesh block the undisplaced mesh will still be used.
Default:False
C++ Type:bool
Unit:(no unit assumed)
Controllable:No
Description:Whether or not this object should use the displaced mesh for computation. Note that in the case this is true but no displacements are provided in the Mesh block the undisplaced mesh will still be used.

Advanced Parameters

References

M. A. Alvarez, L. Rosasco, and N. D. Lawrence. Kernels for vector-valued functions: a review. Foundations and Trends in Machine Learning, 4(3):195–266, 2012.

@article{Alvarez2012gp,
    author = "Alvarez, M. A. and Rosasco, L. and Lawrence, N. D.",
    title = "Kernels for vector-valued functions: A review",
    journal = "Foundations and Trends in Machine Learning",
    volume = "4",
    number = "3",
    pages = "195--266",
    year = "2012"
}

D. P. Kingma and J. Ba. Adam: a method for stochastic optimization. arXiv:1412.6980 [cs.LG], 2014. URL: https://doi.org/10.48550/arXiv.1412.6980.

@article{kingma2014adam,
    author = "Kingma, D. P. and Ba, J.",
    title = "Adam: A Method for Stochastic Optimization",
    journal = "arXiv:1412.6980 [cs.LG]",
    year = "2014",
    url = "https://doi.org/10.48550/arXiv.1412.6980"
}

H. Liu, J. Cai, and Y. S. Ong. Remarks on multi-output gaussian process regression. Knowledge-Based Systems, 144:102–112, 2018.

@article{Liu2018gp,
    author = "Liu, H. and Cai, J. and Ong, Y. S.",
    title = "Remarks on multi-output Gaussian process regression",
    journal = "Knowledge-Based Systems",
    volume = "144",
    pages = "102--112",
    year = "2018"
}

Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2005. ISBN 026218253X.

@book{rasmussen2005gaussian,
    author = "Rasmussen, Carl Edward and Williams, Christopher K. I.",
    title = "Gaussian Processes for Machine Learning",
    year = "2005",
    isbn = "026218253X",
    publisher = "The MIT Press"
}

Common Hyperparameters
Selected Covariance Functions
Hyperparameter tuning options
Input Parameters
References