Creating a Surrogate Model

This example goes through the process of creating a custom surrogate model, in his case the creation of NearestPointSurrogate.

Overview

Building a surrogate model requires the creation of two objects: SurrogateTrainer and SurrogateModel. The SurrogateTrainer uses information from samplers and results to construct variables to be saved into a .rd file at the conclusion of the training run. The SurrogateModel object loads the data from the .rd and contains a function called evaluate that evaluates the surrogate model at a given input. The SurrogateTrainer and Surrogate are heavily tied together where each have the same member variables, the difference being one saves the data and the other loads it. It might be beneficial to have an interface class that contains common functions for training and evaluating, to avoid duplicate code. This example will not go into the creation of this interface class.

Creating a Trainer

This example will go over the creation of NearestPointTrainer. Trainers are derived from SurrogateTrainer which performs a loop over the training data and calls virtual functions that derived classes are meant to override to perform the proper training.

This example makes use of SurrogateTrainer input parameters, API functions, and protected member variables for gathering and accessing common types of training data. For further details on these features, see Trainers System.

validParams

The trainer requires the input of a sampler, so that it understands how many data points are included and how they are distributed across processors. The trainer also needs the predictor and response values from the full-order model which are stored in a vector postprocessor or reporter.

InputParameters
SurrogateTrainer::validParams()
{
  InputParameters params = SurrogateTrainerBase::validParams();
  params.addRequiredParam<SamplerName>("sampler",
                                       "Sampler used to create predictor and response data.");
  params.addParam<ReporterName>(
      "converged_reporter",
      "Reporter value used to determine if a sample's multiapp solve converged.");
  params.addParam<bool>("skip_unconverged_samples",
                        false,
                        "True to skip samples where the multiapp did not converge, "
                        "'stochastic_reporter' is required to do this.");

  // Common Training Data
  MooseEnum data_type("real=0 vector_real=1", "real");
  params.addRequiredParam<ReporterName>(
      "response",
      "Reporter value of response results, can be vpp with <vpp_name>/<vector_name> or sampler "
      "column with 'sampler/col_<index>'.");
  params.addParam<MooseEnum>("response_type", data_type, "Response data type.");
  params.addParam<std::vector<ReporterName>>(
      "predictors",
      std::vector<ReporterName>(),
      "Reporter values used as the independent random variables, If 'predictors' and "
      "'predictor_cols' are both empty, all sampler columns are used.");
  params.addParam<std::vector<unsigned int>>(
      "predictor_cols",
      std::vector<unsigned int>(),
      "Sampler columns used as the independent random variables, If 'predictors' and "
      "'predictor_cols' are both empty, all sampler columns are used.");
  // End Common Training Data

  MooseEnum cv_type("none=0 k_fold=1", "none");
  params.addParam<MooseEnum>(
      "cv_type",
      cv_type,
      "Cross-validation method to use for dataset. Options are 'none' or 'k_fold'.");
  params.addRangeCheckedParam<unsigned int>(
      "cv_splits", 10, "cv_splits > 1", "Number of splits (k) to use in k-fold cross-validation.");
  params.addParam<UserObjectName>("cv_surrogate",
                                  "Name of Surrogate object used for model cross-validation.");
  params.addParam<unsigned int>(
      "cv_n_trials", 1, "Number of repeated trials of cross-validation to perform.");
  params.addParam<unsigned int>("cv_seed",
                                std::numeric_limits<unsigned int>::max(),
                                "Seed used to initialize random number generator for data "
                                "splitting during cross validation.");

  return params;
}
(contrib/moose/modules/stochastic_tools/src/trainers/SurrogateTrainer.C)
InputParameters
NearestPointTrainer::validParams()
{
  InputParameters params = SurrogateTrainer::validParams();
  params.addClassDescription("Loops over and saves sample values for [NearestPointSurrogate.md].");

  return params;
}
(contrib/moose/modules/stochastic_tools/src/trainers/NearestPointTrainer.C)

By default, classes inheriting from SurrogateTrainer support Real and std::vector<Real> response types. For responses of these types, a Reporter can be specified using the "response" and "response_type" input parameters. Additionally, child classes can support predictor data as a combination of Sampler columns as well as Reporter values - these options can be controlled using the "response" and "response_type" input parameters.

Classes derived from SurrogateTrainer also support kk-fold cross-validation. For details on the input parameters for this capability, see Cross Validation. To ensure compatibility with these features, some extra considerations are needed during implementation of a new Trainer. These will be discussed in the following sections.

Constructor

All trainers are based on SurrogateTrainer, which provides the necessary interface for saving the surrogate model data and gathering response/predictor. Any additional data meant to be saved and gathered is defined in the constructor of the training object. In NearestPointTrainer, the variables _sample_points and _sample_results are declared as the necessary surrogate data, see Trainers for more information on declaring model data. Because we will use several default inputs to retrieve training data, we only need to resize these variables for the number of dimensions in the training data. Each processor will contain a portion of the samples and results. We will gather all the samples in postTrain().

NearestPointTrainer::NearestPointTrainer(const InputParameters & parameters)
  : SurrogateTrainer(parameters),
    _sample_points(declareModelData<std::vector<std::vector<Real>>>("_sample_points")),
    _sample_results(declareModelData<std::vector<std::vector<Real>>>("_sample_results")),
    _predictor_row(getPredictorData())
{
  _sample_points.resize(_n_dims);
  _sample_results.resize(1);
}
(contrib/moose/modules/stochastic_tools/src/trainers/NearestPointTrainer.C)

The member variables _sample_points, _sample_results, and _predictor_row are defined in the header file:

/// Map containing sample points and the results
std::vector<std::vector<Real>> & _sample_points;

/// Container for results (y values).
std::vector<std::vector<Real>> & _sample_results;

/// Data from the current predictor row
const std::vector<Real> & _predictor_row;
(contrib/moose/modules/stochastic_tools/include/trainers/NearestPointTrainer.h)

preTrain

preTrain() is called before the sampler loop. This is typically used to initialize variables and allocate memory. For NearestPointTrainer, we will explicitly clear _sample_points and _sample_results. This is done because the implementation of kk-fold cross-validation used in SurrogateTrainer requires that preTrain() resets the state of the trainer and clears any essential data related to prior training sets (for further details, see Trainers System).

void
NearestPointTrainer::preTrain()
{
  for (auto & it : _sample_points)
  {
    it.clear();
    it.reserve(getLocalSampleSize());
  }

  for (auto & it : _sample_results)
  {
    it.clear();
    it.reserve(getLocalSampleSize());
  }
}
(contrib/moose/modules/stochastic_tools/src/trainers/NearestPointTrainer.C)

train

train() is where the actual training occurs. This function is called during the sampler loop for each row, at which time the member variables _row, _local_row, and ones gathered with getTrainingData are updated. In NearestPointTrainer, we will push_back predictor and response data as appropriate. Using push_back is convenient because some samples may be skipped (due to non-convergence, or intentionally during cross-validation).

void
NearestPointTrainer::train()
{
  if (_rvecval && (_sample_results.size() != _rvecval->size()))
    _sample_results.resize(_rvecval->size());

  // Get predictors from reporter values
  for (auto d : make_range(_n_dims))
    _sample_points[d].push_back(_predictor_row[d]);

  // Get responses
  if (_rval)
    _sample_results[0].push_back(*_rval);
  else if (_rvecval)
    for (auto r : make_range(_rvecval->size()))
      _sample_results[r].push_back((*_rvecval)[r]);
}
(contrib/moose/modules/stochastic_tools/src/trainers/NearestPointTrainer.C)

postTrain

postTrain() is called after the sampler loop. This is typically where processor communication happens. Here, we use postTrain() to gather all the local _sample_points so that each processor has the full copy. _communicator.allgather makes it so that every processor has a copy of the full array and _communicator.gather makes it so that only one of the processors has the full copy, the latter is typically used because outputting only happens on the root processor. See libMesh::Parallel::Communicator for more communication options.

void
NearestPointTrainer::postTrain()
{
  for (auto & it : _sample_points)
    _communicator.allgather(it);

  for (auto & it : _sample_results)
    _communicator.allgather(it);
}
(contrib/moose/modules/stochastic_tools/src/trainers/NearestPointTrainer.C)

Creating a Surrogate

This example will go over the creation of NearestPointSurrogate. Surrogates are a specialized version of a MooseObject that must have the evaluate public member function. The validParams for a surrogate will generally define how the surrogate is evaluated. NearestPointSurrogate does not have any options for the method of evaluation.

Constructor

In the constructor, the references for the model data are defined, taken from the training data:

NearestPointSurrogate::NearestPointSurrogate(const InputParameters & parameters)
  : SurrogateModel(parameters),
    _sample_points(getModelData<std::vector<std::vector<Real>>>("_sample_points")),
    _sample_results(getModelData<std::vector<std::vector<Real>>>("_sample_results"))
{
}
(contrib/moose/modules/stochastic_tools/src/surrogates/NearestPointSurrogate.C)

See Surrogates for more information on the getModelData function. _sample_points in the surrogate is a const reference, since we do not want to modify the training data during evaluation:

/// Array containing sample points
const std::vector<std::vector<Real>> & _sample_points;
(contrib/moose/modules/stochastic_tools/include/surrogates/NearestPointSurrogate.h)

evaluate

evaluate is a public member function required for all surrogate models. This is where surrogate model is actually used. evaluate takes in parameter values and returns the surrogate's estimation of the quantity of interest. See EvaluateSurrogate for an example on how the evaluate function is used.

Real
NearestPointSurrogate::evaluate(const std::vector<Real> & x) const
{
  // Check whether input point has same dimensionality as training data
  mooseAssert(_sample_points.size() == x.size(),
              "Input point does not match dimensionality of training data.");

  return _sample_results[0][findNearestPoint(x)];
}
(contrib/moose/modules/stochastic_tools/src/surrogates/NearestPointSurrogate.C)
unsigned int
NearestPointSurrogate::findNearestPoint(const std::vector<Real> & x) const
{
  unsigned int idx = 0;

  // Container of current minimum distance during training sample loop
  Real dist_min = std::numeric_limits<Real>::max();

  for (dof_id_type p = 0; p < _sample_points[0].size(); ++p)
  {
    // Sum over the distance of each point dimension
    Real dist = 0;
    for (unsigned int i = 0; i < x.size(); ++i)
    {
      Real diff = (x[i] - _sample_points[i][p]);
      dist += diff * diff;
    }

    // Check if this training point distance is smaller than the current minimum
    if (dist < dist_min)
    {
      idx = p;
      dist_min = dist;
    }
  }
  return idx;
}
(contrib/moose/modules/stochastic_tools/src/surrogates/NearestPointSurrogate.C)