Running with Cobaya#

GPry has two ways to interface with the general and cosmological Bayesian inference code Cobaya.

Note

Regardless of the interfacing method, the combination of GPry + Cobaya is MPI-parallelizable!

a) Passing a Cobaya `model` to the GPry `Runner`#

You can pass a Cobaya model as the first argument of the Runner, and run GPry normally.

In this case, GPry will use the prior specified in the Cobaya model, so there is no need to pass parameter bounds to the Runner. Parameter names and labels will also be automatically used for tables and plots.

b) Running Cobaya with GPry as a sampler#

Conversely, you can integrate GPry into a Cobaya-based inference pipeline, including for example calls to a minimizer, or tests with alternative samplers. As as GPry is installed, this is as simple as naming GPry as the sampler of choice in an input .yaml file or an input dictionary:

likelihood:
  gaussian_mixture:
    means: [0.2, 0]
    covs: [[0.1, 0.05], [0.05,0.2]]

params:
  a:
    prior: [-0.5, 3]
    latex: \alpha
  b:
    prior:
      dist: norm
      loc: 0
      scale: 1
    ref: 0
    latex: \beta

sampler:
  gpry:

output: chains/example

A call to the Cobaya run function or cobaya-run shell command will run the GPry’s learning loop and, after convergence or exhaustion of the evaluation budget, will run an MC sample of the surrogate model, and return/write it in the same way as other Cobaya-interfaced samplers:

As a Cobaya SampleCollection using the samples or products method of the Cobaya sampler wrapper (if in a script or a notebook).
Written into a [prefix].1.txt [TODO: CHECK SUFFIX] file if calling cobaya-run from the command line.

Note

As with other Cobaya samplers, you can resume a run with the -r/--resume argument (command line) or with resume=True (script).

Note

If a run has not converged before the evaluation budget has been exhausted, a warning will be printed at the end of the output, and an MC sample will be run and saved anyway.

The last state/checkpoint of the GPry relevant objects of the run can be found, if writing to the hard drive, in a [prefix]_gpry_output sub-folder, or via the Runner instance stored in a gpry_runner attribute of the Cobaya sampler object, which in this case is the GPry Cobaya wrapper, documented below:

input_dict = {...}

from cobaya.run import run

upd_input, gpry_wrapper = run(input_dict)

# The GPry runner is at:
gpry_wrapper.gpry_runner

Doing some plots#

A full set of plots can be generated and saved automatically by passing a plot: True option to the Cobaya gpry wrapper (see Customising the GPry run from Cobaya), or an any point later calling the do_plots() method of the wrapper.

The plot option can be assigned also individual arguments of run.Runner.plot_progress() (click for documentation), to control which plots are made, and in which format.

Individual plots can be generated by hand using the methods of the stored gpry_runner attribute (see above).

Refining/re-running the MC sample#

If you would like to re-run the MC sample of the surrogate model with a different sampler or with different settings, you can call the do_surrogate_sample() method of the wrapper by hand. The new samples will be written to the hard drive or returned by the samples() or products() method of the wrapper.

You can pass that method a different sampler to the one specified at initialisation, and/or different options for it using the add_options argument.

Customising the GPry run from Cobaya#

Same as for other Cobaya samplers, gpry can be customised setting its options as a dictionary. These options are the same, with small formatting differences, as for the Runner. An overview with default values can be seen below:

Note

Undefined values are left to their default for the Runner. In general, to disable a feature, set it to False instead of not None.

# Options regarding the bayesian optimization loop.
# NB: 'd' after a number means the dimensionality of the sampling space,
#     and a number following 'd' means the power of the dimensionality factor.

# General options for the main loop
options:
  # Number of finite initial truth evaluations before starting the learning loop
  n_initial: 3d
  # Maximum number of truth evaluations at initialization. If it is reached before
  # `n_initial` finite points have been found, the run will fail. To avoid that, try
  # decreasing the volume of your prior
  max_initial: 30d1.5
  # Maximum number of truth evaluations before the run stops. This is useful for e.g.
  # restricting the maximum computation resources.
  max_total: 70d1.5
  # Maximum number of sampling points accepted into the GP training set before the run
  # stops. If this limit is frequently saturated, try decreasing the prior volume.
  max_finite:  # default (undefined) = max_total
  # Number of points which are aquired with Kriging believer for every acquisition step.
  # Gets adjusted (with 20% tol.) to match a multiple of the num. of parallel processes.
  n_points_per_acq: d
  # Number of iterations between full GP hyperparameters fits, including several
  # restarts of the optimiser. Pass 'np.inf' or a large number to never refit with restarts.
  fit_full_every: 2d0.5
  # Similar to `fit_full_every`, but with a single optimiser run from the last optimum.
  # Overridden by `fit_full_every` if it applies. Pass np.inf or a large number to never
  # refit from last optimum (hyperparameters kept constant in that iteration).
  fit_simple_every: 1  # every iteration

# The GP regressor used for interpolating the posterior
gpr:
  # Spatial correlation kernel and params, e.g. RBF, Matern, {Matern: {nu: 2.5}}, ...
  kernel: RBF
  # Priors for the output and length scale, in normalised logp units
  output_scale_prior: [1e-2, 1e3]
  length_scale_prior: [1e-3, 1e1]
  # Noise level in logp units; increase for numerically noisy likelihoods
  noise_level: 1e-1
  # Factor used to clip the GPR from above, to avoid overshoots (undefined to disable)
  clip_factor: 1.1
  # Treatment of infinities and large negative values; False for no classifier
  account_for_inf: SVM
  # Difference in standard deviations ('s') for considering a value as -infinity
  inf_threshold: 20s
  # Hyperparameter fitting: optimizer (from scipy) and number of restarts for full fits
  optimizer: fmin_l_bfgs_b
  n_restarts_optimizer: 3d
  # Verbosity (set only if different from the overall verbosity)
  verbose:
  # Shrinkage of the model region, if necessary for speed and robustness
  trust_region_nstd:
  trust_region_factor:
  # [Not interfaced yet] Preprocessors for input and output scales
  # preprocessing_X: Normalize_bounds
  # preprocessing_y: Normalize_y

# The acquisition class, function and their options
gp_acquisition:
  # Acquisition function and its arguments
  acq_func:
    LogExp: {zeta_scaling: 0.85}
  # Verbosity (set only if different from the overall verbosity)
  verbose:
  # Acquisition engine: NORA or BatchOptimizer
  engine: BatchOptimizer
  # Options for the engine (only the relevant ones are kept in the .updated.yaml)
  options_BatchOptimizer:
    proposer:  # default (undefined): a mixture of uniform and centroids
    acq_optimizer: fmin_l_bfgs_b  # scipy optimiser to use
    n_restarts_optimizer: 5d  # number of restarts during hyperparameter fitting
    n_repeats_propose: 10  # number of starting points drawn from the proposer
  options_NORA:
    # nested sampler used for acquisition
    sampler:  # undefined: in order, of available: polychord > ultranest > nessai
    mc_every: 2d  # number of iterations between full NS runs
    nlive_per_training: 3  # number of live points per training sample
    nlive_max: 25d  # cap for the number of live points
    num_repeats: 5d  # number of steps of slice chains (polychord only)
    precision_criterion_target: 0.01  # precision criterion for the NS
    nprior_per_nlive: 10  # number of prior points in the initial sample, times nlive
    max_ncalls:  # maximum number of calls to the GPR model during NS

# Proposer used for drawing the initial training samples before running
# the acquisition loop. One of [reference, prior, uniform].
# Can be specified as dict with args, e.g. {reference: {max_tries: 1000}}
initial_proposer: reference

# Convergence criterion.
# Can be specified as a dict with args, e.g. {CorrectCounter: {abstol: 0.01s}}
convergence_criterion: CorrectCounter

# Cobaya sampler used to generate the final sample from the surrogate model
mc_sampler: mcmc  # default: mcmc with Cobaya defaults

# Produce progress plots (inside the gpry_output dir).
# One can specify options detailing which plots will be made, and in which format, e.g.:
# {timing: True, convergence: True, trace: False, slices: False, format: svg}
# (Adds overhead for very fast likelihoods.)
plots: False

# Function run each iteration after adapting the recently acquired points and
# the computation of the convergence criterion. See docs for implementation.
callback:

# Whether the callback function handles MPI-parallelization internally.
# Otherwise run only by the rank-0 process
callback_is_MPI_aware:

# Change to increase or reduce verbosity. If None, it is handled by Cobaya.
# '3' produces general progress output (default for Cobaya if None),
# and '4' debug-level output
verbose:

The Cobaya wrapper#

class gpry.cobaya.CobayaWrapper(*args: Any, **kwargs: Any)[source]#

GPry: a package for Bayesian inference of expensive likelihoods using GPs.

initialize()[source]#: Initializes GPry.

run()[source]#: Gets the initial training points and starts the acquistion loop.

do_surrogate_sample(sampler=None, add_options=None, resume=False, prefix=None)[source]#

Perform an MC sample of the surrogate model.

This function is called automatically at the end of a run, but it can be called by hand too e.g. if the initial one did not converge.

Parameters:

sampler (str) – An anternative sampler, if different from the one specified at initialisation
add_options (dict) – Configuration to be passed to the sampler.
resume (bool (default: False)) – Whether to try to resume a previous run
prefix (str, optional) – An alternative path where to save the sample. If not given, the sample will use the default one with suffix (_)gpr.

Returns:

surr_info (dict) – The dictionary that was used to run (or initialized) the sampler, corresponding to the surrogate model, and populated with the sampler input specification.
sampler (Sampler instance) – The sampler instance that has been run (or just initialised). The sampler products can be retrieved with the Sampler.products() method.

property is_mc_sampled#: Returns True if the MC sampling of the surrogate process has run and converged.

do_plots(ext='png')[source]#: Produces some results and diagnosis plots.

samples(combined: bool = False, skip_samples: float = 0, to_getdist: bool = False) → cobaya.collection.SampleCollection | MCSamples[source]#: Returns the last sample from the surrogate model.

products(combined: bool = False, skip_samples: float = 0, to_getdist: bool = False) → dict[source]#: Returns the products of the run: an MC sample of the surrogate posterior under sample, and the GPRy Runner object under runner.

classmethod get_checkpoint_dir_and_surr_prefix(output=None)[source]#

Folder where the checkpoint output of GPry is going to be saved, and prefix for the output object of the MC sample of the surrogate model, given a Cobaya OutputReadOnly instance.

These two are wrapped into a single classmethod in order to use the same temp folder if called with dummy output.

Parameters:

output (cobaya.output.Output, cobaya.output.DummyOutput, optional) – Cobaya output instance. Can be a dummy one or None, in which case a temporary folder will be created.

Returns:

checkpoint_dir (str) – Relative folder where the GPry checkpoint will be saved.
surrogate_prefix (str) – Prefix for surrogate MC chains for a Cobaya Output with relative path.

Examples

Assuming that cls._gpry_output_dir = "gpry_output" and cls._surrogate_suffix = "gpr":

>>> from cobaya.output import get_output
>>> cls.get_checkpoint_dir_and_surr_prefix(get_output("folder/"))
'folder/gpry_output', 'folder/gpr'
>>> cls.get_checkpoint_dir_and_surr_prefix(get_output("folder/prefix"))
'folder/prefix_gpry_output', 'folder/prefix_gpr'
>>> cls.get_checkpoint_dir_and_surr_prefix(get_output())  # dummy output
'[tmp_folder]/gpry_output', '[tmp_folder]/gpr'

classmethod output_files_regexps(output, info=None, minimal=False)[source]#

Returns a list of tuples (regexp, root) of output files potentially produced. If root in the tuple is None, output.folder is used.

If minimal=True, returns regexp’s for the files that should really not be there when we are not resuming: GPry checkpoint products and the MC sample from the surrogate.

static is_nora(info)[source]#: Returns True if NORA is being used.

Running with Cobaya

Contents

Running with Cobaya#

a) Passing a Cobaya model to the GPry Runner#

b) Running Cobaya with GPry as a sampler#

Doing some plots#

Refining/re-running the MC sample#

Customising the GPry run from Cobaya#

The Cobaya wrapper#

a) Passing a Cobaya `model` to the GPry `Runner`#