# Learning a suitable target distribution

While we have described estimators to compute the marginal likelihood and Bayes factors based on a learnt target distribution \(\varphi(\theta)\), we have yet to consider the critical task of learning the target distribution. As discussed, the ideal target distribution is the posterior itself. However, since the target must be normalised, use of the posterior would require knowledge of the marginal likelihood – precisely the quantity that we attempting to estimate. Instead, one can learn an approximation of the posterior that is normalised. The approximation itself does not need to be highly accurate. More critically, the learned target approximating the posterior must exhibit narrower tails than the posterior to avoid the problematic scenario of the original harmonic mean that can result in very large variance.

We present three examples of models that can be used to learn appropriate target distributions and discuss how to train them, although other models can of course be considered. Samples of the posterior are split into training and evaluation (*cf.* test) sets. The training set is used to learn the target distribution, after which the evaluation set, combined with the learnt target, is used to estimate the marginal likelihood. To train the models we typically construct and solve an optimisation problem to minimise the variance of the estimator, while ensuring it is unbiased. We typically solve the resulting optimisation problem by stochastic gradient descent. To set hyperparameters, we advocate cross-validation.

## Learning from posterior samples

Here we cover several functional forms for the learned model \(\varphi(\theta)\) which are used throughout the code. Hyper-parameters of these models can be considered nodes of a conventional network, the values of which are learnt from a small sub-set of posterior samples.

The simplest model one may wish to consider is a hyper-sphere, much like the truncated harmonic mean estimator. However, here we learn the optimal radius of the hyper-sphere, rather than setting the radius based on arbitrary level-sets of the posterior as in Robert and Wraith (2009).

Consider the target distribution defined by the normalised hyper-sphere

where the indicator function \(I_\mathcal{S}(\theta)\) is unity if \(\theta\) is within a hyper-sphere of radius \(R\), centred on \(\bar{\theta}\) with covariance \(\Sigma\), *i.e*.

The values of \(\bar{\theta}\) and \(\Sigma\) can be computed directly from the training samples. Often, although not always, a diagonal approximation of \(\Sigma\) is considered for computational efficiency. The volume of the hyper-sphere required to normalise the distribution is given by

Recall that \(d\) is the dimension of the parameter space, *i.e.* \(\theta \in \mathbb{R}^d\), and note that \(\Gamma(\cdot)\) is the Gamma function. To estimate the radius of the hyper-sphere we pose the following optimisation problem to minimise the variance of the learnt harmonic mean estimator, while also constraining it be be unbiased:

By minimising the variance of the estimator we ensure, on one hand, that the tails of the learnt target are not so wide that they are broader than the posterior, and, on the other hand, that they are not so narrow that very few samples are effectively retained in the estimator.

This optimisation problem is equivalent to minimising the estimator of the second harmonic moment:

Writing out the cost function explicitly in terms of the posterior samples, the optimisation problem reads

with costs for each sample given by

This one-dimensional optimisation problem can be solved by straightforward techniques, such as the Brent hybrid root-finding algorithm.

While the learnt hyper-sphere model is very simple, it is good pedagogical illustration of the general procedure for learning target distributions. First, construct a normalised model. Second, train the model to learn its parameters by solving an optimisation problem to minimise the variance of the estimator while ensuring it is unbiased. If required, set hyper-parameters or compare alternative models by cross-validation.

While the simple learnt hyper-sphere model may be sufficient in some settings, it is not effective for multimodal posterior distributions or for posteriors with narrow curving degeneracies. For such scenarios we consider alternative learnt models.

Kernel density estimation (KDE) provides another alternative model to learn an effective target distribution. In particular, it can be used to effectively model narrow curving posterior degeneracies.

Consider the target distribution defined by the kernel density function

with kernel

where \(k(\theta) = 1\) if \(\vert \theta \vert < 1/2\) and 0 otherwise. The volume of the kernel is given by

The kernel covariance \(\Sigma_K\) can be computed directly from the training samples, for example by estimating the covariance or even simply by the separation between the lowest and highest samples in each dimension. A diagonal representation is often, although not always, considered for computational efficiency.

The kernel radius \(R\) can be estimating by following a similar procedure to those outlined above for the hyper-sphere and modified Gaussian mixture model to minimise the variance of the resulting estimator. Alternatively, since there is only a single parameter cross-validation is also effective.

A modified Gaussian mixture model provides greater flexibility than the simple hyper-sphere model. In particular, it is much more effective for multimodal posterior distributions.

Consider the target distribution defined by the modified Gaussian mixture model

for \(K\) components, with centres \(\bar{\theta}_k\) and covariances \(\Sigma_k^{-1}\), where the relative scale of each component is controlled by \(s_k\) and the weights are specified by

which in turn depend on weights \(z_k\). Given \(K\), the posterior training samples can be clustered by \(K\)-means. The values of \(\bar{\theta}_k\) and \(\Sigma_k\) can then be computed by the samples in cluster \(k\). The model is modified relative to the usual Gaussian mixture model in that the cluster mean and covariance are estimated from the samples of each cluster, while the relative cluster scale and weights are fitted. Moreover, as before, a bespoke training approach is adopted tailored to the problem of learning an effective model for the learnt harmonic mean estimator.

To estimate the the weights \(z_k\), which in turn define the weights \(w_k\), and the relative scales \(s_k\) we again set up an optimisation problem to minimise the variance of the learnt harmonic mean estimator, while also constraining it to be unbiased. We also regularise the relative scale parameters, resulting in the following optimisation problem:

for regularisation parameter \(\lambda\). The problem may equivalently be written as

or explicitly in terms of the posterior samples by

The individual cost terms for each sample \(i\) are given by

which include the following component from cluster \(k\):

We solve this optimisation problem by stochastic gradient decent, which requires the gradients of the objective function. Denoting the total cost of the objective function by \(C = \sum_i C_i^2 + \frac{1}{2} \lambda \sum_{k=1}^K s_k^2\), it is straightforward to show that the gradients of the cost function with respect to the weights \(z_k\) and relative scales \(s_k\) are given by

and

respectively.

The general procedure to learn the target distribution is the same as before: first, construct a normalised model; second, train the model by solving an optimisation problem to minimise the variance of the resulting learnt harmonic mean estimator. In this case we regularise the relative scale parameters and then solve by stochastic gradient descent. The number of clusters \(K\) can be deteremined by cross-validation (or other methods). While the modified Gaussian mixture model can effectively handle multimodal distributions, alternative models are better suited to narrow curving posterior degeneracies.

Note

This list of models is by no means comprehensive, and bespoke models may be implemented which perform better (*i.e.* lower cross-validation variance) in specific use-cases.