Learning a suitable target distribution

While we have described estimators to compute the marginal likelihood and Bayes factors based on a learnt target distribution \(\varphi(\theta)\), we have yet to consider the critical task of learning the target distribution. As discussed, the ideal target distribution is the posterior itself. However, since the target must be normalised, use of the posterior would require knowledge of the marginal likelihood – precisely the quantity that we attempting to estimate. Instead, one can learn an approximation of the posterior that is normalised. The approximation itself does not need to be highly accurate. More critically, the learned target approximating the posterior must exhibit narrower tails than the posterior to avoid the problematic scenario of the original harmonic mean that can result in very large variance.

We present three examples of models that can be used to learn appropriate target distributions and discuss how to train them, although other models can of course be considered. Samples of the posterior are split into training and evaluation (cf. test) sets. The training set is used to learn the target distribution, after which the evaluation set, combined with the learnt target, is used to estimate the marginal likelihood. To train the models we typically construct and solve an optimisation problem to minimise the variance of the estimator, while ensuring it is unbiased. We typically solve the resulting optimisation problem by stochastic gradient descent. To set hyperparameters, we advocate cross-validation.

Learning from posterior samples

Here we cover several functional forms for the learned model \(\varphi(\theta)\) which are used throughout the code. Hyper-parameters of these models can be considered nodes of a conventional network, the values of which are learnt from a small sub-set of posterior samples.

The simplest model one may wish to consider is a hyper-sphere, much like the truncated harmonic mean estimator. However, here we learn the optimal radius of the hyper-sphere, rather than setting the radius based on arbitrary level-sets of the posterior as in Robert and Wraith (2009).

Consider the target distribution defined by the normalised hyper-sphere

\[\varphi(\theta) = \frac{1}{V_\mathcal{S}} I_\mathcal{S}(\theta),\]

where the indicator function \(I_\mathcal{S}(\theta)\) is unity if \(\theta\) is within a hyper-sphere of radius \(R\), centred on \(\bar{\theta}\) with covariance \(\Sigma\), i.e.

\[\begin{split}I_\mathcal{S}(\theta) = \begin{cases} 1, & \bigl(\theta - \bar{\theta}\bigr)^\text{T} \Sigma^{-1} \bigl(\theta - \bar{\theta} \bigr) < R^2 \\ 0, & \text{otherwise} \end{cases}.\end{split}\]

The values of \(\bar{\theta}\) and \(\Sigma\) can be computed directly from the training samples. Often, although not always, a diagonal approximation of \(\Sigma\) is considered for computational efficiency. The volume of the hyper-sphere required to normalise the distribution is given by

\[V_\mathcal{S} = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)} R^d \, \vert \Sigma \vert^{1/2}.\]

Recall that \(d\) is the dimension of the parameter space, i.e. \(\theta \in \mathbb{R}^d\), and note that \(\Gamma(\cdot)\) is the Gamma function. To estimate the radius of the hyper-sphere we pose the following optimisation problem to minimise the variance of the learnt harmonic mean estimator, while also constraining it be be unbiased:

\[\min_R \: \hat{\sigma}^2 \quad \text{s.t.} \quad \hat{\rho} = \hat{\mu}_1.\]

By minimising the variance of the estimator we ensure, on one hand, that the tails of the learnt target are not so wide that they are broader than the posterior, and, on the other hand, that they are not so narrow that very few samples are effectively retained in the estimator.

This optimisation problem is equivalent to minimising the estimator of the second harmonic moment:

\[\min_R \: \hat{\mu}_2.\]

Writing out the cost function explicitly in terms of the posterior samples, the optimisation problem reads

\[\min_R \: \sum_i C_i^2,\]

with costs for each sample given by

\[\begin{split}C_i = \frac{\varphi(\theta_i)}{\mathcal{L}(\theta_i) \pi(\theta_i)} \propto \begin{cases} \frac{1}{\mathcal{L}(\theta_i) \pi(\theta_i) R^d}, & \bigl(\theta - \bar{\theta}\bigr)^\text{T} \Sigma^{-1} \bigl(\theta - \bar{\theta} \bigr) < R^2 \\ 0, & \text{otherwise} \end{cases}.\end{split}\]

This one-dimensional optimisation problem can be solved by straightforward techniques, such as the Brent hybrid root-finding algorithm.

While the learnt hyper-sphere model is very simple, it is good pedagogical illustration of the general procedure for learning target distributions. First, construct a normalised model. Second, train the model to learn its parameters by solving an optimisation problem to minimise the variance of the estimator while ensuring it is unbiased. If required, set hyper-parameters or compare alternative models by cross-validation.

While the simple learnt hyper-sphere model may be sufficient in some settings, it is not effective for multimodal posterior distributions or for posteriors with narrow curving degeneracies. For such scenarios we consider alternative learnt models.

Note

This list of models is by no means comprehensive, and bespoke models may be implemented which perform better (i.e. lower cross-validation variance) in specific use-cases.