Learning a suitable target distribution

While we have described estimators to compute the marginal likelihood and Bayes factors based on a learnt target distribution \(\varphi(\theta)\), we have yet to consider the critical task of learning the target distribution. As discussed, the ideal target distribution is the posterior itself. However, since the target must be normalised, use of the posterior would require knowledge of the marginal likelihood – precisely the quantity that we attempting to estimate. Instead, one can learn an approximation of the posterior that is normalised. The approximation itself does not need to be highly accurate. More critically, the learned target approximating the posterior must exhibit narrower tails than the posterior to avoid the problematic scenario of the original harmonic mean that can result in very large variance.

We present three examples of models that can be used to learn appropriate target distributions and discuss how to train them, although other models can of course be considered. Samples of the posterior are split into training and evaluation (cf. test) sets. The training set is used to learn the target distribution, after which the evaluation set, combined with the learnt target, is used to estimate the marginal likelihood. To train the models we typically construct and solve an optimisation problem to minimise the variance of the estimator, while ensuring it is unbiased. We typically solve the resulting optimisation problem by stochastic gradient descent. To set hyperparameters, we advocate cross-validation.

Learning from posterior samples

Here we cover several functional forms for the learned flow models \(\varphi(\theta)\) which are used throughout the code. For these models no hyper-parameter optimisation is required, the flow should do all the heavy lifting for us!

We also provide support for legacy models, which are somewhat less expressive but nonetheless useful for simple posterior distributions. Hyper-parameters of these models can be considered nodes of a conventional network, the values of which are learnt from a small sub-set of posterior samples.

Normalizing flows are a class of probabilistic models that allow one to evaluate the density of and sample from a learned probability distribution (for a review see (Papamakarios et al., 2021)). They consist of a series of transformations that are applied to a simple base distribution. A vector \(\theta\) of an unknown distribution \(p(\theta)\), can be expressed through a transformation \(T\) of a vector \(z\) sampled from a base distribution \(q(z)\):

\[\theta = T(z), \text{ where } z \sim q(z).\]

Typically the base distribution is chosen so that its density can be evaluated simply and that it can be sampled from easily. Often a Gaussian is used for the base distribution. The unknown distribution can then be recovered by the change of variables formula:

\[p(\theta) = q(z) \vert \det J_{T}(z) \vert^{-1},\]

where \(J_{T}(z)\) is the Jacobian corresponding to transformation \(T\). In a flow-based model \(T\) consists of a series of learned transformations that are each invertible and differentiable, so that the full transformation is also invertible and differentiable. This allows us to compose multiple simple transformations with learned parameters, into what is called a flow, obtaining a normalized approximation of the unknown distribution that we can sample from and evaluate. Careful attention is given to construction of the transformations such that the determinant of the Jacobian can be computed easily.

Real NVP Flows

A relatively simple example of a normalizing flow is the real-valued non-volume preserving (real NVP) flow introduced in (Dinh et al., 2016). It consists of a series of bijective transformations given by affine coupling layers. Consider the \(D\) dimensional input \(z\), split into elements up to and following \(d\), respectively, \(z_{1:d}\) and \(z_{d+1:D}\), for \(d<D\). Given input \(z\), the output \(y\) of an affine couple layer is calculated by

\[\begin{split}y_{1:d} = & z_{1:d} ; \\ y_{d+1:D} = & z_{d+1:D} \odot \exp\bigl(s(z_{1:d})\bigr) + t(z_{1:d}),\end{split}\]

where \(\odot\) denotes Hadamard (elementwise) multiplication. The scale \(s\) and translation \(t\) are typically represented by neural networks with learnable parameters that take as input \(z_{1:d}\). This construction is easily invertible and ensures the Jacobian is a lower-triangular matrix, making its determinant efficient to calculate.

Rational Quadratic Spline Flows

A more complex and expressive class of flows are rational quadratic spline flows described in detail in (Durkan et al., 2019). The architecture is similar to Real NVP flows, but the layers include monotonic splines. These are piecewise functions consisting of multiple segments of monotonic rational-quadratics with learned parameters. Such layers are combined with alternating affine transformations to create the normalizing flow. Rational quadratic spline flows are well-suited to higher dimensional and more complex problems than real NVP flows.

The simplest model one may wish to consider is a hyper-sphere, much like the truncated harmonic mean estimator. However, here we learn the optimal radius of the hyper-sphere, rather than setting the radius based on arbitrary level-sets of the posterior as in Robert and Wraith (2009).

Consider the target distribution defined by the normalised hyper-sphere

\[\varphi(\theta) = \frac{1}{V_\mathcal{S}} I_\mathcal{S}(\theta),\]

where the indicator function \(I_\mathcal{S}(\theta)\) is unity if \(\theta\) is within a hyper-sphere of radius \(R\), centred on \(\bar{\theta}\) with covariance \(\Sigma\), i.e.

\[\begin{split}I_\mathcal{S}(\theta) = \begin{cases} 1, & \bigl(\theta - \bar{\theta}\bigr)^\text{T} \Sigma^{-1} \bigl(\theta - \bar{\theta} \bigr) < R^2 \\ 0, & \text{otherwise} \end{cases}.\end{split}\]

The values of \(\bar{\theta}\) and \(\Sigma\) can be computed directly from the training samples. Often, although not always, a diagonal approximation of \(\Sigma\) is considered for computational efficiency. The volume of the hyper-sphere required to normalise the distribution is given by

\[V_\mathcal{S} = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)} R^d \, \vert \Sigma \vert^{1/2}.\]

Recall that \(d\) is the dimension of the parameter space, i.e. \(\theta \in \mathbb{R}^d\), and note that \(\Gamma(\cdot)\) is the Gamma function. To estimate the radius of the hyper-sphere we pose the following optimisation problem to minimise the variance of the learnt harmonic mean estimator, while also constraining it be be unbiased:

\[\min_R \: \hat{\sigma}^2 \quad \text{s.t.} \quad \hat{\rho} = \hat{\mu}_1.\]

By minimising the variance of the estimator we ensure, on one hand, that the tails of the learnt target are not so wide that they are broader than the posterior, and, on the other hand, that they are not so narrow that very few samples are effectively retained in the estimator.

This optimisation problem is equivalent to minimising the estimator of the second harmonic moment:

\[\min_R \: \hat{\mu}_2.\]

Writing out the cost function explicitly in terms of the posterior samples, the optimisation problem reads

\[\min_R \: \sum_i C_i^2,\]

with costs for each sample given by

\[\begin{split}C_i = \frac{\varphi(\theta_i)}{\mathcal{L}(\theta_i) \pi(\theta_i)} \propto \begin{cases} \frac{1}{\mathcal{L}(\theta_i) \pi(\theta_i) R^d}, & \bigl(\theta - \bar{\theta}\bigr)^\text{T} \Sigma^{-1} \bigl(\theta - \bar{\theta} \bigr) < R^2 \\ 0, & \text{otherwise} \end{cases}.\end{split}\]

This one-dimensional optimisation problem can be solved by straightforward techniques, such as the Brent hybrid root-finding algorithm.

While the learnt hyper-sphere model is very simple, it is good pedagogical illustration of the general procedure for learning target distributions. First, construct a normalised model. Second, train the model to learn its parameters by solving an optimisation problem to minimise the variance of the estimator while ensuring it is unbiased. If required, set hyper-parameters or compare alternative models by cross-validation.

While the simple learnt hyper-sphere model may be sufficient in some settings, it is not effective for multimodal posterior distributions or for posteriors with narrow curving degeneracies. For such scenarios we consider alternative learnt models.

A modified Gaussian mixture model provides greater flexibility than the simple hyper-sphere model. In particular, it is much more effective for multimodal posterior distributions.

Consider the target distribution defined by the modified Gaussian mixture model

\[\varphi(\theta) = \sum_{k=1}^K \frac{w_k}{(2\pi)^{d/2} \vert \Sigma_k \vert^{1/2} s_k^d} \exp \biggl( \frac{- \bigl(\theta - \bar{\theta}_k\bigr)^\text{T} \Sigma_k^{-1} \bigl(\theta - \bar{\theta}_k\bigr)}{2 s_k^2}\biggr),\]

for \(K\) components, with centres \(\bar{\theta}_k\) and covariances \(\Sigma_k^{-1}\), where the relative scale of each component is controlled by \(s_k\) and the weights are specified by

\[w_k = \frac{\exp(z_k)}{\sum_{k^\prime=1}^K \exp(z_{k^\prime})},\]

which in turn depend on weights \(z_k\). Given \(K\), the posterior training samples can be clustered by \(K\)-means. The values of \(\bar{\theta}_k\) and \(\Sigma_k\) can then be computed by the samples in cluster \(k\). The model is modified relative to the usual Gaussian mixture model in that the cluster mean and covariance are estimated from the samples of each cluster, while the relative cluster scale and weights are fitted. Moreover, as before, a bespoke training approach is adopted tailored to the problem of learning an effective model for the learnt harmonic mean estimator.

To estimate the the weights \(z_k\), which in turn define the weights \(w_k\), and the relative scales \(s_k\) we again set up an optimisation problem to minimise the variance of the learnt harmonic mean estimator, while also constraining it to be unbiased. We also regularise the relative scale parameters, resulting in the following optimisation problem:

\[\min_{\{z_k,s_k\}_{k=1}^K} \: \hat{\sigma}^2 + \frac{1}{2} \lambda \sum_{k=1}^K s_k^2 \quad \text{s.t.} \quad \hat{\rho} = \hat{\mu}_1,\]

for regularisation parameter \(\lambda\). The problem may equivalently be written as

\[\min_{\{z_k,s_k\}_{k=1}^K} \: \hat{\mu}_2 + \frac{1}{2} \lambda \sum_{k=1}^K s_k^2,\]

or explicitly in terms of the posterior samples by

\[\min_{\{z_k,s_k\}_{k=1}^K} \: \sum_i C_i^2 + \frac{1}{2} \lambda \sum_{k=1}^K s_k^2.\]

The individual cost terms for each sample \(i\) are given by

\[C_i = \frac{\varphi(\theta_i)}{\mathcal{L}(\theta_i) \pi(\theta_i)} = \sum_{k=1}^K C_{ik},\]

which include the following component from cluster \(k\):

\[C_{ik} = \frac{w_k}{(2\pi)^{d/2} \vert \Sigma_k \vert^{1/2} s_k^d}\, \exp \biggl(\frac{- \bigl(\theta_i - \bar{\theta}_k\bigr)^\text{T} \Sigma_k^{-1} \bigl(\theta_i - \bar{\theta}_k\bigr)}{2 s_k^2}\biggr)\frac{1}{\mathcal{L}(\theta_i) \pi(\theta_i)}.\]

We solve this optimisation problem by stochastic gradient decent, which requires the gradients of the objective function. Denoting the total cost of the objective function by \(C = \sum_i C_i^2 + \frac{1}{2} \lambda \sum_{k=1}^K s_k^2\), it is straightforward to show that the gradients of the cost function with respect to the weights \(z_k\) and relative scales \(s_k\) are given by

\[\frac{\partial C}{\partial z_k} = 2 \sum_i C_i (C_{ik} - w_k C_i),\]

and

\[\frac{\partial C}{\partial s_k} = 2 \sum_i \frac{C_i C_{ik}}{s_k^3}\Bigl( \bigl(\theta_i - \bar{\theta}_k\bigr)^\text{T} \Sigma_k^{-1} \bigl(\theta_i - \bar{\theta}_k\bigr) - d s_k^2 \Bigr),\]

respectively.

The general procedure to learn the target distribution is the same as before: first, construct a normalised model; second, train the model by solving an optimisation problem to minimise the variance of the resulting learnt harmonic mean estimator. In this case we regularise the relative scale parameters and then solve by stochastic gradient descent. The number of clusters \(K\) can be deteremined by cross-validation (or other methods). While the modified Gaussian mixture model can effectively handle multimodal distributions, alternative models are better suited to narrow curving posterior degeneracies.

Note

This list of models is by no means comprehensive, and bespoke models may be implemented which perform better in specific use-cases.