Learning a suitable target distribution

While we have described estimators to compute the marginal likelihood and Bayes factors based on a learnt target distribution \(\varphi(\theta)\), we have yet to consider the critical task of learning the target distribution. As discussed, the ideal target distribution is the posterior itself. However, since the target must be normalised, use of the posterior would require knowledge of the marginal likelihood – precisely the quantity that we attempting to estimate. Instead, one can learn an approximation of the posterior that is normalised. The approximation itself does not need to be highly accurate. More critically, the learned target approximating the posterior must exhibit narrower tails than the posterior to avoid the problematic scenario of the original harmonic mean that can result in very large variance.

We present three examples of models that can be used to learn appropriate target distributions and discuss how to train them, although other models can of course be considered. Samples of the posterior are split into training and evaluation (cf. test) sets. The training set is used to learn the target distribution, after which the evaluation set, combined with the learnt target, is used to estimate the marginal likelihood. To train the models we typically construct and solve an optimisation problem to minimise the variance of the estimator, while ensuring it is unbiased. We typically solve the resulting optimisation problem by stochastic gradient descent. To set hyperparameters, we advocate cross-validation.

Learning from posterior samples

Here we cover several functional forms for the learned flow models \(\varphi(\theta)\) which are used throughout the code. For these models no hyper-parameter optimisation is required, the flow should do all the heavy lifting for us!

We also provide support for legacy models, which are somewhat less expressive but nonetheless useful for simple posterior distributions. Hyper-parameters of these models can be considered nodes of a conventional network, the values of which are learnt from a small sub-set of posterior samples.

Normalizing flows are a class of probabilistic models that allow one to evaluate the density of and sample from a learned probability distribution (for a review see (Papamakarios et al., 2021)). They consist of a series of transformations that are applied to a simple base distribution. A vector \(\theta\) of an unknown distribution \(p(\theta)\), can be expressed through a transformation \(T\) of a vector \(z\) sampled from a base distribution \(q(z)\):

\[\theta = T(z), \text{ where } z \sim q(z).\]

Typically the base distribution is chosen so that its density can be evaluated simply and that it can be sampled from easily. Often a Gaussian is used for the base distribution. The unknown distribution can then be recovered by the change of variables formula:

\[p(\theta) = q(z) \vert \det J_{T}(z) \vert^{-1},\]

where \(J_{T}(z)\) is the Jacobian corresponding to transformation \(T\). In a flow-based model \(T\) consists of a series of learned transformations that are each invertible and differentiable, so that the full transformation is also invertible and differentiable. This allows us to compose multiple simple transformations with learned parameters, into what is called a flow, obtaining a normalized approximation of the unknown distribution that we can sample from and evaluate. Careful attention is given to construction of the transformations such that the determinant of the Jacobian can be computed easily.

Real NVP Flows

A relatively simple example of a normalizing flow is the real-valued non-volume preserving (real NVP) flow introduced in (Dinh et al., 2016). It consists of a series of bijective transformations given by affine coupling layers. Consider the \(D\) dimensional input \(z\), split into elements up to and following \(d\), respectively, \(z_{1:d}\) and \(z_{d+1:D}\), for \(d<D\). Given input \(z\), the output \(y\) of an affine couple layer is calculated by

\[\begin{split}y_{1:d} = & z_{1:d} ; \\ y_{d+1:D} = & z_{d+1:D} \odot \exp\bigl(s(z_{1:d})\bigr) + t(z_{1:d}),\end{split}\]

where \(\odot\) denotes Hadamard (elementwise) multiplication. The scale \(s\) and translation \(t\) are typically represented by neural networks with learnable parameters that take as input \(z_{1:d}\). This construction is easily invertible and ensures the Jacobian is a lower-triangular matrix, making its determinant efficient to calculate.

Rational Quadratic Spline Flows

A more complex and expressive class of flows are rational quadratic spline flows described in detail in (Durkan et al., 2019). The architecture is similar to Real NVP flows, but the layers include monotonic splines. These are piecewise functions consisting of multiple segments of monotonic rational-quadratics with learned parameters. Such layers are combined with alternating affine transformations to create the normalizing flow. Rational quadratic spline flows are well-suited to higher dimensional and more complex problems than real NVP flows.

Note

This list of models is by no means comprehensive, and bespoke models may be implemented which perform better in specific use-cases.