Normalizing Flows for Flexible Priors and Posteriors
Standard Variational Autoencoders rely on relatively simple distributional assumptions, typically Gaussian, for both the prior p(z) and the variational posterior qϕ(z∣x). While this choice ensures tractability, it can significantly cap the model's expressiveness. The true posterior p(z∣x) for complex data might exhibit multimodality or intricate dependencies between latent dimensions, far richer than a factorized Gaussian can capture. Similarly, forcing the aggregated posterior q(z)=∫qϕ(z∣x)pdata(x)dx to match a simple, fixed prior p(z) can restrict the model's capacity to learn representations and may contribute to issues like posterior collapse. Normalizing Flows (NFs) provide an effective and principled method to construct more flexible, learnable probability distributions, enabling VAEs to overcome these limitations.
The Essence of Normalizing Flows
At its core, a Normalizing Flow transforms a simple initial probability distribution p0(z0) (often termed the base distribution, typically a standard Gaussian N(0,I)) into a more complex target distribution pK(zK) by applying a sequence of K invertible and differentiable transformations f1,…,fK.
Imagine starting with a sample z0 drawn from p0(z0). This sample is then passed through the sequence:
z1=f1(z0)z2=f2(z1)
...
zK=fK(zK−1)
The remarkable part is that we can precisely calculate the probability density of the final transformed variable zK. This is achieved using the change of variables formula from probability theory. If we have a transformation z′=f(z), the density of z′ is related to the density of z by pZ′(z′)=pZ(f−1(z′))det(∂z′∂f−1(z′)). For a sequence of forward transformations zk=fk(zk−1), it's often more convenient to express the log-density of the final output zK in terms of the initial z0:
Here, Jfk(zk−1) represents the Jacobian matrix of the transformation fk (i.e., the matrix of all first-order partial derivatives) evaluated at its input zk−1. The term ∣detJfk(zk−1)∣ accounts for how the transformation fk locally stretches or compresses space.
For this entire process to be computationally viable and useful, each transformation fk in the flow must satisfy three conditions:
It must be invertible, meaning we can recover zk−1 from zk via fk−1.
It must be differentiable, so that the Jacobian matrix exists.
Crucially, its Jacobian determinantdetJfk must be efficient to compute. This constraint heavily influences the design of suitable flow layers.
A sequence of invertible transformations f1,…,fK maps samples z0 from a simple base distribution p0(z0) to samples zK from a more complex target distribution pK(zK). The parameters of these transformations are typically learned.
Weaving Normalizing Flows into VAEs
The flexibility of Normalizing Flows can be harnessed within the VAE framework to enrich either the variational posterior qϕ(z∣x), the prior p(z), or even both.
More Expressive Variational Posteriors
The standard VAE often employs a factorized Gaussian for the variational posterior, such as qϕ(z∣x)=N(μϕ(x),diag(σϕ2(x))). This is a mean-field approximation, which assumes independence between the latent dimensions given x. This assumption can be overly restrictive if the true posterior p(z∣x) exhibits complex correlations or is multimodal.
Using NFs, we can construct a much richer qϕ(z∣x):
The encoder network (parameterized by ϕ) outputs parameters for a simple base distribution, say z0∼qbase(z0∣x) (e.g., a diagonal Gaussian whose mean and variance are functions of x).
This initial sample z0 is then passed through a sequence of K flow transformations f1,…,fK. The parameters of these flow layers can also be dependent on x or be global, learned parameters. This produces zK=fK(…f1(z0)).
The resulting variational posterior is qϕ(zK∣x). Its log-density is computed as:
logqϕ(zK∣x)=logqbase(z0∣x)−k=1∑Klog∣detJfk(zk−1)∣
This more sophisticated qϕ(zK∣x) then replaces the simpler posterior in the Evidence Lower Bound (ELBO) calculation. Specifically, the KL divergence term Eqϕ(z∣x)[logqϕ(z∣x)−logp(z)] now involves this expressive density. The ability of qϕ(z∣x) to better approximate the true, often intractable, posterior p(z∣x) can lead to a tighter ELBO (a higher value, closer to the true log-likelihood logp(x)) and consequently, more informative and useful latent representations.
Learnable and Flexible Priors
In many VAE implementations, the prior over latent variables, p(z), is fixed, commonly to a standard normal distribution N(0,I). This choice imposes a strong assumption on the structure of the latent space. If the data's intrinsic manifold doesn't naturally conform to an isotropic Gaussian shape when projected into the latent space, the model might struggle to learn effectively.
Normalizing Flows offer an elegant way to make the prior p(z) learnable and more adaptive:
Begin with a very simple base distribution, z0∼p0(z0) (e.g., N(0,I)).
Apply a sequence of M flow transformations g1,…,gM, whose parameters θ are learnable, to obtain zM=gM(…g1(z0)).
This construction defines the prior pθ(zM), and its log-density is:
logpθ(zM)=logp0(z0)−m=1∑Mlog∣detJgm(zm−1)∣
The parameters θ of these prior-transforming flow layers gm are optimized jointly with the VAE's encoder and decoder parameters during training. A more flexible prior allows the model to discover a latent space geometry that is better suited to the data. This can be particularly helpful in mitigating posterior collapse, a phenomenon where the KL divergence term is minimized by making qϕ(z∣x) nearly identical to p(z), rendering the latent variables uninformative. If p(z) itself can adapt, it may be "easier" for the encoder to map data to meaningful latent codes.
Common Architectures for Flow Transformations
The practical utility of NFs relies on designing transformation layers fk that are both expressive and allow for efficient computation of their Jacobian determinants. Several families of such transformations have proven effective:
Planar Flows: These apply a transformation f(z)=z+uh(wTz+b), where u,w∈RD and b∈R are learnable parameters, and h is a smooth element-wise non-linearity like tanh. The Jacobian determinant is relatively simple: detJf=1+uTψ(z), where ψ(z)=h′(wTz+b)w. Planar flows are straightforward but might require stacking many layers to achieve high expressivity, as each layer essentially pushes and pulls density along a hyperplane.
Radial Flows: These transformations modify the density around a specific reference point zref: f(z)=z+β(α+∣∣z−zref∣∣)−1(z−zref). Parameters include zref∈RD, α∈R+, and β∈R. Radial flows can create more localized changes in density.
Coupling Layers (e.g., RealNVP, NICE, Glow): This class of transformations is particularly powerful and widely used, especially for high-dimensional z. The core idea is to split the input z into two (or more) parts, say zA and zB. One part is transformed based on the other, while the other part might be left unchanged or transformed independently:
zA′=zA (identity transformation for the first part)
zB′=zB⊙exp(s(zA))+t(zA) (the second part is scaled and shifted, where scaling s(⋅) and translation t(⋅) functions are complex maps, like neural networks, that only depend on zA).
The Jacobian of this transformation is lower triangular (or upper triangular if zB′=zB), meaning its determinant is simply the product of its diagonal elements. For the form above, this is ∏iexp(s(zA)i)=exp(∑is(zA)i). Inversion is also computationally efficient:
zA=zA′zB=(zB′−t(zA′))⊙exp(−s(zA′))
By stacking many such coupling layers and alternating which part of z is transformed (e.g., using permutations or by swapping roles of zA and zB), very complex and expressive distributions can be modeled.
Autoregressive Flows (e.g., MAF, IAF): In these flows, the transformation for each dimension zi is conditioned on the preceding dimensions z<i=(z1,…,zi−1). Specifically, zi′=τ(zi;hi(z<i)), where τ is an invertible scalar transformation (like an affine transformation azi+b) whose parameters hi (e.g., a and b) are produced by functions of z<i.
Masked Autoregressive Flow (MAF): The parameters for transforming zi are generated based on z<i. This structure makes density evaluation logp(z′) efficient (can be done in one pass), but sampling z′ is sequential ( z1′ first, then z2′ using z1′, etc.) and thus slower for high dimensions.
Inverse Autoregressive Flow (IAF): Designed as the inverse operation of MAF. Sampling z′ can be done in parallel and is very fast (as zi depends on z<i′ during inversion), but density evaluation becomes sequential and slow.
Both MAF and IAF are highly expressive, particularly when the conditioning functions hi are themselves parameterized by neural networks (e.g., using MADE architecture).
A simple 1D Gaussian base distribution (blue) transformed into a Log-Normal distribution (orange) by the function z′=exp(z). Note how the density changes: areas where the transformation expands space (large z) see reduced density, and areas where it contracts space (small z) see increased density, as dictated by the Jacobian of the transformation.
The Upshot for VAE Performance
Integrating Normalizing Flows into VAEs can yield substantial benefits:
Improved Density Modeling and Tighter ELBO: By allowing qϕ(z∣x) to better approximate the true posterior, or p(z) to better model the latent data manifold, NFs often lead to a higher (tighter) ELBO. This indicates that the VAE is learning a better model of the data distribution.
Enhanced Sample Quality: VAEs equipped with flow-based posteriors and/or priors frequently generate samples (from pθ(x)=∫pθ(x∣z)pθ(z)dz) that are sharper, more diverse, and more realistic than those from standard VAEs. This is especially noticeable in domains with complex data like high-resolution images.
Richer and More Meaningful Latent Representations: When not constrained by overly simplistic distributional forms, the VAE can learn latent variables z that capture more intricate and semantically meaningful factors of variation in the data. This is particularly true if a learnable flow-based prior allows the latent space to adapt its geometry.
Alleviation of Posterior Collapse: A more flexible posterior qϕ(z∣x) is less likely to become trivial (i.e., ignore x and collapse to the prior p(z)) because it has the capacity to model complex conditional dependencies. Similarly, a flexible prior can adapt to the aggregated posterior, reducing the KL-divergence pressure that sometimes causes this collapse.
However, these advantages come with certain trade-offs:
Increased Computational Demand: Each layer in a Normalizing Flow adds to the computational load, primarily due to the forward pass through the transformation and the calculation of its Jacobian determinant. Deeper or more complex flow architectures can significantly increase training and inference times.
Greater Model Complexity and Optimization Challenges: The overall VAE model becomes more complex due to the additional parameters within the flow networks. Optimizing these larger models can be more difficult, potentially requiring careful hyperparameter tuning, sophisticated optimization algorithms, or longer training schedules.
Choice of Flow Architecture: The degree of improvement often depends critically on the specific choice of flow architecture (e.g., coupling vs. autoregressive, the number of flow layers, the complexity of the neural networks within each layer). Selecting the optimal flow design for a given problem is not always straightforward and is an active area of research.
Moving Forward with Expressive Distributions
Normalizing Flows mark a significant advancement in the VAE toolkit, directly addressing some of the fundamental limitations related to distributional assumptions. They empower VAEs to learn intricate probability distributions for both the inference (posterior) and generative (prior) aspects of the model. When you're designing VAEs for challenging datasets or aiming for state-of-the-art generative performance and representation quality, evaluating whether the added expressiveness of NFs is worth the computational investment is an important consideration. Their successful integration into many cutting-edge generative models underscores their value in modern deep learning.
Was this section helpful?
Variational Inference with Normalizing Flows, Danilo Jimenez Rezende, Shakir Mohamed, 2015Proceedings of the 32nd International Conference on Machine Learning (ICML) - Introduces normalizing flows for variational inference, proposing planar and radial flows to build more expressive approximate posteriors in VAEs.
Improving Variational Inference with Inverse Autoregressive Flows, Diederik P. Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, Max Welling, 2016Advances in Neural Information Processing Systems, Vol. 29 (NeurIPS)DOI: 10.48550/arXiv.1606.04934 - Introduces inverse autoregressive flows (IAF) to create flexible posterior distributions, allowing efficient sampling and exact log-probability computation.
Density estimation using Real NVP, Laurent Dinh, Jascha Sohl-Dickstein, Samy Bengio, 2017International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1605.08803 - Presents Real NVP, a type of coupling layer flow that allows for both efficient density estimation and sampling, suitable for high-dimensional data.
Masked Autoregressive Flow for Density Estimation, George Papamakarios, Theo Pavlakou, Iain Murray, 2017Advances in Neural Information Processing Systems, Vol. 30 (NeurIPS)DOI: 10.48550/arXiv.1705.07057 - Details masked autoregressive flows (MAF) for density estimation, where log-probability evaluation is efficient while sampling is sequential.