As we've seen, standard Variational Autoencoders rely on relatively simple distributional assumptions, typically Gaussian, for both the prior p(z) and the variational posterior qϕ(z∣x). While this choice ensures tractability, it can significantly cap the model's expressiveness. The true posterior p(z∣x) for complex data might exhibit multimodality or intricate dependencies between latent dimensions, far richer than a factorized Gaussian can capture. Similarly, forcing the aggregated posterior q(z)=∫qϕ(z∣x)pdata(x)dx to match a simple, fixed prior p(z) can restrict the model's capacity to learn nuanced representations and may contribute to issues like posterior collapse. Normalizing Flows (NFs) provide a powerful and principled method to construct more flexible, learnable probability distributions, enabling VAEs to overcome these limitations.
At its core, a Normalizing Flow transforms a simple initial probability distribution p0(z0) (often termed the base distribution, typically a standard Gaussian N(0,I)) into a more complex target distribution pK(zK) by applying a sequence of K invertible and differentiable transformations f1,…,fK.
Imagine starting with a sample z0 drawn from p0(z0). This sample is then passed through the sequence: z1=f1(z0) z2=f2(z1) ... zK=fK(zK−1)
The remarkable part is that we can precisely calculate the probability density of the final transformed variable zK. This is achieved using the change of variables formula from probability theory. If we have a transformation z′=f(z), the density of z′ is related to the density of z by pZ′(z′)=pZ(f−1(z′))det(∂z′∂f−1(z′)). For a sequence of forward transformations zk=fk(zk−1), it's often more convenient to express the log-density of the final output zK in terms of the initial z0:
logpK(zK)=logp0(z0)−k=1∑Klog∣detJfk(zk−1)∣Here, Jfk(zk−1) represents the Jacobian matrix of the transformation fk (i.e., the matrix of all first-order partial derivatives) evaluated at its input zk−1. The term ∣detJfk(zk−1)∣ accounts for how the transformation fk locally stretches or compresses space.
For this entire process to be computationally viable and useful, each transformation fk in the flow must satisfy three conditions:
A sequence of invertible transformations f1,…,fK maps samples z0 from a simple base distribution p0(z0) to samples zK from a more complex target distribution pK(zK). The parameters of these transformations are typically learned.
The flexibility of Normalizing Flows can be harnessed within the VAE framework to enrich either the variational posterior qϕ(z∣x), the prior p(z), or even both.
The standard VAE often employs a factorized Gaussian for the variational posterior, such as qϕ(z∣x)=N(μϕ(x),diag(σϕ2(x))). This is a mean-field approximation, which assumes independence between the latent dimensions given x. This assumption can be overly restrictive if the true posterior p(z∣x) exhibits complex correlations or is multimodal.
Using NFs, we can construct a much richer qϕ(z∣x):
This more sophisticated qϕ(zK∣x) then replaces the simpler posterior in the Evidence Lower Bound (ELBO) calculation. Specifically, the KL divergence term Eqϕ(z∣x)[logqϕ(z∣x)−logp(z)] now involves this expressive density. The ability of qϕ(z∣x) to better approximate the true, often intractable, posterior p(z∣x) can lead to a tighter ELBO (a higher value, closer to the true log-likelihood logp(x)) and consequently, more informative and useful latent representations.
In many VAE implementations, the prior over latent variables, p(z), is fixed, commonly to a standard normal distribution N(0,I). This choice imposes a strong assumption on the structure of the latent space. If the data's intrinsic manifold doesn't naturally conform to an isotropic Gaussian shape when projected into the latent space, the model might struggle to learn effectively.
Normalizing Flows offer an elegant way to make the prior p(z) learnable and more adaptive:
The parameters θ of these prior-transforming flow layers gm are optimized jointly with the VAE's encoder and decoder parameters during training. A more flexible prior allows the model to discover a latent space geometry that is better suited to the data. This can be particularly helpful in mitigating posterior collapse, a phenomenon where the KL divergence term is minimized by making qϕ(z∣x) nearly identical to p(z), rendering the latent variables uninformative. If p(z) itself can adapt, it may be "easier" for the encoder to map data to meaningful latent codes.
The practical utility of NFs hinges on designing transformation layers fk that are both expressive and allow for efficient computation of their Jacobian determinants. Several families of such transformations have proven effective:
Planar Flows: These apply a transformation f(z)=z+uh(wTz+b), where u,w∈RD and b∈R are learnable parameters, and h is a smooth element-wise non-linearity like tanh. The Jacobian determinant is relatively simple: detJf=1+uTψ(z), where ψ(z)=h′(wTz+b)w. Planar flows are straightforward but might require stacking many layers to achieve high expressivity, as each layer essentially pushes and pulls density along a hyperplane.
Radial Flows: These transformations modify the density around a specific reference point zref: f(z)=z+β(α+∣∣z−zref∣∣)−1(z−zref). Parameters include zref∈RD, α∈R+, and β∈R. Radial flows can create more localized changes in density.
Coupling Layers (e.g., RealNVP, NICE, Glow): This class of transformations is particularly powerful and widely used, especially for high-dimensional z. The core idea is to split the input z into two (or more) parts, say zA and zB. One part is transformed based on the other, while the other part might be left unchanged or transformed independently: zA′=zA (identity transformation for the first part) zB′=zB⊙exp(s(zA))+t(zA) (the second part is scaled and shifted, where scaling s(⋅) and translation t(⋅) functions are complex maps, like neural networks, that only depend on zA). The Jacobian of this transformation is lower triangular (or upper triangular if zB′=zB), meaning its determinant is simply the product of its diagonal elements. For the form above, this is ∏iexp(s(zA)i)=exp(∑is(zA)i). Inversion is also computationally efficient: zA=zA′ zB=(zB′−t(zA′))⊙exp(−s(zA′)) By stacking many such coupling layers and alternating which part of z is transformed (e.g., using permutations or by swapping roles of zA and zB), very complex and expressive distributions can be modeled.
Autoregressive Flows (e.g., MAF, IAF): In these flows, the transformation for each dimension zi is conditioned on the preceding dimensions z<i=(z1,…,zi−1). Specifically, zi′=τ(zi;hi(z<i)), where τ is an invertible scalar transformation (like an affine transformation azi+b) whose parameters hi (e.g., a and b) are produced by functions of z<i.
A simple 1D Gaussian base distribution (blue) transformed into a Log-Normal distribution (orange) by the function z′=exp(z). Note how the density changes: areas where the transformation expands space (large z) see reduced density, and areas where it contracts space (small z) see increased density, as dictated by the Jacobian of the transformation.
Integrating Normalizing Flows into VAEs can yield substantial benefits:
However, these advantages come with certain trade-offs:
Normalizing Flows mark a significant advancement in the VAE toolkit, directly addressing some of the fundamental limitations related to distributional assumptions. They empower VAEs to learn intricate probability distributions for both the inference (posterior) and generative (prior) aspects of the model. When you're designing VAEs for challenging datasets or aiming for state-of-the-art generative performance and representation quality, evaluating whether the added expressiveness of NFs is worth the computational investment is an important consideration. Their successful integration into many cutting-edge generative models underscores their value in modern deep learning.
Was this section helpful?
© 2025 ApX Machine Learning