Normalizing Flows offer a distinct approach to generative modeling compared to Variational Autoencoders (VAEs) or Generative Adversarial Networks (GANs). Their primary strength lies in providing an exact computation of the data likelihood while simultaneously defining an invertible mapping between the data space and a latent space with a simple distribution (like a standard Gaussian). This makes them particularly suitable for tasks requiring precise density estimation or where invertibility is advantageous.
The core idea is to learn a transformation f:X→Z that maps complex data points x∈X to simpler latent variables z∈Z, where pZ(z) is a known, tractable probability distribution (e.g., N(0,I)). Because f is designed to be invertible (x=f−1(z)) and differentiable, we can use the change of variables theorem from probability theory to relate the density of the data pX(x) to the density of the latent variables pZ(z).
The change of variables formula states that for an invertible, differentiable function f mapping x to z, the relationship between their probability densities is:
pX(x)=pZ(f(x))det(∂xT∂f(x))Here, ∂xT∂f(x) is the Jacobian matrix of the transformation f evaluated at x, and ∣det(⋅)∣ denotes the absolute value of its determinant.
This formula is essential. It tells us that if we can compute f(x) and the determinant of its Jacobian, we can calculate the exact probability density pX(x) for any given data point x, using the known density pZ.
For generative modeling, we often work with the log-likelihood, which avoids numerical underflow and simplifies calculations:
logpX(x)=logpZ(f(x))+logdet(∂xT∂f(x))Training a normalizing flow involves maximizing this log-likelihood over a dataset. This requires the transformation f to have two properties:
Complex transformations f are typically constructed by composing simpler invertible functions, often called bijectors or coupling layers: f=fL∘⋯∘f2∘f1. If each fi is invertible and has a tractable Jacobian determinant, the composite function f inherits these properties.
The Jacobian of the composite function f is the product of the Jacobians of the individual layers:
∂xT∂f(x)=∂zL−1T∂fL(zL−1)…∂z1T∂f2(z1)∂xT∂f1(x)where zi=fi(zi−1) and z0=x.
Due to the property det(AB)=det(A)det(B), the log-determinant of the overall Jacobian becomes a sum:
logdet(∂xT∂f(x))=i=1∑Llogdet(∂zi−1T∂fi(zi−1))This compositionality allows us to build expressive transformations from simpler blocks while maintaining computational tractability.
Here's a conceptual diagram of a normalizing flow:
Data space X is transformed into a simple latent space Z (e.g., Gaussian) via a sequence of invertible functions f1,…,fL. The inverse transformation f−1 allows sampling from pX(x) by drawing z∼pZ(z) and computing x=f−1(z).
Designing effective bijectors is central to normalizing flows. A popular and successful approach involves coupling layers.
Real Non-Volume Preserving (RealNVP) flows, building on NICE (Non-linear Independent Components Estimation), use a clever masking strategy. They split the input vector x into two parts, x1 and x2. The transformation applies as follows:
The inverse transformation is also simple:
x1=z1 x2=(z2−t(z1))⊙exp(−s(z1))The Jacobian of this transformation is lower triangular (or upper triangular, depending on which part is transformed):
∂xT∂z=(I∂x1T∂z20∂x2T∂z2)Crucially, ∂x2T∂z2 is a diagonal matrix with diagonal elements exp(s(x1)). The determinant is therefore simply the product of the diagonal elements: ∏exp(s(x1))=exp(∑s(x1)). The log-determinant is just the sum of the outputs of the scale network: ∑s(x1).
This structure ensures both easy inversion and efficient computation of the log-determinant. To ensure all dimensions are transformed, consecutive coupling layers typically swap the roles of x1 and x2 or use different masks.
Another significant class is autoregressive flows. In these models, each dimension zi of the latent variable is conditioned only on the previous dimensions z1:i−1 (or x1:i−1 depending on the direction).
They often use masked neural networks to enforce the autoregressive property.
PyTorch's torch.distributions
module provides excellent tools for building normalizing flows. Key components include:
torch.distributions.Distribution
: Base class for probability distributions (like Normal
).torch.distributions.Transform
: Base class for invertible transformations. Implementations like AffineTransform
, ExpTransform
, etc., are available. You often define custom transforms inheriting from this.torch.distributions.TransformedDistribution
: Creates a new distribution by applying a sequence of Transform
objects to a base distribution. It automatically handles the change of variables calculation.torch.distributions.constraints
: Used to define the support of distributions and validity checks for transform parameters.Let's sketch out how you might define a simple flow using coupling layers. You'd typically define a CouplingLayer
class inheriting from torch.distributions.Transform
.
import torch
import torch.nn as nn
import torch.distributions as dist
class CouplingLayer(dist.Transform):
def __init__(self, input_dim, hidden_dim, mask, base_transform_type='affine'):
super().__init__()
self.input_dim = input_dim
# Ensure mask is a binary tensor (0s and 1s)
self.register_buffer('mask', mask)
# Define the network 's_t_network' that computes scale and translation params
self.s_t_network = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, input_dim * 2) # Output scale and shift for all dims
)
self.bijective = True
# For affine: domain = real, codomain = real
self.domain = dist.constraints.real_vector
self.codomain = dist.constraints.real_vector
# Optional: Use built-in transforms like AffineTransform
# if base_transform_type == 'affine' ... (implementation detail)
def _call(self, x):
""" Applies the transform: x -> z """
x_masked = x * self.mask
s_t_params = self.s_t_network(x_masked)
# Split the output into scale (s) and translation (t)
# Ensure scale is positive, e.g., using tanh for stability + scaling
s = torch.tanh(s_t_params[..., :self.input_dim])
t = s_t_params[..., self.input_dim:]
# Apply transformation only to unmasked elements
# z = x_masked + (x_unmasked * exp(s) + t) * (1 - mask)
z = self.mask * x + (1 - self.mask) * (x * torch.exp(s) + t)
return z
def _inverse(self, z):
""" Applies the inverse transform: z -> x """
z_masked = z * self.mask
s_t_params = self.s_t_network(z_masked)
s = torch.tanh(s_t_params[..., :self.input_dim])
t = s_t_params[..., self.input_dim:]
# Apply inverse transformation only to unmasked elements
# x = z_masked + ((z_unmasked - t) * exp(-s)) * (1 - mask)
x = self.mask * z + (1 - self.mask) * ((z - t) * torch.exp(-s))
return x
def log_abs_det_jacobian(self, x, z):
""" Computes log |det J(x)| """
x_masked = x * self.mask
s_t_params = self.s_t_network(x_masked)
s = torch.tanh(s_t_params[..., :self.input_dim])
# Log determinant is the sum of 's' for the transformed dimensions
log_det_jacobian = (1 - self.mask) * s
# Sum over the transformed dimensions
return log_det_jacobian.sum(-1)
# Example Usage:
input_dim = 10
hidden_dim = 64
num_flows = 5
# Define base distribution (e.g., standard Normal)
base_dist = dist.Normal(torch.zeros(input_dim), torch.ones(input_dim))
# Create masks (alternating between steps is common)
masks = []
for i in range(num_flows):
mask = torch.zeros(input_dim)
mask[i % 2::2] = 1 # Simple alternating mask
# Or more sophisticated masking strategies exist
masks.append(mask)
# Create a sequence of transforms (coupling layers)
transforms = []
for i in range(num_flows):
# Alternate which part is identity vs transformed
current_mask = masks[i] if i % 2 == 0 else (1 - masks[i])
transforms.append(CouplingLayer(input_dim, hidden_dim, current_mask))
# Optional: Add permutation/activation normalization layers between flows
# Build the transformed distribution
flow_dist = dist.TransformedDistribution(base_dist, transforms)
# --- Training ---
# optimizer = torch.optim.Adam(flow_dist.parameters(), lr=1e-4)
#
# for data_batch in dataloader:
# optimizer.zero_grad()
#
# # Calculate log probability of the data batch
# log_prob = flow_dist.log_prob(data_batch) # Shape: [batch_size]
#
# # Maximize log-likelihood -> Minimize negative log-likelihood
# loss = -log_prob.mean()
#
# loss.backward()
# optimizer.step()
# --- Sampling ---
# n_samples = 64
# samples = flow_dist.sample(torch.Size([n_samples])) # Shape: [n_samples, input_dim]
Advantages:
Considerations:
Normalizing flows have found applications in various domains:
In summary, normalizing flows provide a mathematically elegant and powerful framework for generative modeling and density estimation. By leveraging invertible transformations and the change of variables formula, they allow for exact likelihood computation, offering unique advantages for specific probabilistic modeling tasks within the advanced deep learning toolkit.
© 2025 ApX Machine Learning