Constructing a normalizing flow involves assembling architectural components like invertible layers and their Jacobian determinants into a complete training pipeline. This process includes building a PyTorch model, implementing the exact maximum likelihood objective, handling discrete data via dequantization, and generating synthetic samples through the inverse pass.
A standard normalizing flow model acts as a container for a sequence of invertible transformations. It requires a defined base distribution, which is usually a simple continuous distribution like an isotropic Gaussian.
When evaluating the model during training, we pass data through the sequence of layers. Each layer applies its transformation and returns the result along with the log-determinant of its Jacobian matrix. By accumulating these log-determinants across all layers, we can compute the total change in volume for the entire network.
Here is a typical implementation of a container class for normalizing flows:
import torch
import torch.nn as nn
from torch.distributions import MultivariateNormal
class NormalizingFlow(nn.Module):
def __init__(self, layers, num_features):
super().__init__()
self.layers = nn.ModuleList(layers)
# Define a standard normal base distribution
self.register_buffer('loc', torch.zeros(num_features))
self.register_buffer('cov', torch.eye(num_features))
@property
def base_distribution(self):
return MultivariateNormal(self.loc, self.cov)
def forward(self, x):
# Accumulate the log-determinant of the Jacobian
log_det_jacobian = torch.zeros(x.shape[0], device=x.device)
for layer in self.layers:
x, ldj = layer(x)
log_det_jacobian += ldj
return x, log_det_jacobian
def inverse(self, z):
# Pass samples backward through the network for generation
for layer in reversed(self.layers):
z = layer.inverse(z)
return z
Normalizing flows are designed to model continuous probability density functions. Applying them directly to discrete data, such as digital image pixels which take integer values from 0 to 255, will result in the model assigning infinite density to those specific integer points. This phenomenon collapses the training process.
To resolve this, we apply dequantization. Uniform dequantization adds continuous uniform noise to the discrete input data, effectively spreading the probability mass of each discrete value across a continuous unit interval.
def uniform_dequantize(x):
"""
Adds continuous uniform noise to discrete data points.
Assuming x contains discrete integer values.
"""
noise = torch.rand_like(x)
return x + noise
For more advanced applications, variational dequantization uses an auxiliary neural network to learn the optimal noise distribution, but uniform noise is standard and highly effective for most entry-level flow models.
Training a normalizing flow relies on exact maximum likelihood estimation. We aim to maximize the probability of our training data under the learned distribution. In practice, optimization algorithms minimize an objective, so we minimize the negative log-likelihood.
The mathematical formulation for the exact log-likelihood of a data point mapped to a base variable via a transformation is:
Translating this equation into PyTorch involves evaluating the base distribution's density at the transformed output and adding the accumulated log-determinant of the Jacobian.
def compute_loss(model, x):
# 1. Apply uniform dequantization to the discrete input
x_continuous = uniform_dequantize(x)
# 2. Pass data through the forward transformations
z, log_det_jacobian = model(x_continuous)
# 3. Evaluate the log probability of z under the base distribution
log_prob_z = model.base_distribution.log_prob(z)
# 4. Compute the exact log-likelihood
log_likelihood = log_prob_z + log_det_jacobian
# 5. Return the mean negative log-likelihood across the batch
return -log_likelihood.mean()
With the architecture, dequantization, and loss function ready, the training loop follows standard PyTorch patterns. Because flows are explicitly parameterized, we do not need to juggle multiple networks like we do in Generative Adversarial Networks. We simply pass batches of data, compute the negative log-likelihood, and perform backpropagation to update the weights of our coupling layers.
import torch.optim as optim
def train_flow(model, dataloader, epochs=50, lr=1e-3):
optimizer = optim.Adam(model.parameters(), lr=lr)
model.train()
for epoch in range(epochs):
epoch_loss = 0.0
for batch in dataloader:
# Assuming batch is a tuple where the first element is the data
x = batch[0]
optimizer.zero_grad()
loss = compute_loss(model, x)
loss.backward()
# Gradient clipping is recommended for flow models
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
epoch_loss += loss.item()
print(f"Epoch {epoch+1}/{epochs}, Loss: {epoch_loss / len(dataloader):.4f}")
The data processing pathways for training the normalizing flow via density estimation versus generating new samples via the inverse transformations.
Once the training loop finishes converging, the parameters of the flow transformations are optimized to stretch and squish the base normal distribution into the shape of your training data.
To evaluate the model's generative capabilities, we execute the inverse pass. We sample completely random noise from the standard normal base distribution and push it backward through the layers.
@torch.no_grad()
def generate_samples(model, num_samples, num_features):
model.eval()
# Draw random noise from the standard normal distribution
z = torch.randn(num_samples, num_features, device=next(model.parameters()).device)
# Pass the noise through the inverse transformations
generated_data = model.inverse(z)
return generated_data
The success of your normalizing flow is evident when the generated samples closely match the statistical properties and visual distribution of your original training dataset. The transformations have learned to map low-density regions of the normal distribution to the sparse areas of your data space, and high-density regions to the concentrated areas.
A 2D visualization of synthetic data generated from a trained normalizing flow after sampling from a Gaussian base distribution and applying the inverse transformations.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•