Training a Normalizing Flow Practice

Constructing a normalizing flow involves assembling architectural components like invertible layers and their Jacobian determinants into a complete training pipeline. This process includes building a PyTorch model, implementing the exact maximum likelihood objective, handling discrete data via dequantization, and generating synthetic samples through the inverse pass.

Preparing the Model Architecture

A standard normalizing flow model acts as a container for a sequence of invertible transformations. It requires a defined base distribution, which is usually a simple continuous distribution like an isotropic Gaussian.

When evaluating the model during training, we pass data through the sequence of layers. Each layer applies its transformation and returns the result along with the log-determinant of its Jacobian matrix. By accumulating these log-determinants across all layers, we can compute the total change in volume for the entire network.

Here is a typical implementation of a container class for normalizing flows:

import torch
import torch.nn as nn
from torch.distributions import MultivariateNormal

class NormalizingFlow(nn.Module):
    def __init__(self, layers, num_features):
        super().__init__()
        self.layers = nn.ModuleList(layers)

        # Define a standard normal base distribution
        self.register_buffer('loc', torch.zeros(num_features))
        self.register_buffer('cov', torch.eye(num_features))

    @property
    def base_distribution(self):
        return MultivariateNormal(self.loc, self.cov)

    def forward(self, x):
        # Accumulate the log-determinant of the Jacobian
        log_det_jacobian = torch.zeros(x.shape[0], device=x.device)

        for layer in self.layers:
            x, ldj = layer(x)
            log_det_jacobian += ldj

        return x, log_det_jacobian

    def inverse(self, z):
        # Pass samples backward through the network for generation
        for layer in reversed(self.layers):
            z = layer.inverse(z)
        return z

Data Dequantization

Normalizing flows are designed to model continuous probability density functions. Applying them directly to discrete data, such as digital image pixels which take integer values from 0 to 255, will result in the model assigning infinite density to those specific integer points. This phenomenon collapses the training process.

To resolve this, we apply dequantization. Uniform dequantization adds continuous uniform noise to the discrete input data, effectively spreading the probability mass of each discrete value across a continuous unit interval.

def uniform_dequantize(x):
    """
    Adds continuous uniform noise to discrete data points.
    Assuming x contains discrete integer values.
    """
    noise = torch.rand_like(x)
    return x + noise

For more advanced applications, variational dequantization uses an auxiliary neural network to learn the optimal noise distribution, but uniform noise is standard and highly effective for most entry-level flow models.

Implementing the Loss Function

Training a normalizing flow relies on exact maximum likelihood estimation. We aim to maximize the probability of our training data under the learned distribution. In practice, optimization algorithms minimize an objective, so we minimize the negative log-likelihood.

The mathematical formulation for the exact log-likelihood of a data point $x$ mapped to a base variable $z$ via a transformation $f$ is:

$\log p_X(x) = \log p_Z(f(x)) + \log \left| \det \frac{\partial f(x)}{\partial x} \right|$

Translating this equation into PyTorch involves evaluating the base distribution's density at the transformed output and adding the accumulated log-determinant of the Jacobian.

def compute_loss(model, x):
    # 1. Apply uniform dequantization to the discrete input
    x_continuous = uniform_dequantize(x)

    # 2. Pass data through the forward transformations
    z, log_det_jacobian = model(x_continuous)

    # 3. Evaluate the log probability of z under the base distribution
    log_prob_z = model.base_distribution.log_prob(z)

    # 4. Compute the exact log-likelihood
    log_likelihood = log_prob_z + log_det_jacobian

    # 5. Return the mean negative log-likelihood across the batch
    return -log_likelihood.mean()

The Training Loop

With the architecture, dequantization, and loss function ready, the training loop follows standard PyTorch patterns. Because flows are explicitly parameterized, we do not need to juggle multiple networks like we do in Generative Adversarial Networks. We simply pass batches of data, compute the negative log-likelihood, and perform backpropagation to update the weights of our coupling layers.

import torch.optim as optim

def train_flow(model, dataloader, epochs=50, lr=1e-3):
    optimizer = optim.Adam(model.parameters(), lr=lr)
    model.train()

    for epoch in range(epochs):
        epoch_loss = 0.0

        for batch in dataloader:
            # Assuming batch is a tuple where the first element is the data
            x = batch[0]

            optimizer.zero_grad()
            loss = compute_loss(model, x)
            loss.backward()

            # Gradient clipping is recommended for flow models
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

            optimizer.step()
            epoch_loss += loss.item()

        print(f"Epoch {epoch+1}/{epochs}, Loss: {epoch_loss / len(dataloader):.4f}")

The data processing pathways for training the normalizing flow via density estimation versus generating new samples via the inverse transformations.

Sampling and Evaluation

Once the training loop finishes converging, the parameters of the flow transformations are optimized to stretch and squish the base normal distribution into the shape of your training data.

To evaluate the model's generative capabilities, we execute the inverse pass. We sample completely random noise from the standard normal base distribution and push it backward through the layers.

@torch.no_grad()
def generate_samples(model, num_samples, num_features):
    model.eval()

    # Draw random noise from the standard normal distribution
    z = torch.randn(num_samples, num_features, device=next(model.parameters()).device)

    # Pass the noise through the inverse transformations
    generated_data = model.inverse(z)

    return generated_data

The success of your normalizing flow is evident when the generated samples closely match the statistical properties and visual distribution of your original training dataset. The transformations have learned to map low-density regions of the normal distribution to the sparse areas of your data space, and high-density regions to the concentrated areas.

A 2D visualization of synthetic data generated from a trained normalizing flow after sampling from a Gaussian base distribution and applying the inverse transformations.

Was this section helpful?