Choosing Base Distributions

Every normalizing flow requires a starting point. This starting point is the base distribution within the mathematical framework of generative modeling, often denoted as $p_Z(z)$ . When a random variable $z_0$ is sampled from this distribution and passed through a sequence of invertible transformations, the objective is to morph it into a highly complex target distribution that matches the training data. Selecting an appropriate base distribution is an important step in designing a model because it dictates the initial topological shape and the computational efficiency of the entire training process.

Forward pass mapping a base distribution to a complex target distribution through a sequence of intermediate transformations.

Desirable Properties of a Base Distribution

A good base distribution must satisfy several mathematical and practical requirements to function properly within a normalizing flow framework.

First, we must be able to evaluate its probability density function exactly and efficiently. During training, the exact maximum likelihood objective relies on computing the log-probability of the base distribution. If evaluating this density is computationally expensive, the entire training cycle will slow down.

Second, it must be easy to sample from. Generative modeling ultimately requires drawing new samples $z \sim p_Z(z)$ to pass through the inverse flow to generate synthetic data. Fast sampling at the base level translates directly to fast generation times.

Finally, the log-density must be differentiable with respect to its inputs. The training process relies on backpropagation to optimize the parameters of the flow transformations. We need smooth gradients flowing all the way back through the log-probability calculation of the base distribution.

The Standard Normal Distribution

The primary choice for a base distribution in most flow architectures is the standard multivariate normal distribution, also known as the isotropic Gaussian. It satisfies all the required computational properties and provides a well-understood mathematical base for the transformations.

For a $D$ -dimensional space, the isotropic Gaussian is defined as:

p_Z(z) = \frac{1}{(2\pi)^{D/2}} \exp\left(-\frac{1}{2} z^T z\right)

Its log-density simplifies into a computationally efficient form that we evaluate directly in the loss function:

\log p_Z(z) = -\frac{D}{2} \log(2\pi) - \frac{1}{2} ||z||^2

Because the covariance matrix is the identity matrix, the individual dimensions of the sampled vectors are independent. This independence is highly advantageous. We start with completely uncorrelated noise, and we leave it entirely up to the trainable transformations in the flow model to build the complex correlations and dependencies observed in the target dataset.

Density contour of a 2D isotropic Gaussian base distribution centered at the origin.

Alternative Distributions

While the isotropic Gaussian is dominant, other continuous distributions are occasionally used depending on the characteristics of the data. If the target data exhibits heavy tails, meaning extreme values occur more frequently than a normal distribution would predict, a Logistic distribution or a Student's t-distribution might be selected. The Logistic distribution has thicker tails and an analytically tractable cumulative distribution function.

A uniform distribution is almost never used for the main base distribution because its bounded support introduces problems. A continuous normalizing flow represents a diffeomorphism mapping one continuous, unconstrained space to another. If the base distribution is strictly bounded between 0 and 1, the transformations will struggle to map it to a target space that stretches to infinity without causing severe numerical instabilities in the Jacobian determinants.

Implementation in PyTorch

Translating these mathematical requirements into code is straightforward using the built-in distribution classes in PyTorch. We typically use torch.distributions.MultivariateNormal or independent torch.distributions.Normal objects.

Here is how you define a standard isotropic Gaussian base distribution for a 2D flow model and compute the log-probabilities required for training:

import torch
from torch.distributions import MultivariateNormal

# Define the dimensionality of the data space
D = 2

# Mean vector (zeros) and covariance matrix (identity)
loc = torch.zeros(D)
covariance_matrix = torch.eye(D)

# Initialize the base distribution
base_distribution = MultivariateNormal(loc, covariance_matrix)

# Sample from the base distribution (e.g., a batch of 5 samples)
z_0 = base_distribution.sample((5,))
print("Samples z_0:\n", z_0)

# Compute the exact log probability of the samples
log_prob = base_distribution.log_prob(z_0)
print("Log probability:\n", log_prob)

By retaining the base distribution as a simple, independent Gaussian, we isolate the computational complexity within the transformations themselves. This separation of concerns allows us to write clear code where the flow sequence acts directly on the standardized outputs of base_distribution.sample(), and our exact likelihood training loop evaluates targets against base_distribution.log_prob().

Was this section helpful?