Random Variables and Distributions

Generative modeling requires a solid foundation in how we mathematically represent data. When building normalizing flows, we treat our dataset as a collection of samples drawn from an underlying probability distribution. To manipulate these distributions and build models capable of generating new data, we must first establish the rules governing continuous random variables.

A random variable assigns a numerical value to the outcome of a random process. In the context of this course, we focus exclusively on continuous random variables. Unlike discrete random variables which take on specific isolated values, continuous variables can assume an infinite number of possible values within a given range. Because the space of possibilities is infinite, the probability of a continuous random variable taking on one exact, highly specific value is practically zero.

Instead of absolute probabilities for single points, we describe continuous variables using a probability density function, or PDF. Let a continuous random variable be denoted as $x$ . Its probability density function is written as $p(x)$ . A valid PDF must satisfy two specific mathematical conditions. First, it must be non-negative everywhere, meaning $p(x) \ge 0$ . Second, the total area under the density curve must integrate to exactly 1 over the entire domain:

$\int_{-\infty}^{\infty} p(x) dx = 1$

It is a common mistake to think of $p(x)$ as a direct probability. The value of a probability density function at a specific point can absolutely exceed 1. The density simply represents the relative likelihood of values falling within a small region around $x$ .

Normalizing flows operate by moving data between two distinct mathematical spaces. The first is the latent space, which is populated by a base distribution. We represent a random variable in this latent space as $z$ . The base distribution, denoted mathematically as $p_Z(z)$ , is chosen to be intentionally simple. We need a distribution that is computationally cheap to sample from and has a known probability density function that we can evaluate exactly.

In almost all normalizing flow implementations, the standard multivariate Gaussian distribution serves as the base distribution. A standard Gaussian has a mean vector of zeros and an identity covariance matrix. We use it because its properties are highly predictable and its density can be computed efficiently even in thousands of dimensions.

The second space is the data space. This space contains the actual observable data we want our model to generate, such as the pixel values of an image or the spatial coordinates of a molecule. We denote a random variable in the data space as $x$ , and its corresponding target distribution as $p_X(x)$ .

Target distributions are inherently complex, highly correlated, and multimodal. The underlying structure of a dataset like human speech waveforms cannot be described by a simple Gaussian equation. The objective of a normalizing flow model is to learn an invertible mathematical transformation that reshapes the simple base distribution $p_Z(z)$ into the complex target distribution $p_X(x)$ .

The left plot shows a simple unimodal base distribution evaluated in the latent space. The right plot represents a complex target distribution in the data space, characterized by multiple peaks and valleys.

To implement the base distribution in code, we rely on standard machine learning frameworks. PyTorch provides a comprehensive suite of statistical distributions in the torch.distributions module. Since most real datasets have multiple dimensions, we begin by defining a continuous multivariate normal distribution.

import torch
from torch.distributions import MultivariateNormal

# Define the dimensionality of the latent space
dimensions = 2

# Mean vector (zeros) and covariance matrix (identity)
loc = torch.zeros(dimensions)
covariance_matrix = torch.eye(dimensions)

# Initialize the base distribution p_Z(z)
base_distribution = MultivariateNormal(loc, covariance_matrix)

# Sample a batch of 5 latent variables
z_samples = base_distribution.sample((5,))
print("Samples z:\n", z_samples)

# Evaluate the log probability density of the samples
log_probs = base_distribution.log_prob(z_samples)
print("Log probabilities:\n", log_probs)

Notice that we use log_prob rather than calculating the raw probability density. In generative modeling, multiplying many small probability values together rapidly leads to numerical underflow, where the computer rounds tiny floating-point numbers to zero. By operating in the logarithmic space, we convert multiplication into addition, which provides the numerical stability required to train deep neural networks. The value returned by this method is exactly $\log p_Z(z)$ , a metric we will use extensively when optimizing flow models via maximum likelihood estimation.

Was this section helpful?