The Change of Variables Theorem

To generate complex data distributions from simple ones, normalizing flows rely on a specific mathematical rule. When we pass a random variable through a mathematical function, its probability density changes. The change of variables theorem tells us exactly how to calculate that new density. This theorem is the mathematical engine that drives all normalizing flow architectures.

We can build an understanding of this theorem by starting with a single dimension. Imagine a simple one-dimensional continuous random variable $z$ . Let us assume $z$ follows a standard normal distribution. We want to apply an invertible function $f$ to $z$ to create a new random variable $x$ , resulting in the equation $x = f(z)$ . Because $f$ is an invertible mapping, we can also map backward using the inverse function $z = f^{-1}(x)$ .

What is the probability density of our new variable, denoted as $p_x(x)$ ? A common mistake is to assume that $p_x(x)$ is simply equal to the base density $p_z(z)$ . However, probability is determined by the area under the density curve, and the total area must always integrate to 1. When the function $f$ stretches or squishes the number line, the probability density must adjust accordingly to conserve total probability.

The conservation of probability dictates that the probability mass in a small region $dz$ must equal the probability mass in the corresponding transformed region $dx$ :

$p_x(x) |dx| = p_z(z) |dz|$

By rearranging this equation, we get the change of variables formula for a single dimension:

$p_x(x) = p_z(z) \left| \frac{dz}{dx} \right|$

The term $\frac{dz}{dx}$ is the derivative of the inverse function $f^{-1}(x)$ with respect to $x$ . We take the absolute value because probability densities must be strictly non-negative. Without the absolute value, a function with a negative slope would compute a mathematically impossible negative probability density.

Flow of a random variable through an invertible transformation showing forward and inverse mappings.

In machine learning, we rarely work with one-dimensional data. We deal with high-dimensional vectors representing images, audio, or text. If $\mathbf{z}$ and $\mathbf{x}$ are vectors in an $N$ -dimensional space $\mathbb{R}^N$ , the simple scalar derivative $\frac{dz}{dx}$ is no longer sufficient. We must evaluate how every dimension of $\mathbf{z}$ changes with respect to every dimension of $\mathbf{x}$ . This multidimensional relationship is captured by a matrix of partial derivatives called the Jacobian matrix.

The change of variables theorem for multivariate distributions replaces the scalar derivative with the determinant of the Jacobian matrix:

$p_x(\mathbf{x}) = p_z(\mathbf{z}) \left| \det \left( \frac{\partial \mathbf{z}}{\partial \mathbf{x}} \right) \right|$

We can also write this equation explicitly using the inverse function $\mathbf{z} = f^{-1}(\mathbf{x})$ :

$p_x(\mathbf{x}) = p_z(f^{-1}(\mathbf{x})) \left| \det \left( \frac{\partial f^{-1}(\mathbf{x})}{\partial \mathbf{x}} \right) \right|$

We can break down the components of this formula to understand how density estimation works in practice.

$p_x(\mathbf{x})$ is the exact probability density of our data that we want to evaluate.
$p_z(f^{-1}(\mathbf{x}))$ is the probability density of the base distribution. We evaluate it by passing the data point $\mathbf{x}$ backward through the inverse function to get $\mathbf{z}$ .
$\frac{\partial f^{-1}(\mathbf{x})}{\partial \mathbf{x}}$ is the Jacobian matrix. It tracks the rate of change of the inverse transformation across all dimensions.
$\det$ is the determinant operator. It measures the volume expansion or contraction of the multidimensional space caused by the transformation.
The absolute value ensures the final density remains positive.

For this theorem to hold true, the function $f$ must be a bijection. A bijection is a mathematical function that is both one-to-one and onto. This strict requirement guarantees two specific properties. First, every vector $\mathbf{z}$ maps to exactly one vector $\mathbf{x}$ , and every $\mathbf{x}$ maps back to exactly one $\mathbf{z}$ . Second, the input and output dimensions are identical, which ensures the Jacobian is a square matrix. Determinants can only be computed for square matrices. If the function is not a bijection, probability mass could overlap or disappear.

Sometimes it is more convenient to define the change of variables using the forward transformation rather than the inverse. By applying the properties of inverse matrices, we know that the determinant of the inverse Jacobian is equal to the inverse of the determinant of the forward Jacobian. This gives us an alternative representation:

$p_x(\mathbf{x}) = p_z(\mathbf{z}) \left| \det \left( \frac{\partial \mathbf{x}}{\partial \mathbf{z}} \right) \right|^{-1}$

When training neural networks, multiplying many small probabilities together leads to numerical underflow. To maintain numerical stability, normalizing flows compute the log-likelihood of the data instead of the raw probability density. Taking the natural logarithm of both sides transforms the product into a sum:

$\log p_x(\mathbf{x}) = \log p_z(\mathbf{z}) - \log \left| \det \left( \frac{\partial \mathbf{x}}{\partial \mathbf{z}} \right) \right|$

This logarithmic form is the exact objective function you will optimize when training flow models in PyTorch. The first term evaluates how well the model maps data to high-probability regions of the simple base distribution. The second term, the log determinant, acts as a penalty that prevents the volume of the space from collapsing to zero during training.

Was this section helpful?