Defining a Normalizing Flow

The mathematical foundation of mapping random variables through invertible functions relies on the change of variables theorem. A single mathematical transformation is rarely sufficient to model highly complex data distributions like images or audio. To build highly flexible generative models, multiple simple transformations can be chained together. This sequence of operations forms a normalizing flow.

A normalizing flow is formally defined as a sequence of invertible and differentiable mappings. Let us denote an initial random variable $z_0$ sampled from a simple, tractable base distribution $p_0(z_0)$ . Often, this base distribution is an isotropic Gaussian. We apply a sequence of $K$ invertible functions, denoted as $f_1, f_2, \dots, f_K$ . The sequence of transformations produces a final sample $z_K$ :

$z_K = f_K(f_{K-1}(\dots f_1(z_0) \dots))$

Alternatively, using function composition notation, we write this generative path as:

$z_K = f_K \circ f_{K-1} \circ \dots \circ f_1(z_0)$

Let $x = z_K$ represent our final complex target variable, such as a generated image. The mapping from the simple base distribution to the complex distribution constitutes the generative process.

The terminology reveals the mechanics of the model. The term "normalizing" indicates that if we invert the generative process, we map our complex data $x$ back through the inverse functions $f_K^{-1}, \dots, f_1^{-1}$ to evaluate its probability density under the simple "normal" base distribution. The term "flow" describes the path that a sample takes as it passes sequentially through the series of transformations.

A directed sequence of transformations in a normalizing flow mapping a simple base distribution to a complex target distribution.

To compute the probability density of our final variable $x$ , we apply the change of variables theorem repeatedly. The density $p_K(x)$ of the final sample can be calculated by evaluating the base density $p_0(z_0)$ and multiplying it by the absolute values of the determinant of the Jacobian matrices for each transformation step in the sequence.

Because working with products of probabilities usually leads to numerical underflow in software implementations, we compute the log-likelihood instead. Taking the logarithm converts the product of Jacobian determinants into a sum. Using the forward transformations, the log-probability of the data point $x$ is defined as:

$\log p_K(x) = \log p_0(z_0) - \sum_{i=1}^K \log \left| \det \frac{\partial f_i(z_{i-1})}{\partial z_{i-1}} \right|$

For a normalizing flow to be mathematically sound and computationally practical, each function $f_i$ must satisfy specific criteria.

First, $f_i$ must be bijective. It must map exactly one input vector to exactly one output vector. This guarantees that the inverse function $f_i^{-1}$ exists, allowing us to reconstruct the exact noise vector $z_0$ that generated a given data point $x$ .

Second, both the function and its inverse must be differentiable. A bijective and differentiable function with a differentiable inverse is mathematically known as a diffeomorphism. This property ensures we can optimize the parameters of our functions using standard gradient descent techniques.

Third, we must calculate the determinant of the Jacobian matrix efficiently. In machine learning tasks, $x$ often has thousands of dimensions. Computing the determinant of a $D \times D$ Jacobian matrix requires $O(D^3)$ mathematical operations in the general case, which is computationally prohibitive for large models. A major focus of designing normalizing flow architectures is structuring the operations so that their Jacobian matrices are triangular. The determinant of a triangular matrix is simply the product of its diagonal elements. This structural constraint reduces the time complexity from $O(D^3)$ down to $O(D)$ .

In PyTorch, we implement a normalizing flow as a specialized neural network. Instead of just returning the transformed tensor, the forward pass of our flow layer must return both the output tensor and the log-determinant of its Jacobian. When we stack these layers, we accumulate the log-determinants across all layers in the sequence. This accumulated value is then passed directly into our loss function for maximum likelihood estimation.

Was this section helpful?