Stacking Transformations

The change of variables theorem provides a mathematical mechanism to map a simple base probability distribution to a more complex one while keeping track of the exact probability density. However, single parameterized mathematical functions, like an affine scaling or translation, produce limited modifications to the shape of a distribution. A single layer is rarely expressive enough to capture highly irregular and multimodal data distributions.

To achieve the necessary flexibility, we rely on the mathematical property of function composition. If we have two invertible functions, $f_1$ and $f_2$ , their composition $f(z) = f_2(f_1(z))$ is also strictly invertible. The inverse of this composed function is the composition of the individual inverses applied in reverse order, $f^{-1}(x) = f_1^{-1}(f_2^{-1}(x))$ .

This property scales to any arbitrary number of functions. By chaining multiple parameterized transformations sequentially, we construct the architecture of a normalizing flow.

Let $z_0$ be a random variable sampled from a known base distribution $p_0(z_0)$ . We apply a sequence of $K$ invertible transformations $f_1, f_2, \dots, f_K$ . Each intermediate variable in the sequence is defined by the output of the previous layer:

$z_k = f_k(z_{k-1})$

The final output is the generated sample $x = z_K$ . The overall mapping from the base distribution to the target distribution is defined as:

$x = f_K \circ f_{K-1} \circ \dots \circ f_1(z_0)$

Sequence of variables mapping a base distribution to a complex target distribution through stacked invertible transformations.

Tracking the Density Through Stacks

When we stack functions to warp the sample space, we must compute the total change in volume for the entire sequence to maintain exact density tracking. The chain rule of calculus states that the Jacobian matrix of a composed function is the product of the Jacobian matrices of the individual functions. Furthermore, a property of linear algebra states that the determinant of a product of matrices is the product of their individual determinants.

For our normalizing flow, the Jacobian determinant of the complete transformation from $z_0$ to $z_K$ is the product of the Jacobian determinants computed at each intermediate step $k$ :

$\det \left( \frac{\partial z_K}{\partial z_0} \right) = \prod_{k=1}^K \det \left( \frac{\partial z_k}{\partial z_{k-1}} \right)$

In practice, multiplying many small determinant values together frequently results in numerical underflow, especially in deep neural networks trained with single-precision floating-point arithmetic. To avoid this, machine learning frameworks like PyTorch optimize functions in logarithmic space. Taking the logarithm of a product converts it into a sum, which is numerically stable.

Substituting our product of determinants into the change of variables formula yields the exact log-likelihood for a data point under our model:

$\log p_K(z_K) = \log p_0(z_0) - \sum_{k=1}^K \log \left| \det \left( \frac{\partial z_k}{\partial z_{k-1}} \right) \right|$

This specific equation dictates the mechanics of training and evaluating any normalizing flow. To calculate the exact probability density of a given data point $x$ , we must pass it backward through all $K$ inverse transformations to arrive at $z_0$ . During this backward pass, we accumulate the log determinant of the Jacobian at each step and subtract the total sum from the log probability of $z_0$ evaluated under the base distribution.

Computational Constraints

Stacking transformations allows us to build highly expressive generative models, but it imposes strict computational requirements on the design of the individual functions $f_k$ . If a flow architecture consists of 50 stacked layers, evaluating a single data point requires 50 function evaluations and 50 Jacobian determinant computations.

For a stacked model to scale efficiently to high-dimensional data, each layer must satisfy two strict conditions. First, the function must be easily and analytically invertible. Second, the Jacobian determinant must be cheap to compute.

In standard mathematics, computing the determinant of an $N \times N$ matrix requires $O(N^3)$ operations. This cubic time complexity is far too slow for processing high-dimensional data such as high-resolution images or audio signals. Therefore, modern flow architectures deliberately design $f_k$ to produce a triangular Jacobian matrix. The determinant of a triangular matrix is calculated simply by multiplying its diagonal elements, reducing the computational cost of the determinant operation from $O(N^3)$ to $O(N)$ .

By assembling multiple parameterized layers that enforce these specific constraints, we build neural networks capable of both exact density evaluation and fast, parallelized sampling. The subsequent sections will detail specific layer designs that achieve this triangular structure.

Was this section helpful?