Computing Jacobian Determinants

Mapping a simple probability distribution to a more complex one relies on the change of variables theorem. The main component of that equation is the Jacobian determinant. When an invertible function $f$ is applied to a random variable $z$ to produce $x = f(z)$ , the probability density does not just move around; it scales. The Jacobian determinant measures exactly how much the volume of the space expands or contracts during this transformation.

For multi-dimensional data, such as an image or a sequence of tokens, our variable $z$ is a vector of size $N$ . The function $f$ outputs another vector $x$ of the same size. The Jacobian matrix, denoted as $J$ , captures all the partial derivatives of the function $f$ with respect to the input vector $z$ .

J = \frac{\partial x}{\partial z} = \begin{bmatrix} \frac{\partial x_1}{\partial z_1} & \cdots & \frac{\partial x_1}{\partial z_N} \\ \vdots & \ddots & \vdots \\ \frac{\partial x_N}{\partial z_1} & \cdots & \frac{\partial x_N}{\partial z_N} \end{bmatrix}

Every element in this matrix represents how a small change in one specific input dimension $z_j$ affects one specific output dimension $x_i$ .

The determinant of this matrix, $\det(J)$ , quantifies the total change in volume across all dimensions. For example, take a simple 2D scaling transformation where $x_1 = 2z_1$ and $x_2 = 3z_2$ . The Jacobian is a diagonal matrix with 2 and 3 on the diagonal. The determinant is $2 \times 3 = 6$ . Because the transformation expands the volume of the space by a factor of 6, the probability density must decrease by a factor of 6 to ensure the total probability still sums to 1. We take the absolute value of the determinant, $\left| \det(J) \right|$ , because volume scaling is always positive, even if the transformation flips the spatial orientation.

When building machine learning models, multiplying many small probabilities or determinants together leads to numerical underflow. To maintain numerical stability, we almost always compute the log-determinant instead. Using standard logarithm properties, the log-density equation simplifies to an addition operation.

\log p(x) = \log p(z) + \log \left| \det \frac{\partial x}{\partial z} \right|

Computing the determinant of a general $N \times N$ matrix is highly inefficient. It has a time complexity of $O(N^3)$ . For high-dimensional data like images or audio where $N$ can easily reach tens of thousands, calculating a dense Jacobian determinant is computationally prohibitive.

To make density estimation practical, normalizing flows restrict the mathematical functions they use. Flow architectures are specifically engineered so that their Jacobian matrices are triangular, meaning either the lower or upper half of the matrix consists entirely of zeros.

Structure of Jacobian matrices and the computational advantage of triangular designs in flow models.

The determinant of a triangular matrix is simply the product of its diagonal elements. This mathematical property allows us to completely ignore the off-diagonal partial derivatives when computing the volume change.

\det(J) = \prod_{i=1}^N \frac{\partial x_i}{\partial z_i}

By combining this property with the logarithm, we compute the log-determinant as a sum of log-diagonals. This optimization reduces the time complexity of the determinant calculation from $O(N^3)$ to $O(N)$ , allowing normalizing flows to scale to high-dimensional datasets.

\log \left| \det(J) \right| = \sum_{i=1}^N \log \left| \frac{\partial x_i}{\partial z_i} \right|

Understanding how to calculate these determinants efficiently dictates how we design the internal layers of a flow model. We need transformations that are highly expressive but also guarantee a triangular Jacobian. In the upcoming sections, we will use this mathematical foundation to evaluate exact probability densities during the forward pass and generate new data during the inverse pass.

Was this section helpful?