Autoregressive Generative Models

Modeling high-dimensional probability distributions requires understanding how different variables interact with one another. When processing an image or a time series, the value of one feature often depends heavily on the values of preceding features. Autoregressive generative models capture this dependency by treating the data generation process as a sequential sequence of steps.

At the foundation of autoregressive models is the product rule of probability. Any joint probability distribution of an $n$ -dimensional random variable $x = (x_1, x_2, ..., x_n)$ can be factored into a product of conditional probabilities. Mathematically, this is expressed as:

$p(x) = \prod_{i=1}^{n} p(x_i | x_1, x_2, ..., x_{i-1})$

In this formulation, the probability of the first variable $x_1$ is modeled unconditionally. The probability of the second variable $x_2$ is conditioned on $x_1$ . The third variable $x_3$ is conditioned on both $x_1$ and $x_2$ . This pattern continues until the final variable $x_n$ is conditioned on all preceding variables. This sequential dependency is what gives the autoregressive model its name.

Let us represent this sequential dependency structure visually.

Autoregressive dependency structure where each variable conditions all subsequent variables in the sequence.

To use this autoregressive structure within a normalizing flow, we must frame it as an invertible transformation. Let $z$ be a latent variable drawn from a simple base distribution, such as an isotropic Gaussian, and let $x$ be the target data variable. An autoregressive transformation maps $z$ to $x$ by defining each output dimension $x_i$ as a function of the corresponding latent dimension $z_i$ and all previously generated data dimensions $x_{<i}$ .

$x_i = f_i(z_i; \theta_i(x_{1:i-1}))$

Here, $f_i$ is an invertible mapping, and $\theta_i$ represents the parameters of that mapping. The parameters are computed by a neural network that only observes the previous dimensions $x_{1:i-1}$ .

This formulation leads to a highly advantageous mathematical property. When we compute the Jacobian matrix of this transformation, the derivative of $x_i$ with respect to $z_j$ (where $j > i$ ) is always zero. The resulting Jacobian matrix is lower-triangular.

Let us look at a three-dimensional example to see the structure of this Jacobian matrix:

$J = \begin{bmatrix} \frac{\partial x_1}{\partial z_1} & 0 & 0 \\ \frac{\partial x_2}{\partial z_1} & \frac{\partial x_2}{\partial z_2} & 0 \\ \frac{\partial x_3}{\partial z_1} & \frac{\partial x_3}{\partial z_2} & \frac{\partial x_3}{\partial z_3} \end{bmatrix}$

Because the Jacobian is lower-triangular, its determinant is simply the product of the terms on the main diagonal.

$\det(J) = \prod_{i=1}^{n} \frac{\partial x_i}{\partial z_i}$

This property solves a major computational bottleneck in normalizing flows. Computing the determinant of a general $n \times n$ matrix takes $O(n^3)$ operations. By enforcing an autoregressive structure, the determinant calculation is reduced to $O(n)$ operations. This reduction makes it possible to scale normalizing flows to datasets with thousands or millions of dimensions.

While the mathematics of autoregressive models provide a clear path for scaling density estimation, implementing them efficiently requires careful architectural design. If we compute each $x_i$ sequentially using a standard recurrent loop, the sampling process becomes slow. The time required to generate a complete sample grows linearly with the number of dimensions.

To make these models practical for machine learning workflows, we parameterize the conditional distributions using deep neural networks. By using specific network designs like masking, we can compute all parameters for all dimensions simultaneously during training in a single forward pass. This allows us to evaluate the exact log-likelihood of training data very quickly. The following sections will explain how to implement these masked neural network architectures effectively.

Was this section helpful?