Modeling high-dimensional probability distributions requires understanding how different variables interact with one another. When processing an image or a time series, the value of one feature often depends heavily on the values of preceding features. Autoregressive generative models capture this dependency by treating the data generation process as a sequential sequence of steps.
At the foundation of autoregressive models is the product rule of probability. Any joint probability distribution of an -dimensional random variable can be factored into a product of conditional probabilities. Mathematically, this is expressed as:
In this formulation, the probability of the first variable is modeled unconditionally. The probability of the second variable is conditioned on . The third variable is conditioned on both and . This pattern continues until the final variable is conditioned on all preceding variables. This sequential dependency is what gives the autoregressive model its name.
Let us represent this sequential dependency structure visually.
Autoregressive dependency structure where each variable conditions all subsequent variables in the sequence.
To use this autoregressive structure within a normalizing flow, we must frame it as an invertible transformation. Let be a latent variable drawn from a simple base distribution, such as an isotropic Gaussian, and let be the target data variable. An autoregressive transformation maps to by defining each output dimension as a function of the corresponding latent dimension and all previously generated data dimensions .
Here, is an invertible mapping, and represents the parameters of that mapping. The parameters are computed by a neural network that only observes the previous dimensions .
This formulation leads to a highly advantageous mathematical property. When we compute the Jacobian matrix of this transformation, the derivative of with respect to (where ) is always zero. The resulting Jacobian matrix is lower-triangular.
Let us look at a three-dimensional example to see the structure of this Jacobian matrix:
Because the Jacobian is lower-triangular, its determinant is simply the product of the terms on the main diagonal.
This property solves a major computational bottleneck in normalizing flows. Computing the determinant of a general matrix takes operations. By enforcing an autoregressive structure, the determinant calculation is reduced to operations. This reduction makes it possible to scale normalizing flows to datasets with thousands or millions of dimensions.
While the mathematics of autoregressive models provide a clear path for scaling density estimation, implementing them efficiently requires careful architectural design. If we compute each sequentially using a standard recurrent loop, the sampling process becomes slow. The time required to generate a complete sample grows linearly with the number of dimensions.
To make these models practical for machine learning workflows, we parameterize the conditional distributions using deep neural networks. By using specific network designs like masking, we can compute all parameters for all dimensions simultaneously during training in a single forward pass. This allows us to evaluate the exact log-likelihood of training data very quickly. The following sections will explain how to implement these masked neural network architectures effectively.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•