Inverse Autoregressive Flow

Masked Autoregressive Flow (MAF) excels at density estimation because computing the exact likelihood of a data point requires only a single pass through the network. However, generating new data with MAF requires a sequential loop over all dimensions. When dealing with high-dimensional data like audio or images, this sequential sampling process becomes computationally expensive. Inverse Autoregressive Flow solves this problem by restructuring the autoregressive transformation to allow for highly efficient, parallel sampling.

Let us look at the mathematics of this transformation. In an Inverse Autoregressive Flow, the target variable $x$ is generated from a base variable $z$ drawn from a simple distribution, typically a standard Gaussian. The autoregressive transformation is defined as:

$x_i = \mu_i(z_{1:i-1}) + \sigma_i(z_{1:i-1}) \cdot z_i$

Notice the important difference between IAF and MAF. In MAF, the scale and shift parameters depend on the previously generated target variables $x_{1:i-1}$ . In IAF, the parameters $\mu_i$ and $\sigma_i$ depend on the previously sampled base variables $z_{1:i-1}$ .

Because the entire vector $z$ is sampled independently from the base distribution all at once, every element $z_{1:i-1}$ is already known before we compute $x$ . This means we can pass the entire vector $z$ through our neural network in a single parallel pass to output all $\mu$ and $\sigma$ values simultaneously.

Once the network outputs the parameters, the final variables $x$ are calculated using simple element-wise operations. Generating a complete sample takes a single step regardless of the data dimensionality.

The sampling pass of an Inverse Autoregressive Flow calculates all shift and scale parameters in parallel from the base sample, eliminating sequential loops.

The cost of this fast sampling architecture is slow density estimation. Suppose you have a data point $x$ and want to evaluate its exact likelihood. To do this, you must invert the transformation to find the corresponding base variable $z$ . The inverse operation is:

$z_i = \frac{x_i - \mu_i(z_{1:i-1})}{\sigma_i(z_{1:i-1})}$

Here lies the computational bottleneck. To compute $z_i$ , you must already know $z_{1:i-1}$ to calculate $\mu_i$ and $\sigma_i$ . You cannot compute the parameters in parallel because the inputs to the neural network are the very $z$ values you are actively trying to solve for. You must compute $z_1$ , pass it through the network to find $\mu_2$ and $\sigma_2$ , use those to compute $z_2$ , and repeat this process for all dimensions. Evaluating the density requires a sequential loop, making it inefficient for training via exact maximum likelihood estimation.

To make the mechanics of the sampling pass clear, let's see how you might implement the forward generation step in PyTorch. Assume we have an autoregressive network initialized as autoregressive_net that takes $z$ as input and outputs the concatenated shift and scale parameters.

import torch
import torch.nn as nn

class IAFLayer(nn.Module):
    def __init__(self, autoregressive_net):
        super().__init__()
        # The network could be a MADE architecture
        self.net = autoregressive_net

    def forward_sample(self, z):
        # z is sampled from the base distribution: z ~ N(0, I)
        # Pass z through the network in a single step
        params = self.net(z)

        # Split parameters into shift (mu) and log scale (log_sigma)
        mu, log_sigma = torch.chunk(params, 2, dim=-1)

        # Constrain scale to be positive
        sigma = torch.exp(log_sigma)

        # Compute the final sample x in parallel
        x = mu + sigma * z

        # Calculate the log determinant of the Jacobian
        log_det_jacobian = log_sigma.sum(dim=-1)

        return x, log_det_jacobian

In this code snippet, the forward_sample method executes without any loops. We use the exponential function on log_sigma to ensure the scale parameter remains strictly positive, which is a strict requirement for the transformation to be invertible. The log-determinant of the Jacobian is simply the sum of the log-scale parameters, just as it was in MAF.

MAF and IAF represent a distinct architectural duality. If you train a generative model using exact maximum likelihood estimation, MAF is the standard choice. If your primary goal is the rapid generation of synthetic data, IAF is the superior model. In applied settings, researchers sometimes combine the strengths of both architectures. For example, some training procedures involve training a fast-evaluating MAF model first. Then, they use the trained MAF to distill knowledge into an IAF model. This technique, known as probability density distillation, allows the IAF model to learn the target distribution without relying on slow sequential loops during training. The final result is a deployed model that both trained efficiently and samples instantly.

Was this section helpful?