Masked Autoregressive Flow

The Masked Autoregressive Flow (MAF) architecture uses Masked Autoencoder for Distribution Estimation (MADE) blocks to construct a complete normalizing flow. While MADE provides an efficient way to compute autoregressive conditionals using a single network pass, a single MADE layer is generally not expressive enough to model complex distributions. MAF solves this by stacking multiple autoregressive transformations, treating each MADE block as a single invertible layer within a larger model.

In a Masked Autoregressive Flow, the transformation is defined using location and scale parameters. For a given data point $x$ , we map it to a latent variable $u$ in the base distribution. The forward transformation, which we use for density estimation, is defined as:

$u_i = (x_i - \mu_i) \exp(-\alpha_i)$

Here, $\mu_i$ represents the mean and $\alpha_i$ represents the logarithm scale. In an autoregressive model, both $\mu_i$ and $\alpha_i$ are functions of only the previous data dimensions $x_{1:i-1}$ .

Because the parameters $\mu_i$ and $\alpha_i$ depend strictly on the observed data $x$ , and $x$ is fully known during the forward pass, we can compute all $u_i$ simultaneously. A single forward pass through a MADE network yields all the $\mu$ and $\alpha$ parameters at once. This parallel execution makes evaluating the exact probability density very fast.

Calculating the probability density requires the determinant of the Jacobian matrix for this transformation. Because each $u_i$ only depends on $x_{\le i}$ , the Jacobian matrix is lower triangular. The diagonal elements are simply the derivatives of $u_i$ with respect to $x_i$ :

$\frac{\partial u_i}{\partial x_i} = \exp(-\alpha_i)$

The determinant of a lower triangular matrix is the product of its diagonal elements. Therefore, the log determinant of the Jacobian is extremely efficient to calculate:

$\log \det \left| \frac{\partial u}{\partial x} \right| = \sum_{i=1}^{D} -\alpha_i$

This $O(D)$ operation is a significant improvement over the $O(D^3)$ cost associated with dense Jacobian matrices. It allows MAF to scale effectively to high-dimensional datasets.

Execution flow comparison between density estimation and sampling operations in Masked Autoregressive Flows.

While MAF is highly optimized for density estimation, it faces a significant limitation during the sampling phase. To generate new data, we must compute the inverse transformation. We start by sampling $u$ from our base distribution, typically a standard Gaussian, and then solve for $x$ :

$x_i = u_i \exp(\alpha_i) + \mu_i$

The parameters $\mu_i$ and $\alpha_i$ are generated by the MADE network, which requires the previous data dimensions $x_{1:i-1}$ as input. To generate $x_1$ , we need the initial parameters $\mu_1$ and $\alpha_1$ , which depend on no previous variables. However, to generate $x_2$ , we must first compute $x_1$ , feed it back into the MADE network to get $\mu_2$ and $\alpha_2$ , and then compute $x_2$ .

This creates a sequential dependency loop. We must pass data through the MADE network $D$ times to generate a single $D$ -dimensional sample. For a 1024-dimensional image, this requires 1024 sequential network passes, making sampling operations very slow.

A single MAF layer applies an autoregressive transformation according to a specific ordering of the variables. For example, $x_5$ might depend on $x_1$ through $x_4$ , but $x_1$ depends on nothing. If we leave this ordering unchanged, $x_1$ will only ever be a simple marginal distribution modeled directly from $u_1$ , limiting its capacity.

To build a complete MAF model, we stack multiple autoregressive layers and permute the order of variables between each layer. Reversing the order of variables is a common strategy. If the first layer uses the standard order $[1, 2, ..., D]$ , the second layer uses $[D, D-1, ..., 1]$ . This ensures that variables with simple conditional distributions in the first layer receive complex conditional distributions in subsequent layers. By stacking multiple layers with alternating variable orders, the model can capture highly complex dependencies across all dimensions.

MAF is an excellent choice for maximum likelihood estimation because the forward pass is evaluated in a single step. You can compute the exact log-likelihood of your training data very efficiently, leading to fast and stable training. The trade-off is the sequential and computationally expensive sampling procedure. If your primary goal is generating high-quality synthetic data quickly, you need a model that optimizes the inverse pass.

Was this section helpful?