Masked Autoencoders for Distributions

Evaluating the joint probability of an autoregressive model using traditional sequential networks requires computing each conditional probability one step at a time. A 100-dimensional dataset requires 100 sequential operations per data point. This sequential dependency creates a significant computational bottleneck, making training slow on modern hardware designed for parallel execution.

The Masked Autoencoder for Distribution Estimation (MADE) offers an elegant mathematical solution to this problem. Instead of using recurrent loops, MADE modifies a standard feedforward neural network to process all dimensions simultaneously in a single forward pass. It achieves this by applying binary masks to the weight matrices of the network. These masks zero out specific connections, ensuring that the prediction for dimension $i$ only depends on dimensions strictly less than $i$ .

The Mechanics of Masking

In a standard dense layer, every output node is connected to every input node. This fully connected structure violates the autoregressive property because the output representing $p(x_i | x_{<i})$ would have access to information from $x_i$ and future dimensions like $x_{i+1}$ .

MADE fixes this by assigning an integer, called a degree, to every node in the network. The degree determines which connections are permitted and which must be severed.

The assignment follows specific rules:

Input Nodes: Each input dimension $x_d$ receives a degree $m(d)$ equal to its index, from $1$ to $D$ .
Hidden Nodes: Each node in a hidden layer receives a degree drawn randomly from the range $[1, D-1]$ . We stop at $D-1$ because a hidden unit must not possess information about the final input dimension if it contributes to predicting the final output.
Output Nodes: Each output node corresponds to a specific input dimension and receives the same degree as that input, from $1$ to $D$ .

Once degrees are assigned, we construct binary mask matrices for the network weights. For a connection from node $k$ in layer $l-1$ to node $j$ in layer $l$ , the mask $M$ is defined by comparing their degrees.

For connections between hidden layers, information can flow to nodes with an equal or higher degree. The mask is computed as:

$M^{l}_{j, k} = \begin{cases} 1 & \text{if } m^l(j) \ge m^{l-1}(k) \\ 0 & \text{otherwise} \end{cases}$

For connections to the final output layer, the condition becomes strictly greater than. This ensures the output for $x_i$ does not receive information about $x_i$ itself:

$M^{L}_{j, k} = \begin{cases} 1 & \text{if } m^L(j) > m^{L-1}(k) \\ 0 & \text{otherwise} \end{cases}$

During the forward pass, the weight matrix $W$ is multiplied element-wise by the mask $M$ before the linear transformation is computed:

$h^l = \text{Activation}((W^l \odot M^l) h^{l-1} + b^l)$

Directed graph of a simple MADE architecture. The binary masks restrict information flow so that output $p(x_3)$ relies only on hidden nodes containing information from $x_1$ and $x_2$ . Output $p(x_1)$ depends on no inputs, effectively modeling the marginal distribution.

Implementing a Masked Linear Layer in PyTorch

To build a MADE model, the first step is creating a custom PyTorch module that applies these binary masks to a standard linear layer. Subclassing torch.nn.Linear allows us to utilize optimized native operations while injecting our masking logic.

import torch
import torch.nn as nn
import torch.nn.functional as F

class MaskedLinear(nn.Linear):
    def __init__(self, in_features, out_features, bias=True):
        super().__init__(in_features, out_features, bias)
        # Register the mask as a buffer so PyTorch manages its device placement
        # without treating it as a trainable parameter.
        self.register_buffer('mask', torch.ones(out_features, in_features))

    def set_mask(self, mask):
        """Updates the binary mask for this layer."""
        self.mask.data.copy_(mask)

    def forward(self, x):
        # Apply the mask element-wise to the weight matrix
        masked_weight = self.weight * self.mask
        return F.linear(x, masked_weight, self.bias)

By registering the mask as a buffer, PyTorch automatically moves it to the GPU when you call .to('cuda') on the model, but the optimizer knows not to update it during backpropagation. To complete a MADE network, you would stack multiple MaskedLinear layers, assign degrees to each layer's nodes, generate the boolean matrices using the degree comparison rules, and call the set_mask method for each layer.

Outputs and Density Estimation

In the context of normalizing flows, the MADE network does not directly output the probability values. Instead, it outputs the parameters of the base distribution. For example, if you assume each conditional distribution is Gaussian, the MADE network will output two values for every dimension $i$ : the mean $\mu_i$ and the log-standard deviation $\log \sigma_i$ .

Because the network produces all parameters in a single forward pass, calculating the log-likelihood of a training batch becomes highly efficient. You pass the data $x$ into the network, receive $\mu$ and $\log \sigma$ , and evaluate the probability density of $x$ using those parameters.

This single-pass property makes MADE an excellent building block for density estimation tasks. However, generating new samples requires an inverse pass. Sampling is strictly sequential because you must first sample $x_1$ , feed it back into the network to obtain the parameters for $x_2$ , sample $x_2$ , and repeat. This characteristic defines the trade-offs you will encounter when selecting specific autoregressive flow architectures.

Was this section helpful?