Evaluating the joint probability of an autoregressive model using traditional sequential networks requires computing each conditional probability one step at a time. A 100-dimensional dataset requires 100 sequential operations per data point. This sequential dependency creates a significant computational bottleneck, making training slow on modern hardware designed for parallel execution.
The Masked Autoencoder for Distribution Estimation (MADE) offers an elegant mathematical solution to this problem. Instead of using recurrent loops, MADE modifies a standard feedforward neural network to process all dimensions simultaneously in a single forward pass. It achieves this by applying binary masks to the weight matrices of the network. These masks zero out specific connections, ensuring that the prediction for dimension only depends on dimensions strictly less than .
In a standard dense layer, every output node is connected to every input node. This fully connected structure violates the autoregressive property because the output representing would have access to information from and future dimensions like .
MADE fixes this by assigning an integer, called a degree, to every node in the network. The degree determines which connections are permitted and which must be severed.
The assignment follows specific rules:
Once degrees are assigned, we construct binary mask matrices for the network weights. For a connection from node in layer to node in layer , the mask is defined by comparing their degrees.
For connections between hidden layers, information can flow to nodes with an equal or higher degree. The mask is computed as:
For connections to the final output layer, the condition becomes strictly greater than. This ensures the output for does not receive information about itself:
During the forward pass, the weight matrix is multiplied element-wise by the mask before the linear transformation is computed:
Directed graph of a simple MADE architecture. The binary masks restrict information flow so that output relies only on hidden nodes containing information from and . Output depends on no inputs, effectively modeling the marginal distribution.
To build a MADE model, the first step is creating a custom PyTorch module that applies these binary masks to a standard linear layer. Subclassing torch.nn.Linear allows us to utilize optimized native operations while injecting our masking logic.
import torch
import torch.nn as nn
import torch.nn.functional as F
class MaskedLinear(nn.Linear):
def __init__(self, in_features, out_features, bias=True):
super().__init__(in_features, out_features, bias)
# Register the mask as a buffer so PyTorch manages its device placement
# without treating it as a trainable parameter.
self.register_buffer('mask', torch.ones(out_features, in_features))
def set_mask(self, mask):
"""Updates the binary mask for this layer."""
self.mask.data.copy_(mask)
def forward(self, x):
# Apply the mask element-wise to the weight matrix
masked_weight = self.weight * self.mask
return F.linear(x, masked_weight, self.bias)
By registering the mask as a buffer, PyTorch automatically moves it to the GPU when you call .to('cuda') on the model, but the optimizer knows not to update it during backpropagation. To complete a MADE network, you would stack multiple MaskedLinear layers, assign degrees to each layer's nodes, generate the boolean matrices using the degree comparison rules, and call the set_mask method for each layer.
In the context of normalizing flows, the MADE network does not directly output the probability values. Instead, it outputs the parameters of the base distribution. For example, if you assume each conditional distribution is Gaussian, the MADE network will output two values for every dimension : the mean and the log-standard deviation .
Because the network produces all parameters in a single forward pass, calculating the log-likelihood of a training batch becomes highly efficient. You pass the data into the network, receive and , and evaluate the probability density of using those parameters.
This single-pass property makes MADE an excellent building block for density estimation tasks. However, generating new samples requires an inverse pass. Sampling is strictly sequential because you must first sample , feed it back into the network to obtain the parameters for , sample , and repeat. This characteristic defines the trade-offs you will encounter when selecting specific autoregressive flow architectures.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•