Building Autoregressive Models Practice

Autoregressive models use factorization to ensure valid probability distributions and efficient determinant calculations. Implementation of the core component of these architectures in PyTorch relies on the masked linear layer. This foundation of a Masked Autoencoder for Distribution Estimation (MADE) applies a binary mask to the weights of a standard fully connected layer, enforcing the autoregressive property directly via matrix multiplication rather than sequential loops.

To implement this, we need a matrix of zeros and ones. A value of one indicates an allowed connection, while a zero indicates a blocked connection. For an autoregressive model processing a sequence of variables $x_1, x_2, ..., x_D$ , the output predicting $x_i$ can only depend on previous inputs $x_1$ through $x_{i-1}$ . This strict dependency naturally forms a lower-triangular matrix structure.

Heatmap of a strictly lower-triangular binary mask for an input layer. Blue squares represent active weights and gray squares represent masked weights.

We will write a custom PyTorch module that inherits from torch.nn.Linear. The implementation requires registering the binary mask as a buffer. In PyTorch, a buffer is a tensor that is part of the module's state but is not treated as a learnable parameter. This ensures the mask is moved to the appropriate device, such as a GPU, along with the model weights, but remains completely unchanged by the optimizer during the training process.

The linear transformation computes the output $y$ as:

$y = x W^T_{masked} + b$

Where the masked weight matrix is calculated by taking the Hadamard product (element-wise multiplication) of the learnable weights $W$ and the binary mask $M$ :

$W_{masked} = W \odot M$

Here is the implementation of the masked linear layer:

import torch
import torch.nn as nn
import torch.nn.functional as F

class MaskedLinear(nn.Linear):
    def __init__(self, in_features, out_features, mask, bias=True):
        # Initialize the standard linear layer properties
        super().__init__(in_features, out_features, bias)

        # Register the mask as a persistent buffer
        self.register_buffer('mask', mask)

    def forward(self, x):
        # Multiply weights by the mask element-wise
        masked_weight = self.weight * self.mask

        # Perform the linear transformation using the masked weights
        return F.linear(x, masked_weight, self.bias)

Next, we need a function to generate the mask tensor itself. For the very first layer of the network (the input layer), the mask must be strictly lower-triangular. The diagonal must be set to zero because an output variable cannot depend on its corresponding input variable. For subsequent hidden layers, the mask can be standard lower-triangular, allowing nodes with equal autoregressive degrees to connect.

def create_autoregressive_mask(in_features, out_features, is_input_layer=False):
    """
    Generates a lower-triangular binary mask for autoregressive models.
    """
    if is_input_layer:
        # Strictly lower-triangular (diagonal is 0)
        mask = torch.tril(torch.ones(out_features, in_features), diagonal=-1)
    else:
        # Standard lower-triangular (diagonal is 1)
        mask = torch.tril(torch.ones(out_features, in_features), diagonal=0)

    return mask

Let us verify the behavior of our custom layer. We will define a small dimensionality, generate an input mask, and inspect the resulting layer weights. By multiplying the raw randomly initialized weights by the mask, we can confirm the autoregressive constraints are active.

# Define input and output dimensions
D = 5

# Create the mask for the input layer
mask = create_autoregressive_mask(in_features=D, out_features=D, is_input_layer=True)

# Instantiate the masked linear layer
masked_layer = MaskedLinear(in_features=D, out_features=D, mask=mask)

# Print the masked weights to confirm the upper triangle and diagonal are zeroed out
print("Effective Weights:\n", masked_layer.weight * masked_layer.mask)

With the MaskedLinear component complete, you possess the building blocks for an entire MADE architecture. By stacking multiple masked layers and separating them with non-linear activation functions like ReLU, you can model highly complex joint probability distributions. In practice, hidden layers in a full MADE model require assigning specific connection indices to each neuron to ensure the strict mathematical ordering is maintained across multiple transformations. This stacked architecture serves as the underlying density estimator for both Masked Autoregressive Flow (MAF) and Inverse Autoregressive Flow (IAF) models.

Was this section helpful?