Optimizing a neural network in PyTorch requires translating the exact log-likelihood equation into a computable loss function. Since deep learning frameworks are designed to minimize objectives using gradient descent, the maximum likelihood estimation problem is converted into a minimization problem by taking the negative of the log-likelihood. This yields the Negative Log-Likelihood (NLL) loss.
For a single data point , an invertible transformation , and a base distribution , the mathematical formulation for the negative log-likelihood is:
This loss function consists of two distinct components that the model must balance during training. The first term is the log-probability of the transformed data point under the base distribution. The second term is the log-determinant of the Jacobian, which accounts for the change in volume caused by the transformation.
Let us examine the base distribution term first. In practice, we almost always choose a standard multivariate Gaussian for our base distribution . For a -dimensional vector , the log-density of a standard normal distribution is calculated as:
Minimizing the negative of this term encourages the network to map data points close to the origin in the latent space. If we only optimized this specific term, the network would lazily map every input to to achieve the lowest possible loss.
This is where the log-determinant of the Jacobian acts as an exact regularizer against volume collapse. If the network tries to compress the entire input space into a single point, the log-determinant term will penalize the model heavily because the computed volume of the space is shrinking drastically. The network is forced to find a mapping that distributes the data points according to a Gaussian shape without overlapping them into a single location.
Forward pass computation graph for the negative log-likelihood loss.
When implementing this in PyTorch, your normalizing flow architecture should be designed so that the forward pass returns both the transformed output tensor and a 1D tensor containing the accumulated log-determinant of the Jacobian for each sample in the batch.
Here is how you can write a custom PyTorch module to compute the negative log-likelihood loss:
import torch
import torch.nn as nn
import math
class FlowLoss(nn.Module):
def __init__(self):
super().__init__()
def forward(self, z, log_det_J):
# z shape: (batch_size, D)
# log_det_J shape: (batch_size,)
D = z.shape[1]
# Compute log probability of standard normal
# log p(z) = - (D/2) * log(2*pi) - 0.5 * sum(z^2)
log_p_z = -0.5 * D * math.log(2 * math.pi) - 0.5 * torch.sum(z ** 2, dim=1)
# Compute the exact log-likelihood
log_likelihood = log_p_z + log_det_J
# Return the mean negative log-likelihood across the batch
loss = -torch.mean(log_likelihood)
return loss
Notice the argument dim=1 in the torch.sum operation. Because we process data in batches, the variable z has the shape (batch_size, D). We sum the squared values across the feature dimension to calculate the independent log-probability for each individual sample in the batch. The variable log_det_J is expected to have the shape (batch_size,), representing the accumulated volume change for each respective sample.
We sum these two tensors to get the final log-likelihood for every sample. Finally, we take the mean across the batch using -torch.mean(...). Using the mean instead of the sum ensures that your learning rate remains stable even if you modify the batch size later in the training process.
While the raw negative log-likelihood is sufficient for gradient descent optimization, its exact value depends heavily on the dimensionality of the input data. When evaluating flow models on high-dimensional datasets like digital images, researchers report the loss using a standardized metric called Bits Per Dimension (BPD). To calculate bits per dimension, we divide the negative log-likelihood by the total number of dimensions and convert the natural logarithm to base 2.
This equation provides a normalized metric to compare the density estimation performance of different models across datasets with varying sizes and resolutions. Lower bits per dimension indicate a better fit to the underlying data distribution.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•