Maximum Likelihood Estimation

Training a generative model requires a reliable method to measure how well it approximates the true distribution of the data. Many popular generative frameworks, such as Generative Adversarial Networks or Variational Autoencoders, rely on implicit generation techniques or approximate lower bounds to train their networks. Normalizing flows offer a significant mathematical advantage in this area. Because the transformations in a flow are strictly invertible and exact volume changes can be tracked through the Jacobian determinant, normalizing flows allow for exact density evaluation. This property means these models can be trained directly using exact maximum likelihood estimation.

Maximum likelihood estimation is a standard statistical method used to find the parameters of a probability distribution that best explain the observed data. The objective is to identify the specific weights and biases for our neural network that maximize the probability of generating the training set. Because we map data points into a continuous space, maximizing this probability is equivalent to maximizing the probability density assigned to the training examples.

To apply maximum likelihood estimation, we evaluate the log-likelihood of our data points under the model. Let $x$ be a single data point from our training set, $f$ be the invertible transformation parameterized by our neural network, and $Z$ be the simple base probability distribution. Based on the change of variables theorem, the exact log-likelihood of observing $x$ is defined as:

$\log p_X(x) = \log p_Z(f(x)) + \log \left| \det \frac{\partial f(x)}{\partial x} \right|$

This equation contains two distinct parts that work together during training. The first term, $\log p_Z(f(x))$ , measures the log-likelihood of the transformed data point under the base distribution. If we use a standard multivariate Gaussian as our base distribution, this term evaluates the density of a bell curve. It penalizes the network if it maps the input data far away from the origin in the latent space. The network learns to map complex input patterns into a neat, standardized distribution.

The second term, $\log \left| \det \frac{\partial f(x)}{\partial x} \right|$ , is the log absolute Jacobian determinant. This term acts as a volume correction factor. It measures how much the transformation $f$ stretches or compresses the mathematical space around the data point $x$ . Without this term, the network could trivially maximize the first term by aggressively shrinking all data points into a tiny region near the origin of the base distribution. The log Jacobian determinant penalizes excessive compression, ensuring that the probability mass is conserved and the density mapping remains valid.

Deep learning frameworks like PyTorch are designed to minimize loss functions rather than maximize them. We flip the sign of our log-likelihood to create the objective function known as the negative log-likelihood. For a training dataset $D$ containing $N$ independent samples, the training objective is the average negative log-likelihood across all samples.

$\mathcal{L}(\theta) = - \frac{1}{N} \sum_{i=1}^{N} \left( \log p_Z(f_\theta(x_i)) + \log \left| \det \frac{\partial f_\theta(x_i)}{\partial x_i} \right| \right)$

In this context, $\theta$ represents the trainable weights of the neural network layers making up the flow. By minimizing $\mathcal{L}(\theta)$ using gradient descent algorithms like Adam, we force the model to assign higher probability density to the training examples.

Computation graph for calculating the negative log-likelihood loss during the forward pass of a normalizing flow.

When implementing this in code, you process the data in mini-batches. The forward pass of your normalizing flow must return two outputs for every batch. The first output is the transformed latent variable $z = f(x)$ . The second output is the sum of the log Jacobian determinants for all the transformations applied. Normalizing flows are typically constructed by stacking multiple simpler layers. The total log Jacobian determinant for the entire network is simply the sum of the log determinants from each individual layer.

$\log \left| \det \frac{\partial f(x)}{\partial x} \right| = \sum_{k=1}^{K} \log \left| \det \frac{\partial f_k(z_{k-1})}{\partial z_{k-1}} \right|$

Once you have the final latent variable $z$ and the accumulated log determinant from the network, you pass $z$ to your base distribution object to evaluate its log probability density. You add these two scalar tensors together to find the exact log-likelihood for the batch. Finally, you compute the mean across the batch dimension and multiply by negative one. The resulting scalar is the final loss value passed to the backpropagation engine to update the network weights.

Was this section helpful?