While fixed noise schedules, whether linear, cosine, or custom-designed, provide a predefined path for the diffusion process, they operate under the assumption that a single, predetermined variance progression for the reverse steps is optimal. However, the ideal variance $\sigma_t^2$ in the reverse process step $p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 \mathbf{I})$ might vary depending on the timestep $t$ or even the specific data instance. Fixing the variance, typically to choices derived from the forward process noise schedule like $\beta_t$ or $\tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t$ , limits the model's flexibility.

This limitation motivates the development of learned variance schedules, a technique where the diffusion model itself predicts the appropriate variance for each reverse step during training. This approach, notably explored by Nichol and Dhariwal in "Improved Denoising Diffusion Probabilistic Models" (2021), allows the model to dynamically adjust the stochasticity of the reverse process.

Why Learn the Variance?

Learning the variance grants the model more expressive power. Consider these points:

Optimal Noise Levels: Different stages of the reverse process might benefit from different amounts of noise. Early steps (large $t$ ) might need larger variance to make significant changes from pure noise, while later steps (small $t$ ) might need smaller variance for fine detail refinement. A learned variance allows the model to tailor this.
Improved Likelihood: The original DDPM paper fixed the reverse process variance primarily for simplicity, optimizing a surrogate objective related to noise prediction ( $L_{\text{simple}}$ ). Learning the variance allows for direct optimization related to the variational lower bound (VLB) on the data log-likelihood, potentially leading to models that better capture the true data distribution and achieve superior log-likelihood scores.
Data-Dependent Adaptation: While typically learned as a function of timestep $t$ , the mechanism could potentially adapt based on $x_t$ as well, although this is less common.

Parameterizing and Predicting the Variance

Instead of fixing $\sigma_t^2$ , we parameterize it and have the model predict the parameters. A common approach involves interpolating between the two standard fixed choices, $\beta_t$ and $\tilde{\beta}_t$ . Recall that $\beta_t$ corresponds to the forward process variance at step $t$ , and $\tilde{\beta}_t$ is derived to match the posterior variance $q(x_{t-1}|x_t, x_0)$ when $x_0$ is known.

The learned variance $\sigma_{\theta, t}^2$ can be parameterized as:

\sigma_{\theta, t}^2 = \exp(v \log \beta_t + (1-v) \log \tilde{\beta}_t)

Here, $v$ is a parameter predicted by the neural network, typically constrained between 0 and 1. The network architecture (e.g., U-Net or Transformer) is modified to output an additional value (or set of values, one per pixel/patch if spatially varying variance is desired, though usually a single scalar $v$ per timestep is predicted) representing $v$ , alongside the prediction used for the mean $\mu_\theta(x_t, t)$ (which is typically derived from the noise prediction $\epsilon_\theta(x_t, t)$ ).

The diffusion model takes the noisy input $x_t$ and the timestep $t$ embedding. Its output is split to predict both the noise $\epsilon_\theta$ (determining the reverse process mean) and the variance parameter $v_\theta$ .

Adjusting the Training Objective

When learning the variance, the training objective needs to account for this prediction. The original DDPM $L_{\text{simple}}$ objective only focuses on predicting the noise $\epsilon$ . To train the variance prediction $v$ , the loss function incorporates a term derived from the VLB, often denoted $L_{\text{vlb}}$ . This term directly involves the predicted variance $\sigma_{\theta, t}^2$ .

A common practice is to use a hybrid objective:

L_{\text{hybrid}} = L_{\text{simple}} + \lambda L_{\text{vlb}}

where $L_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon} [||\epsilon - \epsilon_\theta(x_t, t)||^2]$ is the standard noise prediction loss, and $L_{\text{vlb}}$ is the term encouraging accurate variance prediction. The hyperparameter $\lambda$ balances the two objectives. Setting $\lambda=0$ recovers the standard DDPM training with fixed variance. Nichol and Dhariwal found that a small, non-zero $\lambda$ (e.g., $\lambda=0.001$ ) worked well, preserving the sample quality benefits of $L_{\text{simple}}$ while gaining the likelihood improvements from $L_{\text{vlb}}$ .

Implementation Approaches

Implementing learned variance involves these primary modifications:

Model Output: Adjust the final layer of your network (U-Net or Transformer) to output twice the number of channels compared to standard $\epsilon$ -prediction. One half represents $\epsilon_\theta$ , and the other half represents the parameters needed to compute $\sigma_{\theta, t}^2$ (e.g., the value $v$ to interpolate between $\beta_t$ and $\tilde{\beta}_t$ ).
Loss Function: Implement the hybrid loss $L_{\text{hybrid}}$ , calculating both the MSE loss on $\epsilon_\theta$ and the $L_{\text{vlb}}$ term based on the predicted variance.
Sampling: During sampling, calculate the reverse step $x_{t-1}$ using the standard mean calculation derived from the predicted $\epsilon_\theta$ , but draw the noise component $z \sim \mathcal{N}(0, \mathbf{I})$ scaled by the predicted standard deviation $\sigma_{\theta, t}$ instead of a fixed $\sigma_t$ . $x_{t-1} = \mu_\theta(x_t, t) + \sigma_{\theta, t} z \quad \text{where } z \sim \mathcal{N}(0, \mathbf{I}) \text{ if } t > 1, \text{ else } z = 0$

Benefits and Trade-offs

Benefits:
- Can significantly improve log-likelihood scores compared to fixed variance models.
- May lead to modest improvements in sample quality (e.g., FID scores), although the primary gain is often in likelihood.
- Provides the model with greater flexibility to adapt the generation process.
Trade-offs:
- Increases model complexity as the network must predict additional parameters.
- The hybrid loss function adds complexity to the training setup.
- Requires careful tuning of the hyperparameter $\lambda$ in the hybrid loss.

Learned variance schedules represent a step from fixed or manually designed schedules, helping the diffusion model optimize a fundamental aspect of the reverse process. While adding some complexity, the potential gains in likelihood and adaptability make it a valuable technique in the advanced diffusion modeling toolkit, particularly when accurate density estimation is as important as sample quality.

Learned Variance Schedules