While the standard diffusion model training objective, predicting the noise added at timestep , works remarkably well, research has explored alternative formulations for the model's prediction target and loss function. These alternatives can offer benefits in terms of sample quality, training stability, and how well the model behaves across different noise levels. Two significant approaches are the widely used simplified loss (based on -prediction) and the -prediction objective.
Most diffusion models are trained to predict the noise that was added to the original data to produce the noisy sample . Recall the forward process:
The model, typically a U-Net or Transformer denoted by , takes the noisy input and the timestep (usually via an embedding) and outputs a prediction of the noise: .
The theoretical loss derived from the variational lower bound (VLB) includes weighting terms that depend on the timestep . However, the DDPM paper found that a simplified, unweighted version of the loss often yields better results in practice:
Here, is sampled uniformly from the total number of timesteps . This objective is computationally straightforward and forms the backbone of many successful diffusion model implementations. It directly optimizes the model to denoise the input by predicting the added noise.
Despite its success, predicting can sometimes face challenges, particularly near the boundaries of the diffusion process (very small or very large ) where the signal-to-noise ratio varies drastically.
An alternative formulation proposed to address some limitations of -prediction is -prediction. Instead of predicting the noise , the model predicts a different target, , defined as:
Here, we use the common notation where (signal scale) and (noise scale), consistent with the forward process equation .
Why predict ? The intuition is related to signal scaling during the diffusion process. The target always has a unit variance. However, the noisy input has variance that changes with . Predicting can be interpreted as predicting a target that has a more consistent variance across different timesteps, potentially making the learning task easier or more stable for the network. It effectively combines predicting the score function () and the data in a balanced way.
The model architecture remains largely the same (e.g., U-Net), but it's now parameterized to output . The corresponding loss function is analogous to , but with as the target:
Often, weighting terms similar to those in the full VLB might be reintroduced, or the loss might be formulated based on score matching principles, but the core idea is to predict this target.
When using -prediction, you need to adjust how the model's output is used during sampling. If the model predicts , you can recover the predictions for and as needed for the sampling steps (like DDIM or DDPM):
These recovered values can then be plugged into the standard sampling equations.
The choice between -prediction () and -prediction () isn't always clear-cut and can depend on the specific application, dataset, and model architecture.
(-prediction):
(-prediction):
The diagram below illustrates the different targets the network aims to predict in each formulation:
Diagram comparing the prediction targets and loss calculation for -prediction and -prediction formulations. Both take the noisy data and timestep as input.
Ultimately, both and are effective objectives for training diffusion models. While remains a baseline, -prediction offers a valuable alternative, particularly when pushing for sample quality or dealing with training dynamics. Experimentation is often necessary to determine the best choice for a given project. Understanding these different formulations provides you with more tools to optimize and control the behavior of your diffusion models during training.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with