While the standard diffusion model training objective, predicting the noise ϵ added at timestep t, works remarkably well, research has explored alternative formulations for the model's prediction target and loss function. These alternatives can offer benefits in terms of sample quality, training stability, and how well the model behaves across different noise levels. Let's examine two significant approaches: the widely used simplified loss Lsimple (based on ϵ-prediction) and the v-prediction objective.
Most diffusion models are trained to predict the noise ϵ that was added to the original data x0 to produce the noisy sample xt. Recall the forward process:
xt=αˉtx0+1−αˉtϵwhereϵ∼N(0,I)The model, typically a U-Net or Transformer denoted by ϵθ, takes the noisy input xt and the timestep t (usually via an embedding) and outputs a prediction of the noise: ϵθ(xt,t).
The theoretical loss derived from the variational lower bound (VLB) includes weighting terms that depend on the timestep t. However, the DDPM paper found that a simplified, unweighted version of the loss often yields better results in practice:
Lsimple=Et∼U(1,T),x0∼q(x0),ϵ∼N(0,I)[∥ϵ−ϵθ(xt,t)∥2]Here, t is sampled uniformly from the total number of timesteps T. This Lsimple objective is computationally straightforward and forms the backbone of many successful diffusion model implementations. It directly optimizes the model to denoise the input by predicting the added noise.
Despite its success, predicting ϵ can sometimes face challenges, particularly near the boundaries of the diffusion process (very small t or very large t) where the signal-to-noise ratio varies drastically.
An alternative formulation proposed to address some limitations of ϵ-prediction is v-prediction. Instead of predicting the noise ϵ, the model predicts a different target, v, defined as:
v=αtϵ−σtx0Here, we use the common notation where αt=αˉt (signal scale) and σt=1−αˉt (noise scale), consistent with the forward process equation xt=αtx0+σtϵ.
Why predict v? The intuition is related to signal scaling during the diffusion process. The target ϵ always has a unit variance. However, the noisy input xt has variance that changes with t. Predicting v can be interpreted as predicting a target that has a more consistent variance across different timesteps, potentially making the learning task easier or more stable for the network. It effectively combines predicting the score function (∇xtlogp(xt)) and the data x0 in a balanced way.
The model architecture remains largely the same (e.g., U-Net), but it's now parameterized to output vθ(xt,t). The corresponding loss function is analogous to Lsimple, but with v as the target:
Lv=Et,x0,ϵ[∥v−vθ(xt,t)∥2]Often, weighting terms similar to those in the full VLB might be reintroduced, or the loss might be formulated based on score matching principles, but the core idea is to predict this v target.
When using v-prediction, you need to adjust how the model's output is used during sampling. If the model vθ(xt,t) predicts v, you can recover the predictions for ϵ and x0 as needed for the sampling steps (like DDIM or DDPM):
Predicted ϵ=αtvθ(xt,t)+σtxt Predicted x0=αtxt−σtvθ(xt,t)These recovered values can then be plugged into the standard sampling equations.
The choice between ϵ-prediction (Lsimple) and v-prediction (Lv) isn't always clear-cut and can depend on the specific application, dataset, and model architecture.
Lsimple (ϵ-prediction):
Lv (v-prediction):
The diagram below illustrates the different targets the network aims to predict in each formulation:
Diagram comparing the prediction targets and loss calculation for ϵ-prediction and v-prediction formulations. Both take the noisy data xt and timestep t as input.
Ultimately, both Lsimple and Lv are effective objectives for training diffusion models. While Lsimple remains a robust and widely used baseline, v-prediction offers a valuable alternative, particularly when pushing for state-of-the-art sample quality or dealing with specific training dynamics. Experimentation is often necessary to determine the best choice for a given project. Understanding these different formulations provides you with more tools to optimize and control the behavior of your diffusion models during training.
© 2025 ApX Machine Learning