A fundamental choice when training a diffusion model is deciding what the underlying neural network should predict. Given the noisy input xt and the timestep t, the model needs to estimate some quantity related to the denoising process. The two most common parameterizations are predicting the noise (ϵ) that was added, or predicting the original clean data (x0). This choice impacts the loss function, training dynamics, and potentially the final sample quality.
This is the standard approach introduced in the original Denoising Diffusion Probabilistic Models (DDPM) paper. The model, denoted as fθ(xt,t), is trained to predict the noise ϵ that was sampled from a standard Gaussian distribution and added to the original data x0 to create xt according to the forward process equation:
xt=αˉtx0+1−αˉtϵHere, αˉt is the cumulative product of the noise schedule variances up to time t. The objective function is typically a simplified mean squared error (MSE) loss between the predicted noise ϵ^=fθ(xt,t) and the actual noise ϵ used to generate xt:
Lϵ=Et,x0,ϵ[∣∣ϵ−fθ(xt,t)∣∣2]The expectation is taken over random timesteps t, initial data samples x0, and the sampled noise ϵ.
Advantages:
During sampling (the reverse process), the predicted noise ϵ^ is used to estimate the direction towards a less noisy state, often by first estimating the predicted x0 (denoted x^0) and then using it in the DDPM or DDIM update step.
An alternative approach is to parameterize the model, let's call it gθ(xt,t), to directly predict the original clean data x0 from the noisy input xt and timestep t. The corresponding MSE loss function aims to minimize the difference between the predicted x^0=gθ(xt,t) and the true x0:
Lx0=Et,x0,ϵ[∣∣x0−gθ(xt,t)∣∣2]Advantages:
Connection between Parameterizations:
The two parameterizations are mathematically related. Given the forward process equation xt=αˉtx0+1−αˉtϵ, we can express one prediction in terms of the other:
If the model predicts noise ϵ^=fθ(xt,t), the corresponding prediction for x0 is:
x^0=αˉt1(xt−1−αˉtϵ^)If the model predicts the clean data x^0=gθ(xt,t), the corresponding prediction for ϵ is:
ϵ^=1−αˉt1(xt−αˉtx^0)These relationships show that choosing one parameterization implicitly defines the other. However, training the network to predict one quantity versus the other directly affects the gradients and loss landscape, leading to potentially different training dynamics and final model performance.
Comparison of epsilon-prediction and x0-prediction approaches. The core model architecture is often the same, but the target variable for the loss calculation differs.
import torch
import torch.nn.functional as F
def get_sqrt_alphas_cumprod(alphas):
"""Helper to get cumulative products"""
return torch.sqrt(torch.cumprod(alphas, dim=0))
def get_sqrt_one_minus_alphas_cumprod(alphas):
"""Helper to get sqrt(1 - alpha_bar)"""
return torch.sqrt(1.0 - torch.cumprod(alphas, dim=0))
# --- Example Setup ---
T = 1000
betas = torch.linspace(0.0001, 0.02, T) # Example linear schedule
alphas = 1.0 - betas
alphas_cumprod = torch.cumprod(alphas, dim=0)
sqrt_alphas_cumprod = torch.sqrt(alphas_cumprod)
sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - alphas_cumprod)
# --- Assume we have these variables per batch item ---
# x_start: Original clean image [B, C, H, W]
# noise: Sampled Gaussian noise (epsilon) [B, C, H, W]
# t: Sampled timesteps [B]
# model: Your U-Net or Transformer model
# Extract schedule values for batch timesteps t
sqrt_alpha_bar_t = sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)
sqrt_one_minus_alpha_bar_t = sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)
# Calculate noisy image x_t
x_t = sqrt_alpha_bar_t * x_start + sqrt_one_minus_alpha_bar_t * noise
# Get model prediction
# model_output shape: [B, C, H, W]
model_output = model(x_t, t)
# --- Loss Calculation ---
# 1. Epsilon-Prediction Loss
target_eps = noise
loss_eps = F.mse_loss(model_output, target_eps)
# Use loss_eps for backpropagation
# 2. x0-Prediction Loss
target_x0 = x_start
loss_x0 = F.mse_loss(model_output, target_x0)
# Use loss_x0 for backpropagation
# --- Sampling Consideration (Example DDIM step) ---
# If model predicts epsilon (model_output = predicted_eps):
predicted_x0_from_eps = (x_t - sqrt_one_minus_alpha_bar_t * model_output) / sqrt_alpha_bar_t
# Use predicted_x0_from_eps in the DDIM update formula
# If model predicts x0 (model_output = predicted_x0):
predicted_eps_from_x0 = (x_t - sqrt_alpha_bar_t * model_output) / sqrt_one_minus_alpha_bar_t
# Use predicted_eps_from_x0 in the DDIM update formula (or modify DDIM to use predicted_x0 directly)
Python code snippet illustrating the difference in target calculation for ϵ-prediction and x0-prediction losses, and how to derive one prediction from the other for sampling.
While ϵ-prediction is the well-established and often more stable default, x0-prediction provides a valid alternative. Experimenting with x0-prediction might be worthwhile if:
It's also important to be aware of other parameterizations like v-prediction, discussed in the next section, which attempts to combine the benefits of both ϵ and x0 prediction, particularly improving how the model scales its output across different noise levels. Understanding the implications of predicting ϵ versus x0 provides a foundation for appreciating these more advanced techniques.
© 2025 ApX Machine Learning