A fundamental choice when training a diffusion model is deciding what the underlying neural network should predict. Given the noisy input and the timestep , the model needs to estimate some quantity related to the denoising process. The two most common parameterizations are predicting the noise () that was added, or predicting the original clean data (). This choice impacts the loss function, training dynamics, and potentially the final sample quality.
This is the standard approach introduced in the original Denoising Diffusion Probabilistic Models (DDPM) paper. The model, denoted as , is trained to predict the noise that was sampled from a standard Gaussian distribution and added to the original data to create according to the forward process equation:
Here, is the cumulative product of the noise schedule variances up to time . The objective function is typically a simplified mean squared error (MSE) loss between the predicted noise and the actual noise used to generate :
The expectation is taken over random timesteps , initial data samples , and the sampled noise .
Advantages:
During sampling (the reverse process), the predicted noise is used to estimate the direction towards a less noisy state, often by first estimating the predicted (denoted ) and then using it in the DDPM or DDIM update step.
An alternative approach is to parameterize the model, let's call it , to directly predict the original clean data from the noisy input and timestep . The corresponding MSE loss function aims to minimize the difference between the predicted and the true :
Advantages:
Connection between Parameterizations:
The two parameterizations are mathematically related. Given the forward process equation , we can express one prediction in terms of the other:
If the model predicts noise , the corresponding prediction for is:
If the model predicts the clean data , the corresponding prediction for is:
These relationships show that choosing one parameterization implicitly defines the other. However, training the network to predict one quantity versus the other directly affects the gradients and performance, leading to potentially different training dynamics and final model performance.
Comparison of epsilon-prediction and x0-prediction approaches. The core model architecture is often the same, but the target variable for the loss calculation differs.
import torch
import torch.nn.functional as F
def get_sqrt_alphas_cumprod(alphas):
"""Helper to get cumulative products"""
return torch.sqrt(torch.cumprod(alphas, dim=0))
def get_sqrt_one_minus_alphas_cumprod(alphas):
"""Helper to get sqrt(1 - alpha_bar)"""
return torch.sqrt(1.0 - torch.cumprod(alphas, dim=0))
# --- Example Setup ---
T = 1000
betas = torch.linspace(0.0001, 0.02, T) # Example linear schedule
alphas = 1.0 - betas
alphas_cumprod = torch.cumprod(alphas, dim=0)
sqrt_alphas_cumprod = torch.sqrt(alphas_cumprod)
sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - alphas_cumprod)
# --- Assume we have these variables per batch item ---
# x_start: Original clean image [B, C, H, W]
# noise: Sampled Gaussian noise (epsilon) [B, C, H, W]
# t: Sampled timesteps [B]
# model: Your U-Net or Transformer model
# Extract schedule values for batch timesteps t
sqrt_alpha_bar_t = sqrt_alphas_cumprod[t].view(-1, 1, 1, 1)
sqrt_one_minus_alpha_bar_t = sqrt_one_minus_alphas_cumprod[t].view(-1, 1, 1, 1)
# Calculate noisy image x_t
x_t = sqrt_alpha_bar_t * x_start + sqrt_one_minus_alpha_bar_t * noise
# Get model prediction
# model_output shape: [B, C, H, W]
model_output = model(x_t, t)
# --- Loss Calculation ---
# 1. Epsilon-Prediction Loss
target_eps = noise
loss_eps = F.mse_loss(model_output, target_eps)
# Use loss_eps for backpropagation
# 2. x0-Prediction Loss
target_x0 = x_start
loss_x0 = F.mse_loss(model_output, target_x0)
# Use loss_x0 for backpropagation
# --- Sampling (Example DDIM step) ---
# If model predicts epsilon (model_output = predicted_eps):
predicted_x0_from_eps = (x_t - sqrt_one_minus_alpha_bar_t * model_output) / sqrt_alpha_bar_t
# Use predicted_x0_from_eps in the DDIM update formula
# If model predicts x0 (model_output = predicted_x0):
predicted_eps_from_x0 = (x_t - sqrt_alpha_bar_t * model_output) / sqrt_one_minus_alpha_bar_t
# Use predicted_eps_from_x0 in the DDIM update formula (or modify DDIM to use predicted_x0 directly)
Python code snippet illustrating the difference in target calculation for -prediction and -prediction losses, and how to derive one prediction from the other for sampling.
While -prediction is the well-established and often more stable default, -prediction provides a valid alternative. Experimenting with -prediction might be worthwhile if:
It's also important to be aware of other parameterizations like -prediction, discussed in the next section, which attempts to combine the benefits of both and prediction, particularly improving how the model scales its output across different noise levels. Understanding the implications of predicting versus provides a foundation for appreciating these more advanced techniques.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with