As we established, directly computing the true reverse probability p(xt−1∣xt) is generally intractable. Our strategy is to approximate this reverse transition using a parameterized distribution, typically a Gaussian, learned by a neural network: pθ(xt−1∣xt). The network needs to estimate the parameters of this Gaussian distribution, specifically its mean μθ(xt,t) and variance Σθ(xt,t).
While the network could be trained to directly predict the mean μθ(xt,t) or even the denoised sample xt−1, a different approach has proven remarkably effective in practice: predicting the noise component ϵ that was added during the forward process at timestep t.
Let's see why this is useful. Recall the closed-form expression for sampling xt directly from x0 in the forward process:
xt=αˉtx0+1−αˉtϵHere, ϵ is a standard Gaussian noise sample, and αˉt is derived from the noise schedule. This equation links the original data x0, the noisy version xt, and the noise ϵ added to reach that state.
If our neural network, let's call it ϵθ, can accurately predict the noise ϵ based on the noisy input xt and the timestep t, we can use this prediction, ϵθ(xt,t), to inform our estimate of the previous state xt−1.
How does predicting noise help estimate the mean μθ(xt,t) of the reverse step pθ(xt−1∣xt)? The original Denoising Diffusion Probabilistic Models (DDPM) paper showed that the mean of the reverse transition p(xt−1∣xt,x0) can be expressed as:
μ~t(xt,x0)=αt1(xt−1−αˉt1−αtϵ)where αt=αˉt/αˉt−1. Notice this expression depends on the original data x0 (through ϵ) which we don't have during generation.
However, we can rearrange the first equation to get an estimate of x0 if we know xt and ϵ:
x0≈x^0=αˉt1(xt−1−αˉtϵθ(xt,t))By substituting our network's noise prediction ϵθ(xt,t) for ϵ in the equation for μ~t, we arrive at an expression for the mean of our approximate reverse transition pθ(xt−1∣xt):
μθ(xt,t)=αt1(xt−1−αˉt1−αtϵθ(xt,t))This establishes a direct link: if our network ϵθ(xt,t) successfully predicts the noise added at step t, we can calculate the mean required for the denoising step xt→xt−1. The variance Σθ(xt,t) is often fixed to a value related to the forward process variances, or sometimes also learned, but predicting the noise is primarily used to determine the mean.
Parameterizing the reverse process by predicting noise offers several advantages:
Therefore, the standard approach involves training a neural network ϵθ that takes the noisy data xt and the timestep t as input and outputs a prediction of the noise ϵ that was used to generate xt from x0. This predicted noise ϵθ(xt,t) then allows us to compute the parameters (specifically the mean) of the approximate reverse distribution pθ(xt−1∣xt), enabling the step-by-step generation process.
The neural network ϵθ takes the current noisy sample xt and the timestep t as input. Its goal is to predict the noise ϵθ that was likely added to the original data to produce xt. This prediction is the core component used to guide the reverse denoising step.
© 2025 ApX Machine Learning