The reverse diffusion process aims to reverse noising effects, starting from a noisy sample xt and estimating the slightly less noisy sample xt−1. The true reverse probability q(xt−1∣xt) is intractable, but it can be approximated using a neural network, pheta(xt−1∣xt). A common approach uses a network, ϵθ(xt,t), trained to predict the noise ϵ added to the original data x0 to obtain xt. This predicted noise is then used to define the mathematical operation for one step of denoising.
The goal is to define the distribution pθ(xt−1∣xt). It turns out that if we knew the original data point x0, the true posterior distribution q(xt−1∣xt,x0) can be calculated analytically and is a Gaussian distribution. While we won't go through the full derivation here (it involves applying Bayes' theorem to the Gaussian distributions of the forward process, as detailed in Appendix B of the original DDPM paper by Ho et al. 2020), the result is important. The posterior q(xt−1∣xt,x0) is given by:
q(xt−1∣xt,x0)=N(xt−1;μ~t(xt,x0),β~tI)where the mean μ~t(xt,x0) and variance β~t are:
μ~t(xt,x0)=1−αˉtαˉt−1βtx0+1−αˉtαt(1−αˉt−1)xt β~t=1−αˉt1−αˉt−1βtRemember that αt=1−βt and αˉt=∏i=1tαi are derived from the noise schedule βt.
Now, during the reverse (generation) process, we don't have access to the original x0. This is where our trained neural network ϵθ(xt,t) comes in. We use it to approximate x0. Recall the forward process equation that allows sampling xt directly from x0:
xt=αˉtx0+1−αˉtϵwhere ϵ∼N(0,I). We can rearrange this to express x0 in terms of xt and ϵ:
x0=αˉt1(xt−1−αˉtϵ)Since our network ϵθ(xt,t) is trained to predict ϵ, we can get an estimate of x0, let's call it x^0, using the network's output:
x^0=αˉt1(xt−1−αˉtϵθ(xt,t))This gives us an approximation of the original image based on the current noisy image xt and the predicted noise ϵθ(xt,t).
Now we substitute this estimate x^0 back into the equation for the posterior mean μ~t(xt,x0). This gives us the mean for our parameterized reverse step, μθ(xt,t):
μθ(xt,t)=1−αˉtαˉt−1βt(αˉt1(xt−1−αˉtϵθ(xt,t)))+1−αˉtαt(1−αˉt−1)xtThis looks complicated, but after some algebraic simplification (using the relationships αt=1−βt and αˉt=αtαˉt−1), it reduces to a much cleaner form:
μθ(xt,t)=αt1(xt−1−αˉtβtϵθ(xt,t))This equation is the heart of the denoising step. It tells us how to calculate the mean of the distribution for the previous timestep xt−1, given the current state xt and the noise ϵθ(xt,t) predicted by our U-Net model for that timestep.
For the variance of the reverse step pθ(xt−1∣xt), we need to choose a value σt2. The DDPM paper proposes using the variance derived from the posterior q(xt−1∣xt,x0), which is σt2=β~t=1−αˉt1−αˉt−1βt. Another option explored later is simply setting σt2=βt. This variance term introduces stochasticity into the generation process; sampling from this distribution involves adding Gaussian noise scaled by σt.
So, a single step in the reverse diffusion process is defined by sampling xt−1 from the Gaussian distribution:
pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),σt2I)Where:
By repeatedly applying this formula, starting from pure noise xT∼N(0,I) and stepping backwards from t=T down to t=1, we can generate a sample x0 that should resemble the data the model was trained on. This iterative process, guided at each step by the neural network's noise prediction, is how diffusion models generate new data. We will explore the full sampling algorithms in Chapter 5.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with