Okay, we've established that the reverse diffusion process aims to reverse the noising effects, starting from a noisy sample xt and trying to estimate the slightly less noisy sample xt−1. We also know that the true reverse probability q(xt−1∣xt) is intractable, but we can approximate it using a neural network, pθ(xt−1∣xt). Our network ϵθ(xt,t) is trained to predict the noise ϵ that was added to the original data x0 to get xt. Now, let's see how we use this predicted noise to define the mathematical operation for one step of denoising.
The goal is to define the distribution pθ(xt−1∣xt). It turns out that if we knew the original data point x0, the true posterior distribution q(xt−1∣xt,x0) can be calculated analytically and is a Gaussian distribution. While we won't go through the full derivation here (it involves applying Bayes' theorem to the Gaussian distributions of the forward process, as detailed in Appendix B of the original DDPM paper by Ho et al. 2020), the result is important. The posterior q(xt−1∣xt,x0) is given by:
q(xt−1∣xt,x0)=N(xt−1;μ~t(xt,x0),β~tI)where the mean μ~t(xt,x0) and variance β~t are:
μ~t(xt,x0)=1−αˉtαˉt−1βtx0+1−αˉtαt(1−αˉt−1)xt β~t=1−αˉt1−αˉt−1βtRemember that αt=1−βt and αˉt=∏i=1tαi are derived from the noise schedule βt.
Now, during the reverse (generation) process, we don't have access to the original x0. This is where our trained neural network ϵθ(xt,t) comes in. We use it to approximate x0. Recall the forward process equation that allows sampling xt directly from x0:
xt=αˉtx0+1−αˉtϵwhere ϵ∼N(0,I). We can rearrange this to express x0 in terms of xt and ϵ:
x0=αˉt1(xt−1−αˉtϵ)Since our network ϵθ(xt,t) is trained to predict ϵ, we can get an estimate of x0, let's call it x^0, using the network's output:
x^0=αˉt1(xt−1−αˉtϵθ(xt,t))This gives us an approximation of the original image based on the current noisy image xt and the predicted noise ϵθ(xt,t).
Now we substitute this estimate x^0 back into the equation for the posterior mean μ~t(xt,x0). This gives us the mean for our parameterized reverse step, μθ(xt,t):
μθ(xt,t)=1−αˉtαˉt−1βt(αˉt1(xt−1−αˉtϵθ(xt,t)))+1−αˉtαt(1−αˉt−1)xtThis looks complicated, but after some algebraic simplification (using the relationships αt=1−βt and αˉt=αtαˉt−1), it reduces to a much cleaner form:
μθ(xt,t)=αt1(xt−1−αˉtβtϵθ(xt,t))This equation is the heart of the denoising step. It tells us how to calculate the mean of the distribution for the previous timestep xt−1, given the current state xt and the noise ϵθ(xt,t) predicted by our U-Net model for that timestep.
For the variance of the reverse step pθ(xt−1∣xt), we need to choose a value σt2. The DDPM paper proposes using the variance derived from the posterior q(xt−1∣xt,x0), which is σt2=β~t=1−αˉt1−αˉt−1βt. Another option explored later is simply setting σt2=βt. This variance term introduces stochasticity into the generation process; sampling from this distribution involves adding Gaussian noise scaled by σt.
So, a single step in the reverse diffusion process is defined by sampling xt−1 from the Gaussian distribution:
pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),σt2I)Where:
By repeatedly applying this formula, starting from pure noise xT∼N(0,I) and stepping backwards from t=T down to t=1, we can generate a sample x0 that should resemble the data the model was trained on. This iterative process, guided at each step by the neural network's noise prediction, is how diffusion models generate new data. We will explore the full sampling algorithms in Chapter 5.
© 2025 ApX Machine Learning