In the previous chapter, we established the forward process, a fixed procedure for gradually adding noise to data. Our generative goal, however, is to reverse this: starting from noise xT∼N(0,I), we want to sample xT−1, then xT−2, and so on, until we arrive at a sample x0 that looks like it came from our original data distribution. This requires calculating the reverse transition probabilities, p(xt−1∣xt).
If we could compute p(xt−1∣xt) exactly, we could sample from it iteratively to generate data. However, calculating this distribution directly poses a significant challenge. Why? Because the probability of transitioning back to a less noisy state xt−1 depends not just on the current noisy state xt, but implicitly on the entire distribution of possible starting data points q(x0). Mathematically, evaluating p(xt−1∣xt) involves marginalizing over all possible initial data points x0:
p(xt−1∣xt)=∫p(xt−1∣xt,x0)q(x0∣xt)dx0The term q(x0∣xt) represents the probability of a specific starting data point x0 given the noisy version xt. Calculating this requires knowledge of the (unknown) true data distribution q(x0) and involves complex integration, making the true reverse probability p(xt−1∣xt) intractable to compute for complex datasets.
Interestingly, the situation changes if we know the starting point x0 that led to xt. The reverse conditional probability p(xt−1∣xt,x0) is tractable. Using Bayes' theorem and the properties of the Gaussian noise added during the forward process q(xt∣xt−1) (defined by the noise schedule βt), we can show that this distribution is also a Gaussian:
p(xt−1∣xt,x0)=N(xt−1;μ~t(xt,x0),β~tI)Here, N(x;μ,σ2I) denotes a Gaussian distribution with mean μ and diagonal covariance σ2I. The variance β~t and the mean μ~t(xt,x0) depend only on the known forward process noise schedule parameters (βt or equivalently αt=1−βt and αˉt=∏s=1tαs) and the specific values of xt and x0. Specifically:
β~t=1−αˉt1−αˉt−1βt μ~t(xt,x0)=1−αˉtαˉt−1βtx0+1−αˉtαt(1−αˉt−1)xtThese equations tell us that if we knew the original image x0 corresponding to a noisy image xt, we could precisely calculate the distribution of the slightly less noisy image xt−1 it came from.
But here's the catch: during the actual generative process (sampling), we don't have x0. Our goal is precisely to generate x0 starting from pure noise xT. Knowing x0 would defeat the purpose.
Since the true reverse transition p(xt−1∣xt) is intractable, and the conditional reverse transition p(xt−1∣xt,x0) requires the unknown x0, we need an alternative. The solution is to approximate the true reverse transition using a learned model.
We introduce a parameterized distribution, typically a neural network, denoted as pθ(xt−1∣xt), to approximate the intractable true posterior p(xt−1∣xt). The parameters θ of this network will be learned from data.
Diagram illustrating the relationship between the intractable true reverse transition, the tractable conditional reverse transition (requiring x0), and the learned approximation pθ(xt−1∣xt) using a neural network.
This neural network takes the current noisy state xt and the current timestep t as input and outputs the parameters of the approximate reverse distribution pθ(xt−1∣xt). Since the tractable conditional reverse distribution p(xt−1∣xt,x0) is Gaussian, it's convenient and effective to also model our approximation pθ(xt−1∣xt) as a Gaussian:
pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))Here, μθ(xt,t) is the mean and Σθ(xt,t) is the covariance matrix, both predicted by the neural network θ. The network's task is therefore to learn functions μθ and Σθ that accurately predict the parameters of the distribution for xt−1 given xt.
In many successful diffusion models, like the original Denoising Diffusion Probabilistic Models (DDPM), the covariance matrix Σθ(xt,t) is not learned directly. Instead, it's often fixed to a value related to the forward process variance, such as Σθ(xt,t)=β~tI or Σθ(xt,t)=βtI. This simplifies the learning problem significantly: the neural network only needs to learn the mean μθ(xt,t) of the reverse transition.
The next step is to understand how we parameterize this neural network, specifically how it predicts the mean μθ(xt,t), and how we train it to effectively reverse the diffusion process. As we will see, a common and effective strategy is to train the network to predict the noise that was added at timestep t.
© 2025 ApX Machine Learning