As established previously, the true reverse transition probability p(xt−1∣xt) is what we need to sample from to generate data, moving backward from noise xT towards a clean sample x0. However, calculating this distribution requires knowing the original data distribution q(x0), making it intractable.
Our solution is to approximate this reverse transition using a neural network. We define a parameterized distribution pθ(xt−1∣xt) that our model will learn. Since the forward process adds Gaussian noise, a reasonable choice for the reverse transition is also a Gaussian distribution:
pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))Here, μθ(xt,t) is the mean and Σθ(xt,t) is the covariance matrix of the Gaussian distribution for xt−1, given xt at timestep t. Both the mean and covariance could, in principle, depend on the noisy data xt and the timestep t, and they are determined by the parameters θ of our neural network.
The task of the neural network is therefore to predict the parameters of this Gaussian distribution for each timestep t.
There are two common ways to parameterize the mean μθ(xt,t):
Let's focus on the second approach, predicting the noise. We design a neural network, often denoted as ϵθ, which takes the current noisy sample xt and the timestep t as input and outputs a predicted noise component ϵθ(xt,t).
Why is predicting the noise useful? Recall from the forward process (Chapter 2, "Sampling from Intermediate Steps") that the posterior distribution q(xt−1∣xt,x0) is tractable and also Gaussian. Its mean has a specific form. While we don't know x0 during the reverse process, we can use the relationship derived for q(xt−1∣xt,x0) and substitute x0 with an estimate derived from xt and the predicted noise ϵθ(xt,t). This leads to the following expression for the mean of our parameterized reverse distribution pθ(xt−1∣xt):
μθ(xt,t)=αt1(xt−1−αˉtβtϵθ(xt,t))Where:
This equation provides a direct way to calculate the mean of the denoising step xt−1 using the current state xt and the network's noise prediction ϵθ(xt,t). The network's primary job is to learn the correct noise ϵ that was added at step t, given the result xt.
What about the covariance matrix Σθ(xt,t)? While the network could also learn the variance, Ho et al. (authors of DDPM) found that fixing the variance works well and simplifies the model. The covariance is typically set to be diagonal, Σθ(xt,t)=σt2I, where I is the identity matrix. The variance σt2 is often chosen based on the forward process variances:
Both options connect the reverse process variance to the noise schedule used in the forward process. Using β~t corresponds to the variance of the true posterior q(xt−1∣xt,x0) when x0 is known, while using βt performs well empirically, especially for the Lsimple objective function we'll discuss later. For simplicity, σt2=βt is a common choice. Fixing the variance means the neural network only needs to predict the noise ϵθ(xt,t) to define the reverse transition.
The neural network ϵθ(xt,t) needs to process an input xt (which usually has the same dimensions as the original data, e.g., an image) and the timestep t (a scalar). It must then output a tensor ϵθ of the same shape as xt, representing the predicted noise.
A very successful architecture for this task is the U-Net, which we will explore in detail in Chapter 4. The U-Net architecture is well-suited for image-like data and allows for effective integration of the timestep information t.
Diagram illustrating the role of the neural network ϵθ in parameterizing the reverse process. It takes the noisy data xt and timestep t to predict the noise ϵθ. This prediction is then used, along with fixed parameters from the noise schedule (αt,βt,αˉt), to calculate the mean μθ of the approximate reverse transition pθ(xt−1∣xt). The variance σt2 is typically fixed.
In summary, we approximate the intractable reverse distribution p(xt−1∣xt) with a learned Gaussian distribution pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),σt2I). We use a neural network ϵθ(xt,t) to predict the noise component, which then allows us to compute the mean μθ(xt,t). The variance σt2 is usually fixed based on the forward process noise schedule. This setup forms the basis for training the diffusion model, which we will cover next.
© 2025 ApX Machine Learning