The true reverse transition probability represents the distribution needed for sampling to generate data, moving backward from noise towards a clean sample . Calculating this distribution, however, requires knowing the original data distribution , which makes it intractable.
Our solution is to approximate this reverse transition using a neural network. We define a parameterized distribution that our model will learn. Since the forward process adds Gaussian noise, a reasonable choice for the reverse transition is also a Gaussian distribution:
Here, is the mean and is the covariance matrix of the Gaussian distribution for , given at timestep . Both the mean and covariance could, in principle, depend on the noisy data and the timestep , and they are determined by the parameters of our neural network.
The task of the neural network is therefore to predict the parameters of this Gaussian distribution for each timestep .
There are two common ways to parameterize the mean :
Let's focus on the second approach, predicting the noise. We design a neural network, often denoted as , which takes the current noisy sample and the timestep as input and outputs a predicted noise component .
Why is predicting the noise useful? Recall from the forward process (Chapter 2, "Sampling from Intermediate Steps") that the posterior distribution is tractable and also Gaussian. Its mean has a specific form. While we don't know during the reverse process, we can use the relationship derived for and substitute with an estimate derived from and the predicted noise . This leads to the following expression for the mean of our parameterized reverse distribution :
Where:
This equation provides a direct way to calculate the mean of the denoising step using the current state and the network's noise prediction . The network's primary job is to learn the correct noise that was added at step , given the result .
What about the covariance matrix ? While the network could also learn the variance, Ho et al. (authors of DDPM) found that fixing the variance works well and simplifies the model. The covariance is typically set to be diagonal, , where is the identity matrix. The variance is often chosen based on the forward process variances:
Both options connect the reverse process variance to the noise schedule used in the forward process. Using corresponds to the variance of the true posterior when is known, while using performs well empirically, especially for the objective function we'll discuss later. For simplicity, is a common choice. Fixing the variance means the neural network only needs to predict the noise to define the reverse transition.
The neural network needs to process an input (which usually has the same dimensions as the original data, e.g., an image) and the timestep (a scalar). It must then output a tensor of the same shape as , representing the predicted noise.
A very successful architecture for this task is the U-Net, which we will explore in detail in Chapter 4. The U-Net architecture is well-suited for image-like data and allows for effective integration of the timestep information .
Diagram illustrating the role of the neural network in parameterizing the reverse process. It takes the noisy data and timestep to predict the noise . This prediction is then used, along with fixed parameters from the noise schedule (), to calculate the mean of the approximate reverse transition . The variance is typically fixed.
In summary, we approximate the intractable reverse distribution with a learned Gaussian distribution . We use a neural network to predict the noise component, which then allows us to compute the mean . The variance is usually fixed based on the forward process noise schedule. This setup forms the basis for training the diffusion model, which we will cover next.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•