The forward process in diffusion models involves gradually adding noise to data through a fixed procedure. For generative AI, the objective is to reverse this process: starting from pure noise , the aim is to sequentially sample , then , and so on, until a clean data sample is obtained that resembles the original data distribution. Achieving this requires calculating the reverse transition probabilities, .
If we could compute exactly, we could sample from it iteratively to generate data. However, calculating this distribution directly poses a significant challenge. Why? Because the probability of transitioning back to a less noisy state depends not just on the current noisy state , but implicitly on the entire distribution of possible starting data points . Mathematically, evaluating involves marginalizing over all possible initial data points :
The term represents the probability of a specific starting data point given the noisy version . Calculating this requires knowledge of the (unknown) true data distribution and involves complex integration, making the true reverse probability intractable to compute for complex datasets.
Interestingly, the situation changes if we know the starting point that led to . The reverse conditional probability is tractable. Using Bayes' theorem and the properties of the Gaussian noise added during the forward process (defined by the noise schedule ), we can show that this distribution is also a Gaussian:
Here, denotes a Gaussian distribution with mean and diagonal covariance . The variance and the mean depend only on the known forward process noise schedule parameters ( or equivalently and ) and the specific values of and . Specifically:
These equations tell us that if we knew the original image corresponding to a noisy image , we could precisely calculate the distribution of the slightly less noisy image it came from.
But here's the catch: during the actual generative process (sampling), we don't have . Our goal is precisely to generate starting from pure noise . Knowing would defeat the purpose.
Since the true reverse transition is intractable, and the conditional reverse transition requires the unknown , we need an alternative. The solution is to approximate the true reverse transition using a learned model.
We introduce a parameterized distribution, typically a neural network, denoted as , to approximate the intractable true posterior . The parameters of this network will be learned from data.
Diagram illustrating the relationship between the intractable true reverse transition, the tractable conditional reverse transition (requiring ), and the learned approximation using a neural network.
This neural network takes the current noisy state and the current timestep as input and outputs the parameters of the approximate reverse distribution . Since the tractable conditional reverse distribution is Gaussian, it's convenient and effective to also model our approximation as a Gaussian:
Here, is the mean and is the covariance matrix, both predicted by the neural network . The network's task is therefore to learn functions and that accurately predict the parameters of the distribution for given .
In many successful diffusion models, like the original Denoising Diffusion Probabilistic Models (DDPM), the covariance matrix is not learned directly. Instead, it's often fixed to a value related to the forward process variance, such as or . This simplifies the learning problem significantly: the neural network only needs to learn the mean of the reverse transition.
The next step is to understand how we parameterize this neural network, specifically how it predicts the mean , and how we train it to effectively reverse the diffusion process. As we will see, a common and effective strategy is to train the network to predict the noise that was added at timestep .
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•