While Generative Adversarial Networks (GANs) represent one major family of generative models, characterized by their adversarial training dynamic, Diffusion Models offer a distinct and increasingly powerful approach. Introduced more recently, these models have rapidly achieved state-of-the-art results, particularly in image synthesis, often producing high-fidelity samples with remarkable diversity. Unlike the sometimes unstable min-max game of GANs, diffusion models typically offer a more stable training process, although often at the cost of slower sampling speeds.
The fundamental idea behind diffusion models involves two processes: a fixed forward process and a learned reverse process.
Imagine taking a clean data sample, like an image drawn from the true data distribution . The forward process systematically degrades this sample over a series of timesteps by gradually adding small amounts of Gaussian noise. This defines a Markov chain where the state at timestep , denoted , depends only on the state at the previous timestep :
Here, represents a small positive constant determining the variance of the noise added at step . The sequence is known as the variance schedule, which is typically predefined (e.g., increasing linearly or quadratically).
A useful property of this process is that we can directly sample conditioned on the original data without iterating through all intermediate steps. If we define and , then the distribution is also Gaussian:
As increases towards , the cumulative effect of adding noise means approaches zero. Consequently, loses almost all information about the original and effectively becomes indistinguishable from pure Gaussian noise, . The forward process is fixed; it requires no training.
Diagram illustrating the fixed forward process, where data is gradually noised over T steps according to the transition probability until it resembles pure noise .
The core generative task lies in the reverse process. We want to start with random noise and reverse the noising steps to gradually denoise it, eventually obtaining a sample that looks like it came from the original data distribution . This involves learning the transition probabilities for .
If the noise steps in the forward process are sufficiently small, the true reverse transition can also be shown to be approximately Gaussian. However, calculating this true reverse transition requires knowing the original data , which is precisely what we want to generate.
Therefore, we approximate these true reverse transitions using a neural network, parameterized by . This network takes the noisy sample and the current timestep as input and learns to predict the parameters (typically the mean and variance) of the distribution :
Instead of directly predicting the mean , a common and effective parameterization is to train the network, often denoted , to predict the noise that was added to get from to (or more accurately, the noise component corresponding to based on ). Given the predicted noise , we can derive the parameters for .
The parameters of the reverse process network are optimized by maximizing the likelihood of the training data. This often involves optimizing a variational lower bound (ELBO) on the log-likelihood. For diffusion models, this objective can often be simplified to a much more tractable form. A common simplified objective aims to minimize the mean squared error between the actual noise added during the forward process (which can be easily sampled given ) and the noise predicted by the network :
where is generated using the direct forward sampling formula. Training involves repeatedly sampling a data point , a timestep , sampling noise , constructing the corresponding , and performing a gradient descent step on this loss.
Once the model is trained, generation (sampling) proceeds by:
Diagram comparing the fixed forward noising process with the learned reverse denoising process . The reverse process starts from noise and uses a trained model (optimized via the training objective ) to iteratively sample less noisy states, ultimately producing a generated sample .
This iterative noise-and-denoise framework forms the basis of diffusion models. While straightforward, the specific choices of variance schedules, network architectures (often U-Nets), and training objectives lead to different model variants like Denoising Diffusion Probabilistic Models (DDPMs) and score-based models, which we will explore in detail in Chapter 4. Understanding this core mechanism is essential before implementing and optimizing these advanced generative techniques.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with