Diffusion models offer a specific approach within generative AI for creating new data. Their core strategy is surprisingly intuitive, involving a pair of processes: one that systematically destroys structure in data and another that learns to undo the destruction.
Imagine you have a clear, high-resolution image. This is your starting point, let's call it . The first process, called the forward process or diffusion process, gradually adds a small amount of noise (typically Gaussian noise) to this image over a large number of discrete time steps, . At each step , we add just enough noise so that the change is subtle. If you watched this process unfold, you'd see the image slowly lose its features and structure, becoming progressively noisier. After many steps (where might be hundreds or thousands), the resulting image, , bears no resemblance to the original . It effectively becomes pure, unstructured noise, similar to sampling from a standard Gaussian distribution. This forward process is fixed; it doesn't involve any learning. It's simply a predefined mechanism for degrading data into noise.
The magic happens in the second process, the reverse process or denoising process. Here, the goal is to learn how to reverse the noising procedure. We start with the pure noise sample (which, importantly, we can easily sample from a known distribution like a Gaussian). The model then attempts to perform the opposite of the forward process: starting from , it iteratively predicts a slightly less noisy version , then uses that to predict , and so on, all the way back to . If the model can successfully learn this step-by-step denoising procedure, it can generate a realistic-looking data sample starting from random noise.
This reverse process is where the learning occurs. A neural network is trained to predict the noise that was added at each step of the forward process, given the noisy data . More precisely, the network typically takes the noisy data and the current timestep as input and outputs an estimate of the noise component that was added to get from . By subtracting this predicted noise (or using it to estimate the mean of the previous state), the model can approximate the transition from back to . Repeating this procedure times, starting from random noise , generates a new data sample .
The diagram below illustrates this two-part structure:
This diagram shows the fixed forward process transforming data into noise by adding noise incrementally. The learned reverse process starts from noise and uses a neural network at each step to predict and remove noise, eventually generating a sample .
This noise-and-denoise approach differs significantly from VAEs, which use an encoder-decoder structure to map data to and from a latent space, or GANs, which rely on a generator and discriminator competing against each other. Diffusion models directly learn to reverse a data destruction process, which often leads to stable training and high-quality sample generation, addressing some limitations of earlier methods.
The forward process is mathematically well-defined and tractable. The core challenge, and where the neural network comes in, is learning the reverse denoising steps. In the following chapters, we will examine the precise mathematical formulation of both the forward and reverse processes, explore the neural network architectures commonly used (like the U-Net), understand the training objective derived from a probabilistic framework, and finally, see how to implement the sampling procedure to generate new data.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with