Having reviewed other generative model families like VAEs and GANs and considered some of the challenges they present, we now introduce the specific approach taken by diffusion models. At its heart, the diffusion model strategy is surprisingly intuitive, involving a pair of processes: one that systematically destroys structure in data and another that learns to undo the destruction.
Imagine you have a clear, high-resolution image. This is your starting point, let's call it x0. The first process, called the forward process or diffusion process, gradually adds a small amount of noise (typically Gaussian noise) to this image over a large number of discrete time steps, t=1,2,...,T. At each step t, we add just enough noise so that the change is subtle. If you watched this process unfold, you'd see the image slowly lose its features and structure, becoming progressively noisier. After many steps (where T might be hundreds or thousands), the resulting image, xT, bears no resemblance to the original x0. It effectively becomes pure, unstructured noise, similar to sampling from a standard Gaussian distribution. This forward process is fixed; it doesn't involve any learning. It's simply a predefined mechanism for degrading data into noise.
The magic happens in the second process, the reverse process or denoising process. Here, the goal is to learn how to reverse the noising procedure. We start with the pure noise sample xT (which, importantly, we can easily sample from a known distribution like a Gaussian). The model then attempts to perform the opposite of the forward process: starting from xT, it iteratively predicts a slightly less noisy version xT−1, then uses that to predict xT−2, and so on, all the way back to x0. If the model can successfully learn this step-by-step denoising procedure, it can generate a realistic-looking data sample starting from random noise.
This reverse process is where the learning occurs. A neural network is trained to predict the noise that was added at each step t of the forward process, given the noisy data xt. More precisely, the network typically takes the noisy data xt and the current timestep t as input and outputs an estimate of the noise component that was added to get xt from xt−1. By subtracting this predicted noise (or using it to estimate the mean of the previous state), the model can approximate the transition from xt back to xt−1. Repeating this procedure T times, starting from random noise xT, generates a new data sample x0.
The diagram below illustrates this two-part structure:
This diagram shows the fixed forward process transforming data x0 into noise xT by adding noise incrementally. The learned reverse process starts from noise xT and uses a neural network at each step to predict and remove noise, eventually generating a sample x0.
This noise-and-denoise approach differs significantly from VAEs, which use an encoder-decoder structure to map data to and from a latent space, or GANs, which rely on a generator and discriminator competing against each other. Diffusion models directly learn to reverse a data destruction process, which often leads to stable training and high-quality sample generation, addressing some limitations of earlier methods.
The forward process is mathematically well-defined and tractable. The core challenge, and where the neural network comes in, is learning the reverse denoising steps. In the following chapters, we will examine the precise mathematical formulation of both the forward and reverse processes, explore the neural network architectures commonly used (like the U-Net), understand the training objective derived from a probabilistic framework, and finally, see how to implement the sampling procedure to generate new data.
© 2025 ApX Machine Learning