While Generative Adversarial Networks (GANs) represent one major family of generative models, characterized by their adversarial training dynamic, Diffusion Models offer a distinct and increasingly powerful approach. Introduced more recently, these models have rapidly achieved state-of-the-art results, particularly in image synthesis, often producing high-fidelity samples with remarkable diversity. Unlike the sometimes unstable min-max game of GANs, diffusion models typically offer a more stable training process, although often at the cost of slower sampling speeds.
The fundamental idea behind diffusion models involves two processes: a fixed forward process and a learned reverse process.
Imagine taking a clean data sample, like an image x0 drawn from the true data distribution pdata(x). The forward process systematically degrades this sample over a series of T timesteps by gradually adding small amounts of Gaussian noise. This defines a Markov chain where the state at timestep t, denoted xt, depends only on the state at the previous timestep xt−1:
q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)
Here, βt represents a small positive constant determining the variance of the noise added at step t. The sequence β1,β2,...,βT is known as the variance schedule, which is typically predefined (e.g., increasing linearly or quadratically).
A useful property of this process is that we can directly sample xt conditioned on the original data x0 without iterating through all intermediate steps. If we define αt=1−βt and αˉt=∏i=1tαi, then the distribution q(xt∣x0) is also Gaussian:
q(xt∣x0)=N(xt;αˉtx0,(1−αˉt)I)
As t increases towards T, the cumulative effect of adding noise means αˉt approaches zero. Consequently, xT loses almost all information about the original x0 and effectively becomes indistinguishable from pure Gaussian noise, xT≈N(0,I). The forward process is fixed; it requires no training.
Diagram illustrating the fixed forward process, where data x0 is gradually noised over T steps according to the transition probability q until it resembles pure noise xT.
The core generative task lies in the reverse process. We want to start with random noise xT∼N(0,I) and reverse the noising steps to gradually denoise it, eventually obtaining a sample x0 that looks like it came from the original data distribution pdata. This involves learning the transition probabilities pθ(xt−1∣xt) for t=T,T−1,...,1.
If the noise steps βt in the forward process are sufficiently small, the true reverse transition q(xt−1∣xt,x0) can also be shown to be approximately Gaussian. However, calculating this true reverse transition requires knowing the original data x0, which is precisely what we want to generate.
Therefore, we approximate these true reverse transitions using a neural network, parameterized by θ. This network takes the noisy sample xt and the current timestep t as input and learns to predict the parameters (typically the mean and variance) of the distribution pθ(xt−1∣xt):
pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))
Instead of directly predicting the mean μθ(xt,t), a common and effective parameterization is to train the network, often denoted ϵθ(xt,t), to predict the noise ϵ that was added to get from xt−1 to xt (or more accurately, the noise component corresponding to xt based on x0). Given the predicted noise ϵθ(xt,t), we can derive the parameters for pθ(xt−1∣xt).
The parameters θ of the reverse process network are optimized by maximizing the likelihood of the training data. This often involves optimizing a variational lower bound (ELBO) on the log-likelihood. For diffusion models, this objective can often be simplified to a much more tractable form. A common simplified objective aims to minimize the mean squared error between the actual noise ϵ added during the forward process (which can be easily sampled given x0) and the noise predicted by the network ϵθ:
Lsimple(θ)=Et∼[1,T],x0∼pdata,ϵ∼N(0,I)[∣∣ϵ−ϵθ(xt,t)∣∣2]
where xt=αˉtx0+1−αˉtϵ is generated using the direct forward sampling formula. Training involves repeatedly sampling a data point x0, a timestep t, sampling noise ϵ, constructing the corresponding xt, and performing a gradient descent step on this loss.
Once the model ϵθ(xt,t) is trained, generation (sampling) proceeds by:
Diagram comparing the fixed forward noising process q with the learned reverse denoising process pθ. The reverse process starts from noise xT and uses a trained model ϵθ (optimized via the training objective Lsimple) to iteratively sample less noisy states, ultimately producing a generated sample x0.
This iterative noise-and-denoise framework forms the basis of diffusion models. While conceptually straightforward, the specific choices of variance schedules, network architectures (often U-Nets), and training objectives lead to different model variants like Denoising Diffusion Probabilistic Models (DDPMs) and score-based models, which we will explore in detail in Chapter 4. Understanding this core mechanism is essential before implementing and optimizing these advanced generative techniques.
© 2025 ApX Machine Learning