Let's begin by revisiting the Denoising Diffusion Probabilistic Model (DDPM), a foundational framework upon which many advanced techniques are built. Understanding its core mechanics is essential before we explore more sophisticated architectures and training methods. DDPMs operate on a simple yet effective principle: systematically destroy structure in data through a forward diffusion process and then learn a reverse process to restore it, effectively generating new data.
The forward process, denoted by q, gradually adds Gaussian noise to an initial data point x0 (e.g., an image) over T discrete timesteps. This process is defined as a Markov chain, where the state xt at timestep t only depends on the state xt−1 at the previous step.
Specifically, the transition is defined by a conditional Gaussian distribution:
q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)Here, βt represents the variance of the noise added at step t. These variances are determined by a predefined variance schedule {βt}t=1T, typically small values increasing over time (e.g., from 10−4 to 0.02). As t increases, the data progressively loses its distinguishing features, eventually approaching an isotropic Gaussian distribution xT∼N(0,I) if T is sufficiently large.
A useful property of this process is that we can sample xt directly from x0 without iterating through intermediate steps. Letting αt=1−βt and αˉt=∏s=1tαs, the distribution q(xt∣x0) is also Gaussian:
q(xt∣x0)=N(xt;αˉtx0,(1−αˉt)I)This allows us to express xt using the reparameterization trick as:
xt=αˉtx0+1−αˉtϵwhere ϵ∼N(0,I) is standard Gaussian noise. This formulation is significant for efficient training.
Diagram illustrating the forward diffusion process, adding noise incrementally from data x0 to approximate noise xT.
The generative power comes from the reverse process, pθ, which aims to reverse the diffusion. Starting from pure noise xT∼N(0,I), the goal is to learn the transitions pθ(xt−1∣xt) to gradually remove noise and eventually sample a realistic data point x0.
The true reverse transition q(xt−1∣xt,x0) is tractable and also Gaussian when conditioned on x0. However, x0 is unknown during generation. DDPM approximates the true reverse transition q(xt−1∣xt) with a parameterized Gaussian distribution learned by a neural network:
pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))The neural network, typically a U-Net architecture (which we will examine in detail in Chapter 2), takes the noisy data xt and the timestep t as input. It's trained to predict the parameters of the reverse transition. For DDPM, the variance Σθ(xt,t) is often fixed to a value related to the forward process variances (σt2I, where σt2 is typically βt or β~t=1−αˉt1−αˉt−1βt). The network then primarily focuses on learning the mean μθ(xt,t).
Training involves maximizing the log-likelihood logpθ(x0) of the data. This is achieved by optimizing the Variational Lower Bound (ELBO) on the log-likelihood. Through mathematical derivation (involving properties of the forward and reverse processes), the objective can be significantly simplified. A common and effective objective function used in practice is:
Lsimple(θ)=Et∼U(1,T),x0∼data,ϵ∼N(0,I)[∣∣ϵ−ϵθ(xt,t)∣∣2]where xt=αˉtx0+1−αˉtϵ, and ϵθ is the output of the neural network. In this formulation, the network is trained to predict the noise ϵ that was added to x0 to obtain xt. Predicting the noise ϵ has been found to be more stable and effective than directly predicting the mean μθ(xt,t) or the denoised image x0.
This objective function connects DDPMs to score-based generative modeling. The optimal ϵθ(xt,t) is related to the score function ∇xtlogq(xt), which represents the direction in the data space pointing towards higher density. Effectively, the neural network learns to estimate this score function scaled appropriately. We will touch upon this connection again when discussing score matching and ODEs later in this chapter.
Once the model pθ is trained, generating a new sample involves:
This iterative process requires T sequential evaluations of the neural network, making standard DDPM sampling relatively slow compared to other generative models like GANs or VAEs. Techniques like DDIM (which we recap next) and the advanced samplers discussed in Chapter 6 aim to accelerate this.
This recap provides the necessary context for DDPMs. While simple in principle, the choices of noise schedule, network architecture, and objective formulation offer many avenues for improvement, forming the basis for the advanced topics covered in subsequent chapters. Keep this foundational structure in mind as we proceed to analyze noise schedules and explore more complex model variations.
© 2025 ApX Machine Learning