Denoising Diffusion Probabilistic Models (DDPMs) represent a prominent and highly effective framework within the broader family of diffusion models. Introduced by Ho et al. (2020), DDPMs leverage a fixed, multi-step forward process that gradually adds Gaussian noise to data, paired with a learned reverse process that iteratively removes the noise to generate new data samples.
The Forward Process: Adding Noise
The forward process, also known as the diffusion process, is defined as a Markov chain that progressively corrupts an initial data point x0∼q(x0) by adding small amounts of Gaussian noise over T discrete timesteps. The noise level at each step t is controlled by a predefined variance schedule {βt}t=1T, where 0<β1<β2<...<βT<1. Typically, T is large (e.g., T=1000), and the βt values are small, ensuring that the change at each step is subtle.
The distribution of the noisy sample xt given the previous sample xt−1 is a Gaussian:
q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)
Here, N(x;μ,σ2I) denotes a Gaussian distribution over variable x with mean μ and diagonal covariance σ2I.
A significant property of this process is that we can sample xt directly from the original data x0 in a closed form, avoiding the need to iterate through all intermediate steps. Let αt=1−βt and αˉt=∏i=1tαi. Then, the distribution of xt conditioned on x0 is also Gaussian:
q(xt∣x0)=N(xt;αˉtx0,(1−αˉt)I)
This equation can be interpreted as scaling the original image x0 by αˉt and adding Gaussian noise ϵ∼N(0,I) scaled by 1−αˉt:
xt=αˉtx0+1−αˉtϵ
As t approaches T, αˉt approaches 0 (since 0<αt<1). Consequently, xT becomes almost entirely independent of x0 and resembles an isotropic Gaussian distribution N(0,I). The variance schedule βt is often chosen to be linearly increasing, quadratically increasing, or based on a cosine schedule.
The Reverse Process: Learning to Denoise
The generative power of DDPMs comes from learning the reverse process: starting from pure noise xT∼N(0,I) and gradually removing noise to obtain a realistic sample x0. This involves learning the transition probabilities pθ(xt−1∣xt) for t=T,T−1,...,1.
If βt is sufficiently small, the true reverse transition q(xt−1∣xt) would also be approximately Gaussian. The challenge is that computing the true reverse transition q(xt−1∣xt) requires marginalizing over all possible data points, which is intractable. However, the posterior q(xt−1∣xt,x0)is tractable and can be shown to be Gaussian:
Since we don't have x0 during the reverse sampling process (that's what we want to generate!), we approximate this distribution using a neural network. DDPM parameterizes the reverse transitions pθ(xt−1∣xt) as Gaussians:
pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))
The goal is to train the parameters θ such that this distribution closely matches the true (but unknown) q(xt−1∣xt).
Network Parameterization and the Simplified Objective
In the original DDPM paper, the variance of the reverse process is fixed to untrained constants Σθ(xt,t)=σt2I. Two common choices are σt2=βt or σt2=β~t. The neural network, therefore, only needs to predict the mean μθ(xt,t).
Instead of directly predicting the mean μθ, DDPMs reparameterize the network to predict the noise ϵ that was added at step t. Recall the forward process equation: xt=αˉtx0+1−αˉtϵ. We can rearrange this to express x0 in terms of xt and ϵ:
x0=αˉt1(xt−1−αˉtϵ)
Substituting this expression for x0 into the formula for the ideal mean μ~t(xt,x0) allows us to express the mean μθ in terms of the predicted noise ϵθ(xt,t):
μθ(xt,t)=αt1(xt−1−αˉtβtϵθ(xt,t))
This reparameterization leads to a connection between minimizing the difference between the predicted mean μθ and the true mean μ~t and minimizing the difference between the predicted noise ϵθ and the actual noise ϵ.
The authors found that training the model using a simplified objective function, derived from the variational lower bound (VLB) on the data log-likelihood, works very well in practice and is easier to implement. This simplified objective aims to minimize the mean squared error between the true noise ϵ added during the forward process and the noise predicted by the network ϵθ:
where xt=αˉtx0+1−αˉtϵ is the noisy sample generated using the original data x0 and the sampled noise ϵ at a randomly chosen timestep t.
Essentially, the network ϵθ takes a noisy sample xt and the timestep t as input and tries to predict the noise component ϵ that was used to create xt from x0. Training with this objective effectively teaches the network to denoise samples at various noise levels.
Diagram illustrating the fixed forward (noising) process from data x0 to noise xT, and the learned reverse (denoising) process generating data x0 from noise xT by predicting the noise ϵθ at each step.
DDPM Training and Sampling Algorithms
Training:
Sample a real data point x0∼q(x0).
Sample a timestep t uniformly from {1,...,T}.
Sample noise ϵ∼N(0,I).
Compute the noisy sample xt=αˉtx0+1−αˉtϵ.
Train the neural network ϵθ by minimizing the loss ∥ϵ−ϵθ(xt,t)∥2 using gradient descent. Repeat this process over many iterations and data points.
Sampling (Generation):
Start with pure noise xT∼N(0,I).
Iterate backwards from t=T down to 1:
a. Sample z∼N(0,I) if t>1, otherwise z=0.
b. Predict the noise ϵθ(xt,t) using the trained network.
c. Compute the denoised sample xt−1 using the reverse transition formula which incorporates the predicted noise:
xt−1=αt1(xt−1−αˉtβtϵθ(xt,t))+σtz
(where σt is the fixed standard deviation, e.g., βt or β~t).
The final result x0 is the generated sample.
DDPMs provide a solid foundation for understanding modern diffusion-based generative modeling. Their ability to generate high-fidelity samples, particularly in image synthesis, stems from this carefully defined noising process and the learned denoising network trained via the intuitive noise-prediction objective. Subsequent sections will explore variations like score-based interpretations and techniques to improve sampling efficiency and control.