As mentioned in the chapter introduction, unconditional generation produces samples representative of the overall training data, but we often desire more specific outputs. Imagine wanting to generate images only of cats, or perhaps digits corresponding to the number '8'. How can we steer the diffusion model, which has learned the distribution p(x), to generate samples from a conditional distribution p(x∣y), where y represents the desired condition (e.g., y= 'cat' or y= '8')?
One of the first successful approaches to achieve this is Classifier Guidance. The core idea is to leverage a separate, pre-trained classifier model, let's call it pϕ(y∣x), where ϕ represents the classifier's parameters. This classifier is trained to predict the class label y given an input x.
However, during the diffusion model's reverse sampling process, we are not dealing with clean data x0 but with noisy intermediate samples xt at various timesteps t. Therefore, for classifier guidance to be effective, the classifier pϕ(y∣xt) must be trained to recognize the class y even from noisy inputs xt. This means training the classifier not just on the original dataset (like clean images) but also on noisy versions of the data, consistent with the noise levels encountered during the diffusion process.
How does this classifier guide the generation? Recall that the reverse process aims to approximate p(xt−1∣xt). We want to modify this step to sample from p(xt−1∣xt,y). Using Bayes' theorem, we can write:
p(xt∣y)=p(y)p(y∣xt)p(xt)Taking the logarithm and then the gradient with respect to xt:
∇xtlogp(xt∣y)=∇xtlogp(y∣xt)+∇xtlogp(xt)The term ∇xtlogp(xt) is the score function of the marginal data distribution at time t. The diffusion model's noise prediction network ϵθ(xt,t) is trained to approximate −σt∇xtlogp(xt) (scaled by noise level). The term ∇xtlogp(y∣xt) is the gradient of the log-likelihood of the desired class y according to the classifier, evaluated at the current noisy sample xt. This gradient indicates the direction in the input space (xt) that makes the sample look more like class y to the classifier.
Classifier guidance modifies the sampling step by incorporating this classifier gradient. Specifically, when calculating the mean μθ(xt,t) for the reverse step pθ(xt−1∣xt), we perturb it using the classifier's gradient. The update rule for the mean in the DDPM sampling step can be adjusted as follows:
The original predicted mean is derived from the predicted noise ϵθ:
μθ(xt,t)=αt1(xt−1−αˉtβtϵθ(xt,t))The guided mean μ^θ(xt,t,y) becomes:
μ^θ(xt,t,y)=μθ(xt,t)+s⋅Σt∇xtlogpϕ(y∣xt)Here:
The sampling process then proceeds using this adjusted mean μ^θ:
This process is repeated from t=T down to t=1.
Diagram illustrating the classifier guidance mechanism within a single reverse step. The U-Net predicts noise, while a separate classifier provides a gradient based on the target class y. These are combined, scaled by s, to produce a guided mean used for sampling xt−1.
Advantages of Classifier Guidance:
Disadvantages of Classifier Guidance:
While effective, the need for a separate, noise-aware classifier led researchers to develop methods that achieve similar guidance without this external dependency. This motivates the next technique we will discuss: Classifier-Free Guidance (CFG).
© 2025 ApX Machine Learning