Unconditional generation produces samples representative of the overall training data. However, there is often a need for more specific outputs. Imagine wanting to generate images only of cats, or perhaps digits corresponding to the number '8'. How can we steer a diffusion model, which has learned the distribution p(x), to generate samples from a conditional distribution p(x∣y), where y represents the desired condition (e.g., y= 'cat' or y= '8')?
One of the first successful approaches to achieve this is Classifier Guidance. The core idea is to leverage a separate, pre-trained classifier model, let's call it pϕ(y∣x), where ϕ represents the classifier's parameters. This classifier is trained to predict the class label y given an input x.
However, during the diffusion model's reverse sampling process, we are not dealing with clean data x0 but with noisy intermediate samples xt at various timesteps t. Therefore, for classifier guidance to be effective, the classifier pϕ(y∣xt) must be trained to recognize the class y even from noisy inputs xt. This means training the classifier not just on the original dataset (like clean images) but also on noisy versions of the data, consistent with the noise levels encountered during the diffusion process.
How does this classifier guide the generation? Recall that the reverse process aims to approximate p(xt−1∣xt). We want to modify this step to sample from p(xt−1∣xt,y). Using Bayes' theorem, we can write:
p(xt∣y)=p(y)p(y∣xt)p(xt)
Taking the logarithm and then the gradient with respect to xt:
∇xtlogp(xt∣y)=∇xtlogp(y∣xt)+∇xtlogp(xt)
The term ∇xtlogp(xt) is the score function of the marginal data distribution at time t. The diffusion model's noise prediction network ϵθ(xt,t) is trained to approximate −σt∇xtlogp(xt) (scaled by noise level). The term ∇xtlogp(y∣xt) is the gradient of the log-likelihood of the desired class y according to the classifier, evaluated at the current noisy sample xt. This gradient indicates the direction in the input space (xt) that makes the sample look more like class y to the classifier.
Classifier guidance modifies the sampling step by incorporating this classifier gradient. Specifically, when calculating the mean μθ(xt,t) for the reverse step pθ(xt−1∣xt), we perturb it using the classifier's gradient. The update rule for the mean in the DDPM sampling step can be adjusted as follows:
The original predicted mean is derived from the predicted noise ϵθ:
μθ(xt,t)=αt1(xt−1−αˉtβtϵθ(xt,t))
The guided mean μ^θ(xt,t,y) becomes:
μ^θ(xt,t,y)=μθ(xt,t)+s⋅Σt∇xtlogpϕ(y∣xt)
Here:
Σt is the covariance matrix (variance) of the reverse step pθ(xt−1∣xt), often related to βt.
s is the guidance scale, a hyperparameter that controls the strength of the classifier's influence. A higher s pushes the generation more strongly towards the target class y.
∇xtlogpϕ(y∣xt) is the gradient computed by the classifier pϕ.
The sampling process then proceeds using this adjusted mean μ^θ:
Predict the noise ϵθ(xt,t) using the U-Net.
Calculate the original mean μθ(xt,t).
Compute the classifier gradient ∇xtlogpϕ(y∣xt) for the desired class y.
Calculate the guided mean μ^θ(xt,t,y) using the guidance scale s.
Sample xt−1 from N(xt−1;μ^θ(xt,t,y),Σt).
This process is repeated from t=T down to t=1.
Diagram illustrating the classifier guidance mechanism within a single reverse step. The U-Net predicts noise, while a separate classifier provides a gradient based on the target class y. These are combined, scaled by s, to produce a guided mean used for sampling xt−1.
Advantages of Classifier Guidance:
Explicit Control: Provides a direct way to steer generation towards specific attributes that the classifier is trained on.
Uses Existing Classifiers: Can potentially use well-trained classifiers off-the-shelf (though they need noise robustness).
Disadvantages of Classifier Guidance:
Requires a Separate Classifier: Need to train and maintain an additional model pϕ(y∣xt).
Classifier Training: The classifier must be strong to the noise levels seen during diffusion, adding training complexity.
Computational Cost: Requires running inference on both the diffusion model and the classifier at each sampling step.
Guidance Strength Tuning: Finding the right guidance scale s often requires experimentation. Too low, and the guidance is ineffective; too high, and samples might become unrealistic or suffer from adversarial effects exploited by the classifier gradient.
While effective, the need for a separate, noise-aware classifier led researchers to develop methods that achieve similar guidance without this external dependency. This motivates the next technique we will discuss: Classifier-Free Guidance (CFG).
Was this section helpful?
Denoising Diffusion Probabilistic Models, Jonathan Ho, Ajay Jain, and Pieter Abbeel, 2020Advances in Neural Information Processing Systems (NeurIPS), Vol. 34DOI: 10.48550/arXiv.2006.11239 - This paper introduces Denoising Diffusion Probabilistic Models (DDPMs) and proposes classifier guidance for conditional generation, detailing its mathematical formulation and implementation.
Generative Modeling by Estimating Gradients of the Data Distribution, Yang Song and Stefano Ermon, 2019Advances in Neural Information Processing Systems (NeurIPS), Vol. 32 (NeurIPS Foundation)DOI: 10.48550/arXiv.1907.05600 - This work presents a framework for generative modeling based on estimating gradients of the data distribution (score functions), offering a theoretical basis for gradient-based guidance in diffusion models.
Diffusion Models: A Comprehensive Survey of Methods and Applications, Zefan Yang, Haobo Shu, Chengyue Shang, Binjie Wang, Zhiyuan Li, Chenyang Xie, and Xuming He, 2023arXiv preprint arXiv:2303.07222DOI: 10.48550/arXiv.2303.07222 - This survey provides a broad overview of diffusion models, including various conditional generation strategies like classifier guidance, placing it within the field's context.