While the diffusion models discussed so far excel at learning the underlying distribution of the training data and generating diverse samples, we often require more control over the generation process. For instance, we might want to generate an image of a specific object class (like a "cat" or a "dog") or synthesize data possessing particular attributes. This is the domain of conditional generation. Two prominent techniques for achieving this control in diffusion models are Classifier Guidance and Classifier-Free Guidance (CFG).
Classifier Guidance leverages a separate, pre-trained classifier model to steer the diffusion sampling process towards samples that exhibit desired characteristics, typically defined by a class label y. The core idea is to modify the sampling steps to not only denoise the image but also make it more recognizable as class y according to the classifier.
Recall that the reverse diffusion process aims to approximate the score function ∇xtlogp(xt), which guides the sampling from noise towards data. To incorporate conditioning on a class y, we want to sample from the conditional distribution p(xt∣y). Using Bayes' theorem, we can relate the conditional score to the unconditional score and the classifier's prediction:
logp(xt∣y)=logp(y∣xt)+logp(xt)−logp(y)Taking the gradient with respect to xt gives:
∇xtlogp(xt∣y)=∇xtlogp(y∣xt)+∇xtlogp(xt)Here, ∇xtlogp(xt) is the score estimated by the unconditional diffusion model, and ∇xtlogp(y∣xt) is the gradient of the log-likelihood provided by a classifier pϕ(y∣xt) trained to predict the class y from a noisy input xt.
In practice, for models parameterized via noise prediction ϵθ, the update direction during sampling is adjusted. The standard noise prediction ϵθ(xt,t) is modified to incorporate the classifier's gradient. A common formulation for the guided noise prediction ϵ^θ is:
ϵ^θ(xt,t,y)=ϵθ(xt,t)−s⋅1−αˉt∇xtlogpϕ(y∣xt)Here, s is the guidance scale, a hyperparameter that controls the strength of the conditioning. A higher value of s pushes the generation process more strongly towards samples that the classifier pϕ recognizes as belonging to class y.
Mechanism: At each step of the reverse diffusion process, the classifier examines the current noisy sample xt and calculates how changes to xt would increase the probability of the target class y. This gradient information is then used to nudge the denoising step, effectively biasing the generation towards the desired class.
Advantages:
Disadvantages:
Classifier-Free Guidance (CFG) emerged as a way to achieve conditional generation without relying on a separate classifier model. It has become a widely adopted and highly effective technique, particularly prominent in large-scale models like those used for text-to-image synthesis.
Mechanism: The central idea is to train a single conditional diffusion model, typically parameterized by ϵθ(xt,t,y), which takes the conditioning information y (e.g., a class label, a text embedding) as an additional input. During training, the conditioning input y is randomly replaced with a special null token ∅ (representing unconditional generation) with some probability (e.g., 10-20% of the time). This forces the model to learn both the conditional noise prediction ϵθ(xt,t,y) and the unconditional noise prediction ϵθ(xt,t,∅) within the same set of weights θ.
During sampling, both the conditional and unconditional noise predictions are computed at each step. The final noise prediction used for the denoising step is then calculated by extrapolating from the unconditional prediction in the direction of the conditional prediction:
ϵ^θ(xt,t,y)=ϵθ(xt,t,∅)+s⋅(ϵθ(xt,t,y)−ϵθ(xt,t,∅))Again, s is the guidance scale (often denoted as w in the literature).
Intuition: The term (ϵθ(xt,t,y)−ϵθ(xt,t,∅)) can be seen as implicitly representing the direction related to the condition y in the noise prediction space. CFG effectively learns this direction directly from the data during training, rather than relying on an external classifier's gradient. Scaling this difference by s>1 strengthens the influence of the condition y on the generation outcome.
Diagram illustrating the Classifier-Free Guidance mechanism during a single sampling step. Both unconditional (∅) and conditional (y) noise predictions are computed from the current state xt. The final guided prediction ϵ^θ is an extrapolation based on these two predictions and the guidance scale s.
Advantages:
Disadvantages:
Feature | Classifier Guidance | Classifier-Free Guidance (CFG) |
---|---|---|
External Model | Yes (Classifier $p_\phi(y | \mathbf{x}_t)$) |
Training | Standard diffusion model + separate classifier training (on noisy data) | Modified diffusion model training (with conditional dropout) |
Inference Speed | Needs diffusion model + classifier evaluation per step | Needs diffusion model evaluation twice per step (cond + uncond) |
Typical Quality | Good, but sensitive to classifier quality & can have artifacts | Often state-of-the-art, generally higher quality and fewer artifacts |
Implementation | Requires integrating two models | Single model, modified training loop |
Flexibility | Can swap classifiers (if trained) | Guidance baked into the model |
The Guidance Scale (s): In both methods, the guidance scale s (or w) plays a significant role. It controls the trade-off between sample fidelity to the condition and sample diversity/realism.
Finding an optimal value for s usually requires empirical tuning for a specific model and task. It provides a powerful knob to adjust the generation behavior at inference time without retraining the model.
Guidance techniques are essential for directing the output of diffusion models towards specific desired properties, moving beyond simple unconditional generation. Classifier guidance uses an external classifier to inject conditioning information via gradients, while Classifier-Free Guidance achieves this more effectively by modifying the training process of the diffusion model itself, enabling it to learn conditional and unconditional generation simultaneously. CFG has become the standard approach due to its superior performance and elimination of the need for a separate, potentially problematic classifier model. Understanding and utilizing these guidance mechanisms is fundamental for applying diffusion models to practical conditional synthesis tasks.
© 2025 ApX Machine Learning