Diffusion models excel at learning the underlying distribution of training data and generating diverse samples. However, there is often a need for more control over the generation process. For instance, generating an image of a specific object class (like a "cat" or a "dog") or synthesizing data possessing particular attributes falls within the domain of conditional generation. Two prominent techniques for achieving this control in diffusion models are Classifier Guidance and Classifier-Free Guidance (CFG).Classifier GuidanceClassifier Guidance uses a separate, pre-trained classifier model to steer the diffusion sampling process towards samples that exhibit desired characteristics, typically defined by a class label $y$. The core idea is to modify the sampling steps to not only denoise the image but also make it more recognizable as class $y$ according to the classifier.Recall that the reverse diffusion process aims to approximate the score function $\nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t)$, which guides the sampling from noise towards data. To incorporate conditioning on a class $y$, we want to sample from the conditional distribution $p(\mathbf{x}_t | y)$. Using Bayes' theorem, we can relate the conditional score to the unconditional score and the classifier's prediction:$$ \log p(\mathbf{x}_t | y) = \log p(y | \mathbf{x}_t) + \log p(\mathbf{x}_t) - \log p(y) $$Taking the gradient with respect to $\mathbf{x}_t$ gives:$$ \nabla_{\mathbf{x}_t} \log p(\mathbf{x}t | y) = \nabla{\mathbf{x}_t} \log p(y | \mathbf{x}t) + \nabla{\mathbf{x}_t} \log p(\mathbf{x}_t) $$Here, $\nabla_{\mathbf{x}_t} \log p(\mathbf{x}t)$ is the score estimated by the unconditional diffusion model, and $\nabla{\mathbf{x}_t} \log p(y | \mathbf{x}t)$ is the gradient of the log-likelihood provided by a classifier $p\phi(y | \mathbf{x}_t)$ trained to predict the class $y$ from a noisy input $\mathbf{x}_t$.In practice, for models parameterized via noise prediction $\boldsymbol{\epsilon}\theta$, the update direction during sampling is adjusted. The standard noise prediction $\boldsymbol{\epsilon}\theta(\mathbf{x}t, t)$ is modified to incorporate the classifier's gradient. A common formulation for the guided noise prediction $\hat{\boldsymbol{\epsilon}}\theta$ is:$$ \hat{\boldsymbol{\epsilon}}_\theta(\mathbf{x}t, t, y) = \boldsymbol{\epsilon}\theta(\mathbf{x}_t, t) - s \cdot \sqrt{1-\bar{\alpha}t} \nabla{\mathbf{x}t} \log p\phi(y | \mathbf{x}_t) $$Here, $s$ is the guidance scale, a hyperparameter that controls the strength of the conditioning. A higher value of $s$ pushes the generation process more strongly towards samples that the classifier $p_\phi$ recognizes as belonging to class $y$.Mechanism: At each step of the reverse diffusion process, the classifier examines the current noisy sample $\mathbf{x}_t$ and calculates how changes to $\mathbf{x}_t$ would increase the probability of the target class $y$. This gradient information is then used to nudge the denoising step, effectively biasing the generation towards the desired class.Advantages:Can potentially leverage powerful, independently trained classifiers.Straightforward: use an external "expert" to guide the generation.Disadvantages:Requires a classifier $p_\phi(y | \mathbf{x}_t)$ trained specifically on noisy data at various timesteps $t$, which might not be readily available or easy to train effectively.The quality of the generated samples can be highly dependent on the quality of the classifier. Poor or non-robust classifiers can introduce artifacts.The classifier gradients can sometimes be noisy or adversarial, degrading sample quality.The guidance scale $s$ introduces another hyperparameter requiring careful tuning.Classifier-Free Guidance (CFG)Classifier-Free Guidance (CFG) emerged as a way to achieve conditional generation without relying on a separate classifier model. It has become a widely adopted and highly effective technique, particularly prominent in large-scale models like those used for text-to-image synthesis.Mechanism: The central idea is to train a single conditional diffusion model, typically parameterized by $\boldsymbol{\epsilon}_\theta(\mathbf{x}t, t, y)$, which takes the conditioning information $y$ (e.g., a class label, a text embedding) as an additional input. During training, the conditioning input $y$ is randomly replaced with a special null token $\emptyset$ (representing unconditional generation) with some probability (e.g., 10-20% of the time). This forces the model to learn both the conditional noise prediction $\boldsymbol{\epsilon}\theta(\mathbf{x}t, t, y)$ and the unconditional noise prediction $\boldsymbol{\epsilon}\theta(\mathbf{x}_t, t, \emptyset)$ within the same set of weights $\theta$.During sampling, both the conditional and unconditional noise predictions are computed at each step. The final noise prediction used for the denoising step is then calculated by extrapolating from the unconditional prediction in the direction of the conditional prediction:$$ \hat{\boldsymbol{\epsilon}}_\theta(\mathbf{x}t, t, y) = \boldsymbol{\epsilon}\theta(\mathbf{x}t, t, \emptyset) + s \cdot (\boldsymbol{\epsilon}\theta(\mathbf{x}t, t, y) - \boldsymbol{\epsilon}\theta(\mathbf{x}_t, t, \emptyset)) $$Again, $s$ is the guidance scale (often denoted as $w$ in the literature).If $s = 0$, $\hat{\boldsymbol{\epsilon}}\theta = \boldsymbol{\epsilon}\theta(\mathbf{x}_t, t, \emptyset)$, resulting in unconditional generation.If $s = 1$, $\hat{\boldsymbol{\epsilon}}\theta = \boldsymbol{\epsilon}\theta(\mathbf{x}_t, t, y)$, corresponding to standard conditional generation using the learned conditional model.If $s > 1$, the guidance effect is amplified. The formula essentially takes the unconditional prediction and adds a scaled version of the difference vector pointing from the unconditional to the conditional prediction. This pushes the result further in the direction indicated by the condition $y$.Intuition: The term $(\boldsymbol{\epsilon}_\theta(\mathbf{x}t, t, y) - \boldsymbol{\epsilon}\theta(\mathbf{x}_t, t, \emptyset))$ can be seen as implicitly representing the direction related to the condition $y$ in the noise prediction space. CFG effectively learns this direction directly from the data during training, rather than relying on an external classifier's gradient. Scaling this difference by $s > 1$ strengthens the influence of the condition $y$ on the generation outcome.digraph G { rankdir=TB; node [shape=record, style=filled, fillcolor="#ffffff", fontname="Arial", fontsize=10]; edge [arrowhead=vee, arrowsize=0.7, fontname="Arial", fontsize=9, color="#495057"]; xt [label="xt", fillcolor="#dee2e6"]; eps_unc [label="epsilon_theta(xt, t, null)\nUnconditional", fillcolor="#a5d8ff"]; eps_cond [label="epsilon_theta(xt, t, y)\nConditional", fillcolor="#b2f2bb"]; eps_guided [label="epsilon_hat_theta(xt, t, y)\nGuided", fillcolor="#ffd8a8"]; next_step [label="Use epsilon_hat to compute xt_minus_1", fillcolor="#ffffff", shape=ellipse]; xt -> eps_unc [label="compute"]; xt -> eps_cond [label="compute"]; eps_unc -> eps_guided [label="1.0x"]; eps_cond -> eps_guided [label="s x"]; eps_guided -> next_step; formula [shape=plaintext, label="epsilon_hat = epsilon_null + s * (epsilon_y - epsilon_null)"]; eps_guided -> formula [style=dashed, arrowhead=none]; eps_unc -> eps_cond [style=dotted, arrowhead=none, label="direction: epsilon_y - epsilon_null", fontsize=8]; } Diagram illustrating the Classifier-Free Guidance mechanism during a single sampling step. Both unconditional ($\emptyset$) and conditional ($y$) noise predictions are computed from the current state $\mathbf{x}t$. The final guided prediction $\hat{\boldsymbol{\epsilon}}\theta$ is an extrapolation based on these two predictions and the guidance scale $s$.Advantages:Eliminates the need for a separate classifier, simplifying the overall pipeline.Often produces higher-quality conditional samples compared to classifier guidance.Training involves only the diffusion model itself, potentially leading to better interaction between the generative and conditioning aspects.Guidance strength $s$ is easily adjustable at inference time, allowing control over the trade-off between sample quality and adherence to the condition.Disadvantages:Requires modifying the training process to include conditional dropout.Involves running the model forward twice per sampling step (once with $y$, once with $\emptyset$) if not optimized, increasing inference time compared to a purely conditional or unconditional model (though often faster than classifier guidance which also needs classifier evaluation).Comparison and Practical NotesFeatureClassifier GuidanceClassifier-Free Guidance (CFG)External ModelYes (Classifier $p_\phi(y\mathbf{x}_t)$)NoTrainingStandard diffusion model + separate classifier training (on noisy data)Modified diffusion model training (with conditional dropout)Inference SpeedNeeds diffusion model + classifier evaluation per stepNeeds diffusion model evaluation twice per step (cond + uncond)Typical QualityGood, but sensitive to classifier quality & can have artifactsOften state-of-the-art, generally higher quality and fewer artifactsImplementationRequires integrating two modelsSingle model, modified training loopFlexibilityCan swap classifiers (if trained)Guidance baked into the modelThe Guidance Scale (s): In both methods, the guidance scale $s$ (or $w$) plays a significant role. It controls the trade-off between sample fidelity to the condition and sample diversity/realism.Low $s$ (e.g., 0 or 1): Samples are diverse but may not strongly reflect the condition $y$. $s=0$ yields unconditional samples in CFG.High $s$ (e.g., 5-15): Samples strongly adhere to the condition $y$, but diversity might decrease, and samples can sometimes become oversaturated or develop artifacts, appearing less realistic.Finding an optimal value for $s$ usually requires empirical tuning for a specific model and task. It provides a powerful knob to adjust the generation behavior at inference time without retraining the model.SummaryGuidance techniques are essential for directing the output of diffusion models towards specific desired properties, moving past simple unconditional generation. Classifier guidance uses an external classifier to inject conditioning information via gradients, while Classifier-Free Guidance achieves this more effectively by modifying the training process of the diffusion model itself, enabling it to learn conditional and unconditional generation simultaneously. CFG has become the standard approach due to its superior performance and elimination of the need for a separate, potentially problematic classifier model. Understanding and utilizing these guidance mechanisms is fundamental for applying diffusion models to practical conditional synthesis tasks.