Classifier-Free Guidance (CFG) offers a powerful way to steer the generation process without relying on a separate, pre-trained classifier model. As discussed previously, this avoids potential issues tied to classifier accuracy or domain mismatch and has become a standard technique in modern diffusion models. Let's look at how to implement it.
The core idea is to train a single diffusion model ϵθ that can operate both conditionally and unconditionally. This is achieved through a modification during the training phase.
Training with Conditional Dropout
During training, for each data sample x0 and its corresponding conditioning information y (like a class label or text embedding), we do the following:
Sample a timestep t∼U(1,T).
Sample noise ϵ∼N(0,I).
Compute the noisy sample xt using the forward process equation: xt=αˉtx0+1−αˉtϵ.
With a certain probability puncond (e.g., 10-20%), replace the actual conditioning information y with a special null or unconditional token, denoted as ∅. This replacement acts like a form of "conditional dropout".
Feed the noisy sample xt, the timestep t, and the (potentially replaced) conditioning information into the U-Net model: ϵpred=ϵθ(xt,t,yeff), where yeff is either y or ∅.
Calculate the loss, typically the Mean Squared Error (MSE), between the predicted noise ϵpred and the true noise ϵ: L=∣∣ϵ−ϵpred∣∣2.
Update the model parameters θ using gradient descent on this loss.
By randomly omitting the conditioning information during training, the model learns two things simultaneously:
How to predict the noise given the condition: ϵθ(xt,t,y).
How to predict the noise unconditionally: ϵθ(xt,t,∅).
The null token ∅ needs a specific representation. For class labels, it might be a dedicated "unconditional" class index. For text embeddings (like CLIP embeddings), it's often a fixed, learnable embedding vector trained alongside the model to represent the absence of text conditioning.
Sampling with Guidance
During the generation (sampling) process, we leverage the model's ability to make both conditional and unconditional predictions at each step. For a given timestep t and the current noisy sample xt, we compute two noise predictions:
Unconditional Prediction:ϵuncond=ϵθ(xt,t,∅)
Conditional Prediction:ϵcond=ϵθ(xt,t,y) (where y is the desired condition for the output)
Instead of using just ϵcond to perform the denoising step, we combine these two predictions using a guidance scale parameter w (often called guidance strength or scale, sometimes denoted s or γ). The combined noise prediction ϵ~t is calculated as:
ϵ~t=ϵuncond+w⋅(ϵcond−ϵuncond)
This formula has a clear interpretation:
Start with the unconditional noise prediction ϵuncond.
Calculate the "guidance direction": the difference between the conditional and unconditional predictions (ϵcond−ϵuncond). This vector points from the unconditional generation path towards the conditional one.
Scale this direction by the guidance scale w and add it to the unconditional prediction.
An equivalent way to write this is:
ϵ~t=(1−w)ϵuncond+w⋅ϵcond
From this form, we can see:
If w=0, we get ϵ~t=ϵuncond, resulting in purely unconditional generation.
If w=1, we get ϵ~t=ϵcond, which corresponds to standard conditional generation without extra guidance amplification.
If w>1, we extrapolate along the guidance direction, pushing the generation more strongly towards the condition y.
This combined noise estimate ϵ~t is then used in the standard denoising step (e.g., the DDPM or DDIM update rule) to compute the less noisy sample xt−1. The process repeats from t=T down to t=1.
The diagram below illustrates the computation at a single sampling step:
Flow diagram for calculating the guided noise prediction ϵ~t using Classifier-Free Guidance at a single denoising step t.
Practical Considerations
Guidance Scale (w): This is a critical hyperparameter. Typical values range from 1.5 to 15. Higher values enforce the condition more strongly, which can improve alignment (e.g., making an image look more like the text prompt) but may lead to saturation, artifacts, or reduced diversity in the generated samples. Lower values yield more diverse but potentially less condition-aligned results. You often need to experiment to find a good balance for your specific task and model.
Conditioning Input: How y and ∅ are integrated into the U-Net architecture is important. Common methods include adding their embeddings to the timestep embeddings or using cross-attention layers within the U-Net blocks, allowing the model to attend to relevant parts of the conditioning information. We'll discuss architecture modifications in more detail later.
Computational Cost: CFG requires running the model forward twice per sampling step (once conditional, once unconditional). This roughly doubles the computational cost of sampling compared to standard conditional or unconditional generation. This trade-off is often acceptable given the significant improvement in control and quality.
Implementing CFG involves modifying both the training loop (to handle conditional dropout) and the sampling loop (to perform the two forward passes and combine the results). The next sections will delve into specific architectural changes often used for conditioning and provide practical examples.