Implement Classifier-Free Guidance (CFG) during the sampling process of diffusion models. Practical examples and code structure are provided for this purpose. A pre-trained U-Net model capable of accepting conditioning information is assumed for these implementations.Recap: The CFG MechanismRemember, CFG guides the generation towards a condition $y$ (like a class label or text embedding) without needing a separate classifier model. It achieves this by leveraging the diffusion model's ability to perform both conditional and unconditional predictions. During sampling at each timestep $t$, we calculate:Conditional Noise Prediction: $\epsilon_\theta(x_t, t, y)$ - The noise predicted by the model when given the current noisy image $x_t$, the timestep $t$, and the specific condition $y$.Unconditional Noise Prediction: $\epsilon_\theta(x_t, t, \emptyset)$ - The noise predicted by the model when given $x_t$ and $t$, but with a generic "null" or empty condition $\emptyset$. This represents what the model would generate without specific guidance.These two predictions are then combined using a guidance scale $w$:$$ \hat{\epsilon}t = \epsilon\theta(x_t, t, \emptyset) + w \cdot (\epsilon_\theta(x_t, t, y) - \epsilon_\theta(x_t, t, \emptyset)) $$This $\hat{\epsilon}t$ is the guided noise estimate used to compute the next, less noisy state $x{t-1}$. The guidance scale $w$ controls how strongly the generation adheres to the condition $y$. A value of $w=0$ ignores the condition, resulting in unconditional sampling. Increasing $w$ pushes the generation more strongly towards the condition $y$.Preparing Conditioning InformationBefore starting the sampling loop, you need to prepare your conditioning input $y$ and the null condition $\emptyset$.Class Labels: If conditioning on class labels (e.g., for MNIST or CIFAR-10), you typically convert the integer label into an embedding vector. This might involve a simple embedding layer within your model or a fixed encoding.Text Descriptions: For text-to-image models, $y$ is usually an embedding derived from the text prompt using a pre-trained text encoder like CLIP.Null Condition: The null condition $\emptyset$ needs a corresponding representation. This is often a specific learned embedding vector trained to represent the absence of conditioning. Sometimes, a simple vector of zeros might be used, or it could be tied to a specific token in the text encoder's vocabulary if using text conditioning. The point is that the model was trained to recognize this specific input as the signal for unconditional generation (via conditioning dropout during training).Let's assume y_cond holds the conditioning vector for your desired output (e.g., embedding for "cat" or class 7) and y_null holds the vector for the null condition.Implementing the Guided Sampling LoopThe core modification happens inside the sampling loop (whether using DDPM or DDIM). Here's a simplified Python-like pseudo-code structure, assuming a PyTorch-like framework and a function get_denoised_xt_minus_1 that performs the standard reverse step (like Eq. 11 or 12 from the DDPM paper, or the DDIM update) given $x_t$, $t$, and a predicted noise $\epsilon$:# Assume model is your U-Net, scheduler holds noise schedule info # x_t starts as pure noise: x_T ~ N(0, I) # timesteps is a list/tensor of timesteps, e.g., [999, 998, ..., 0] # y_cond is the conditioning vector for the desired output # y_null is the null conditioning vector # w is the guidance scale (e.g., 7.5) x_t = torch.randn_like(initial_sample_shape) # Start with random noise x_T for t_val in timesteps: t_tensor = torch.tensor([t_val] * batch_size, device=x_t.device) # Ensure x_t requires gradients for model input if needed by framework details, # but we typically don't need gradients during inference. # Using torch.no_grad() is common practice for efficiency. with torch.no_grad(): # 1. Predict noise for the conditional input pred_noise_cond = model(x_t, t_tensor, y_cond) # 2. Predict noise for the unconditional input pred_noise_uncond = model(x_t, t_tensor, y_null) # 3. Combine predictions using the CFG formula guided_noise = pred_noise_uncond + w * (pred_noise_cond - pred_noise_uncond) # 4. Use the guided noise to compute x_{t-1} # This step depends on whether you use DDPM or DDIM sampling logic # Example assuming a function encapsulating the reverse step: x_t = scheduler.step(guided_noise, t_val, x_t) # Updates x_t to x_{t-1} # Final result after the loop is x_0 (the generated sample) generated_sample = x_tSteps in the Loop:Get Timestep: Get the current timestep t.Predict Conditional Noise: Pass $x_t$, $t$, and the target condition y_cond to the model.Predict Unconditional Noise: Pass $x_t$, $t$, and the null condition y_null to the model.Apply CFG Formula: Calculate the guided_noise using the unconditional prediction, the conditional prediction, and the guidance scale w.Perform Denoising Step: Use this guided_noise in your chosen sampler's (DDPM or DDIM) reverse diffusion equation to calculate $x_{t-1}$. Update x_t for the next iteration.Repeat this for all timesteps from $T-1$ down to 0. The final x_t will be your generated sample $x_0$.Observing the Effect of Guidance Scale (w)The choice of w significantly impacts the output.Low w (e.g., 0 or 1): Generation is less constrained by the condition. If $w=0$, it's purely unconditional. If $w=1$, it follows the learned conditional distribution but might lack strong adherence. Samples might be diverse but less aligned with the prompt y.Moderate w (e.g., 3 to 10): Often the sweet spot. Balances adherence to the condition y with overall sample quality and diversity. The generated image clearly reflects the condition.High w (e.g., 15+): Strong adherence to the condition, but samples might become less diverse, potentially exhibiting saturation or artifacts. The model might over-emphasize features related to the condition.Experimenting with different values of w is common to find the best trade-off for a specific model and task.Let's visualize how changing w might affect generating, say, the digit '8' using a diffusion model trained on MNIST with CFG.{"layout": {"title": "Effect of Guidance Scale (w) on Digit Generation (Illustrative)", "xaxis": {"title": "Guidance Scale (w)", "tickvals": [0, 1, 7, 15], "ticktext": ["w=0 (Uncond)", "w=1", "w=7", "w=15"]}, "yaxis": {"title": "Adherence to Condition '8'", "range": [0, 1], "showticklabels": false}, "yaxis2": {"title": "Sample Diversity/Quality", "overlaying": "y", "side": "right", "range": [0, 1], "showticklabels": false}, "legend": {"orientation": "h", "yanchor": "bottom", "y": -0.3, "xanchor": "center", "x": 0.5}, "shapes": [{"type": "rect", "xref": "x", "yref": "paper", "x0": 5, "y0": 0, "x1": 10, "y1": 1, "fillcolor": "#b2f2bb", "opacity": 0.2, "line": {"width": 0}, "layer": "below"}], "annotations": [{"xref": "x", "yref": "paper", "x": 7.5, "y": 0.5, "text": "Often<br>Optimal<br>Range", "showarrow": false, "font": {"color": "#37b24d"}}]}, "data": [{"x": [0, 1, 7, 15], "y": [0.1, 0.4, 0.85, 0.95], "name": "Adherence to '8'", "type": "scatter", "mode": "lines+markers", "line": {"color": "#4263eb"}}, {"x": [0, 1, 7, 15], "y": [0.8, 0.85, 0.75, 0.5], "name": "Diversity/Quality", "type": "scatter", "mode": "lines+markers", "yaxis": "y2", "line": {"color": "#f76707"}}]}This illustrative plot shows the typical trade-off. As guidance scale w increases, adherence to the condition (generating an '8') generally improves (blue line), but sample diversity and potentially overall quality might decrease after a certain point (orange line), sometimes leading to artifacts at very high values. The green shaded region indicates a common range where a good balance is often found.Training Requirement: Conditioning DropoutIt's important to remember that CFG during sampling relies on the model being specifically trained to handle both conditional and unconditional inputs. This is typically achieved using conditioning dropout during the training phase:During each training step, select a batch of data $x_0$ and corresponding conditions $y$.For a fraction of the samples in the batch (e.g., 10-20%), replace the true condition $y$ with the null condition embedding $\emptyset$.Train the model using the standard diffusion loss (predicting the noise $\epsilon$), providing either the true condition $y$ or the null condition $\emptyset$ as input alongside $x_t$ and $t$.This forces the model to learn how to predict noise both when guided by a specific condition and when no condition is provided (using the null embedding). Without this training strategy, the model wouldn't know how to interpret the null condition $\emptyset$, and the CFG formula wouldn't produce meaningful guidance.By implementing the guided sampling loop described here, leveraging a model trained with conditioning dropout, you can effectively steer the diffusion process to generate outputs that match your desired conditions. This significantly expands the creative control offered by diffusion models.