Classifier-Free Guidance (CFG), covered in Chapter 4, provides a powerful way to steer the diffusion sampling process towards a desired condition (like a text prompt) without needing a separate classifier model. By interpolating between a conditional prediction and an unconditional prediction, CFG allows control over the strength of the conditioning signal via the guidance scale, often denoted as w or γ.
While effective, using a fixed guidance scale throughout the entire sampling process can sometimes be suboptimal. Very high guidance scales, while potentially improving prompt adherence, can lead to oversaturation, unnatural contrast, or visual artifacts in the generated samples. Conversely, low scales might ignore the conditioning too much. Guided sampling refinements aim to mitigate these issues and gain finer control over the generation process.
Instead of using a constant guidance scale w for all timesteps t, dynamic guidance involves adjusting the scale during the reverse diffusion process. The intuition is that the optimal level of guidance might differ depending on how much structure or noise is present in the image at a given step.
For instance, one common strategy is to start with a higher guidance scale in the initial steps (when the image is mostly noise) to strongly establish the conditioned features, and then gradually decrease the scale in later steps as details are refined. This can help prevent the oversaturation associated with maintaining high guidance when the image structure is already largely formed.
The adjustment can be based on the timestep t or the noise level σt. A simple schedule might linearly or non-linearly decay the guidance scale from a maximum value wmax to a minimum value wmin over the sampling steps.
# Pseudocode for dynamic guidance scaling
def dynamic_cfg_prediction(model_output_cond, model_output_uncond, w_schedule, t):
"""Applies CFG with a time-varying scale."""
guidance_scale = w_schedule(t) # Get scale for current timestep t
return model_output_uncond + guidance_scale * (model_output_cond - model_output_uncond)
# Example schedule function (linear decay)
def linear_decay_schedule(t, T, w_max, w_min):
"""Linearly decay guidance from w_max to w_min over T steps."""
progress = t / T
return w_max - (w_max - w_min) * progress
# During sampling loop:
# noise_pred = dynamic_cfg_prediction(cond_pred, uncond_pred, my_schedule, current_t)
Here's a visualization comparing a fixed guidance scale to a simple linear decay schedule:
Comparison of a fixed guidance scale (e.g., 7.0) versus a dynamic scale linearly decaying from 10.0 to 2.0 over 50 sampling steps.
Experimenting with different schedules (e.g., cosine decay, step-based changes) can yield different trade-offs between prompt alignment and image naturalness.
Another refinement, particularly useful with high guidance scales, is thresholding. High guidance can sometimes push the predicted x0 (the estimated clean image) far outside the typical data range (e.g., [-1, 1] for normalized images). This can lead to clamping artifacts when the values are clipped back into the valid range later in the sampling step.
Thresholding techniques aim to correct these predicted x0 values before they are used to compute the next step's latent xt−1.
Static Thresholding: A simple approach involves clamping the predicted x0 values to a fixed percentile range of the data distribution. For example, if 99% of the training data pixel values fall within [-1.5, 1.5], you might clamp the predicted x0 to this range at each step.
Dynamic Thresholding: A more sophisticated method adjusts the threshold based on the statistics of the predicted x0 at the current step. Proposed in the Imagen paper, dynamic thresholding computes a percentile p of the absolute pixel values in the predicted x0. If this percentile exceeds a threshold s (often set slightly above 1.0, e.g., 1.2 to 1.5), the entire predicted x0 is rescaled to bring that percentile value down to s.
# Pseudocode for dynamic thresholding (simplified)
def dynamic_thresholding(predicted_x0, percentile=0.99, threshold_scale=1.5):
"""Applies dynamic thresholding to the predicted clean image."""
abs_pixels = abs(predicted_x0)
# Calculate the specified percentile of absolute pixel values
p_value = calculate_percentile(abs_pixels, percentile)
if p_value > threshold_scale:
# If percentile exceeds threshold, scale down the entire prediction
scaling_factor = threshold_scale / p_value
predicted_x0 = predicted_x0 * scaling_factor
# Optional: Clamp to [-1, 1] after scaling if needed by the sampler logic
# predicted_x0 = clamp(predicted_x0, -1.0, 1.0)
return predicted_x0
# Usage within a sampling step that predicts x0:
# 1. Get conditional and unconditional model outputs (e.g., predicted noise eps)
# 2. Calculate CFG blended prediction: eps_cfg = uncond_eps + w * (cond_eps - uncond_eps)
# 3. Calculate the corresponding predicted x0: pred_x0 = get_x0_from_noise(xt, eps_cfg, t)
# 4. Apply thresholding: pred_x0_thresholded = dynamic_thresholding(pred_x0)
# 5. Use pred_x0_thresholded to calculate the next latent x_{t-1} (or the noise to use)
Dynamic thresholding helps prevent saturation and maintain image fidelity even at very high guidance scales, allowing for stronger prompt adherence without severe degradation in image quality.
These guidance refinements often interact with the choice of sampler.
Implementing dynamic guidance and thresholding can lead to:
Experimentation is key. The optimal dynamic schedule, thresholding parameters (p, s), and their combination often depend on the specific model architecture, dataset, sampler, and desired output characteristics. These refinements provide valuable tools for pushing the quality and controllability of guided diffusion sampling.
© 2025 ApX Machine Learning