All Courses

Consistency Model Training: Distillation Approach

Leveraging the power of pre-trained diffusion models provides an effective route for training consistency models. This method, known as consistency distillation (CD), treats the existing diffusion model as a "teacher" that guides the training of the "student" consistency model. The goal is to transfer the generative capabilities learned by the iterative teacher model into a student model capable of fast, potentially single-step, generation.

The Teacher-Student Framework

In this setup:

Teacher Model ( $\phi$ ): This is a pre-trained, high-performing diffusion model (like a DDPM or DDIM-trained model). Its role is to provide accurate estimates of the solution paths defined by the probability flow ODE associated with the diffusion process. It doesn't generate the final output directly but provides the necessary intermediate steps or score estimates. The teacher model's parameters ( $\phi$ ) are frozen during consistency distillation.
Student Model ( $\theta$ ): This is the consistency model $f_\theta(x, t)$ we aim to train. It takes a noisy input $x_t$ and a timestep $t$ and directly predicts the estimated origin of the trajectory, $\hat{x}_0$ .
Target Model ( $\theta^-$ ): To stabilize training and improve performance, a separate target network $f_{\theta^-}(x, t)$ is typically used. This network's parameters ( $\theta^-$ ) are an exponential moving average (EMA) of the student model's parameters ( $\theta$ ). It provides the target values for the student model's predictions during training.

The core idea is to train the student model $f_\theta$ such that its output remains consistent along the trajectories defined by the teacher model $\phi$ .

The Consistency Distillation Objective

Recall the consistency property: for any pair of points $(x_t, x_{t'})$ on the same ODE trajectory where $t' < t$ , we want $f(x_t, t) \approx f(x_{t'}, t')$ . Distillation enforces this by minimizing the difference between the student model's output at a later time $t$ and the target model's output at an earlier time $t'$ on the same trajectory, where the step from $x_t$ to $x_{t'}$ is estimated using the teacher model.

The training process involves sampling pairs of adjacent timesteps $(t_{n+1}, t_n)$ from a discretization $T=t_1 > t_2 > \dots > t_N = \epsilon > 0$ . For each pair:

Sample a data point $x_0 \sim p_{data}(x)$ .
Sample Gaussian noise $z \sim \mathcal{N}(0, I)$ .
Generate the noisy sample $x_{t_{n+1}}$ corresponding to time $t_{n+1}$ using the standard forward process (e.g., $x_{t_{n+1}} = \alpha_{t_{n+1}} x_0 + \sigma_{t_{n+1}} z$ ).
Use the teacher model $\phi$ and a one-step ODE solver (like Euler or Heun) to estimate the point $x_{t_n}$ on the trajectory that would precede $x_{t_{n+1}}$ . This step typically involves using the teacher's noise prediction $\hat{\epsilon}_\phi(x_{t_{n+1}}, t_{n+1})$ or score estimate $\hat{s}_\phi(x_{t_{n+1}}, t_{n+1})$ . For instance, using the DDIM update rule: $\hat{x}_0 = \frac{x_{t_{n+1}} - \sigma_{t_{n+1}} \hat{\epsilon}_\phi(x_{t_{n+1}}, t_{n+1})}{\alpha_{t_{n+1}}}$ $x_{t_n} = \alpha_{t_n} \hat{x}_0 + \sigma_{t_n} \hat{\epsilon}_\phi(x_{t_{n+1}}, t_{n+1})$ (Note: More sophisticated ODE solvers can be used here for better accuracy).
Compute the consistency distillation loss: $L_{CD}(\theta, \theta^-; \phi) = \mathbb{E}_{n, x_0, z} [ \lambda(t_n) d(f_\theta(x_{t_{n+1}}, t_{n+1}), f_{\theta^-}(x_{t_n}, t_n)) ]$ Here:
- $n$ is sampled uniformly from $\{1, \dots, N-1\}$ .
- $f_\theta(x_{t_{n+1}}, t_{n+1})$ is the student model's prediction using the "later" noisy sample.
- $f_{\theta^-}(x_{t_n}, t_n)$ is the target model's prediction using the "earlier" sample estimated via the teacher. Crucially, gradients are not propagated through the target network $f_{\theta^-}$ or the teacher model $\phi$ .
- $d(\cdot, \cdot)$ is a distance function measuring the difference between the predictions. Common choices include L2 distance, L1 distance, or perceptual metrics like LPIPS.
- $\lambda(t_n)$ is an optional positive weighting function, often set to 1.

Target Network Updates

The target network parameters $\theta^-$ are updated periodically using an exponential moving average (EMA) of the student parameters $\theta$ :

\theta^- \leftarrow \mu \theta^- + (1 - \mu) \theta

The momentum parameter $\mu$ is typically close to 1 (e.g., 0.99, 0.999). This slow update provides stable targets for the student model, preventing oscillations and improving convergence, similar to techniques used in reinforcement learning and self-supervised learning.

Implementation Considerations

Timestep Discretization ( $N$ ): The number of discrete steps $N$ used during training affects the granularity of the consistency being enforced. Larger $N$ provides finer control but increases computational overhead slightly as it determines the possible pairs $(t_{n+1}, t_n)$ .
ODE Solver: The choice of ODE solver used to estimate $x_{t_n}$ from $x_{t_{n+1}}$ with the teacher model impacts the accuracy of the target. Higher-order solvers might yield better results at the cost of computation.
Distance Metric ( $d$ ): L2 loss is common, but L1 can be more to outliers. Losses like LPIPS can sometimes yield results that align better with human perception, especially for images.
Architecture: The architecture of the student model $f_\theta$ often mirrors the teacher model's architecture (e.g., a U-Net or DiT) but is trained with the consistency objective.

Diagram illustrating the consistency distillation training process. Data $x_0$ , noise $z$ , and a timestep $t_{n+1}$ produce $x_{t_{n+1}}$ . The teacher model $\phi$ helps estimate the prior point $x_{t_n}$ on the trajectory. The student model $f_\theta$ predicts the origin from $x_{t_{n+1}}$ , while the target model $f_{\theta^-}$ predicts the origin from $x_{t_n}$ . The loss minimizes the distance between these predictions, updating only the student model parameters $\theta$ . The target parameters $\theta^-$ are updated via EMA from $\theta$ .

Advantages and Disadvantages

Advantages:

Uses Powerful Teachers: Can effectively transfer the knowledge from state-of-the-art diffusion models without needing to rediscover the data distribution entirely from scratch.
Potentially Faster Convergence: Compared to training from scratch (consistency training), distillation can sometimes converge faster as it starts with strong guidance from the teacher.
High-Quality Results: Distilled consistency models have demonstrated the ability to generate high-fidelity samples in significantly fewer steps than their teacher models.

Disadvantages:

Dependency on Teacher Model: The performance of the distilled consistency model is inherently limited by the quality of the teacher diffusion model. Any flaws or biases in the teacher may be transferred.
Requires Pre-trained Model: This approach necessitates having a well-trained diffusion model available, which itself requires significant computational resources and data.

Consistency distillation provides a practical and effective method for obtaining fast generative models by building upon the successes of established diffusion models. It represents a significant step towards mitigating the slow sampling speed that often limits the applicability of diffusion-based generative approaches.

Was this section helpful?