Sampling from Consistency Models (Single-step and Multi-step)
Once a consistency model fθ(xt,t) is trained, either through distillation or standalone training, it provides a direct mapping from any point (xt,t) on a probability flow ODE trajectory to an estimate of the trajectory's origin, x^0. This capability allows for significantly accelerated sampling compared to traditional diffusion models that require simulating the reverse process iteratively. We will now examine how to generate samples using both single-step and multi-step approaches.
Single-Step Generation
The most direct application of a consistency model is single-step generation. This method aims to produce a sample in a single forward pass of the model, offering maximum inference speed.
The procedure is straightforward:
Sample an initial noise vector xT from the prior distribution, typically a standard Gaussian: xT∼N(0,I). Here, T represents the maximum noise level or time in the diffusion process context.
Evaluate the consistency model fθ using this noise vector and the corresponding maximum time T:
x^0=fθ(xT,T)
The output x^0 is the generated sample.
This single evaluation replaces the hundreds or thousands of steps often required by DDPM or DDIM samplers. The underlying principle relies on the trained model's ability to approximate the function that maps any point (xt,t) directly to the starting point x0 of its ODE trajectory.
Advantages:
Extreme Speed: Single-step generation is the fastest possible method, requiring only one network evaluation per sample.
Simplicity: The sampling algorithm is trivial to implement.
Disadvantages:
Potential Quality Trade-off: The quality of samples generated in a single step might be lower than those produced by multi-step consistency sampling or traditional diffusion models, especially if the consistency property is not perfectly learned across all timesteps. Errors in the single mapping from xT to x^0 are not corrected.
Single-step sampling is particularly useful in applications where generation speed is the primary concern, such as real-time interactive systems, even if it means a slight compromise on the highest possible fidelity.
Multi-Step Generation
To improve sample quality while still maintaining significant speed advantages over traditional methods, consistency models can be used in a multi-step generation process. This approach leverages the consistency property iteratively, adding intermediate denoising and noise injection steps to refine the estimate.
The multi-step algorithm typically involves the following:
Choose the number of sampling steps, N (e.g., 2, 5, 10), which is significantly smaller than typical diffusion sampling steps.
Define a sequence of timesteps T=tN>tN−1>⋯>t1>t0=0. A common choice is uniform spacing on some scale (e.g., linear or log-linear). Let ϵ be a small positive constant representing the minimum noise level (e.g., ϵ=0.002).
Sample the initial noise vector: xtN∼N(0,I).
Iterate from i=N down to 1:
a. Get the current state xti and time ti.
b. Estimate the origin using the consistency model: x^0(i)=fθ(xti,ti).
c. If i>1:
i. Determine the next timestep ti−1.
ii. Sample Gaussian noise: zi∼N(0,I).
iii. Add noise to the estimated origin to obtain the state at the next timestep:
xti−1=x^0(i)+ti−12−ϵ2⋅zi
This step uses the estimate x^0(i) and injects the appropriate amount of noise to simulate xti−1 lying on the same trajectory. The ϵ2 term ensures the noise variance is positive.
d. If i=1: The final estimate x^0(1) is the result.
Return the final estimate x^0=x^0(1).
This process uses the consistency model at each step i to get an estimate x^0(i) of the origin from the current state xti. It then 'jumps' to the next time step ti−1 by adding appropriately scaled noise back to this estimate, effectively correcting the path towards the origin at multiple points along the trajectory.
Advantages:
Improved Sample Quality: Compared to single-step generation, the iterative refinement generally leads to higher fidelity samples that more closely match the quality of the original diffusion model (if distilled) or the target distribution.
Flexible Trade-off: The number of steps N allows tuning the balance between computational cost and sample quality. Even with a small N (e.g., 2-10), significant quality improvements over single-step are often observed.
Disadvantages:
Increased Computation: Requires N network evaluations, making it slower than single-step generation, though still much faster than traditional diffusion samplers.
The following diagram illustrates the difference between single-step and multi-step sampling paths:
Single-step sampling directly maps initial noise xT to the estimate x^0. Multi-step sampling follows an iterative path, using intermediate estimates and noise injection steps (xt3→xt2→xt1) to reach the final x^0.
Practical Considerations
Timestep Schedule: The choice of the sequence tN,…,t1 in multi-step sampling can impact performance. Schedules that place more steps at lower noise levels might be beneficial, similar to findings in accelerated diffusion sampling.
Noise Term: The ti−12−ϵ2 term is important for correctly scaling the injected noise. Ensure t1>ϵ. The exact form might vary slightly depending on the specific noise schedule parameterization used during training.
Boundary Condition ϵ: This small value prevents issues when ti−1 approaches zero and ensures the noise standard deviation remains real. Its exact value is usually not highly sensitive but should be small and positive.
By choosing between single-step and multi-step generation, you can balance the need for inference speed with the desired level of sample quality, making consistency models a versatile tool for efficient generative modeling.