As highlighted in the chapter introduction and recalling our discussion on the connection between diffusion models and Ordinary Differential Equations (ODEs) in Chapter 1, the iterative nature of standard diffusion samplers poses a practical challenge due to high computational cost. Consistency models offer a distinct approach to circumvent this multi-step generation process. The foundation of this approach lies in the consistency property derived from the underlying continuous-time perspective of diffusion.
Diffusion models can be interpreted through the lens of continuous stochastic processes and their corresponding ODEs. Specifically, the generation process can be viewed as reversing a forward process described by an ODE, often referred to as the Probability Flow (PF) ODE. Let x(t) represent the state (e.g., an image) at time t, where t ranges from t=0 (real data) to t=T (approximately pure noise). The PF ODE describes the path, or trajectory, that transforms a data point x(0) into noise x(T) and vice-versa. Standard diffusion samplers essentially approximate the solution to this ODE by taking many small, discrete steps backward from x(T) to estimate x(0).
The central idea behind consistency models revolves around a specific property of functions defined on these ODE trajectories. Imagine a function, let's call it f, that takes a noisy state x(t) at any time t along a particular PF ODE trajectory and directly maps it back to the origin of that trajectory, x(0).
Mathematically, let {x(t)}t∈[tmin,T] be the solution trajectory of the PF ODE starting from x(0), where tmin is a small positive value close to zero (acting as the lower bound of integration) and T is the maximum noise schedule time. A function f(x,t) satisfies the consistency property if, for all t within the interval [tmin,T] along this specific trajectory:
f(x(t),t)=x(0)This equation signifies that no matter where you are on the trajectory (at any time t>tmin), the consistency function f always yields the initial data point x(0) from which that trajectory originated.
Furthermore, this implies a form of self-consistency: for any two points x(t1) and x(t2) on the same trajectory (where t1,t2∈[tmin,T]):
f(x(t1),t1)=f(x(t2),t2)=x(0)It's this characteristic self-consistency across time for points on the same trajectory that lends the name to these models.
Diagram illustrating the consistency property. Points x(tmin), x(t1), x(t2), and x(T) lie on the same Probability Flow ODE trajectory originating from x(0). The consistency function f(x,t) maps each of these points back to the original data point x(0).
The existence of such a consistency function f has profound implications for generation. If we could learn or accurately approximate f(x,t), we could perform generation in potentially a single step. We would start with a sample xT drawn from the noise distribution pT(x) (typically a standard Gaussian) and simply compute:
x^0=f(xT,T)This directly estimates the data point x0 corresponding to the trajectory ending at the noise sample xT, bypassing the need for iterative refinement used in DDPM, DDIM, or other ODE solvers.
This is fundamentally different from standard diffusion model sampling. Traditional methods learn a model (often a U-Net or Transformer) to predict the score function ∇xlogpt(x) or the noise ϵ added at time t. Sampling then involves using these predictions within a numerical solver (like Euler-Maruyama for the SDE or DDIM/DPM-Solver for the ODE) to take many small steps Δt backward in time, gradually transforming noise xT into a sample x^0.
In contrast, consistency models aim to learn the result of the entire integration process from time T back to tmin≈0, encapsulated within the function f. The subsequent sections will explore how we can train neural networks to approximate this consistency function f(x,t), enabling extremely fast generation.
© 2025 ApX Machine Learning