As introduced, diffusion models operate by progressively adding noise to data and then learning to reverse this process. While models like DDPM are often presented using discrete time steps, a more general and powerful mathematical framework arises when we consider the continuous-time limit of this noising process. This leads us to the language of Stochastic Differential Equations (SDEs).
Understanding the SDE formulation provides deeper insights into why diffusion models work and unifies various discrete-time diffusion approaches under a single mathematical structure. It also opens doors to more flexible noise scheduling and sampling techniques.
Recall the discrete forward process in DDPM, where noise is added incrementally at each step t. If we consider infinitesimally small time steps, this sequence of transformations converges to a continuous stochastic process. An SDE describes the evolution of a variable over continuous time, incorporating both deterministic change (drift) and random fluctuations (diffusion).
A general Itô SDE takes the form:
dxt=f(xt,t)dt+g(t)dwtHere:
In the context of diffusion models, the forward process transforms complex data x0 into a simple noise distribution (typically Gaussian) as time t progresses from 0 to T. This "information destruction" process can be modeled by a specific SDE. A common choice, corresponding to the Variance Preserving (VP) SDE often linked to DDPM, is:
dxt=−21β(t)xtdt+β(t)dwtHere, β(t) is a positive, time-dependent function often called the noise schedule.
As t increases from 0 to T, the influence of the initial data x0 diminishes, and xT approaches a standard Gaussian distribution, irrespective of x0.
The generative power of diffusion models comes from reversing this process. We start with a sample xT from the simple noise distribution and evolve it backward in time from T to 0 to generate a data sample x0. A remarkable result from stochastic calculus (Anderson, 1982) states that the reverse trajectory of a diffusion process defined by a forward SDE also follows an SDE, provided we know the score function of the marginal distributions pt(xt).
The reverse SDE corresponding to the forward process above is given by:
dxt=[−21β(t)xt−β(t)∇xtlogpt(xt)]dt+β(t)dwˉtHere:
This reverse SDE tells us how to infinitesimally adjust the current state xt to make it slightly more likely under the data distribution pt. The drift term now includes the score function, effectively guiding the process away from noise and towards plausible data structures.
The central challenge in using the reverse SDE for generation is that we don't know the true score function ∇xtlogpt(xt) for the intermediate distributions pt. This is where neural networks come in. We train a time-dependent neural network, often denoted as sθ(xt,t), to approximate the true score function:
sθ(xt,t)≈∇xtlogpt(xt)This network is typically trained using objectives derived from score matching or objectives equivalent to those used in DDPMs (like the Lsimple objective mentioned in the chapter introduction, which implicitly learns the score). Once trained, sθ(xt,t) can be plugged into the reverse SDE:
dxt=[−21β(t)xt−β(t)sθ(xt,t)]dt+β(t)dwˉtSimulating this SDE backward in time, starting from xT∼N(0,I), allows us to generate new data samples x0.
Diagram illustrating the forward SDE destroying data structure over time and the learned reverse SDE reconstructing data from noise by following the estimated score function.
Viewing diffusion models through the lens of SDEs offers several advantages:
This continuous-time perspective sets the stage for understanding score-based generative modeling and advanced techniques like DDIM, which leverages properties of the underlying SDEs for efficient sampling. We will build upon these foundations as we examine the implementation details and improvements of diffusion models in the subsequent sections.
© 2025 ApX Machine Learning