Having revisited the discrete-time formulations of DDPM and DDIM, we now look at the mathematical foundations that provide a deeper understanding and link diffusion models to other areas of generative modeling. This connection, particularly through score matching and differential equations, explains why these models work and opens doors to more advanced techniques, including the custom noise schedules discussed later in this chapter and the sophisticated samplers explored later in the course.
At the heart of diffusion models lies the idea of reversing a process that gradually adds noise to data. The crucial element needed for this reversal is knowing, at any noise level t, which direction to step in to make the data xt slightly less noisy, moving it closer to the original data distribution. This direction is mathematically captured by the gradient of the log-probability density of the noisy data, known as the score function:
∇xtlogq(xt)The score function points towards regions of higher data density for the noisy data distribution q(xt) at time t. If we had access to this score function for all t, we could use it to guide a reverse process, starting from pure noise and gradually removing it to generate realistic data.
Score matching is a principle for training a model, let's call it sθ(xt,t), to approximate this true score function. The goal is to minimize the difference between the model's output and the true score:
EtExt∼q(xt)[∣∣sθ(xt,t)−∇xtlogq(xt)∣∣22]While directly computing the true score ∇xtlogq(xt) is often intractable, techniques like denoising score matching show that this objective is equivalent to training a model to denoise samples.
Specifically, for Gaussian noise perturbations commonly used in diffusion models, it can be shown that the denoising objective used in DDPMs, where a model ϵθ(xt,t) is trained to predict the noise ϵ added to obtain xt from x0, is implicitly learning the score function. The relationship is remarkably simple:
sθ(xt,t)≈∇xtlogq(xt)≈−σtϵθ(xt,t)Here, σt represents the standard deviation of the noise at time t. This fundamental connection reveals that the neural network in a DDPM, while framed as predicting noise, is effectively learning the score of the noise-perturbed data distribution. This perspective provides a solid theoretical grounding for the model's architecture and objective function.
The discrete steps of the DDPM forward process q(xt∣xt−1) can be seen as approximations of a continuous-time process described by a Stochastic Differential Equation (SDE). An SDE describes how a variable evolves over continuous time under the influence of both deterministic forces (drift) and random fluctuations (diffusion).
A common SDE formulation corresponding to many DDPM variants is the Variance Preserving (VP) SDE:
dx=f(x,t)dt+g(t)dwwhere:
This SDE describes a continuous noising process where data gradually drifts towards zero mean while accumulating Gaussian noise, with the rate controlled by β(t). The discrete forward process q(xt∣x0) of a DDPM can be recovered by solving this SDE from t=0 to t=T.
The real power of the SDE perspective comes when considering the reverse process. It's a known result from SDE theory (attributed to Anderson, 1982) that the forward SDE has a corresponding reverse-time SDE that maps noise back to data. This reverse SDE also has drift and diffusion terms, and crucially, its drift term depends directly on the score function ∇xtlogqt(xt) of the marginal distributions qt(x) generated by the forward SDE.
The reverse SDE for the VP-SDE example is:
dx=[f(x,t)−g(t)2∇xtlogqt(xt)]dt+g(t)dwHere, dw is a reverse-time Wiener process, and we integrate time backwards from T to 0. This SDE tells us how to stochastically remove noise using the score.
Furthermore, Song et al. (NeurIPS 2020) showed that associated with any diffusion process defined by an SDE, there exists a deterministic process described by an Ordinary Differential Equation (ODE) whose trajectories share the same marginal densities qt(x) over time. This is often called the Probability Flow ODE. For the VP-SDE, this ODE is:
dx=[f(x,t)−21g(t)2∇xtlogqt(xt)]dtNotice the absence of the stochastic term dw. This ODE provides a deterministic path from noise to data.
This ODE is immensely significant:
Deterministic Generation: If we start with a sample xT from the noise distribution qT(x) (usually a standard Gaussian) and solve this ODE backwards in time from t=T to t=0, the resulting x0 is a sample from the data distribution.
Requires the Score: The ODE requires the score function ∇xtlogqt(xt).
Practical Implementation: We don't know the true score, but we have trained a model sθ(xt,t) (or equivalently, ϵθ(xt,t)) to approximate it! By plugging our learned score model sθ into the ODE, we get a practical way to generate samples:
dx≈[f(x,t)−21g(t)2sθ(xt,t)]dtSubstituting f(x,t)=−21β(t)x, g(t)2=β(t), and sθ(xt,t)≈−ϵθ(xt,t)/σt (where σt2 is the variance at time t) connects this directly back to the operations performed in diffusion model sampling.
Foundation for Advanced Samplers: DDIM sampling can be interpreted as a particular numerical method (specifically, a first-order discretization) for solving this Probability Flow ODE. This explains its deterministic nature and ability to use larger step sizes than DDPM. Moreover, this ODE formulation allows us to apply more sophisticated numerical ODE solvers (like Runge-Kutta methods, DPM-Solver, UniPC, etc.) to potentially achieve higher accuracy with fewer function evaluations (i.e., fewer model inference steps), leading to faster sampling, which we will explore in Chapter 6.
Diagram illustrating the relationship between the forward SDE, the score function, the learned score model, the reverse SDE, and the probability flow ODE used for sampling.
This perspective, linking discrete diffusion steps to underlying continuous-time SDEs and ODEs via score matching, is a cornerstone for understanding many advanced techniques. It clarifies the role of the noise schedule (related to β(t) or g(t)) and motivates the search for better ODE solvers for faster and more accurate sampling. As we proceed to discuss custom noise schedules and learned variances, keep this SDE/ODE framework in mind as the continuous-time ideal that discrete implementations aim to approximate effectively.
© 2025 ApX Machine Learning