Mathematical foundations provide a thorough understanding of diffusion models and link them to other areas of generative modeling. These foundations cover discrete-time formulations like DDPM and DDIM. This connection, particularly through score matching and differential equations, explains why these models work and enables advanced techniques, including custom noise schedules and sophisticated samplers.
At the core of diffusion models lies the idea of reversing a process that gradually adds noise to data. The main element needed for this reversal is knowing, at any noise level , which direction to step in to make the data slightly less noisy, moving it closer to the original data distribution. This direction is mathematically captured by the gradient of the log-probability density of the noisy data, known as the score function:
The score function points towards regions of higher data density for the noisy data distribution at time . If we had access to this score function for all , we could use it to guide a reverse process, starting from pure noise and gradually removing it to generate realistic data.
Score matching is a principle for training a model, let's call it , to approximate this true score function. The goal is to minimize the difference between the model's output and the true score:
While directly computing the true score is often intractable, techniques like denoising score matching show that this objective is equivalent to training a model to denoise samples.
Specifically, for Gaussian noise perturbations commonly used in diffusion models, it can be shown that the denoising objective used in DDPMs, where a model is trained to predict the noise added to obtain from , is implicitly learning the score function. The relationship is remarkably simple:
Here, represents the standard deviation of the noise at time . This fundamental connection reveals that the neural network in a DDPM, while framed as predicting noise, is effectively learning the score of the noise-perturbed data distribution. This perspective provides a solid theoretical grounding for the model's architecture and objective function.
The discrete steps of the DDPM forward process can be seen as approximations of a continuous-time process described by a Stochastic Differential Equation (SDE). An SDE describes how a variable evolves over continuous time under the influence of both deterministic forces (drift) and random fluctuations (diffusion).
A common SDE formulation corresponding to many DDPM variants is the Variance Preserving (VP) SDE:
where:
This SDE describes a continuous noising process where data gradually drifts towards zero mean while accumulating Gaussian noise, with the rate controlled by . The discrete forward process of a DDPM can be recovered by solving this SDE from to .
The real power of the SDE perspective comes when considering the reverse process. It's a known result from SDE theory (attributed to Anderson, 1982) that the forward SDE has a corresponding reverse-time SDE that maps noise back to data. This reverse SDE also has drift and diffusion terms, and crucially, its drift term depends directly on the score function of the marginal distributions generated by the forward SDE.
The reverse SDE for the VP-SDE example is:
Here, is a reverse-time Wiener process, and we integrate time backwards from to . This SDE tells us how to stochastically remove noise using the score.
Furthermore, Song et al. (NeurIPS 2020) showed that associated with any diffusion process defined by an SDE, there exists a deterministic process described by an Ordinary Differential Equation (ODE) whose trajectories share the same marginal densities over time. This is often called the Probability Flow ODE. For the VP-SDE, this ODE is:
Notice the absence of the stochastic term . This ODE provides a deterministic path from noise to data.
This ODE is immensely significant:
Deterministic Generation: If we start with a sample from the noise distribution (usually a standard Gaussian) and solve this ODE backwards in time from to , the resulting is a sample from the data distribution.
Requires the Score: The ODE requires the score function .
Practical Implementation: We don't know the true score, but we have trained a model (or equivalently, ) to approximate it! By plugging our learned score model into the ODE, we get a practical way to generate samples:
Substituting , , and (where is the variance at time ) connects this directly back to the operations performed in diffusion model sampling.
Foundation for Advanced Samplers: DDIM sampling can be interpreted as a particular numerical method (specifically, a first-order discretization) for solving this Probability Flow ODE. This explains its deterministic nature and ability to use larger step sizes than DDPM. Moreover, this ODE formulation allows us to apply more sophisticated numerical ODE solvers (like Runge-Kutta methods, DPM-Solver, UniPC, etc.) to potentially achieve higher accuracy with fewer function evaluations (i.e., fewer model inference steps), leading to faster sampling, which we will explore in Chapter 6.
Diagram illustrating the relationship between the forward SDE, the score function, the learned score model, the reverse SDE, and the probability flow ODE used for sampling.
This perspective, linking discrete diffusion steps to underlying continuous-time SDEs and ODEs via score matching, is a foundation for understanding many advanced techniques. It clarifies the role of the noise schedule (related to or ) and motivates the search for better ODE solvers for faster and more accurate sampling. As we proceed to discuss custom noise schedules and learned variances, keep this SDE/ODE framework in mind as the continuous-time ideal that discrete implementations aim to approximate effectively.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with