While Denoising Diffusion Probabilistic Models (DDPMs) offer one perspective on diffusion models, focusing on parameterized forward and reverse Markov chains, Score-Based Generative Modeling provides an alternative, yet deeply connected, viewpoint. This approach centers on directly modeling the gradient of the logarithm of the data probability density function, known as the score function, at various noise levels. Understanding this perspective illuminates the connection between diffusion models and established techniques like score matching and Langevin dynamics.
For a given data distribution p(x), the score function is defined as the gradient of its log-probability with respect to the data x:
∇xlogp(x)This vector field is significant because it points in the direction of the steepest ascent of the log-probability density at any given point x. Intuitively, it tells us how to modify x to make it slightly more probable under the distribution p(x).
In the context of diffusion models, we are interested in the score function of the perturbed data distributions pt(xt) at different noise levels t. Recall that xt is the data point x0 after t steps of the forward noising process. The score function ∇xtlogpt(xt) indicates how to adjust a noisy sample xt to increase its likelihood under the marginal distribution at time t.
The reverse process in diffusion models aims to denoise a sample xt to estimate xt−1. It turns out that the optimal reverse transition, which maximizes the likelihood of the data, is closely related to the score function ∇xtlogpt(xt).
Specifically, Tweedie's formula provides a link showing that the conditional expectation of the original data point x0 given the noisy version xt involves the score of the noisy distribution pt. The mean of the reverse transition pθ(xt−1∣xt) depends on estimating this score.
The neural network ϵθ(xt,t) trained in DDPMs using the simplified objective:
Lsimple(θ)=Et,x0,ϵ[∥ϵ−ϵθ(αˉtx0+1−αˉtϵ,t)∥2]is essentially learning a re-scaled version of the score function of the noisy data distribution pt(xt). The relationship is approximately:
ϵθ(xt,t)≈−1−αˉt∇xtlogpt(xt)This shows that even when training a DDPM to predict the noise ϵ, the model is implicitly learning the score of the intermediate noisy data distributions.
Instead of implicitly learning the score via noise prediction, score-based models aim to train a neural network, often denoted as sθ(x,t), to directly estimate the score function ∇xlogpt(x). A naive approach would be to minimize the squared difference between the model's output and the true score:
L(θ)=EtEpt(xt)[∥sθ(xt,t)−∇xtlogpt(xt)∥22]However, computing the true score ∇xtlogpt(xt) is generally intractable because pt(xt) involves an unknown normalization constant and requires integrating over all possible x0.
Score Matching provides a way around this. The original score matching objective, equivalent to the one above, involves the trace of the Hessian of the score model, which can be computationally expensive. More practical variants exist:
Training a score-based model sθ(xt,t) involves sampling timesteps t, sampling data points x0, generating corresponding noisy samples xt using the forward process, and then optimizing θ using a suitable score matching objective like DSM.
Once a time-dependent score model sθ(x,t) has been trained, we can generate new samples by simulating the reverse diffusion process. A common algorithm for this is Annealed Langevin Dynamics.
Langevin dynamics is an iterative method originally used in physics to sample from a probability distribution p(x) using its score function ∇xlogp(x). The update rule combines a gradient ascent step on the log probability (following the score) with Gaussian noise injection:
xi+1=xi+2η∇xlogp(xi)+ηziHere, η is the step size and zi∼N(0,I) is Gaussian noise.
For score-based generative models, we apply this idea across decreasing noise levels (annealing). We start with a sample from the prior distribution, typically pure Gaussian noise xT∼N(0,I), and iteratively refine it using the learned score function sθ(xt,t) for t=T,T−1,…,1. A typical update step looks like:
xt−1=xt+αtsθ(xt,t)+2αtztwhere αt>0 is a step size that depends on the noise level t, and zt∼N(0,I). This process gradually moves the sample away from noise and towards regions of high probability under the data distribution, guided by the learned score field. More sophisticated sampling procedures, often involving predictor-corrector steps inspired by numerical methods for solving SDEs, are commonly used to improve sample quality and efficiency. These samplers often resemble the reverse process sampler used in DDPMs.
The score-based perspective highlights the fundamental role of the score function in generative modeling via diffusion. It reveals that:
The Stochastic Differential Equation (SDE) formulation, discussed previously, provides a continuous-time framework that naturally encompasses both DDPMs and score-based models. Both approaches can be seen as different ways to discretize and implement the same underlying continuous diffusion process.
Score-Based Generative Modeling provides a powerful theoretical lens for understanding diffusion models. By focusing on the score function ∇xlogpt(xt), it connects diffusion processes to score matching for training and Langevin dynamics for sampling. This perspective not only clarifies the mechanisms behind DDPMs but also opens avenues for designing new models and samplers based on directly estimating and utilizing the score function of the evolving data distribution under noise. The close relationship between noise prediction in DDPMs and score estimation underscores the deep connections between these seemingly different formulations.
© 2025 ApX Machine Learning