Having introduced the core idea of diffusion models, systematically adding noise and then learning to reverse that process, let's place this within a more formal probabilistic setting. This framework helps us understand the mathematical underpinnings and sets the stage for deriving the training objectives and sampling algorithms discussed in later chapters.
Generative modeling aims to learn the underlying probability distribution q(x0) of some data, like images. We want to build a model, parameterized by θ, that can generate new samples x0 from an approximate distribution pθ(x0).
The forward process, where we progressively add noise, can be formally defined as a sequence of latent variables x1,x2,...,xT. We start with an original data sample x0∼q(x0). Each subsequent step xt is obtained by applying a fixed noise process to the previous step xt−1. This process is defined as a Markov chain, meaning that xt depends only on xt−1:
q(x1:T∣x0)=t=1∏Tq(xt∣xt−1)The transition probability q(xt∣xt−1) typically involves adding a small amount of Gaussian noise, scaled according to a predefined schedule. The total number of steps, T, is usually large (e.g., 1000). As t increases, the data xt gradually loses its original structure, eventually becoming indistinguishable from pure noise. By design, the final state xT should approximate a simple, tractable distribution, often a standard Gaussian, xT≈N(0,I). This forward process is fixed and does not involve any learning.
The forward process adds noise step-by-step, transforming data x0 into noise xT through a fixed Markov chain q.
The generative power comes from the reverse process. Our goal is to start with a sample from the noise distribution, xT∼N(0,I), and reverse the noising steps to obtain a sample x0 that looks like it came from the original data distribution q(x0). This involves learning the reverse Markov chain transitions pθ(xt−1∣xt):
pθ(x0:T)=p(xT)t=1∏Tpθ(xt−1∣xt)Here, p(xT) is the prior noise distribution (e.g., N(0,I)), and pθ(xt−1∣xt) represents the learned denoising step, parameterized by a neural network (often a U-Net, as we'll see later) with parameters θ.
The central challenge is that the true reverse probability q(xt−1∣xt) is intractable to compute directly because it depends on the entire data distribution. However, it becomes tractable if we condition it on the original data point x0. Diffusion models cleverly use this insight. While we don't know x0 during generation, we can use it during training to formulate a tractable objective. The network pθ(xt−1∣xt) is trained to approximate the true posterior q(xt−1∣xt,x0). As we will explore in Chapter 3, this approximation often simplifies to predicting the noise that was added at step t.
The reverse process starts from noise xT and learns to denoise step-by-step using a parameterized Markov chain pθ to generate data x0.
This setup provides a structured way to transform a complex data distribution q(x0) into a simple noise distribution p(xT) and then learn the reverse transformation pθ(x0∣xT) implicitly through the sequence of denoising steps pθ(xt−1∣xt). The intermediate states x1,...,xT−1 act as latent variables, guiding the generation from pure noise back to structured data.
The ultimate goal is to train the parameters θ such that the distribution pθ(x0), obtained by running the full reverse process starting from xT∼p(xT), closely matches the true data distribution q(x0). The training objective, which we'll examine in detail in Chapter 4, typically involves maximizing the likelihood of the observed data x0 under the model. This is often achieved by optimizing a lower bound on the log-likelihood (the ELBO), which conveniently breaks down into terms related to predicting the noise at each step of the diffusion process.
This probabilistic view provides the foundation for understanding how diffusion models operate, how they are trained, and how samples are generated. The next chapters will delve into the specific mathematical formulations of the forward and reverse steps.
© 2025 ApX Machine Learning