Diffusion models operate by systematically adding noise to data and then learning to reverse that process. A more formal probabilistic setting provides the mathematical underpinnings for this approach. Understanding this framework is essential for deriving the training objectives and sampling algorithms that will be discussed in later chapters.
Generative modeling aims to learn the underlying probability distribution of some data, like images. We want to build a model, parameterized by , that can generate new samples from an approximate distribution .
The forward process, where we progressively add noise, can be formally defined as a sequence of latent variables . We start with an original data sample . Each subsequent step is obtained by applying a fixed noise process to the previous step . This process is defined as a Markov chain, meaning that depends only on :
The transition probability typically involves adding a small amount of Gaussian noise, scaled according to a predefined schedule. The total number of steps, , is usually large (e.g., 1000). As increases, the data gradually loses its original structure, eventually becoming indistinguishable from pure noise. By design, the final state should approximate a simple, tractable distribution, often a standard Gaussian, . This forward process is fixed and does not involve any learning.
The forward process adds noise step-by-step, transforming data into noise through a fixed Markov chain .
The generative power comes from the reverse process. Our goal is to start with a sample from the noise distribution, , and reverse the noising steps to obtain a sample that looks like it came from the original data distribution . This involves learning the reverse Markov chain transitions :
Here, is the prior noise distribution (e.g., ), and represents the learned denoising step, parameterized by a neural network (often a U-Net, as we'll see later) with parameters .
The central challenge is that the true reverse probability is intractable to compute directly because it depends on the entire data distribution. However, it becomes tractable if we condition it on the original data point . Diffusion models cleverly use this insight. While we don't know during generation, we can use it during training to formulate a tractable objective. The network is trained to approximate the true posterior . As we will explore in Chapter 3, this approximation often simplifies to predicting the noise that was added at step .
The reverse process starts from noise and learns to denoise step-by-step using a parameterized Markov chain to generate data .
This setup provides a structured way to transform a complex data distribution into a simple noise distribution and then learn the reverse transformation implicitly through the sequence of denoising steps . The intermediate states act as latent variables, guiding the generation from pure noise back to structured data.
The ultimate goal is to train the parameters such that the distribution , obtained by running the full reverse process starting from , closely matches the true data distribution . The training objective, which we'll examine in detail in Chapter 4, typically involves maximizing the likelihood of the observed data under the model. This is often achieved by optimizing a lower bound on the log-likelihood (the ELBO), which conveniently breaks down into terms related to predicting the noise at each step of the diffusion process.
This probabilistic view provides the foundation for understanding how diffusion models operate, how they are trained, and how samples are generated. The next chapters will get into the specific mathematical formulations of the forward and reverse steps.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with