The core task of a neural network, often a U-Net, in diffusion models is to learn the reverse diffusion process. This involves estimating the original data distribution by progressively removing noise. While the full probabilistic derivation of diffusion models involves maximizing the data likelihood via the Evidence Lower Bound (ELBO), a common and effective training objective simplifies this significantly.
The full ELBO for diffusion models contains several terms related to the transitions in the reverse Markov chain. Optimizing it directly can be intricate. However, researchers (notably Ho et al. in the original DDPM paper) found that the training objective can be greatly simplified without sacrificing performance.
Instead of modeling the complex reverse transition probabilities directly, we reparameterize the network to predict the noise that was added to obtain from the original image at timestep . Recall the forward process equation that allows sampling directly from :
Here, is a standard Gaussian noise sample , and is derived from the noise schedule.
The insight is that if our network can accurately predict this noise given the noisy image and the timestep , it effectively learns how to reverse the diffusion step.
This leads to a much simpler training objective: we minimize the Mean Squared Error (MSE) between the actual noise used to create and the noise predicted by our network .
The loss function is formulated as an expectation over random timesteps , initial data samples , and noise samples :
Let's break this down:
Minimizing this MSE loss encourages the network to become an effective noise predictor for any given noise level . Intuitively, if the model can precisely identify the noise component within , it implicitly understands the structure of the underlying data required to reverse the noising process.
While we've skipped the detailed mathematical derivation from the ELBO, it can be shown that this simplified MSE loss corresponds to a specific weighting of the terms in the variational lower bound. This provides theoretical grounding for why this simpler objective is effective for training high-quality diffusion models.
The practical advantage is significant: training boils down to a standard regression problem where the network learns to map a noisy input and timestep to the noise that was added. This is far more straightforward to implement and optimize than dealing directly with the complex distributions of the full ELBO. This simplified loss is the foundation upon which we build the training algorithm described in the next section.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with