As discussed in the previous chapter, the core task of our neural network, often a U-Net, is to learn the reverse diffusion process. This involves estimating the original data distribution by progressively removing noise. While the full probabilistic derivation of diffusion models involves maximizing the data likelihood via the Evidence Lower Bound (ELBO), a common and effective training objective simplifies this significantly.
The full ELBO for diffusion models contains several terms related to the transitions in the reverse Markov chain. Optimizing it directly can be intricate. However, researchers (notably Ho et al. in the original DDPM paper) found that the training objective can be greatly simplified without sacrificing performance.
Instead of modeling the complex reverse transition probabilities pθ(xt−1∣xt) directly, we reparameterize the network ϵθ to predict the noise ϵ that was added to obtain xt from the original image x0 at timestep t. Recall the forward process equation that allows sampling xt directly from x0:
xt=αˉtx0+1−αˉtϵHere, ϵ is a standard Gaussian noise sample (ϵ∼N(0,I)), and αˉt is derived from the noise schedule.
The insight is that if our network can accurately predict this noise ϵ given the noisy image xt and the timestep t, it effectively learns how to reverse the diffusion step.
This leads to a much simpler training objective: we minimize the Mean Squared Error (MSE) between the actual noise ϵ used to create xt and the noise predicted by our network ϵθ(xt,t).
The loss function Lsimple is formulated as an expectation over random timesteps t, initial data samples x0, and noise samples ϵ:
Lsimple=Et∼[1,T],x0∼q(x0),ϵ∼N(0,I)[∣∣ϵ−ϵθ(αˉtx0+1−αˉtϵ,t)∣∣2]Let's break this down:
Minimizing this MSE loss encourages the network ϵθ to become an effective noise predictor for any given noise level t. Intuitively, if the model can precisely identify the noise component within xt, it implicitly understands the structure of the underlying data x0 required to reverse the noising process.
While we've skipped the detailed mathematical derivation from the ELBO, it can be shown that this simplified MSE loss corresponds to a specific weighting of the terms in the variational lower bound. This provides theoretical grounding for why this simpler objective is effective for training high-quality diffusion models.
The practical advantage is significant: training boils down to a standard regression problem where the network learns to map a noisy input xt and timestep t to the noise ϵ that was added. This is far more straightforward to implement and optimize than dealing directly with the complex distributions of the full ELBO. This simplified loss is the foundation upon which we build the training algorithm described in the next section.
© 2025 ApX Machine Learning