Okay, we've established that our goal is to train a neural network, typically a U-Net architecture denoted by ϵθ, to reverse the diffusion process. How do we actually train it? The network needs a clear target, a specific quantity to predict.
Recall the forward process: we start with clean data x0 and progressively add Gaussian noise ϵ at each timestep t to get xt. The core idea for training is surprisingly direct: we train the network ϵθ to predict the exact noise ϵ that was added to x0 to create xt, given the noisy input xt and the timestep t.
The network takes xt and t as input and outputs a prediction, let's call it ϵθ(xt,t). Our objective is to make this prediction match the actual noise ϵ that was sampled from a standard Gaussian distribution N(0,I) during the forward step calculation for xt.
To quantify the difference between the predicted noise ϵθ(xt,t) and the true noise ϵ, a common and effective choice is the Mean Squared Error (MSE). Minimizing this error forces the network's output to closely match the target noise.
The training objective, or loss function L, is formulated as the expectation over all possible inputs: the initial data x0, the randomly chosen timestep t (uniformly sampled between 1 and the maximum timestep T), and the randomly sampled noise ϵ. Mathematically, we express this as:
L=Et∼U(1,T),x0∼q(x0),ϵ∼N(0,I)[∣∣ϵ−ϵθ(xt,t)∣∣2]Let's break this down:
In practice, during training, we approximate this expectation by averaging the loss over mini-batches of training data. For each item in the batch, we sample a random t, sample a random ϵ, compute xt, feed xt and t to the network ϵθ, and calculate the MSE between the network's output ϵθ(xt,t) and the original ϵ.
Diagram illustrating the flow for calculating the training loss for a single data point. The network ϵθ tries to predict the original noise ϵ based on the noisy data xt and timestep t. The MSE loss measures the difference between the actual and predicted noise.
While this MSE loss on the noise seems intuitive and works well empirically, it's not just a heuristic. It arises naturally from the more rigorous mathematical objective of diffusion models: maximizing the evidence lower bound (ELBO) on the data log-likelihood.
The full derivation involves variational inference and Bayes' theorem, which we will simplify in the next section. For now, it's sufficient to understand that minimizing L=∣∣ϵ−ϵθ(xt,t)∣∣2 is directly related to optimizing the ELBO. This connection provides a solid theoretical foundation for why this seemingly simple objective enables the model to learn the complex data distribution required for generation.
By training the network ϵθ with this objective, we equip it with the ability to estimate the noise present in xt. This noise estimate is precisely what we need during the reverse process (sampling) to iteratively denoise a random signal back into a sample that resembles our training data.
© 2025 ApX Machine Learning