The objective involves training a neural network, often a U-Net architecture denoted by , to reverse the diffusion process. For training, this network requires a clear target—a specific quantity to predict.
Recall the forward process: we start with clean data and progressively add Gaussian noise at each timestep to get . The core idea for training is surprisingly direct: we train the network to predict the exact noise that was added to to create , given the noisy input and the timestep .
The network takes and as input and outputs a prediction, let's call it . Our objective is to make this prediction match the actual noise that was sampled from a standard Gaussian distribution during the forward step calculation for .
To quantify the difference between the predicted noise and the true noise , a common and effective choice is the Mean Squared Error (MSE). Minimizing this error forces the network's output to closely match the target noise.
The training objective, or loss function , is formulated as the expectation over all possible inputs: the initial data , the randomly chosen timestep (uniformly sampled between 1 and the maximum timestep ), and the randomly sampled noise . Mathematically, we express this as:
Let's break this down:
In practice, during training, we approximate this expectation by averaging the loss over mini-batches of training data. For each item in the batch, we sample a random , sample a random , compute , feed and to the network , and calculate the MSE between the network's output and the original .
Diagram illustrating the flow for calculating the training loss for a single data point. The network tries to predict the original noise based on the noisy data and timestep . The MSE loss measures the difference between the actual and predicted noise.
While this MSE loss on the noise seems intuitive and works well empirically, it's not just a heuristic. It arises naturally from the more rigorous mathematical objective of diffusion models: maximizing the evidence lower bound (ELBO) on the data log-likelihood.
The full derivation involves variational inference and Bayes' theorem, which we will simplify in the next section. For now, it's sufficient to understand that minimizing is directly related to optimizing the ELBO. This connection provides a solid theoretical foundation for why this seemingly simple objective enables the model to learn the complex data distribution required for generation.
By training the network with this objective, we equip it with the ability to estimate the noise present in . This noise estimate is precisely what we need during the reverse process (sampling) to iteratively denoise a random signal back into a sample that resembles our training data.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•