As we outlined previously, the core of the reverse diffusion process is a neural network, typically a U-Net, tasked with predicting the noise ϵ that was added to an image x0 to produce a noisy version xt at a specific timestep t. A significant aspect of this process is that the same network parameters θ must be used for all possible timesteps t (from 0 to T).
How can a single network adapt its behavior based on the current timestep? It needs to be informed about which timestep t it is currently processing. Simply feeding the integer t directly into the network is often suboptimal. Neural networks generally work better with normalized inputs, and a raw integer doesn't easily convey the "position" within the diffusion process in a way the network can effectively use. Furthermore, different timesteps require vastly different denoising behavior; the noise pattern and magnitude at t=50 are distinct from those at t=800.
Therefore, we need an effective way to represent the timestep t and inject this information into the model. The most common and effective technique, inspired by the Transformer architecture (Vaswani et al., 2017), is to use sinusoidal positional embeddings.
Instead of using the raw integer t, we transform it into a high-dimensional vector e(t). This embedding vector provides a richer, more structured representation of the timestep that the network can better utilize. The standard approach uses sine and cosine functions of varying frequencies:
Let d be the desired dimensionality of the embedding vector. For each dimension i from 0 to d/2−1, the embedding components are calculated as:
e(t)2i=sin(100002i/dt) e(t)2i+1=cos(100002i/dt)Here:
This formulation creates a vector where each pair of dimensions (2i,2i+1) corresponds to a sinusoid with a specific frequency. Lower dimensions (small i) vary slowly with t, while higher dimensions (large i) vary rapidly. This multi-frequency representation allows the network to easily distinguish between different timesteps and potentially generalize better to unseen timesteps if needed.
Example values of sinusoidal embeddings for different dimensions across timesteps 0 to 1000, assuming an embedding dimension d greater than 51. Lower dimensions change slowly, while higher dimensions oscillate rapidly.
Once we have the timestep embedding vector e(t) (which typically has a dimension like 256 or 512), we need to integrate it into the U-Net architecture. The U-Net processes the noisy image xt through a series of convolutional layers, downsampling, and upsampling blocks. The timestep information needs to influence these computations.
A common strategy involves:
Diagram illustrating how a processed time embedding is typically added to feature maps within a U-Net block. The sinusoidal embedding is first transformed by an MLP before being injected.
By incorporating timestep information in this way, the single U-Net model ϵθ(xt,t) learns to adapt its noise prediction based on the level of noise present, indicated by t. This conditioning on time is fundamental to the operation of diffusion models, allowing the network to perform the appropriate denoising operation at each step of the reverse process.
© 2025 ApX Machine Learning