Diffusion models operate across a continuous spectrum of noise levels, represented by the timestep t. The model's objective, whether predicting the added noise ϵ or the original data x0, is fundamentally conditioned on t. A model tasked with denoising an image with minimal noise (t≈0) behaves very differently from one denoising nearly pure noise (t≈T). Therefore, effectively informing the neural network, typically a U-Net, about the current timestep t is essential for successful diffusion modeling. Simply feeding the scalar value t directly as input is insufficient, as neural networks struggle to interpret the magnitude significance of raw scalar values in this context.
Early attempts might involve normalizing t to a specific range (e.g., [0, 1]) and concatenating it channel-wise to the input feature maps. However, this often fails to provide a sufficiently expressive signal. A more effective approach, inspired by the positional encodings used in Transformer models, is to use sinusoidal time embeddings.
Given a timestep t (often an integer from 0 to T), we first map it into a high-dimensional vector using functions of varying frequencies:
PE(t)2i=sin(t/100002i/d) PE(t)2i+1=cos(t/100002i/d)Here, i indexes the dimension of the embedding vector, ranging from 0 to d/2−1, where d is the desired embedding dimension (e.g., 128, 256, 512). This formulation provides a unique vector representation for each timestep. The sinusoidal nature creates smooth transitions between embeddings of adjacent timesteps and potentially allows the network to generalize to timesteps not seen during training, although diffusion models typically operate on a fixed, discrete set of timesteps.
While the sinusoidal embedding PE(t) provides a rich representation, it's typically processed further before being injected into the U-Net. A small Multi-Layer Perceptron (MLP), often consisting of two linear layers with a non-linear activation function like SiLU (Sigmoid Linear Unit, also known as Swish), transforms the fixed sinusoidal embedding into a learned feature representation tailored to the model's needs.
Let e=PE(t) be the sinusoidal embedding. The MLP processes it as:
temb=Linear2(SiLU(Linear1(e)))This resulting vector temb captures the timestep information in a format that can be effectively integrated into the U-Net's convolutional blocks.
Once we have the processed time embedding temb, we need to integrate it into the U-Net's architecture. Several strategies exist:
A common and straightforward method is to add the time embedding to the intermediate feature maps within the U-Net's residual blocks. Typically, the time embedding temb is first projected to match the number of channels C of the feature map h it will be added to, often using another linear layer.
h′=Conv(h)+Linearproj(temb)[:,None,None,:]The time embedding projection is reshaped to be broadcastable (adding dimensions for spatial height and width, often represented as [:, None, None, :]
or equivalent framework functions) before element-wise addition. This addition usually happens after the first convolution within a residual block.
A more sophisticated and often more effective technique involves using the time embedding to modulate the parameters of normalization layers. Instead of just adding the information, the time embedding dynamically controls the scale and shift parameters of Group Normalization (AdaGN) or Layer Normalization (AdaLN).
Recall that a normalization layer like GroupNorm computes:
GroupNorm(h)=γ⋅σ2+ϵh−μ+βIn adaptive normalization, the scale γ and shift β are no longer fixed learned parameters but are instead generated by the time embedding temb. A linear layer maps temb to produce separate γ(t) and β(t) vectors.
[γ(t),β(t)]=LinearAdaNorm(temb)These generated parameters are then used to affine the normalized feature map:
AdaGN(h,t)=γ(t)⋅GroupNorm(h)+β(t)This allows the timestep t to exert fine-grained control over the feature statistics within each block, effectively modulating the "style" of the computation based on the noise level. This is analogous to style modulation techniques seen in generative models like StyleGAN and has proven highly effective in diffusion models. AdaLN (Adaptive Layer Normalization) works similarly but uses Layer Normalization as the base.
The following diagram illustrates the AdaGN mechanism within a typical U-Net residual block:
Diagram illustrating how a processed time embedding temb is used to generate scale γ(t) and shift β(t) parameters, which then modulate the output of a Group Normalization layer within a U-Net residual block (AdaGN).
Time embeddings, whether added or used for adaptive normalization, are typically injected into multiple layers of the U-Net, often within each residual block in both the encoder and decoder paths. This ensures the timestep context is available throughout the network's computation.
Adaptive normalization methods like AdaLN or AdaGN generally provide superior performance compared to simple addition, as they offer a more expressive way for the timestep to influence feature distributions. However, they also introduce slightly more parameters via the projection layers that generate γ(t) and β(t). The choice often depends on the specific model scale and performance requirements.
In summary, converting the scalar timestep t into a high-dimensional embedding using sinusoidal functions, processing it through an MLP, and integrating it into U-Net blocks via addition or, more effectively, adaptive normalization, are standard and important techniques for building high-performing diffusion models. This ensures the network is always aware of its position along the diffusion trajectory, enabling it to perform the correct denoising operation for any given noise level.
© 2025 ApX Machine Learning