While classifier guidance and classifier-free guidance (CFG) offer ways to steer generation towards predefined categories, text conditioning allows for much richer and more flexible control. Instead of just specifying a class label like "cat" or "dog," we can provide detailed descriptions like "a photorealistic image of a Siberian husky playing in the snow" and have the diffusion model attempt to generate a corresponding image. This capability is fundamental to modern text-to-image systems like DALL-E 2, Imagen, and Stable Diffusion.
The core challenge is bridging the gap between human language (text) and the numerical world of neural networks. How can the U-Net model, which operates on tensors representing images and noise, understand the meaning of a sentence? The answer lies in text embeddings.
Just as we represent images as grids of pixel values, we need a way to convert text prompts into meaningful numerical vectors, or embeddings. The goal is to create embeddings where prompts with similar meanings result in vectors that are close together in the embedding space.
Various techniques exist for creating text embeddings, from earlier methods like TF-IDF and Word2Vec to more advanced transformer-based models like BERT. However, for conditioning generative image models, a particularly effective approach involves models trained specifically to connect text and images.
A prominent example is CLIP (Contrastive Language-Image Pre-training). CLIP is trained on a massive dataset of image-text pairs. Its objective is to learn transformations for both images and text such that the embedding of a text description is close to the embedding of its corresponding image in a shared latent space. This joint training makes CLIP's text encoder particularly well-suited for image generation tasks, as its embeddings capture visual concepts described in the text. When we input a text prompt like "a red apple" into CLIP's text encoder, it outputs a vector y that represents the semantic meaning of that phrase in a way that is aligned with visual data.
Once we have a text embedding y representing our desired output, we need to incorporate it into the diffusion model's U-Net. Recall that the standard U-Net in a diffusion model typically takes the noisy image xt and the current timestep t as input to predict the noise ϵ. For text conditioning, the U-Net must be adapted to accept the text embedding y as an additional input.
The prediction task of the network then becomes estimating the noise conditioned on the text:
ϵθ(xt,t,y)Here, ϵθ represents the U-Net parameterized by weights θ.
During the reverse diffusion process (sampling), the workflow typically looks like this:
The following diagram illustrates how the text embedding is used during the noise prediction step within the reverse process:
Flow showing how a text prompt is encoded into an embedding y, which is then used alongside the noisy image xt and timestep t as input to the U-Net for predicting the conditioned noise.
Text conditioning is powerful on its own, but it's often combined with Classifier-Free Guidance (CFG), which we discussed previously. To enable CFG, the diffusion model is typically trained on both conditioned prompts (y) and occasionally with the conditioning information omitted (often represented by a null or empty prompt embedding, y∅).
During sampling, the U-Net performs two predictions at each step:
The final noise estimate used for the denoising step is then extrapolated from these two predictions, typically as:
ϵ^=ϵθ(xt,t,y∅)+w⋅(ϵθ(xt,t,y)−ϵθ(xt,t,y∅))where w is the guidance scale. This allows the sampling process to more strongly emphasize the text prompt, often leading to generated images that better align with the description.
Incorporating text conditioning requires specific modifications to the U-Net architecture to effectively fuse the text embedding information with the image and timestep information. The next section, "Architecture Modifications for Conditioning," will examine common techniques like cross-attention that enable this integration.
© 2025 ApX Machine Learning