Generating images that correspond to detailed text descriptions represents a significant step towards more intuitive human-computer interaction and creative content generation. While basic conditional GANs (cGANs), as discussed earlier, can generate images based on simple labels (like classes), generating high-resolution, photorealistic images from complex free-form text descriptions presents a much greater challenge. The text provides semantic information but lacks the pixel-level detail required for a high-fidelity image. Generating everything in one go often leads to blurry results or images that only vaguely match the description.
StackGAN addresses this challenge with a hierarchical approach, breaking down the complex text-to-image problem into more manageable sub-problems handled by stacked generative networks.
The core idea behind StackGAN is to use a two-stage generative process:
Stage-I GAN: This network focuses on sketching a preliminary, low-resolution version of the image. It takes a text embedding (derived from the input description using a pre-trained text encoder like a character-level CNN-RNN or BERT) and a random noise vector z as input. Its goal is to generate an image that captures the rough layout, colors, and basic shapes described in the text, ignoring fine-grained details and focusing on semantic coherence. The output is typically a low-resolution image (e.g., 64x64 pixels).
Stage-II GAN: This network takes the low-resolution image generated by Stage-I and the original text embedding as input. Its objective is to refine the initial sketch, correct defects, add details, and significantly increase the image resolution (e.g., to 256x256 pixels). By conditioning on both the Stage-I image and the text embedding, Stage-II can focus on adding realism and detail consistent with the initial structure and the textual description.
This staged approach allows the model to first focus on the global structure and semantics (Stage-I) before concentrating on pixel-level refinement and detail (Stage-II).
StackGAN architecture overview. Text is encoded, augmented, and used with noise z in Stage-I to produce a low-resolution sketch. Stage-II takes this sketch and the text embedding to generate a refined, high-resolution image. Both stages involve adversarial training.
A potential issue in conditional generation is that the space of text embeddings might be sparse or not sufficiently smooth. If the embeddings for similar descriptions are far apart, the generator might struggle to generalize. StackGAN introduces Conditioning Augmentation (CA) to address this.
Instead of directly using the text embedding ϕt produced by the encoder, CA models the embedding as a Gaussian distribution N(μ(ϕt),Σ(ϕt)), where the mean μ(ϕt) and diagonal covariance matrix Σ(ϕt) are functions of the original embedding ϕt (often implemented using small fully connected layers). During training, a conditioning variable c is sampled from this distribution:
c^∼N(μ(ϕt),Σ(ϕt))This sampled c^ is then used as the condition for the Stage-I generator G1(z,c^). This sampling process encourages robustness to small variations in the text embedding and creates a smoother conditioning manifold, making the generator's learning task easier. A regularization term, the Kullback-Leibler (KL) divergence between the learned distribution N(μ(ϕt),Σ(ϕt)) and a standard Gaussian N(0,I), is added to the generator's objective to prevent the variance from collapsing:
LCA=DKL(N(μ(ϕt),Σ(ϕt))∣∣N(0,I))The Stage-II GAN typically uses the original text embedding ϕt directly for conditioning, as it primarily focuses on adding detail based on the already established structure from Stage-I and the semantic guidance from the text.
Training StackGAN involves training Stage-I first, and then training Stage-II using the outputs of the trained Stage-I generator.
Stage-I: The generator G1 and discriminator D1 are trained adversarially. G1 tries to generate realistic low-resolution images matching the sampled condition c^, incorporating the LCA regularization. D1 tries to distinguish between real low-resolution images (downsampled from the dataset) paired with their corresponding text condition c^, and fake low-resolution images from G1 paired with c^.
Stage-II: The generator G2 and discriminator D2 are trained adversarially. G2 takes the low-resolution output s0=G1(z,c^) and the original text embedding ϕt as input, aiming to produce a realistic high-resolution image s1=G2(s0,ϕt). The discriminator D2 learns to distinguish real high-resolution images paired with their text embedding ϕt from the fake high-resolution images s1 generated by G2, also conditioned on ϕt.
Both stages use standard GAN loss functions (or variants like Wasserstein loss) combined with the conditioning information. The key is that Stage-II's discriminator must evaluate not only the realism of the high-resolution image but also its consistency with the text embedding ϕt.
StackGAN demonstrated a significant improvement in generating high-resolution images (e.g., 256x256) from text descriptions compared to previous single-stage approaches. The hierarchical refinement strategy proved effective for tackling the complexity of mapping abstract text to detailed pixels. Subsequent work, like StackGAN++, further refined this approach with improved loss functions and training stability techniques.
© 2025 ApX Machine Learning