While the original StyleGAN architecture represented a significant leap forward in generating high-resolution, visually appealing images with disentangled controls, practitioners quickly identified certain characteristic visual artifacts and limitations. StyleGAN2 was developed specifically to address these shortcomings, refining the generator and training process for improved fidelity and smoother results.
One of the most noticeable issues in StyleGAN outputs was the appearance of blob-like or "water droplet" artifacts. These were traced back to the Adaptive Instance Normalization (AdaIN) operation used within the synthesis network. Recall that AdaIN first normalizes the feature map activations (zero mean, unit variance) within each channel, effectively erasing magnitude information, and then applies learned scale and bias parameters derived from the style vector w.
AdaIN(xi,y)=ys,iσ(xi)xi−μ(xi)+yb,iThe hypothesis was that the network struggled to reintroduce magnitude information solely through the learned bias yb,i after normalization had removed it. This sometimes led to the generator creating excessively strong, localized signals (the "droplets") to compensate.
StyleGAN2 replaces AdaIN with a technique called Modulated Convolution followed by Demodulation (ModDemod).
Modulation: Instead of applying style information after normalization, StyleGAN2 incorporates the style directly into the convolutional weights before the convolution. For a convolutional layer with weights wijk (output channel j, input channel i, spatial kernel position k) and an incoming style scale si derived from w for input channel i, the modulated weights wijk′ become:
wijk′=si⋅wijkThis scales the influence of each input feature map based on the style.
Demodulation: Applying modulated weights can drastically change the overall scale of the output activations. To counteract this and prevent signal escalation or vanishing, demodulation is applied after the modulated convolution. It normalizes the output feature map x′ based on the L2 norm of the modulated weights used to produce it. For each output feature map j:
xj′′=xj′/i,k∑(wijk′)2+ϵHere, ϵ is a small constant for numerical stability. This demodulation step ensures that the standard deviation of the output activations remains roughly unit, effectively standardizing the signal magnitude based on the learned weights themselves, rather than erasing information via instance normalization.
This combined ModDemod operation removes the need for AdaIN and, consequently, eliminates the primary source of the droplet artifacts, leading to cleaner feature maps.
Another artifact observed in StyleGAN was related to phase. During interpolations in the latent space (or when generating animations), certain features like eyes or teeth might appear "stuck" to the image canvas instead of moving naturally with the apparent pose or viewpoint change. This suggested issues with how the network represented spatial frequencies and transformations.
The StyleGAN2 authors identified the progressive growing technique (inherited from ProGAN) as a potential contributor. While progressive growing helps stabilize training for high resolutions initially, dynamically changing the network architecture during training might interfere with the generator learning consistent phase behavior across different feature scales.
StyleGAN2 abandons progressive growing. Instead, it trains the full-resolution network from the beginning. To manage the flow of information from lower to higher resolutions effectively (which progressive growing aimed to help with), StyleGAN2 employs:
toRGB
layers that map intermediate feature resolutions directly to RGB outputs. These intermediate RGB outputs are then upsampled and summed together to form the final image. This allows gradients to flow more directly to earlier layers and encourages the network to utilize features from all resolution levels simultaneously.Simplified view of StyleGAN2 generator architecture, highlighting skip connections (
toRGB
outputs from multiple resolutions) which replace progressive growing. Intermediate outputs are combined to form the final image.
Removing progressive growing and using this alternative multi-scale architecture design significantly reduced the phase artifacts, leading to more natural-looking transformations and interpolations.
To further improve image quality and the smoothness of the latent space W, StyleGAN2 introduces path length regularization (Lpath) during training. The goal is to encourage the mapping from W to the generated image to be more regular, meaning that a step of a fixed size in W should correspond to a roughly fixed-magnitude change in the image, regardless of the location in W or the direction of the step. Abrupt changes in the image for small steps in W are undesirable and correlate with lower visual quality.
The path length regularizer penalizes the deviation of the magnitude of the image space gradient (Jacobian) from a constant value. It's formulated as:
Lpath=Ew,y∼N(0,I)[(∥JwTy∥2−a)2]Where:
By minimizing this loss, the generator is incentivized to make the mapping G:W→Image less "curvy" and more predictable locally. This regularization improves model conditioning, enhances image quality (often measured by FID), and leads to perceptually smoother interpolations.
Regularization terms like Lpath or the R1 gradient penalty for the discriminator can add significant computational overhead to each training iteration. StyleGAN2 introduces the concept of "lazy regularization," where these computationally intensive regularization terms are computed and applied less frequently than the main generator and discriminator losses. For instance, the path length regularization might only be calculated once every 16 mini-batches, while the main adversarial loss is computed every mini-batch. This simple trick drastically reduces the training time cost associated with these regularizers without substantially impacting their effectiveness or the final model quality.
Collectively, these refinements in StyleGAN2 led to substantial improvements:
These enhancements solidified StyleGAN2 as a benchmark architecture for high-resolution image synthesis and further demonstrated the importance of careful architectural design and training regularization in achieving top-tier generative results. Understanding these improvements provides valuable insights into tackling common challenges in GAN development.
© 2025 ApX Machine Learning