While Progressive GANs (ProGANs) significantly improved the generation of high-resolution images, controlling the characteristics of these images remained a challenge. The standard GAN latent space z often exhibits entanglement, meaning that changing a single dimension in z can affect multiple unrelated attributes in the generated image simultaneously. StyleGAN, introduced by Karras et al. (NVIDIA), addresses this by redesigning the generator architecture to promote a more disentangled representation of styles.
The first major innovation in StyleGAN is the introduction of a mapping network, denoted as f. Instead of directly feeding the input latent vector z (typically drawn from a standard normal distribution) into the synthesis network, StyleGAN first transforms z into an intermediate latent space W using a non-linear mapping network:
w=f(z)
This mapping network f is usually implemented as a multi-layer perceptron (MLP). The key idea is that the distribution of z doesn't have to follow the distribution induced by the training data's variations. The mapping network learns a transformation that "unwarps" z into a space W where factors of variation are potentially more linearly separable. This intermediate latent space w∈W is argued to be less entangled than the input space Z. Since W is learned, it doesn't need to adhere to a fixed distribution like Z, allowing it to better represent the probability density of the training data features.
The second core component is the synthesis network, g, which generates the image based on the intermediate latent code w. Unlike traditional GAN generators that consume the latent vector at the input layer, the StyleGAN synthesis network starts from a learned constant tensor. The style information, derived from w, is injected at multiple points throughout the network.
This injection happens via Adaptive Instance Normalization (AdaIN) layers. Recall that Instance Normalization normalizes the feature map statistics (mean and standard deviation) for each sample and each channel independently. AdaIN extends this by modulating the normalized features using style information derived from w.
Specifically, for each feature map xi at a given layer i, AdaIN first normalizes it per channel:
Normalized(xi)=σ(xi)xi−μ(xi)
Where μ(xi) and σ(xi) are the mean and standard deviation computed spatially for each channel and each sample in the batch. Then, it applies learned affine transformations (scale ys,i and bias yb,i) derived from the intermediate latent vector w to inject the style:
AdaIN(xi,w)=ys,i(σ(xi)xi−μ(xi))+yb,i
The scale (ys) and bias (yb) parameters are obtained by passing w through separate learned affine transformations (dense layers) for each layer where AdaIN is applied. This mechanism allows w to control the style (like pose, identity features, lighting, texture) represented in the feature maps at different resolutions within the synthesis network g. Coarser resolutions (early layers) typically control larger structures, while finer resolutions (later layers) control finer details.
Real-world images contain many stochastic details (e.g., exact placement of hairs, freckles, textures) that don't correlate strongly with high-level attributes. To model this, StyleGAN introduces explicit noise inputs directly into the synthesis network. Gaussian noise is added per-pixel to the feature maps after each convolution but before the activation function. The magnitude of the noise effect is controlled by learned per-channel scaling factors.
This explicit noise injection provides a way for the network to generate convincing non-structural variations without needing to encode this information within the latent code w, further freeing up w to focus on higher-level style attributes.
The use of the intermediate latent space W and AdaIN enables a powerful technique called style mixing. Instead of using a single w vector to control all AdaIN layers, we can use two different vectors, w1 and w2 (derived from z1 and z2), and switch between them at some point in the synthesis network.
For example, w1 might control the styles applied to the coarse resolution layers (e.g., layers 4x4 to 8x8), while w2 controls the styles for the finer resolution layers (e.g., 16x16 to 1024x1024). This often results in an image that inherits coarse attributes (like pose, general shape, hair style) from w1 and finer attributes (like texture, color scheme, fine facial details) from w2.
Style mixing acts as a form of regularization during training. By training the network to handle combinations of styles from different latent codes, it discourages the network from assuming adjacent layers are strongly correlated in style space, further improving disentanglement.
Diagram illustrating the StyleGAN generator architecture. The mapping network transforms input latent z to intermediate latent w. The synthesis network uses w to generate style parameters (via affine transformations 'A') applied using AdaIN within each synthesis block. Learned noise (scaled via 'B') is also added at each resolution level.
The original StyleGAN architecture was highly influential but suffered from some image artifacts, particularly droplet-like artifacts appearing in generated images. StyleGAN2 addressed these by:
More recently, StyleGAN3 focused on addressing aliasing issues inherent in traditional CNN architectures. By redesigning the signal processing within the generator layers to be alias-free, StyleGAN3 achieves true rotation and translation equivariance for certain configurations. This means transformations applied to the input latent code w correspond precisely to transformations in the output image, fixing the issue where details might appear "stuck" to image coordinates regardless of the object's pose.
In summary, StyleGAN and its successors represent a significant leap in generative modeling, particularly for image synthesis. By introducing the mapping network, AdaIN-based style modulation, and explicit noise inputs, they provide enhanced control over the generation process and enable the synthesis of highly realistic and detailed images. Understanding these architectural innovations is fundamental for anyone working with advanced generative models. The hands-on practical later in this chapter will involve implementing some of these core components.
© 2025 ApX Machine Learning