While Progressive GANs (ProGANs) significantly improved the generation of high-resolution images, controlling the characteristics of these images remained a challenge. The standard GAN latent space $z$ often exhibits entanglement, meaning that changing a single dimension in $z$ can affect multiple unrelated attributes in the generated image simultaneously. StyleGAN, introduced by Karras et al. (NVIDIA), addresses this by redesigning the generator architecture to promote a more disentangled representation of styles.

The Mapping Network: From $z$ to $w$

The first major innovation in StyleGAN is the introduction of a mapping network, denoted as $f$ . Instead of directly feeding the input latent vector $z$ (typically drawn from a standard normal distribution) into the synthesis network, StyleGAN first transforms $z$ into an intermediate latent space $W$ using a non-linear mapping network:

$w = f(z)$

This mapping network $f$ is usually implemented as a multi-layer perceptron (MLP). The main idea is that the distribution of $z$ doesn't have to follow the distribution induced by the training data's variations. The mapping network learns a transformation that "unwarps" $z$ into a space $W$ where factors of variation are potentially more linearly separable. This intermediate latent space $w \in W$ is argued to be less entangled than the input space $Z$ . Since $W$ is learned, it doesn't need to adhere to a fixed distribution like $Z$ , allowing it to better represent the probability density of the training data features.

The Synthesis Network and Adaptive Instance Normalization (AdaIN)

The second core component is the synthesis network, $g$ , which generates the image based on the intermediate latent code $w$ . Unlike traditional GAN generators that consume the latent vector at the input layer, the StyleGAN synthesis network starts from a learned constant tensor. The style information, derived from $w$ , is injected at multiple points throughout the network.

This injection happens via Adaptive Instance Normalization (AdaIN) layers. Recall that Instance Normalization normalizes the feature map statistics (mean and standard deviation) for each sample and each channel independently. AdaIN extends this by modulating the normalized features using style information derived from $w$ .

Specifically, for each feature map $x_i$ at a given layer $i$ , AdaIN first normalizes it per channel:

$\text{Normalized}(x_i) = \frac{x_i - \mu(x_i)}{\sigma(x_i)}$

Where $\mu(x_i)$ and $\sigma(x_i)$ are the mean and standard deviation computed spatially for each channel and each sample in the batch. Then, it applies learned affine transformations (scale $y_{s,i}$ and bias $y_{b,i}$ ) derived from the intermediate latent vector $w$ to inject the style:

$\text{AdaIN}(x_i, w) = y_{s,i} \left( \frac{x_i - \mu(x_i)}{\sigma(x_i)} \right) + y_{b,i}$

The scale ( $y_s$ ) and bias ( $y_b$ ) parameters are obtained by passing $w$ through separate learned affine transformations (dense layers) for each layer where AdaIN is applied. This mechanism allows $w$ to control the style (like pose, identity features, lighting, texture) represented in the feature maps at different resolutions within the synthesis network $g$ . Coarser resolutions (early layers) typically control larger structures, while finer resolutions (later layers) control finer details.

Stochastic Variation via Noise Inputs

"Images contain many stochastic details (e.g., exact placement of hairs, freckles, textures) that don't correlate strongly with high-level attributes. To model this, StyleGAN introduces explicit noise inputs directly into the synthesis network. Gaussian noise is added per-pixel to the feature maps after each convolution but before the activation function. The magnitude of the noise effect is controlled by learned per-channel scaling factors."

This explicit noise injection provides a way for the network to generate convincing non-structural variations without needing to encode this information within the latent code $w$ , further freeing up $w$ to focus on higher-level style attributes.

Style Mixing

The use of the intermediate latent space $W$ and AdaIN enables a powerful technique called style mixing. Instead of using a single $w$ vector to control all AdaIN layers, we can use two different vectors, $w_1$ and $w_2$ (derived from $z_1$ and $z_2$ ), and switch between them at some point in the synthesis network.

For example, $w_1$ might control the styles applied to the coarse resolution layers (e.g., layers 4x4 to 8x8), while $w_2$ controls the styles for the finer resolution layers (e.g., 16x16 to 1024x1024). This often results in an image that inherits coarse attributes (like pose, general shape, hair style) from $w_1$ and finer attributes (like texture, color scheme, fine facial details) from $w_2$ .

Style mixing acts as a form of regularization during training. By training the network to handle combinations of styles from different latent codes, it discourages the network from assuming adjacent layers are strongly correlated in style space, further improving disentanglement.

Diagram illustrating the StyleGAN generator architecture. The mapping network transforms input latent $z$ to intermediate latent $w$ . The synthesis network uses $w$ to generate style parameters (via affine transformations 'A') applied using AdaIN within each synthesis block. Learned noise (scaled via 'B') is also added at each resolution level.

StyleGAN Variants: StyleGAN2 and StyleGAN3

The original StyleGAN architecture was highly influential but suffered from some image artifacts, particularly droplet-like artifacts appearing in generated images. StyleGAN2 addressed these by:

Redesigning the generator normalization (moving AdaIN outside the style block and simplifying normalization).
Revisiting the progressive growth technique (using skip connections and residual blocks instead).
Introducing Perceptual Path Length (PPL) regularization to encourage a smoother, more well-behaved latent space $W$ .

More recently, StyleGAN3 focused on addressing aliasing issues inherent in traditional CNN architectures. By redesigning the signal processing within the generator layers to be alias-free, StyleGAN3 achieves true rotation and translation equivariance for certain configurations. This means transformations applied to the input latent code $w$ correspond precisely to transformations in the output image, fixing the issue where details might appear "stuck" to image coordinates regardless of the object's pose.

In summary, StyleGAN and its successors represent a significant leap in generative modeling, particularly for image synthesis. By introducing the mapping network, AdaIN-based style modulation, and explicit noise inputs, they provide enhanced control over the generation process and enable the synthesis of highly realistic and detailed images. Understanding these architectural innovations is fundamental for anyone working with advanced generative models. The hands-on practical later in this chapter will involve implementing some of these core components.

The Mapping Network: From zzz to www

The Synthesis Network and Adaptive Instance Normalization (AdaIN)

Stochastic Variation via Noise Inputs

Style Mixing

StyleGAN Variants: StyleGAN2 and StyleGAN3

The Mapping Network: From $z$ to $w$

Style-Based Generators (StyleGAN variants)