While architectures like ProGAN demonstrated effective strategies for generating high-resolution images by progressively increasing network depth, controlling the specific attributes of the generated images remained a challenge. The standard latent space z often exhibits entanglement, meaning that changing one dimension in z can affect multiple, unrelated features in the output image. StyleGAN, introduced by Karras et al. (NVIDIA), represents a significant architectural shift designed specifically to address this issue and provide more intuitive control over the synthesis process.
Instead of feeding the initial latent code z directly into the generator network, StyleGAN introduces several innovative components that work together to disentangle style features at different levels of granularity.
The first major change is the introduction of a mapping network, denoted as f. This is typically an 8-layer Multi-Layer Perceptron (MLP). Its purpose is to transform the input latent code z, drawn from a standard distribution (e.g., Gaussian), into an intermediate latent space W. So, w=f(z), where w∈W.
A simplified view of the mapping network transforming the input latent z into the intermediate latent w.
Why introduce this intermediate space W? The mapping network f does not have to preserve the distribution of the input space Z. This freedom allows f to "unwarp" the latent space, making W potentially less entangled than Z. Factors of variation in the data might be more linearly represented in W, making it easier to control specific attributes during generation. For instance, disentangling factors like identity, pose, and lighting might be more feasible in W.
The second major component is the synthesis network, g. Unlike traditional GAN generators that receive the latent code z as direct input at the first layer, the StyleGAN synthesis network starts from a learned constant tensor. The intermediate latent code w is then used to control the features generated by g at multiple points throughout the network.
Noise is also explicitly added at each resolution level of the synthesis network. This noise is sampled independently for each layer and provides a mechanism for the network to generate stochastic details, like the exact placement of hairs or freckles, without relying solely on the latent code w.
The mechanism for injecting the style information from w into the synthesis network g is Adaptive Instance Normalization (AdaIN). AdaIN modifies the normalized activations of the synthesis network based on style information derived from w.
Recall that Instance Normalization (IN) normalizes the feature statistics (mean and standard deviation) for each channel per sample. AdaIN extends this by modulating these normalized features using style scales (σ(ys)) and biases (μ(yb)) derived from the intermediate latent vector w. For an activation map xi at a specific layer i:
AdaIN(xi,w)=ys,i(σ(xi)xi−μ(xi))+yb,iHere:
Essentially, AdaIN first normalizes the feature map xi to have zero mean and unit variance per channel, removing the original style information encoded in these statistics. Then, it scales and shifts the normalized map using parameters derived from w, effectively injecting the target style specified by w into that layer's features.
Diagram of the StyleGAN synthesis network (g). The intermediate latent w controls styles via AdaIN layers, and independent noise inputs add stochastic details at different resolutions.
By applying AdaIN after each convolution layer (or block) in the synthesis network, StyleGAN controls the visual style features (like color scheme, texture, lighting) represented at different scales using the single latent vector w. The constant input ensures that the network learns all spatial information from scratch, guided only by the style inputs from w and the injected noise.
The architecture promotes disentanglement because the mapping network f can learn to map z to a less entangled W space, and the synthesis network g uses w globally via AdaIN rather than just at the input layer. This structure facilitates powerful control mechanisms:
Style Mixing: During training, a percentage of images are generated using two different intermediate latent codes, w1 and w2. One code (w1) controls the styles for a subset of layers (e.g., coarse spatial resolutions, 4x4 to 8x8), and the other code (w2) controls the styles for the remaining layers (e.g., finer resolutions, 16x16 to 1024x1024). This technique encourages the network to localize style control to specific subsets of layers and prevents it from assuming correlations between styles at different levels. At inference time, this allows for creative mixing of styles: taking the coarse structure (pose, face shape) from an image generated with w1 and combining it with the finer details (hair texture, skin color) from an image generated with w2.
Perceptual Path Length (PPL) Regularization: StyleGAN training often incorporates a regularization term that encourages smoothness in the mapping from W to the image space. Small steps in W should correspond to small perceptual changes in the generated image, further improving disentanglement and the quality of interpolations. (PPL is discussed further in Chapter 5 on evaluation.)
Truncation Trick: While not unique to StyleGAN, it's often used effectively here. By sampling w vectors and then moving them closer to the average w (calculated over many samples), one can trade diversity for average sample quality. This is done via w′=wˉ+ψ(w−wˉ), where wˉ is the average w and ψ∈[0,1] is the truncation factor. A smaller ψ increases fidelity but reduces variety.
In summary, the StyleGAN architecture, with its mapping network, AdaIN-based style modulation, noise injection, and constant input, provides a powerful framework for generating high-resolution, high-quality images with significantly improved control over style attributes compared to previous GAN architectures. It laid the groundwork for subsequent improvements in StyleGAN2 and other state-of-the-art generative models.
© 2025 ApX Machine Learning