Implement the primary building blocks of the StyleGAN generator, such as its mapping network and style-based generator, using PyTorch. Practical guidance helps you develop these components. Understanding these components at the code level provides a solid grasp of their architectural concepts and prepares you for working with or modifying advanced generative models.We assume you have a solid grasp of PyTorch, convolutional layers, and neural network fundamentals. Our focus here is on the unique aspects of StyleGAN's architecture.The Mapping NetworkThe mapping network's primary function is to transform the initial latent code $z$, typically sampled from a standard normal distribution $\mathcal{N}(0, I)$, into an intermediate latent space $W$. This intermediate space $W$ is often less entangled, allowing for more intuitive style control. The mapping network is usually implemented as a Multi-Layer Perceptron (MLP).Let's implement a simplified mapping network. It will consist of several fully connected layers with LeakyReLU activation.import torch import torch.nn as nn import torch.nn.functional as F class MappingNetwork(nn.Module): def __init__(self, z_dim, w_dim, num_layers=8): """ Initializes the Mapping Network. Args: z_dim (int): Dimensionality of the input latent code z. w_dim (int): Dimensionality of the output intermediate latent code w. num_layers (int): Number of linear layers in the mapping network. """ super().__init__() self.z_dim = z_dim self.w_dim = w_dim self.num_layers = num_layers layers = [] # Input layer normalization (optional but common) layers.append(nn.BatchNorm1d(z_dim)) # Or PixelNorm from StyleGAN paper in_features = z_dim for i in range(num_layers): layers.append(nn.Linear(in_features, w_dim)) layers.append(nn.LeakyReLU(0.2)) in_features = w_dim # Subsequent layers have w_dim features self.network = nn.Sequential(*layers) def forward(self, z): """ Forward pass through the mapping network. Args: z (torch.Tensor): Input latent codes (Batch Size, z_dim). Returns: torch.Tensor: Output intermediate latent codes w (Batch Size, w_dim). """ # Normalize z if needed (PixelNorm is often used in StyleGAN) # Example simple normalization: # z = z / torch.sqrt(torch.mean(z**2, dim=1, keepdim=True) + 1e-8) w = self.network(z) return w # Example usage: z_dim = 512 w_dim = 512 mapping_net = MappingNetwork(z_dim, w_dim) # Generate a batch of random latent codes z_input = torch.randn(16, z_dim) # Batch size 16 # Obtain the intermediate latent codes w_output = mapping_net(z_input) print(f"Input z shape: {z_input.shape}") print(f"Output w shape: {w_output.shape}")In this implementation:We start with an optional normalization layer for $z$. StyleGAN often uses PixelNorm, but Batch Normalization is shown here for simplicity.A sequence of Linear layers transforms the input $z$ dimension to the target $w$ dimension.LeakyReLU activation is used between layers.The final output w represents the vector in the intermediate latent space $W$. This $w$ will be used to control the styles in the synthesis network via AdaIN.Adaptive Instance Normalization (AdaIN)Adaptive Instance Normalization is the mechanism StyleGAN uses to inject style information (derived from $w$) into the synthesis network at each resolution level. Recall the formula:$$ \text{AdaIN}(x, y) = y_s \left( \frac{x - \mu(x)}{\sigma(x)} \right) + y_b $$Here, $x$ is the activation map from a convolutional layer, $\mu(x)$ and $\sigma(x)$ are the mean and standard deviation of $x$ computed per channel, per sample (Instance Normalization). The scale $y_s$ and bias $y_b$ are derived from the intermediate latent code $w$ through learned affine transformations (typically linear layers).Let's implement the AdaIN operation.class AdaIN(nn.Module): def __init__(self, num_channels, w_dim): """ Initializes the AdaIN layer. Args: num_channels (int): Number of channels in the input feature map x. w_dim (int): Dimensionality of the intermediate latent code w. """ super().__init__() self.instance_norm = nn.InstanceNorm2d(num_channels, affine=False) # affine=False because we apply our own scale/bias # Learned affine transformations to map w to style scales and biases self.style_scale_transform = nn.Linear(w_dim, num_channels) self.style_bias_transform = nn.Linear(w_dim, num_channels) def forward(self, x, w): """ Forward pass for AdaIN. Args: x (torch.Tensor): Input feature map (Batch Size, Channels, Height, Width). w (torch.Tensor): Intermediate latent code (Batch Size, w_dim). Returns: torch.Tensor: Feature map modulated by style w (Batch Size, Channels, Height, Width). """ # Normalize the input feature map per channel/sample normalized_x = self.instance_norm(x) # Compute style scales and biases from w # Shape w: (Batch Size, w_dim) style_scale = self.style_scale_transform(w) # Shape: (Batch Size, num_channels) style_bias = self.style_bias_transform(w) # Shape: (Batch Size, num_channels) # Reshape scales and biases to match feature map dimensions for broadcasting # Target shape: (Batch Size, num_channels, 1, 1) style_scale = style_scale.unsqueeze(-1).unsqueeze(-1) style_bias = style_bias.unsqueeze(-1).unsqueeze(-1) # Apply the learned scale and bias transformed_x = style_scale * normalized_x + style_bias return transformed_x # Example usage: num_channels = 64 w_dim = 512 height, width = 32, 32 batch_size = 16 adain_layer = AdaIN(num_channels, w_dim) # Dummy feature map and intermediate latent code feature_map = torch.randn(batch_size, num_channels, height, width) w_code = torch.randn(batch_size, w_dim) # Usually comes from MappingNetwork # Apply AdaIN stylized_feature_map = adain_layer(feature_map, w_code) print(f"Input feature map shape: {feature_map.shape}") print(f"Input w shape: {w_code.shape}") print(f"Output stylized feature map shape: {stylized_feature_map.shape}")Important points about this implementation:nn.InstanceNorm2d with affine=False performs the normalization $(x - \mu(x)) / \sigma(x)$.Two separate nn.Linear layers learn to map the global style vector $w$ to per-channel scale ($y_s$) and bias ($y_b$) values specific to this layer in the synthesis network.The scales and biases are reshaped to (Batch Size, num_channels, 1, 1) so they can be broadcast correctly during the element-wise multiplication and addition.Noise InjectionStyleGAN introduces explicit noise inputs at different layers of the synthesis network. This noise provides a way for the generator to model stochastic details (like hair placement, freckles) that aren't easily controlled by the global style vector $w$. The noise is typically Gaussian noise, scaled by learned per-channel weights, and added directly to the feature maps.class AddNoise(nn.Module): def __init__(self, num_channels): """ Initializes the Noise Injection layer. Args: num_channels (int): Number of channels in the feature map where noise is added. """ super().__init__() # Learnable scaling factor for the noise, one per channel # Initialized to zero, so noise has no effect at the start of training self.noise_weight = nn.Parameter(torch.zeros(1, num_channels, 1, 1)) def forward(self, x): """ Adds scaled noise to the input feature map. Args: x (torch.Tensor): Input feature map (Batch Size, Channels, Height, Width). Returns: torch.Tensor: Feature map with added noise. """ batch_size, _, height, width = x.shape # Generate noise on the correct device, matching input tensor type noise = torch.randn(batch_size, 1, height, width, device=x.device, dtype=x.dtype) # Scale noise by learned weights and add to feature map noisy_x = x + self.noise_weight * noise return noisy_x # Example usage: noise_layer = AddNoise(num_channels=64) # Using the feature map from previous example output_with_noise = noise_layer(feature_map) # Can be applied before or after AdaIN/activation print(f"Shape after adding noise: {output_with_noise.shape}")This simple module creates noise of the same spatial resolution as the input x, scales it using a learnable weight (noise_weight), and adds it.Simplified Synthesis Network BlockNow, let's combine these components into a representative block of the StyleGAN synthesis network. A typical block might involve:Upsampling (for increasing resolution, except in the first block).Convolutional layer.Noise Injection.Activation function (e.g., LeakyReLU).AdaIN (using $w$).Another Convolutional layer.Noise Injection.Activation function.AdaIN (using $w$).Here's a simplified block structure:class SynthesisBlock(nn.Module): def __init__(self, in_channels, out_channels, w_dim, kernel_size=3, upsample=True): """ Initializes a simplified StyleGAN Synthesis Block. Args: in_channels (int): Input channels. out_channels (int): Output channels. w_dim (int): Dimension of intermediate latent code w. kernel_size (int): Kernel size for convolutions. upsample (bool): Whether to perform upsampling at the beginning of the block. """ super().__init__() self.upsample = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=False) if upsample else None padding = kernel_size // 2 self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, padding=padding) self.noise1 = AddNoise(out_channels) self.adain1 = AdaIN(out_channels, w_dim) self.activation1 = nn.LeakyReLU(0.2) self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=kernel_size, padding=padding) self.noise2 = AddNoise(out_channels) self.adain2 = AdaIN(out_channels, w_dim) self.activation2 = nn.LeakyReLU(0.2) def forward(self, x, w): """ Forward pass through the synthesis block. Args: x (torch.Tensor): Input feature map. w (torch.Tensor): Intermediate latent code w. Returns: torch.Tensor: Output feature map from the block. """ if self.upsample: x = self.upsample(x) # First convolution sequence x = self.conv1(x) x = self.noise1(x) x = self.activation1(x) x = self.adain1(x, w) # Second convolution sequence x = self.conv2(x) x = self.noise2(x) x = self.activation2(x) x = self.adain2(x, w) return x # Example usage: # Assuming we have output from a previous block or initial constant input # Start with a constant learned input for the first block (e.g., 4x4 resolution) initial_input = torch.randn(batch_size, 512, 4, 4) # Example: 512 channels at 4x4 w_code = torch.randn(batch_size, w_dim) # From Mapping Network # Example: Block going from 512 channels (4x4) to 256 channels (8x8) block_4x4_to_8x8 = SynthesisBlock(in_channels=512, out_channels=256, w_dim=w_dim, upsample=True) output_8x8 = block_4x4_to_8x8(initial_input, w_code) print(f"Output shape of 8x8 block: {output_8x8.shape}") # Example: Block maintaining 256 channels (8x8 to 8x8 - maybe first block doesn't upsample) # block_8x8_to_8x8 = SynthesisBlock(in_channels=256, out_channels=256, w_dim=w_dim, upsample=False) # output_8x8_v2 = block_8x8_to_8x8(output_8x8, w_code) # print(f"Output shape of next 8x8 block: {output_8x8_v2.shape}")This block structure demonstrates how convolution, noise injection, activation, and AdaIN are interleaved. Notice how the same $w$ vector is used in both AdaIN layers within the block, providing consistent style modulation at this resolution level.Data FlowThe following diagram illustrates the flow within a single synthesis block, highlighting how the intermediate latent code $w$ influences the process via AdaIN and potentially noise scaling (though our AddNoise used learnable weights independent of $w$ for simplicity; some variations might scale noise based on $w$ too).digraph G { rankdir=LR; node [shape=box, style=filled, fillcolor="#a5d8ff", fontname="helvetica"]; edge [fontname="helvetica"]; subgraph cluster_mapping { label = "Mapping Network"; style=dashed; fillcolor="#e9ecef"; node [fillcolor="#ced4da"]; z [label="z (Latent Code)"]; map_net [label="MLP", shape=oval]; w [label="w (Intermediate Latent)"]; z -> map_net -> w; } subgraph cluster_synthesis { label = "Synthesis Block"; style=dashed; fillcolor="#e9ecef"; node [fillcolor="#96f2d7"]; x_in [label="Input Feature Map (x)", shape=note, fillcolor="#ffec99"]; upsample [label="Upsample (Optional)"]; conv1 [label="Conv2D"]; noise1 [label="Add Noise"]; act1 [label="LeakyReLU"]; adain1 [label="AdaIN"]; conv2 [label="Conv2D"]; noise2 [label="Add Noise"]; act2 [label="LeakyReLU"]; adain2 [label="AdaIN"]; x_out [label="Output Feature Map", shape=note, fillcolor="#ffec99"]; x_in -> upsample [style=dotted]; // Optional path upsample -> conv1; x_in -> conv1 [style=dotted]; // If no upsample conv1 -> noise1; noise1 -> act1; act1 -> adain1; adain1 -> conv2; conv2 -> noise2; noise2 -> act2; act2 -> adain2; adain2 -> x_out; w -> adain1 [label=" Style Control", color="#f06595", fontcolor="#f06595", style=dashed, arrowhead=open]; w -> adain2 [label=" Style Control", color="#f06595", fontcolor="#f06595", style=dashed, arrowhead=open]; # Noise inputs n1_src [label="Noise Source", shape=cds, fillcolor="#d0bfff"]; n2_src [label="Noise Source", shape=cds, fillcolor="#d0bfff"]; n1_src -> noise1 [style=dashed, arrowhead=open, color="#7950f2"]; n2_src -> noise2 [style=dashed, arrowhead=open, color="#7950f2"]; } w [ peripheries=2 ]; # Make w stand out slightly }Simplified data flow within one StyleGAN synthesis block. The intermediate latent code $w$, derived from the initial latent $z$ via the mapping network, modulates the feature maps $x$ through AdaIN layers. Noise is added independently.Further NotesWeight Initialization: Proper initialization (e.g., He initialization) is important for stable training.Weight Demodulation (StyleGAN2): StyleGAN2 replaced AdaIN and instance normalization with "Weight Demodulation" to address normalization artifacts. This involves scaling the convolution weights directly based on $w$. Implementing this is a more advanced step.Learning Rate Equalization: StyleGAN uses equalized learning rates, scaling weights at runtime to normalize their dynamic range, which helps when using optimizers like Adam with varying parameter scales.Progressive Growing / Resolution Handling: A full implementation needs a mechanism to handle increasing resolutions, either through progressive growing (adding layers during training) or by designing the network to output multiple resolutions simultaneously.Constant Input: The synthesis network typically starts not from $z$, but from a learned constant tensor (e.g., a 4x4xC tensor), which is then styled and processed through the blocks.This practical exercise focused on the mechanics of the Mapping Network, AdaIN, and Noise Injection. By implementing these core parts, you gain a deeper appreciation for how StyleGAN achieves its fine-grained control over the generation process. Building and training a full StyleGAN model requires integrating these components carefully and applying advanced training stabilization techniques, as discussed in later chapters.