Having explored the architectural concepts behind StyleGAN, including its mapping network and style-based generator, let's transition from theory to practice. This section provides hands-on guidance for implementing some core building blocks of the StyleGAN generator using PyTorch. Understanding these components at the code level solidifies the concepts and prepares you for working with or modifying advanced generative models.
We assume you have a solid grasp of PyTorch, convolutional layers, and neural network fundamentals. Our focus here is on the unique aspects of StyleGAN's architecture.
The mapping network's primary function is to transform the initial latent code z, typically sampled from a standard normal distribution N(0,I), into an intermediate latent space W. This intermediate space W is often less entangled, allowing for more intuitive style control. The mapping network is usually implemented as a Multi-Layer Perceptron (MLP).
Let's implement a simplified mapping network. It will consist of several fully connected layers with LeakyReLU activation.
import torch
import torch.nn as nn
import torch.nn.functional as F
class MappingNetwork(nn.Module):
def __init__(self, z_dim, w_dim, num_layers=8):
"""
Initializes the Mapping Network.
Args:
z_dim (int): Dimensionality of the input latent code z.
w_dim (int): Dimensionality of the output intermediate latent code w.
num_layers (int): Number of linear layers in the mapping network.
"""
super().__init__()
self.z_dim = z_dim
self.w_dim = w_dim
self.num_layers = num_layers
layers = []
# Input layer normalization (optional but common)
layers.append(nn.BatchNorm1d(z_dim)) # Or PixelNorm from StyleGAN paper
in_features = z_dim
for i in range(num_layers):
layers.append(nn.Linear(in_features, w_dim))
layers.append(nn.LeakyReLU(0.2))
in_features = w_dim # Subsequent layers have w_dim features
self.network = nn.Sequential(*layers)
def forward(self, z):
"""
Forward pass through the mapping network.
Args:
z (torch.Tensor): Input latent codes (Batch Size, z_dim).
Returns:
torch.Tensor: Output intermediate latent codes w (Batch Size, w_dim).
"""
# Normalize z if needed (PixelNorm is often used in StyleGAN)
# Example simple normalization:
# z = z / torch.sqrt(torch.mean(z**2, dim=1, keepdim=True) + 1e-8)
w = self.network(z)
return w
# Example usage:
z_dim = 512
w_dim = 512
mapping_net = MappingNetwork(z_dim, w_dim)
# Generate a batch of random latent codes
z_input = torch.randn(16, z_dim) # Batch size 16
# Obtain the intermediate latent codes
w_output = mapping_net(z_input)
print(f"Input z shape: {z_input.shape}")
print(f"Output w shape: {w_output.shape}")
In this implementation:
w
represents the vector in the intermediate latent space W. This w will be used to control the styles in the synthesis network via AdaIN.Adaptive Instance Normalization is the mechanism StyleGAN uses to inject style information (derived from w) into the synthesis network at each resolution level. Recall the formula:
AdaIN(x,y)=ys(σ(x)x−μ(x))+ybHere, x is the activation map from a convolutional layer, μ(x) and σ(x) are the mean and standard deviation of x computed per channel, per sample (Instance Normalization). The scale ys and bias yb are derived from the intermediate latent code w through learned affine transformations (typically linear layers).
Let's implement the AdaIN operation.
class AdaIN(nn.Module):
def __init__(self, num_channels, w_dim):
"""
Initializes the AdaIN layer.
Args:
num_channels (int): Number of channels in the input feature map x.
w_dim (int): Dimensionality of the intermediate latent code w.
"""
super().__init__()
self.instance_norm = nn.InstanceNorm2d(num_channels, affine=False) # affine=False because we apply our own scale/bias
# Learned affine transformations to map w to style scales and biases
self.style_scale_transform = nn.Linear(w_dim, num_channels)
self.style_bias_transform = nn.Linear(w_dim, num_channels)
def forward(self, x, w):
"""
Forward pass for AdaIN.
Args:
x (torch.Tensor): Input feature map (Batch Size, Channels, Height, Width).
w (torch.Tensor): Intermediate latent code (Batch Size, w_dim).
Returns:
torch.Tensor: Feature map modulated by style w (Batch Size, Channels, Height, Width).
"""
# Normalize the input feature map per channel/sample
normalized_x = self.instance_norm(x)
# Compute style scales and biases from w
# Shape w: (Batch Size, w_dim)
style_scale = self.style_scale_transform(w) # Shape: (Batch Size, num_channels)
style_bias = self.style_bias_transform(w) # Shape: (Batch Size, num_channels)
# Reshape scales and biases to match feature map dimensions for broadcasting
# Target shape: (Batch Size, num_channels, 1, 1)
style_scale = style_scale.unsqueeze(-1).unsqueeze(-1)
style_bias = style_bias.unsqueeze(-1).unsqueeze(-1)
# Apply the learned scale and bias
transformed_x = style_scale * normalized_x + style_bias
return transformed_x
# Example usage:
num_channels = 64
w_dim = 512
height, width = 32, 32
batch_size = 16
adain_layer = AdaIN(num_channels, w_dim)
# Dummy feature map and intermediate latent code
feature_map = torch.randn(batch_size, num_channels, height, width)
w_code = torch.randn(batch_size, w_dim) # Usually comes from MappingNetwork
# Apply AdaIN
stylized_feature_map = adain_layer(feature_map, w_code)
print(f"Input feature map shape: {feature_map.shape}")
print(f"Input w shape: {w_code.shape}")
print(f"Output stylized feature map shape: {stylized_feature_map.shape}")
Key points about this implementation:
nn.InstanceNorm2d
with affine=False
performs the normalization (x−μ(x))/σ(x).nn.Linear
layers learn to map the global style vector w to per-channel scale (ys) and bias (yb) values specific to this layer in the synthesis network.(Batch Size, num_channels, 1, 1)
so they can be broadcast correctly during the element-wise multiplication and addition.StyleGAN introduces explicit noise inputs at different layers of the synthesis network. This noise provides a way for the generator to model stochastic details (like hair placement, freckles) that aren't easily controlled by the global style vector w. The noise is typically Gaussian noise, scaled by learned per-channel weights, and added directly to the feature maps.
class AddNoise(nn.Module):
def __init__(self, num_channels):
"""
Initializes the Noise Injection layer.
Args:
num_channels (int): Number of channels in the feature map where noise is added.
"""
super().__init__()
# Learnable scaling factor for the noise, one per channel
# Initialized to zero, so noise has no effect at the start of training
self.noise_weight = nn.Parameter(torch.zeros(1, num_channels, 1, 1))
def forward(self, x):
"""
Adds scaled noise to the input feature map.
Args:
x (torch.Tensor): Input feature map (Batch Size, Channels, Height, Width).
Returns:
torch.Tensor: Feature map with added noise.
"""
batch_size, _, height, width = x.shape
# Generate noise on the correct device, matching input tensor type
noise = torch.randn(batch_size, 1, height, width, device=x.device, dtype=x.dtype)
# Scale noise by learned weights and add to feature map
noisy_x = x + self.noise_weight * noise
return noisy_x
# Example usage:
noise_layer = AddNoise(num_channels=64)
# Using the feature map from previous example
output_with_noise = noise_layer(feature_map) # Can be applied before or after AdaIN/activation
print(f"Shape after adding noise: {output_with_noise.shape}")
This simple module creates noise of the same spatial resolution as the input x
, scales it using a learnable weight (noise_weight
), and adds it.
Now, let's combine these components into a representative block of the StyleGAN synthesis network. A typical block might involve:
Here's a simplified block structure:
class SynthesisBlock(nn.Module):
def __init__(self, in_channels, out_channels, w_dim, kernel_size=3, upsample=True):
"""
Initializes a simplified StyleGAN Synthesis Block.
Args:
in_channels (int): Input channels.
out_channels (int): Output channels.
w_dim (int): Dimension of intermediate latent code w.
kernel_size (int): Kernel size for convolutions.
upsample (bool): Whether to perform upsampling at the beginning of the block.
"""
super().__init__()
self.upsample = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=False) if upsample else None
padding = kernel_size // 2
self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, padding=padding)
self.noise1 = AddNoise(out_channels)
self.adain1 = AdaIN(out_channels, w_dim)
self.activation1 = nn.LeakyReLU(0.2)
self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=kernel_size, padding=padding)
self.noise2 = AddNoise(out_channels)
self.adain2 = AdaIN(out_channels, w_dim)
self.activation2 = nn.LeakyReLU(0.2)
def forward(self, x, w):
"""
Forward pass through the synthesis block.
Args:
x (torch.Tensor): Input feature map.
w (torch.Tensor): Intermediate latent code w.
Returns:
torch.Tensor: Output feature map from the block.
"""
if self.upsample:
x = self.upsample(x)
# First convolution sequence
x = self.conv1(x)
x = self.noise1(x)
x = self.activation1(x)
x = self.adain1(x, w)
# Second convolution sequence
x = self.conv2(x)
x = self.noise2(x)
x = self.activation2(x)
x = self.adain2(x, w)
return x
# Example usage:
# Assuming we have output from a previous block or initial constant input
# Start with a constant learned input for the first block (e.g., 4x4 resolution)
initial_input = torch.randn(batch_size, 512, 4, 4) # Example: 512 channels at 4x4
w_code = torch.randn(batch_size, w_dim) # From Mapping Network
# Example: Block going from 512 channels (4x4) to 256 channels (8x8)
block_4x4_to_8x8 = SynthesisBlock(in_channels=512, out_channels=256, w_dim=w_dim, upsample=True)
output_8x8 = block_4x4_to_8x8(initial_input, w_code)
print(f"Output shape of 8x8 block: {output_8x8.shape}")
# Example: Block maintaining 256 channels (8x8 to 8x8 - maybe first block doesn't upsample)
# block_8x8_to_8x8 = SynthesisBlock(in_channels=256, out_channels=256, w_dim=w_dim, upsample=False)
# output_8x8_v2 = block_8x8_to_8x8(output_8x8, w_code)
# print(f"Output shape of next 8x8 block: {output_8x8_v2.shape}")
This block structure demonstrates how convolution, noise injection, activation, and AdaIN are interleaved. Notice how the same w vector is used in both AdaIN layers within the block, providing consistent style modulation at this resolution level.
The following diagram illustrates the flow within a single synthesis block, highlighting how the intermediate latent code w influences the process via AdaIN and potentially noise scaling (though our AddNoise
used learnable weights independent of w for simplicity; some variations might scale noise based on w too).
Simplified data flow within one StyleGAN synthesis block. The intermediate latent code w, derived from the initial latent z via the mapping network, modulates the feature maps x through AdaIN layers. Noise is added independently.
This practical exercise focused on the mechanics of the Mapping Network, AdaIN, and Noise Injection. By implementing these core parts, you gain a deeper appreciation for how StyleGAN achieves its fine-grained control over the generation process. Building and training a full StyleGAN model requires integrating these components carefully and applying advanced training stabilization techniques, as discussed in later chapters.
© 2025 ApX Machine Learning