This practical section focuses on implementing the fundamental building blocks of the StyleGAN generator architecture. We will build upon the discussions of the Mapping Network, the Synthesis Network, and the Adaptive Instance Normalization (AdaIN) mechanism covered earlier in the chapter. Understanding these components is essential for grasping how StyleGAN achieves its high-fidelity results and controllable synthesis.
We assume you are comfortable with PyTorch (or TensorFlow, though examples will be in PyTorch) and have experience implementing custom neural network layers and architectures, including convolutional networks.
The first step in StyleGAN's generation process is transforming the initial latent code z, typically drawn from a standard normal distribution N(0,I), into an intermediate latent space W. This transformation is performed by the Mapping Network, usually a multi-layer perceptron (MLP). Its purpose is to disentangle the latent space, meaning that variations in W should correspond more directly to distinct semantic attributes in the generated image, compared to variations in Z.
A typical Mapping Network consists of several fully connected layers with activation functions like LeakyReLU or ReLU.
import torch
import torch.nn as nn
import torch.nn.functional as F
class MappingNetwork(nn.Module):
def __init__(self, z_dim, w_dim, num_layers=8):
"""
Initializes the Mapping Network.
Args:
z_dim (int): Dimensionality of the input latent code z.
w_dim (int): Dimensionality of the output intermediate latent code w.
num_layers (int): Number of linear layers in the network.
"""
super().__init__()
self.z_dim = z_dim
self.w_dim = w_dim
self.num_layers = num_layers
layers = []
# Input layer normalization (optional but common)
# layers.append(nn.LayerNorm(z_dim)) # Or PixelNorm
# Hidden layers
current_dim = z_dim
for i in range(num_layers):
layers.append(nn.Linear(current_dim, w_dim))
layers.append(nn.LeakyReLU(0.2))
current_dim = w_dim # Subsequent layers use w_dim
self.network = nn.Sequential(*layers)
def forward(self, z):
"""
Forward pass through the Mapping Network.
Args:
z (torch.Tensor): Input latent codes (batch_size, z_dim).
Returns:
torch.Tensor: Output intermediate latent codes w (batch_size, w_dim).
"""
# Normalize z (optional, depends on specific StyleGAN version)
# z = F.normalize(z, dim=1) * (self.z_dim ** 0.5)
w = self.network(z)
return w
# Example Usage:
z_dim = 512
w_dim = 512
batch_size = 4
mapping_net = MappingNetwork(z_dim, w_dim)
z_input = torch.randn(batch_size, z_dim)
w_output = mapping_net(z_input)
print(f"Input z shape: {z_input.shape}")
print(f"Output w shape: {w_output.shape}")
This implementation creates an 8-layer MLP. In practice, normalization techniques might be applied to z before the network, and different activation functions or layer normalization schemes could be employed within the MLP itself. The key idea is the non-linear transformation from Z to W.
AdaIN is the mechanism StyleGAN uses to inject the style information, encoded in the intermediate latent vector w, into the Synthesis Network. Unlike Batch Normalization, which normalizes across a batch, or Instance Normalization, which normalizes across spatial dimensions for a single sample, AdaIN normalizes the activations x of a convolutional layer per channel and then scales and shifts them using parameters derived from the style vector w.
The AdaIN operation is defined as:
AdaIN(xi,w)=γi(w)σ2(xi)+ϵxi−μ(xi)+βi(w)Here, xi represents the activations for the i-th channel. μ(xi) and σ(xi) are the mean and standard deviation calculated across the spatial dimensions (height and width) for that channel. ϵ is a small constant for numerical stability. The important part is that the scale γi(w) and bias βi(w) parameters are learned functions of the intermediate latent code w. Typically, these are produced by applying separate learned affine transformations (linear layers) to w for each channel i at each layer where AdaIN is used.
import torch
import torch.nn as nn
class AdaIN(nn.Module):
def __init__(self, channels, w_dim):
"""
Initializes the Adaptive Instance Normalization layer.
Args:
channels (int): Number of channels in the input activation map.
w_dim (int): Dimensionality of the intermediate latent code w.
"""
super().__init__()
self.channels = channels
self.w_dim = w_dim
self.instance_norm = nn.InstanceNorm2d(channels, affine=False) # affine=False because we provide our own scale/bias
# Learned affine transformations to map w to scale (gamma) and bias (beta)
self.style_scale_transform = nn.Linear(w_dim, channels)
self.style_bias_transform = nn.Linear(w_dim, channels)
def forward(self, x, w):
"""
Forward pass for AdaIN.
Args:
x (torch.Tensor): Input activation map (batch_size, channels, height, width).
w (torch.Tensor): Intermediate latent code (batch_size, w_dim).
Returns:
torch.Tensor: Activation map after applying AdaIN (batch_size, channels, height, width).
"""
# Normalize the input activations per channel
normalized_x = self.instance_norm(x)
# Generate scale and bias parameters from w
# Shape: (batch_size, channels)
style_scale = self.style_scale_transform(w)
style_bias = self.style_bias_transform(w)
# Reshape scale and bias for broadcasting: (batch_size, channels, 1, 1)
style_scale = style_scale.view(-1, self.channels, 1, 1)
style_bias = style_bias.view(-1, self.channels, 1, 1)
# Apply the learned scale and bias
transformed_x = style_scale * normalized_x + style_bias
return transformed_x
# Example Usage:
batch_size = 4
channels = 64
height, width = 32, 32
w_dim = 512
adain_layer = AdaIN(channels, w_dim)
x_input = torch.randn(batch_size, channels, height, width) # Example activation map
w_vector = torch.randn(batch_size, w_dim) # Example intermediate latent
output_x = adain_layer(x_input, w_vector)
print(f"Input x shape: {x_input.shape}")
print(f"Input w shape: {w_vector.shape}")
print(f"Output x shape: {output_x.shape}")
Notice how InstanceNorm2d
is used with affine=False
because the affine parameters (γ,β) are generated dynamically from w via the style_scale_transform
and style_bias_transform
layers.
The Synthesis Network progressively generates the image, typically starting from a learned constant tensor and gradually increasing resolution. Each resolution level usually involves one or more blocks containing upsampling, convolution, noise injection, AdaIN, and activation functions.
Let's focus on implementing a single StyleGAN block. This block typically takes an activation map from the previous layer and the style vector w, applies a convolution, injects learned noise, applies AdaIN, and then uses an activation function. Upsampling might occur before or within the block depending on the specific StyleGAN version and resolution level.
Noise Injection: StyleGAN adds learned, per-pixel noise independently to each channel after each convolution but before the activation function. This introduces stochastic details (like hair placement, freckles) that don't need to be controlled by w. The noise is scaled by a learned per-channel factor.
import torch
import torch.nn as nn
class NoiseInjection(nn.Module):
def __init__(self, channels):
"""
Initializes the Noise Injection layer.
Args:
channels (int): Number of channels to apply noise to.
"""
super().__init__()
# Learnable scaling factor per channel
# Initialized to 0 so initially noise has no effect
self.weight = nn.Parameter(torch.zeros(1, channels, 1, 1))
def forward(self, x, noise=None):
"""
Forward pass for Noise Injection.
Args:
x (torch.Tensor): Input activation map (batch_size, channels, height, width).
noise (torch.Tensor, optional): Pre-generated noise tensor.
If None, generate noise. Defaults to None.
Returns:
torch.Tensor: Activation map with added scaled noise.
"""
if noise is None:
batch, _, height, width = x.shape
# Generate noise on the correct device
noise = torch.randn(batch, 1, height, width, device=x.device, dtype=x.dtype)
# Add scaled noise to the input
return x + self.weight * noise
# Combining into a Synthesis Block
class SynthesisBlock(nn.Module):
def __init__(self, in_channels, out_channels, w_dim, use_upsample=True):
"""
Initializes a basic StyleGAN Synthesis Block.
Args:
in_channels (int): Input channels.
out_channels (int): Output channels.
w_dim (int): Dimensionality of the intermediate latent code w.
use_upsample (bool): Whether to include nearest-neighbor upsampling.
"""
super().__init__()
self.use_upsample = use_upsample
if self.use_upsample:
self.upsample = nn.Upsample(scale_factor=2, mode='nearest')
self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=1, padding=1)
self.noise_injection = NoiseInjection(out_channels)
self.adain = AdaIN(out_channels, w_dim)
self.activation = nn.LeakyReLU(0.2)
def forward(self, x, w, noise=None):
"""
Forward pass for the Synthesis Block.
Args:
x (torch.Tensor): Input activation map (batch_size, in_channels, H, W).
w (torch.Tensor): Intermediate latent code (batch_size, w_dim).
noise (torch.Tensor, optional): Optional pre-generated noise for this block.
Returns:
torch.Tensor: Output activation map (batch_size, out_channels, H' or 2H, W' or 2W).
"""
if self.use_upsample:
x = self.upsample(x)
x = self.conv(x)
x = self.noise_injection(x, noise=noise) # Pass optional noise
x = self.adain(x, w)
x = self.activation(x)
return x
# Example Usage:
batch_size = 4
in_channels = 128
out_channels = 64
height, width = 16, 16
w_dim = 512
# Block with upsampling
synth_block_upsample = SynthesisBlock(in_channels, out_channels, w_dim, use_upsample=True)
x_in_low_res = torch.randn(batch_size, in_channels, height, width)
w_vector = torch.randn(batch_size, w_dim)
output_high_res = synth_block_upsample(x_in_low_res, w_vector)
print(f"Input x shape (low res): {x_in_low_res.shape}")
print(f"Output x shape (high res): {output_high_res.shape}") # Should be 32x32
# Block without upsampling (e.g., first block or later StyleGAN2 versions)
synth_block_no_upsample = SynthesisBlock(out_channels, out_channels, w_dim, use_upsample=False)
output_same_res = synth_block_no_upsample(output_high_res, w_vector)
print(f"Input x shape (high res): {output_high_res.shape}")
print(f"Output x shape (same res): {output_same_res.shape}") # Should be 32x32
The diagram below illustrates the high-level flow within the generator, emphasizing the role of the Mapping Network and AdaIN within a Synthesis Block.
High-level data flow in StyleGAN. The Mapping Network transforms z to w. The Synthesis Network uses w via AdaIN layers within its blocks to control the style of the generated features, while injected noise adds stochastic details.
4x4x512
) and progressively increasing resolution (e.g., 4x4 -> 8x8 -> ... -> 1024x1024). Each resolution typically has two blocks (one without upsampling, one with).This hands-on section provided implementations for the core components enabling StyleGAN's unique capabilities. Building a full, optimized StyleGAN requires careful integration of these parts, along with training strategies like progressive growing (or equivalent fixed architecture techniques) and regularization methods like style mixing. Referencing official implementations and the associated papers is highly recommended for tackling a complete build.
© 2025 ApX Machine Learning