Conditional GANs (cGANs) allow direction of the generator's output by providing additional information, typically a class label or some other attribute, denoted as $y$. Building and training a cGAN involves main practical steps, detailing how to integrate this conditional input $y$ into the generator and discriminator networks.We'll use the familiar MNIST dataset as our example. It consists of grayscale images of handwritten digits (0-9), making the digit label the natural choice for our condition $y$. Our goal is to train a generator that can produce an image of a specific digit when prompted with the corresponding label.Preparing Data with Conditional LabelsFirst, load your dataset (e.g., MNIST) using your preferred deep learning framework's utilities. Unlike standard GAN training where we only need the images $x$, for a cGAN, we also need the corresponding labels $y$. Ensure your data loader provides pairs of $(x, y)$.The labels $y$ are typically integers (0 to 9 for MNIST). Since neural networks work best with numerical vectors, we need to convert these integer labels into a suitable format. A common and effective approach is to use embedding layers. We can represent each label as a learnable vector. Alternatively, for discrete labels like in MNIST, one-hot encoding is a straightforward option, although embedding often provides more flexibility and potentially better performance, especially with a large number of classes.Let's assume we use embedding. If we have $N_c$ classes, we can create an embedding layer that maps each integer label $i \in {0, 1, ..., N_c-1}$ to a dense vector of a chosen dimension, say $d_e$.Modifying the Generator ArchitectureThe generator, $G$, must now accept two inputs: the random noise vector $z$ and the conditional information $y$. The core idea is to combine these inputs effectively so the generator learns to use $y$ to shape its output.Embed the Condition: Pass the input label $y$ through an embedding layer to get its vector representation, $y_{emb}$.Combine Inputs: A simple and common method is to concatenate the noise vector $z$ and the embedded condition $y_{emb}$. If $z$ has dimension $d_z$ and $y_{emb}$ has dimension $d_e$, the combined input vector will have dimension $d_z + d_e$.Process Combined Input: Feed this concatenated vector into the rest of the generator network (e.g., transposed convolutional layers in a DCGAN-style architecture) to produce the fake image $G(z, y)$.Here's a structure (PyTorch-like):# Generator Structure class ConditionalGenerator(nn.Module): def __init__(self, noise_dim, num_classes, embedding_dim, output_channels): super().__init__() self.label_embedding = nn.Embedding(num_classes, embedding_dim) # Define the main generator network body # Input dimension for the first layer should accommodate noise_dim + embedding_dim self.main = nn.Sequential( # Example: Transposed Conv layers, Batch Norm, ReLU # nn.ConvTranspose2d(noise_dim + embedding_dim, ...) # ... other layers ... # nn.ConvTranspose2d(..., output_channels, ..., bias=False), # nn.Tanh() # Output activation often Tanh for images scaled to [-1, 1] ) def forward(self, noise, labels): # Embed labels label_embedding_vector = self.label_embedding(labels) # Shape: (batch_size, embedding_dim) # Reshape embedding if needed and concatenate with noise # Assume noise is shape (batch_size, noise_dim, 1, 1) for ConvTranspose2d # We need to reshape label_embedding_vector to match spatially label_embedding_reshaped = label_embedding_vector.view(label_embedding_vector.size(0), label_embedding_vector.size(1), 1, 1) # Concatenate along the channel dimension combined_input = torch.cat([noise, label_embedding_reshaped], dim=1) # Shape: (batch_size, noise_dim + embedding_dim, 1, 1) # Generate image generated_image = self.main(combined_input) return generated_image Modifying the Discriminator ArchitectureSimilarly, the discriminator, $D$, must now evaluate not just the image $x$, but the pair $(x, y)$. It needs to determine if the image $x$ is a real image corresponding to label $y$, or a fake image generated for label $y$.Embed the Condition: Like the generator, use an embedding layer to get the vector representation $y_{emb}$ for the input label $y$.Process the Image: Pass the input image $x$ through the initial convolutional layers of the discriminator to extract features. Let the output feature map from a certain layer be $f_x$.Combine Image Features and Condition: There are multiple ways to combine $f_x$ and $y_{emb}$:Concatenation (Early): Reshape $y_{emb}$ to match the spatial dimensions of an early feature map $f_x$ (e.g., by tiling it) and concatenate them along the channel dimension.Concatenation (Late): Flatten the image feature map $f_x$ after several convolutional layers and concatenate it with $y_{emb}$ before feeding into fully connected layers.Projection Discriminator: A more advanced technique involves using inner products between $y_{emb}$ and $f_x$.Final Output: Process the combined representation through the remaining layers of the discriminator to produce a single scalar output (the probability or score indicating real/fake).Here's a structure using late concatenation (PyTorch-like):# Discriminator Structure class ConditionalDiscriminator(nn.Module): def __init__(self, num_classes, embedding_dim, input_channels): super().__init__() self.label_embedding = nn.Embedding(num_classes, embedding_dim) # Define the image processing part (e.g., Conv layers) self.image_processor = nn.Sequential( # Example: Conv layers, Batch Norm, LeakyReLU # nn.Conv2d(input_channels, ...) # ... other conv layers ... ) # Define the final classifier part # Input dimension needs to accommodate flattened image features + embedding_dim self.classifier = nn.Sequential( # Example: Flatten, Linear layers, LeakyReLU # nn.Flatten(), # nn.Linear(feature_dim + embedding_dim, ...) # nn.LeakyReLU(0.2, inplace=True), # nn.Linear(..., 1) # Output layer (no sigmoid if using BCEWithLogitsLoss or Wasserstein loss) ) # Calculate feature_dim based on image_processor output shape def forward(self, image, labels): # Process image image_features = self.image_processor(image) # Shape depends on layers image_features_flat = image_features.view(image_features.size(0), -1) # Flatten features # Embed labels label_embedding_vector = self.label_embedding(labels) # Shape: (batch_size, embedding_dim) # Concatenate flattened features and label embedding combined_input = torch.cat([image_features_flat, label_embedding_vector], dim=1) # Classify validity = self.classifier(combined_input) return validityThe following diagram illustrates the data flow in a cGAN, highlighting where the conditional label $y$ is incorporated.digraph CGAN { rankdir=LR; node [shape=box, style=filled, fillcolor="#e9ecef", fontname="sans-serif"]; edge [fontname="sans-serif"]; subgraph cluster_G { label = "Generator (G)"; style=filled; color="#dee2e6"; node [fillcolor="#ffffff"]; z [label="Noise (z)", shape=ellipse, fillcolor="#a5d8ff"]; y_in_g [label="Condition (y)", shape=ellipse, fillcolor="#ffec99"]; g_embed [label="Embed y"]; g_concat [label="Combine z, y_emb"]; g_net [label="Generator Network"]; fake_img [label="Generated Image G(z,y)", fillcolor="#b2f2bb"]; z -> g_concat; y_in_g -> g_embed -> g_concat; g_concat -> g_net -> fake_img; } subgraph cluster_D { label = "Discriminator (D)"; style=filled; color="#dee2e6"; node [fillcolor="#ffffff"]; x [label="Real Image (x)", shape=ellipse, fillcolor="#b2f2bb"]; y_in_d_real [label="Condition (y)", shape=ellipse, fillcolor="#ffec99"]; y_in_d_fake [label="Condition (y)", shape=ellipse, fillcolor="#ffec99"]; d_embed_real [label="Embed y"]; d_embed_fake [label="Embed y"]; d_proc_real [label="Process x"]; d_proc_fake [label="Process G(z,y)"]; d_combine_real [label="Combine x_feat, y_emb"]; d_combine_fake [label="Combine G_feat, y_emb"]; d_net [label="Discriminator Network"]; decision_real [label="Decision (Real/Fake)", shape=diamond, fillcolor="#ffc9c9"]; decision_fake [label="Decision (Real/Fake)", shape=diamond, fillcolor="#ffc9c9"]; x -> d_proc_real; y_in_d_real -> d_embed_real; d_proc_real -> d_combine_real; d_embed_real -> d_combine_real; d_combine_real -> d_net -> decision_real; fake_img -> d_proc_fake [style=dashed]; y_in_d_fake -> d_embed_fake; d_proc_fake -> d_combine_fake; d_embed_fake -> d_combine_fake; d_combine_fake -> d_net -> decision_fake [style=dashed]; } y_in_g -> y_in_d_real [style=invis]; // Help layout y_in_g -> y_in_d_fake [style=invis]; // Help layout label = "Conditional GAN Data Flow"; fontsize=14; }Data flow in a Conditional GAN. The condition $y$ (yellow) is embedded and combined with the noise $z$ (blue) in the generator, and with image features (green) in the discriminator. The discriminator outputs a decision (red) based on both the image and its supposed condition.The Conditional Loss FunctionThe objective function remains a minimax game, but now $D$ and $G$ also depend on $y$. The value function $V(D, G)$ is:$$ \min_G \max_D V(D, G) = \mathbb{E}{(x, y) \sim p{data}(x, y)}[\log D(x, y)] + \mathbb{E}_{z \sim p_z(z), y \sim p_y(y)}[\log(1 - D(G(z, y), y))] $$Here, $p_{data}(x, y)$ is the joint distribution of real data and labels, and $p_y(y)$ is the distribution of labels (which we often sample uniformly or according to the training set distribution).In practice, when using standard binary cross-entropy loss (often implemented with BCEWithLogitsLoss for stability), the discriminator tries to output high values for real pairs $(x, y)$ and low values for fake pairs $(G(z, y), y)$. The generator tries to fool the discriminator by making $D(G(z, y), y)$ output high values. Remember to use the same label $y$ when generating $G(z, y)$ and passing it to the discriminator.The Training LoopThe cGAN training loop follows the standard GAN pattern, with the important addition of handling the labels $y$:Update Discriminator:Sample a mini-batch of real images $x$ and their corresponding labels $y$ from the dataset.Pass $(x, y)$ through the discriminator $D$ to get real scores $D(x, y)$.Sample a mini-batch of noise vectors $z$. Sample a corresponding mini-batch of labels $y'$ (these can be the same labels as the real batch or sampled independently).Generate fake images $x_{fake} = G(z, y')$. Use .detach() on $x_{fake}$ when training $D$.Pass $(x_{fake}, y')$ through the discriminator $D$ to get fake scores $D(x_{fake}, y')$.Calculate the discriminator loss (e.g., binary cross-entropy comparing real scores to 1s and fake scores to 0s).Compute gradients and update $D$'s parameters.Update Generator:Sample a mini-batch of noise vectors $z$ and a mini-batch of labels $y''$.Generate fake images $x_{fake} = G(z, y'')$. Note: No .detach() here.Pass $(x_{fake}, y'')$ through the discriminator $D$ to get fake scores $D(x_{fake}, y'')$.Calculate the generator loss (e.g., binary cross-entropy comparing fake scores to 1s, aiming to maximize $D$'s output for fake samples).Compute gradients and update $G$'s parameters.Repeat these steps for the desired number of epochs. Remember standard GAN training practices like using appropriate optimizers (e.g., Adam), learning rates, and potentially stabilization techniques discussed in Chapter 3 if needed.Generating Conditional SamplesOnce training is complete, you can generate images conditioned on specific labels. Simply:Choose the desired label(s) $y_{target}$.Sample noise vectors $z$.Pass $z$ and $y_{target}$ to the trained generator $G$.The output $G(z, y_{target})$ will be images synthesized to match the characteristics associated with $y_{target}$.For instance, to generate only images of the digit '7', you would repeatedly call $G(z, \text{label=7})$ with different noise vectors $z$.Evaluation TechniquesEvaluating a cGAN involves assessing not only the quality and diversity of generated images (using metrics like FID or IS, discussed in Chapter 5) but also the conditional consistency. Did the generator produce an image that actually matches the requested label $y$? This can be checked qualitatively by visual inspection or quantitatively by feeding the generated images $G(z, y)$ into a pre-trained classifier (independent of the cGAN's discriminator) and measuring its accuracy in predicting $y$.This practical exercise provides the blueprint for implementing cGANs. By carefully integrating conditional information into both the generator and discriminator, you gain significant control over the generation process, enabling targeted synthesis based on specific attributes. Experiment with different embedding dimensions and concatenation strategies to see how they impact performance on your chosen dataset.