Standard Generative Adversarial Networks, as reviewed in Chapter 1, learn to map a random noise vector z to a sample x that resembles data from a target distribution pdata(x). While powerful, this process offers little explicit control over the type of sample generated. If you train a GAN on images of handwritten digits (like MNIST), it will generate realistic digits, but you cannot simply ask it to produce, say, a '7'. You get whichever digit the sampled z happens to map to.
Conditional GANs (cGANs) extend the basic GAN framework to address this limitation by incorporating auxiliary information, often called a condition or label, denoted by y. This condition y provides context or guidance to the generation process, allowing us to direct the output. Instead of learning the overall data distribution p(x), a cGAN learns the conditional distribution p(x∣y).
The Core Idea: Conditioning the Generation
The fundamental change in a cGAN is that both the generator and the discriminator receive the conditional information y as an additional input.
- Generator: The generator's task is now to produce a sample xfake that is realistic and matches the given condition y. Its input becomes not just the noise vector z, but the pair (z,y). We denote the generator function as G(z,y).
- Discriminator: The discriminator must now distinguish real data samples xreal from generated samples xfake, but it must do so in the context of the condition y. It needs to determine if the input pair (x,y) is a valid pair, meaning x is realistic and corresponds to the label y. Real pairs (xreal,y) come from the true data distribution, while fake pairs (G(z,y),y) come from the generator. We denote the discriminator function as D(x,y).
This conditioning mechanism allows for targeted generation. If y represents a digit class in MNIST, providing y=7 to G(z,y) instructs the generator to specifically synthesize an image of a '7'.
Data flow in a standard GAN versus a Conditional GAN (cGAN). The cGAN incorporates conditional information y into both the generator and discriminator.
Modifying the Objective Function
The introduction of the condition y requires a modification to the standard GAN minimax objective function. The objective now reflects the conditional nature of the generation and discrimination tasks. The value function V(D,G) for a cGAN becomes:
GminDmaxV(D,G)=E(x,y)∼pdata(x,y)[logD(x,y)]+Ez∼pz(z),y∼py(y)[log(1−D(G(z,y),y))]
Let's break this down:
- E(x,y)∼pdata(x,y)[logD(x,y)]: The discriminator D aims to maximize this term. It tries to assign a high probability (close to 1) to real data samples x when paired with their correct corresponding condition y. The expectation is taken over real data samples x and their associated conditions y drawn from the joint data distribution pdata(x,y).
- Ez∼pz(z),y∼py(y)[log(1−D(G(z,y),y))]: The generator G aims to minimize this term (while D tries to maximize it by making D(G(z,y),y) small). G takes random noise z and a condition y (sampled from its distribution py(y), which is often derived from the training data labels) to produce a fake sample G(z,y). The discriminator D then evaluates this generated sample along with its intended condition y. D tries to assign a low probability (close to 0) to these fake pairs, while G tries to fool D into assigning a high probability.
Essentially, the minimax game remains, but it's now played over conditional distributions. The discriminator learns to model p(x∣y), differentiating between real conditional samples and fake ones. The generator learns to produce samples that are indistinguishable from real samples given the condition y.
How is Conditioning Information Incorporated?
The specific mechanism for feeding y into G and D depends on the nature of y and the network architectures. Common approaches, which we will explore further in the "Architectures for cGANs" section, include:
- Concatenation: If y is a simple vector (e.g., a one-hot encoded class label or an embedding), it can be directly concatenated to the noise vector z for the generator, or concatenated to the input x (or intermediate feature maps) for the discriminator.
- Embedding Layers: Categorical labels y are often passed through embedding layers to obtain dense vector representations before being combined with z or x.
- Modulation Techniques: More advanced architectures might use y to modulate weights or activations within the networks (e.g., Conditional Batch Normalization).
Why Use cGANs?
Conditional GANs provide a powerful handle for controlling generative processes. Instead of sampling randomly from the learned distribution, you can explicitly request specific types of outputs. This is immensely useful in various applications:
- Class-Conditional Image Synthesis: Generating images belonging to specific categories (e.g., generating a 'Siamese cat' image instead of just any cat).
- Text-to-Image Synthesis: Generating images based on textual descriptions (e.g., StackGAN, covered later in this chapter). Here, y is a vector representation of the input text.
- Image-to-Image Translation: Translating an image from one domain to another based on a target domain label or example (e.g., converting sketches to photos).
- Attribute Manipulation: Modifying specific attributes of an image (e.g., adding glasses to a face image).
By conditioning the generation, cGANs move beyond simple mimicry towards controllable and directed synthesis, opening up a wider range of practical applications for generative modeling. The next sections will detail specific architectures and related techniques like InfoGAN, which seeks to learn meaningful conditioning factors automatically.