The minimax objective function is the mechanism driving the adversarial training of the Generator (G) and Discriminator (D). This function mathematically formalizes the competing goals of these two networks.
The Value Function V(D,G)
The core of the GAN framework is expressed through a value function, V(D,G), which represents the payoff in a two-player minimax game. The goal is for the generator G to minimize this value, while the discriminator D simultaneously tries to maximize it. The original GAN paper by Goodfellow et al. (2014) proposed the following objective:
GminDmaxV(D,G)=Ex∼pdata(x)[logD(x)]+Ez∼pz(z)[log(1−D(G(z)))]
Let's break down this expression:
- x∼pdata(x): This denotes that x is a sample drawn from the true data distribution, pdata. These are the real examples we want the generator to eventually mimic (e.g., real images of faces).
- z∼pz(z): This denotes that z is a sample drawn from a prior noise distribution, typically a simple distribution like a Gaussian or uniform distribution. This vector z serves as the input seed for the generator.
- G(z): This is the output of the generator network when given the noise vector z. It represents a generated, or "fake," sample (e.g., a synthesized face image). The distribution of these generated samples is denoted as pg.
- D(x): This is the output of the discriminator network when given an input x. It represents the probability that x is a real sample from pdata, rather than a fake sample from pg. Ideally, D(x) should be close to 1 for real samples and close to 0 for fake samples.
- E[⋅]: This represents the expected value, meaning the average value over all possible samples drawn from the specified distribution.
Understanding the Terms
The value function V(D,G) consists of two main terms:
- Ex∼pdata(x)[logD(x)]: This term measures the discriminator's ability to correctly classify real samples. D wants to maximize this term by outputting values close to 1 for real data (x), making logD(x) close to log(1)=0.
- Ez∼pz(z)[log(1−D(G(z)))]: This term measures the discriminator's ability to correctly identify fake samples. For a fake sample G(z), D wants to output a value close to 0. This makes 1−D(G(z)) close to 1, and log(1−D(G(z))) close to log(1)=0. Thus, maximizing this term also corresponds to D correctly identifying fake samples.
The Minimax Game
The objective function captures the adversarial nature of GAN training:
-
Maximizing D (maxDV(D,G)): For a fixed generator G, the discriminator D is trained to maximize V(D,G). This involves adjusting D's parameters to become better at distinguishing real samples (D(x)→1) from fake samples (D(G(z))→0). This is typically achieved by ascending the stochastic gradient of V(D,G) with respect to D's parameters.
-
Minimizing G (minGmaxDV(D,G)): Simultaneously (or often, alternately in practice), for a fixed discriminator D, the generator G is trained to minimize V(D,G). Notice that G only influences the second term, Ez∼pz(z)[log(1−D(G(z)))]. The generator tries to produce samples G(z) that the discriminator classifies as real (D(G(z))→1). If D(G(z)) is close to 1, then log(1−D(G(z))) becomes a large negative number (approaching −∞), thus minimizing the objective from G's perspective. This is achieved by descending the stochastic gradient of V(D,G) with respect to G's parameters.
Flow of the minimax game in a GAN. The Generator tries to produce realistic data (pg) from noise (pz) to fool the Discriminator. The Discriminator tries to maximize its ability to distinguish real data (pdata) from generated data (pg). The objective function drives the parameter updates for both networks.
Theoretical Equilibrium and Connection to Divergence
Theoretically, this minimax game has a global optimum when the generator's distribution perfectly matches the real data distribution, i.e., pg=pdata. At this point, the optimal discriminator D∗ cannot distinguish between real and fake samples better than chance, meaning D∗(x)=1/2 for all x. The value function then evaluates to:
V(D∗,G)=Ex∼pdata[log(1/2)]+Ez∼pz[log(1−1/2)]=log(1/2)+log(1/2)=−2log2
It can be shown that for a fixed G, the optimal discriminator is DG∗(x)=pdata(x)+pg(x)pdata(x). Substituting this back into the objective function reveals a connection to a measure of difference between probability distributions. Specifically, the objective function, when D is optimal, relates to the Jensen-Shannon Divergence (JSD) between the real data distribution pdata and the generated data distribution pg:
DmaxV(D,G)=V(DG∗,G)=2⋅JSD(pdata∣∣pg)−2log2
The JSD is a symmetric measure of similarity between two probability distributions, with JSD(P∣∣Q)=0 if and only if P=Q. Therefore, minimizing the objective with respect to G (after maximizing with respect to D) is equivalent to minimizing the Jensen-Shannon divergence between the generator's distribution and the real data distribution. This provides a theoretical justification for why training a GAN according to this objective should lead the generator to produce realistic samples.
While elegant, this original formulation faces practical challenges during training, particularly when the distributions pdata and pg have little overlap early in training. This can lead to vanishing gradients for the generator, hindering learning. We will explore these instabilities and the techniques developed to mitigate them in Chapter 3. Understanding this foundational objective function, however, is essential before moving on to those advanced methods.