Ideally, training a Generative Adversarial Network would involve finding a stable Nash equilibrium in the minimax game between the generator (G) and the discriminator (D). At this equilibrium, the generator would capture the true data distribution, and the discriminator would be unable to distinguish real samples from generated ones (outputting 0.5 for all inputs). However, achieving this theoretical ideal in practice is notoriously difficult. Standard GAN training procedures often fail to converge smoothly, exhibiting various instabilities that hinder performance.
This lack of reliable convergence stems from several interconnected factors related to the optimization dynamics of the two competing neural networks.
Training a GAN involves optimizing the parameters of two networks simultaneously, each with its own objective function that depends on the other network's parameters. This setup is fundamentally different from standard supervised learning where we minimize a single loss function over one set of parameters.
The objective function for the original GAN formulation is:
GminDmaxV(D,G)=Ex∼pdata(x)[logD(x)]+Ez∼pz(z)[log(1−D(G(z)))]Finding a saddle point (θG∗,θD∗) for this objective in the high-dimensional, non-convex parameter space of deep neural networks is challenging. Standard gradient descent methods are designed for minimization, not for finding saddle points in a zero-sum game. Alternating gradient updates (updating D for k steps, then updating G for one step) is a common heuristic, but it lacks strong convergence guarantees in this non-convex setting.
One significant practical problem arises from the gradients used to update the generator. Consider the generator's loss, effectively LG=Ez∼pz(z)[log(1−D(G(z)))]. If the discriminator D becomes very effective at distinguishing real from fake samples (outputting values close to 0 for fake samples G(z)), the term log(1−D(G(z))) saturates. Its gradient with respect to the discriminator's output becomes very small.
When D(G(z)) is close to 0, the gradient ∇θGLG relies on ∇D(G(z))log(1−D(G(z))), which approaches zero. This phenomenon, known as the vanishing gradient problem in the context of GANs, means the generator receives little information on how to improve its samples, even if they are easily identifiable as fake by the discriminator. Training stalls because the generator isn't learning effectively.
Conversely, the optimization process can also lead to oscillatory behavior. The updates for the generator might shift the generated distribution in a way that increases the discriminator's loss significantly. The subsequent discriminator updates might then over-correct, shifting the decision boundary aggressively. This pushes the generator towards a different part of the data space in the next update. Instead of converging towards an equilibrium, the parameters might oscillate, and the loss values can fluctuate wildly without necessarily improving sample quality.
Plot showing how generator and discriminator losses might oscillate during training instead of steadily converging, indicating instability.
The original GAN loss function, when the discriminator is optimal, effectively minimizes the Jensen-Shannon divergence (JSD) between the real data distribution Pdata and the generated data distribution Pg. While JSD is a valid measure of similarity between distributions, it has a significant drawback in the context of GAN training.
If the two distributions Pdata and Pg have negligible overlap or are supported on disjoint manifolds (which is highly likely early in training when the generator produces unrealistic samples), the JSD between them becomes a constant value (log2). The gradient of a constant is zero. This theoretical perspective explains the vanishing gradient problem: when the distributions are too different, the JSD provides no useful gradient signal to guide the generator on how to make Pg closer to Pdata.
A further complication is that the generator and discriminator loss values during training are often poor indicators of the actual quality and diversity of the generated samples. You might observe decreasing loss values while the generator is collapsing to produce only a few types of outputs (mode collapse), or conversely, see fluctuating losses even when sample quality appears to be improving. This lack of correlation makes it hard to rely solely on loss curves to monitor training progress or determine when to stop training.
These challenges. the difficulty of saddle-point optimization, vanishing or unstable gradients, the limitations of JSD for distributions with low overlap, and the unreliable nature of loss values. motivate the development of the stabilization techniques discussed in the subsequent sections. Understanding these root causes is the first step towards applying methods like Wasserstein loss, gradient penalties, and spectral normalization to build more stable and effective GANs.
© 2025 ApX Machine Learning