While the original Generative Adversarial Network concept introduced a groundbreaking approach to generative modeling, the "vanilla" implementation, based directly on the minimax objective function, quickly revealed several significant practical limitations. These challenges spurred much of the research leading to the advanced techniques we will cover in this course. Understanding these shortcomings is essential for appreciating the motivation behind more sophisticated architectures and training strategies.
The core idea of GANs is an adversarial game between the generator (G) and the discriminator (D). The original objective function, derived from the Jensen-Shannon (JS) divergence between the real data distribution (pdata) and the generated data distribution (pg), is theoretically elegant:
GminDmaxV(D,G)=Ex∼pdata(x)[logD(x)]+Ez∼pz(z)[log(1−D(G(z)))]However, training this minimax game in practice is notoriously difficult.
Perhaps the most frequently cited limitation is mode collapse. This occurs when the generator discovers a few specific outputs (modes) that are particularly effective at fooling the current discriminator. Instead of learning to represent the full diversity of the training data distribution, the generator collapses to producing only these limited variations.
Imagine training a GAN on a dataset of handwritten digits (0-9). Mode collapse might manifest as the generator producing only convincing images of '1's and '7's, completely ignoring the other digits. While these generated samples might look realistic individually, the generator has failed to capture the underlying data distribution. The vanilla GAN objective doesn't inherently penalize this lack of diversity strongly enough, especially if the discriminator can be easily fooled by these few modes.
A significant practical issue with vanilla GANs is the lack of a reliable metric to track training progress. The generator and discriminator losses, derived from the minimax objective, often oscillate during training and do not consistently correlate with the perceptual quality or diversity of the generated samples.
This makes it difficult to know when to stop training or how effectively different hyperparameters or architectural changes are performing based solely on the loss values. Visual inspection becomes the primary tool, which is subjective and time-consuming. This limitation highlights the need for more robust evaluation metrics, such as the Inception Score (IS) and Fréchet Inception Distance (FID), which we will discuss in Chapter 5.
The standard GAN framework generates samples from a random noise vector z. While different input vectors z produce different outputs, there's no straightforward way to control which specific features or types of output are generated. The mapping from the latent space to the data space is complex and often entangled, meaning changing one dimension in z might affect multiple unrelated features in the output image. This lack of direct control motivates the development of conditional GANs (cGANs) and techniques for learning disentangled representations, covered in Chapter 4.
These limitations – training instability, mode collapse, poor progress indicators, and lack of control – demonstrated that while the core GAN idea was powerful, the initial formulation required significant enhancements. The subsequent chapters explore the solutions developed to address these very problems, leading to the more stable, capable, and controllable GANs used today.
© 2025 ApX Machine Learning