The standard GAN min-max game, while elegant, often suffers from training instability. The generator and discriminator might fail to converge, gradients can vanish, or the generator might only learn to produce a limited variety of outputs (mode collapse). A significant contributor to these problems is the nature of the Jensen-Shannon (JS) divergence that the original GAN objective implicitly minimizes. When the distributions of real and generated data are disjoint or have negligible overlap (which happens frequently early in training or with high-dimensional data), the JS divergence saturates, leading to near-zero gradients for the generator.
To address these shortcomings, researchers have proposed alternative loss functions that provide more stable gradient signals and better reflect the distance between the real and generated data distributions. Let's examine three prominent alternatives: Wasserstein GAN (WGAN), WGAN with Gradient Penalty (WGAN-GP), and Least Squares GAN (LSGAN).
The core idea behind WGAN is to replace the JS divergence with the Wasserstein-1 distance, also known as the Earth Mover's distance (W1). Intuitively, if you imagine the real and generated distributions as piles of earth, W1 measures the minimum "cost" (amount of earth multiplied by the distance moved) to transform one pile into the other. Unlike JS divergence, the Wasserstein distance provides a meaningful gradient even when the distributions don't overlap significantly, making it a more suitable metric for GAN training.
Calculating W1 directly is intractable. However, the Kantorovich-Rubinstein duality provides a way to compute it:
W1(Pr,Pg)=∣∣f∣∣L≤1supEx∼Pr[f(x)]−Ex~∼Pg[f(x~)]Here, Pr is the real data distribution, Pg is the generator's distribution (x~=G(z)), and the supremum is taken over all 1-Lipschitz functions f. A function f is 1-Lipschitz if ∣f(x1)−f(x2)∣≤∣x1−x2∣ for all x1,x2.
In the WGAN framework, the discriminator (now often called a "critic" denoted by D or fw) is trained to approximate this optimal function f. The critic outputs a scalar score (not a probability) reflecting the "realness" of the input. The WGAN objective becomes:
Gminw∈WmaxEx∼Pr[Dw(x)]−Ez∼p(z)[Dw(G(z))]Here, w represents the critic's parameters. The constraint ∣∣f∣∣L≤1 (the Lipschitz constraint) is crucial. The original WGAN paper proposed enforcing this by clipping the weights of the critic w to lie within a small fixed range, like [−c,c], after each gradient update.
Critic Update: Maximize Ex∼Pr[Dw(x)]−Ez∼p(z)[Dw(G(z))]. This pushes the score for real samples up and the score for fake samples down. Generator Update: Minimize −Ez∼p(z)[Dw(G(z))]. This is equivalent to maximizing the critic's score for fake samples, encouraging the generator to produce samples that the critic scores higher (i.e., considers more "real").
While WGAN with weight clipping often leads to more stable training and helps prevent mode collapse compared to the standard GAN, weight clipping is a somewhat crude way to enforce the Lipschitz constraint. It can lead to:
WGAN-GP addresses the issues of weight clipping by proposing a more direct way to enforce the Lipschitz constraint: penalizing the gradient norm of the critic with respect to its input. A differentiable function is 1-Lipschitz if and only if its gradients have a norm of at most 1 everywhere. Instead of enforcing this strictly, WGAN-GP adds a penalty term to the critic's loss that encourages this condition.
The penalty focuses on points sampled between the real and generated distributions. For a pair of real sample x and generated sample x~=G(z), an interpolated sample x^ is created:
x^=ϵx+(1−ϵ)x~where ϵ is sampled uniformly from U[0,1]. The gradient penalty term is then:
λEx^∼Px^[(∣∣∇x^Dw(x^)∣∣2−1)2]Here, Px^ is the distribution of interpolated samples, ∣∣⋅∣∣2 is the L2 norm (Euclidean norm), and λ is a penalty coefficient (typically set to 10). This term penalizes deviations of the gradient norm from 1 at these interpolated points.
The WGAN-GP critic loss becomes:
LCritic=Ex~∼Pg[Dw(x~)]−Ex∼Pr[Dw(x)]+λEx^∼Px^[(∣∣∇x^Dw(x^)∣∣2−1)2]The critic aims to minimize this loss (note the sign change compared to the WGAN maximization formulation, which is common in implementations). The generator loss remains the same as in WGAN, aiming to maximize the critic's score for generated samples:
LGenerator=−Ex~∼Pg[Dw(x~)]WGAN-GP typically results in significantly more stable training than the original WGAN and standard GANs, often producing higher-quality samples without requiring careful tuning of clipping parameters. It has become a widely adopted baseline for GAN training. Important implementation details include removing batch normalization layers from the critic (as the penalty is sensitive to batch statistics) and using layer normalization or other alternatives if normalization is needed.
LSGAN tackles training instability from a different angle. It observes that the sigmoid cross-entropy loss function used in the standard GAN discriminator can lead to vanishing gradients. When the discriminator successfully classifies a generated sample as fake (outputting a probability close to 0), the gradient flowing back to the generator becomes very small, slowing down learning.
LSGAN replaces the sigmoid cross-entropy loss with a least squares (mean squared error) objective. The LSGAN objectives are:
Discriminator Loss:
LDLSGAN=21Ex∼Pr[(D(x)−b)2]+21Ez∼p(z)[(D(G(z))−a)2]Generator Loss:
LGLSGAN=21Ez∼p(z)[(D(G(z))−c)2]Here, a and b are the target labels for fake and real data, respectively, and c is the value the generator wants the discriminator to output for fake data. A common choice is a=0, b=1, and c=1.
The key idea is that the least squares loss penalizes samples even if they are classified correctly but lie far from the decision boundary (defined by D(x)=c). By minimizing LG, the generator tries to fool the discriminator by producing samples x~ such that D(x~) is close to the target label for real data (b, or c in the generator's objective formulation). The quadratic loss ensures that gradients do not vanish as quickly as with the sigmoid cross-entropy, leading to a more stable learning process and potentially higher quality results. LSGAN is generally simpler to implement than WGAN-GP as it doesn't require gradient penalties or weight clipping.
Experimentation is often necessary, but WGAN-GP and LSGAN provide powerful tools to overcome the notorious instability of GAN training, allowing you to train more complex models and achieve better results. The hands-on practical later in this chapter will guide you through implementing WGAN-GP.
© 2025 ApX Machine Learning