As we saw in the previous section, using the Wasserstein-1 distance (W1) as the loss function for GANs offers theoretical advantages for training stability. The Kantorovich-Rubinstein duality formulates this distance as:
W1(Pr,Pg)=sup∣∣f∣∣L≤1Ex∼Pr[f(x)]−Ex~∼Pg[f(x~)]
Here, the supremum is taken over all 1-Lipschitz functions f. In the context of WGANs, our discriminator (now often called a "critic") aims to approximate this function f. Therefore, a critical requirement for the WGAN objective to accurately approximate the Wasserstein distance is that the critic function fw (parameterized by weights w) must be 1-Lipschitz. This means its gradient norm should be at most 1 everywhere: ∣∣∇xfw(x)∣∣2≤1.
How can we enforce this constraint on a neural network during training? The original WGAN paper proposed a straightforward, albeit somewhat crude, method: weight clipping.
Weight clipping is a simple procedure applied after each gradient update to the critic's weights. For every weight wi in the critic network, it is constrained to lie within a small, fixed range [−c,c], where c is a small positive constant (e.g., 0.01).
The update step for a weight wi after a standard gradient descent update (e.g., using RMSProp or SGD, not momentum-based optimizers like Adam as suggested in the original paper) looks like this:
This operation effectively "clips" any weight that strays outside the [−c,c] interval back to the boundary.
The intuition is that by keeping the weights small, we indirectly restrict the possible gradients of the function fw. If a function fw is K-Lipschitz, it satisfies ∣fw(x1)−fw(x2)∣≤K∣∣x1−x2∣∣. The magnitude of the weights influences how rapidly the output of the network can change with respect to its input. Bounding the weights to a small range [−c,c] was hoped to enforce a bound on the Lipschitz constant K, ideally keeping it close to 1.
The choice of the clipping constant c is crucial and sensitive.
Finding a good value for c often requires careful tuning for each specific problem and architecture.
While simple to implement, weight clipping introduces several significant problems:
Pathological Weight Distributions: Empirically, it's often observed that with weight clipping, a large fraction of the critic's weights tend to cluster exactly at the boundary values, −c and +c. This suggests the network is not utilizing its full parameter space effectively and is being artificially constrained.
Gradient Issues: The hard clipping operation can lead to gradient problems. For weights at the boundary, the gradient information might be effectively "clipped" away or become zero, hindering learning. Conversely, if c is too large, gradients can still explode. This makes training sensitive to the choice of c.
Capacity Reduction: By forcing weights to be small, we limit the expressive power of the critic. The critic might struggle to learn complex mappings necessary to accurately estimate the Wasserstein distance or provide informative gradients to the generator. It encourages the critic to learn simpler functions than it might otherwise need.
Consider the effect on the critic's weights. Instead of a potentially smooth distribution, clipping forces many weights to accumulate at the boundaries −c and +c.
Histogram showing how weight clipping (pink) can cause weights to pile up at the clipping boundaries (±c), compared to a smoother, more natural distribution (blue) without clipping.
Because of these drawbacks, particularly the difficulty in tuning c and the potential for poor gradient flow, weight clipping is often avoided in modern WGAN implementations. It served as an initial proof-of-concept but has largely been superseded by more theoretically sound and practically effective methods for enforcing the Lipschitz constraint. The most prominent alternative, the gradient penalty (WGAN-GP), directly addresses the gradient norm requirement and will be explored in the next section.
© 2025 ApX Machine Learning