As we've seen, the interplay between the generator (G) and the discriminator (D) during GAN training is a delicate balancing act. The standard approach uses the same learning rate for both networks. However, empirical observations and theoretical analysis suggest that this might not always be optimal. If the discriminator learns too quickly relative to the generator, it can easily distinguish real from fake samples early on, leading to vanishing gradients for the generator. Conversely, if the discriminator learns too slowly, the generator might not receive meaningful feedback to improve.
The Two Time-Scale Update Rule (TTUR) offers a simple yet effective modification to the training dynamics by proposing the use of different learning rates for the generator and the discriminator optimizers. This idea stems from the analysis of stochastic approximation algorithms, suggesting that different learning speeds can lead to better convergence properties in certain optimization scenarios, including the saddle-point optimization characteristic of GANs.
The core principle of TTUR is straightforward: use a higher learning rate for the discriminator than for the generator.
Let αD be the learning rate for the discriminator's optimizer and αG be the learning rate for the generator's optimizer. TTUR mandates that:
αD>αGCommonly used ratios might set αD to be, for instance, 2 to 5 times larger than αG.
Why does this seemingly small change help?
Implementing TTUR is typically very simple within modern deep learning frameworks. It involves defining two separate optimizer instances, one for the discriminator's parameters and one for the generator's parameters, each configured with its respective learning rate (αD and αG).
# Example using PyTorch-like pseudocode
# Define learning rates according to TTUR
learning_rate_D = 0.0004
learning_rate_G = 0.0001 # Typically smaller than learning_rate_D
# Create separate optimizers
optimizer_D = Adam(discriminator.parameters(), lr=learning_rate_D, betas=(0.0, 0.9))
optimizer_G = Adam(generator.parameters(), lr=learning_rate_G, betas=(0.0, 0.9)) # Note: beta1=0 is often used with TTUR
# --- Inside the training loop ---
# 1. Update Discriminator
optimizer_D.zero_grad()
# Calculate discriminator loss (e.g., WGAN-GP loss)
d_loss.backward()
optimizer_D.step()
# 2. Update Generator
optimizer_G.zero_grad()
# Calculate generator loss
g_loss.backward()
optimizer_G.step()
A simplified training loop structure showing separate optimizers with distinct learning rates for the discriminator and generator, following the TTUR principle.
TTUR is not mutually exclusive with other stabilization methods discussed in this chapter, such as Wasserstein loss, gradient penalties (WGAN-GP), or spectral normalization. In fact, it is frequently used in combination with these techniques. While methods like WGAN-GP and spectral normalization modify the loss function or network architecture to improve stability, TTUR adjusts the optimization dynamics directly through the learning rates. Combining these approaches often yields superior results compared to using any single technique in isolation.
While TTUR provides a strong heuristic and theoretical basis for setting αD>αG, the specific values remain hyperparameters. Finding the optimal learning rates and their ratio might still require some experimentation based on the specific dataset, architecture, and other stabilization techniques being employed. However, TTUR provides a principled starting point, often simplifying the tuning process compared to balancing a single learning rate or ad-hoc update schedules. It represents a valuable tool for enhancing the stability and convergence speed of GAN training.
© 2025 ApX Machine Learning