As introduced previously, the core challenge in applying standard GANs to discrete data like text lies in the sampling process. When the generator selects a specific word (a discrete token) from its output distribution, this selection is non-differentiable. Consequently, the gradient signal from the discriminator cannot flow back through the sampling step to update the generator's parameters using standard backpropagation. This breaks the typical GAN training mechanism.
Reinforcement Learning (RL) provides a powerful framework to circumvent this issue. Instead of relying on direct gradient flow through the generated output, we can reframe the generator's task as an RL problem:
The goal of the generator (agent) is to learn a policy πθ that maximizes the expected reward obtained from the discriminator. Crucially, RL algorithms like policy gradients allow updating the agent's policy parameters θ based on received rewards, even when the actions (token sampling) are discrete.
SeqGAN was one of the first successful applications of this RL perspective to GAN-based text generation. It directly employs the policy gradient theorem from RL to train the generator.
The discriminator (D) is trained conventionally, learning to distinguish between real text sequences from the training data and sequences generated by G. Its output, D(Y), represents the probability that a complete sequence Y=(y1,...,yT) is real.
For the generator (G), this discriminator output D(Y) serves as the reward signal. However, a reward is only available after generating a complete sequence. This poses a problem for training the generator, as it needs feedback during the sequence generation process to decide which intermediate actions (token choices) were good.
SeqGAN addresses this by using Monte Carlo (MC) search with rollouts. When the generator has produced a partial sequence Y1:t=(y1,...,yt), it needs to estimate the expected future reward for taking the next action yt+1. To do this, the current generator policy is used to "roll out" or complete the sequence multiple times starting from Y1:t+1. Let's say N rollouts are performed, resulting in N complete sequences: {Y1:T1,...,Y1:TN}. The discriminator evaluates each of these complete sequences. The average discriminator score provides an estimate of the action-value Q(st,at=yt+1):
QDGθ(s=Y1:t,a=yt+1)≈N1∑n=1ND(Y1:Tn)
Where Y1:Tn is the n-th completed sequence starting with Y1:t+1, generated using the current policy Gθ. This Q-value represents the expected reward if we choose token yt+1 at step t and follow the current policy Gθ thereafter.
With this estimated action-value, the generator's parameters θ can be updated using the policy gradient:
∇θJ(θ)≈EY1:t∼Gθ[∑t=1TQDGθ(Y1:t−1,yt)∇θlogGθ(yt∣Y1:t−1)]
This update increases the probability of taking actions that lead to higher expected rewards (i.e., sequences the discriminator finds more realistic). The key insight is that the gradient is computed with respect to the log-probability of the action, ∇θlogGθ(yt∣Y1:t−1), multiplied by the reward signal Q. This bypasses the non-differentiable sampling step.
SeqGAN Training Loop: The generator acts as a policy, sampling actions (tokens). Monte Carlo rollouts estimate the future reward (Q-value) for an action from a given state (partial sequence), using the discriminator's evaluation. This estimated reward guides the generator's update via the policy gradient method.
While effective, SeqGAN can suffer from high variance in the gradient estimates due to the Monte Carlo rollouts, potentially leading to unstable training. The quality of the learned generator also heavily depends on the quality and stability of the discriminator's reward signal.
RankGAN offers an alternative perspective, aiming to provide a potentially more stable learning signal by focusing on relative comparisons rather than absolute classification scores. It reframes the adversarial game: instead of a discriminator trying to assign an absolute probability of realness, it employs a ranker or comparator.
The core idea is that it might be easier and more informative to determine if one sequence is better (e.g., more realistic, higher quality according to some metric) than another, rather than assigning a precise numerical score to each sequence independently.
In RankGAN, the "discriminator" is replaced by a model trained to rank sequences. This ranker could be trained in several ways:
The generator (G) is then trained adversarially against this ranker. Its objective is to produce sequences that the ranker assigns a high rank to, ideally ranking them higher than (or at least comparable to) real sequences from the dataset. The loss function for the generator typically encourages it to "win" the ranking comparison against real samples.
By focusing on relative ordering, RankGAN can potentially avoid some instability issues associated with the absolute reward signal in SeqGAN. The learning signal might be smoother, especially early in training when the generator produces poor samples and the discriminator might saturate or provide noisy gradients.
Both SeqGAN and RankGAN represent significant steps in adapting GANs for discrete sequence generation tasks like text. They leverage principles from RL to overcome the non-differentiability problem:
These approaches come with their own set of considerations. The MC rollouts in SeqGAN introduce computational overhead and potential variance. RankGAN's effectiveness depends on designing and training a good ranker. Both methods often require careful tuning and can be more complex to implement than standard GANs for continuous data. Nevertheless, they paved the way for more advanced techniques in adversarial text generation and demonstrated the versatility of combining GANs with RL concepts. Implementing these models typically involves integrating components from both deep learning frameworks (for the generator and discriminator/ranker) and potentially RL libraries or custom policy gradient implementations.
© 2025 ApX Machine Learning