Understanding how the language model's policy, denoted as πRL, evolves during the reinforcement learning phase is fundamental for successful alignment. Simply maximizing the reward signal from the Reward Model (RM) isn't enough; we need to ensure this optimization process leads to genuine improvements in helpfulness, harmlessness, and honesty, without unintended consequences. Analyzing the policy shift provides insights into the training dynamics, helps detect potential issues early, and validates that the alignment process is effective.
The primary motivations for tracking policy changes include:
Several metrics and techniques are employed to analyze this shift:
The Kullback-Leibler (KL) divergence measures the difference between the probability distributions of tokens predicted by the current RL policy (πRL) and the reference policy (usually the initial SFT model, πSFT), given the same prompt. In PPO for RLHF, we explicitly penalize high KL divergence to prevent the policy from straying too far from the SFT model. Tracking the per-token KL divergence averaged over batches is a standard practice.
A plot showing the mean per-token KL divergence between the RL policy and the SFT policy over training steps. Stable training usually shows a controlled increase, while rapid, unbounded growth might signal instability or excessive deviation from the base model.
A consistently low KL divergence might indicate that the policy isn't learning much or the KL penalty coefficient is too high. Conversely, a rapidly increasing KL divergence suggests the policy is changing significantly, which could be good if rewards are also increasing, but warrants investigation for potential reward hacking or loss of capabilities if rewards stagnate or outputs degrade.
Monitoring the average reward score assigned by the RM to the policy's generations is essential. We expect this score to increase over time as the policy learns to generate responses favored by the RM. Plotting the distribution of reward scores can also be informative.
Histograms showing the distribution of reward scores assigned by the RM to policy generations at different stages of RL training. A shift towards higher scores indicates successful optimization according to the RM.
A tightening distribution around high reward values generally indicates convergence. However, if the distribution shifts dramatically or develops unusual shapes, it might point towards the policy exploiting specific aspects of the RM, potentially indicating reward hacking.
Quantitative metrics alone don't tell the whole story. Regularly sampling and manually inspecting the model's outputs for a fixed set of prompts throughout training is indispensable. Compare generations from:
Look for desired improvements (e.g., increased helpfulness, better instruction following, reduced harmfulness) but also be vigilant for regressions (e.g., decreased coherence, repetition, sycophancy, emergence of new failure modes). This qualitative feedback loop is critical for understanding how the policy is changing.
Analyzing these metrics together provides a more complete picture:
Standard ML experiment tracking tools like Weights & Biases or TensorBoard are invaluable for logging and visualizing these metrics (KL divergence, reward scores, evaluation metrics) over time. Libraries like Hugging Face's TRL often provide built-in utilities for calculating and logging KL divergence and reward statistics during PPO training, simplifying the analysis process.
By systematically analyzing how the policy shifts during RL tuning using a combination of quantitative metrics and qualitative inspection, you can gain confidence that the RLHF process is achieving meaningful alignment and catch potential problems before they derail training or lead to poorly behaved models.
© 2025 ApX Machine Learning