The Proximal Policy Optimization (PPO) phase in Reinforcement Learning from Human Feedback (RLHF) aims to adjust the language model's policy, πθ, to generate responses that maximize the expected reward from a learned reward model (RM). Optimizing solely for the RM score, however, carries significant risks. The policy could rapidly shift into policy space regions that yield high RM rewards but produce outputs that are nonsensical, repetitive, or stylistically inconsistent with the desired behavior established during an initial Supervised Fine-Tuning (SFT) stage. This phenomenon suggests the policy 'overfitting' to the reward model, potentially exploiting its inaccuracies or limitations (a form of reward hacking), or simply forgetting the fundamental language generation capabilities it had prior to this phase.
To mitigate this, PPO incorporates a penalty term based on the Kullback-Leibler (KL) divergence. The KL divergence, denoted as DKL(πθ∣∣πref), measures the difference between two probability distributions. In the context of RLHF, it quantifies how much the current policy πθ has deviated from a reference policy, πref. Typically, this reference policy is the model obtained after the SFT phase, let's call it πSFT.
The KL divergence between the current policy and the SFT policy for a given state (prompt) s and action (token) a is calculated as:
DKL(πθ(⋅∣s)∣∣πSFT(⋅∣s))=a∑πθ(a∣s)logπSFT(a∣s)πθ(a∣s)A low KL divergence indicates that the current policy's output distribution is similar to the SFT policy's distribution, while a high KL divergence indicates a substantial deviation.
The core idea is to augment the PPO objective function. Instead of purely maximizing the expected advantage (which relates to the reward), we maximize a modified objective that includes a penalty proportional to the KL divergence:
Objective≈Et[Rewardt]−βDKL(πθ(⋅∣st)∣∣πSFT(⋅∣st))Where:
This KL penalty acts as a regularization term. It discourages the policy πθ from deviating too drastically from the SFT policy πSFT during optimization. By penalizing large changes in the output probability distribution for each token, it helps ensure that the model retains the general language fluency, knowledge, and stylistic characteristics acquired during the SFT phase, even as it adapts to maximize the reward signal.
In practice, during the PPO training loop:
The reference policy πSFT remains fixed throughout the PPO training process; its weights are not updated. It serves as a stable anchor point representing the behavior learned from the initial supervised dataset.
The choice of the KL coefficient β is important for balancing exploration and stability.
Policy updates during PPO. πSFT is the starting policy. Low β allows larger steps towards high reward regions, risking instability. High β restricts steps, maintaining closeness to πSFT.
Finding an appropriate value for β often requires experimentation. Furthermore, adaptive KL controllers are commonly used. These controllers dynamically adjust β during training based on the observed KL divergence values in each batch. The goal is to keep the actual DKL(πθ∣∣πSFT) within a predefined target range (e.g., maintain an average KL of 6 nats). If the observed KL exceeds the target, β is increased to strengthen the penalty; if it falls below the target, β is decreased to allow more optimization. Libraries like Hugging Face's TRL provide implementations of such adaptive KL controllers.
In summary, the KL divergence penalty is a mechanism in PPO for RLHF. It prevents the language model policy from diverging too far from the initial SFT model while optimizing for human preferences encoded in the reward model. This promotes training stability, preserves desirable characteristics of the base model, and provides a tunable balance between reward maximization and policy constraint.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with