In the PPO phase of RLHF, our primary objective is to adjust the language model's policy, πθ, to generate responses that maximize the expected reward given by the learned reward model (RM). However, optimizing solely for the RM score presents significant risks. The policy might rapidly shift into regions of the policy space that yield high rewards according to the RM but produce outputs that are nonsensical, repetitive, or stylistically inconsistent with the desired behavior learned during Supervised Fine-Tuning (SFT). This phenomenon can be viewed as the policy "overfitting" to the reward model, potentially exploiting its inaccuracies or limitations (a form of reward hacking), or simply forgetting the fundamental language generation capabilities it previously possessed.
To mitigate this, PPO incorporates a penalty term based on the Kullback-Leibler (KL) divergence. The KL divergence, denoted as DKL(πθ∣∣πref), measures the difference between two probability distributions. In the context of RLHF, it quantifies how much the current policy πθ has deviated from a reference policy, πref. Typically, this reference policy is the model obtained after the SFT phase, let's call it πSFT.
The KL divergence between the current policy and the SFT policy for a given state (prompt) s and action (token) a is calculated as:
DKL(πθ(⋅∣s)∣∣πSFT(⋅∣s))=a∑πθ(a∣s)logπSFT(a∣s)πθ(a∣s)A low KL divergence indicates that the current policy's output distribution is similar to the SFT policy's distribution, while a high KL divergence signifies a substantial deviation.
The core idea is to augment the PPO objective function. Instead of purely maximizing the expected advantage (which relates to the reward), we maximize a modified objective that includes a penalty proportional to the KL divergence:
Objective≈Et[Rewardt]−βDKL(πθ(⋅∣st)∣∣πSFT(⋅∣st))Where:
This KL penalty acts as a regularization term. It discourages the policy πθ from deviating too drastically from the SFT policy πSFT during optimization. By penalizing large changes in the output probability distribution for each token, it helps ensure that the model retains the general language fluency, knowledge, and stylistic characteristics acquired during the SFT phase, even as it adapts to maximize the reward signal.
In practice, during the PPO training loop:
The reference policy πSFT remains fixed throughout the PPO training process; its weights are not updated. It serves as a stable anchor point representing the behavior learned from the initial supervised dataset.
The choice of the KL coefficient β is important for balancing exploration and stability.
Policy updates during PPO. πSFT is the starting policy. Low β allows larger steps towards high reward regions, risking instability. High β restricts steps, maintaining closeness to πSFT.
Finding an appropriate value for β often requires experimentation. Furthermore, adaptive KL controllers are commonly used. These controllers dynamically adjust β during training based on the observed KL divergence values in each batch. The goal is to keep the actual DKL(πθ∣∣πSFT) within a predefined target range (e.g., maintain an average KL of 6 nats). If the observed KL exceeds the target, β is increased to strengthen the penalty; if it falls below the target, β is decreased to allow more optimization. Libraries like Hugging Face's TRL provide implementations of such adaptive KL controllers.
In summary, the KL divergence penalty is a mechanism in PPO for RLHF. It prevents the language model policy from diverging too far from the initial SFT model while optimizing for human preferences encoded in the reward model. This promotes training stability, preserves desirable characteristics of the base model, and provides a tunable balance between reward maximization and policy constraint.
© 2025 ApX Machine Learning