Having trained a reward model, rϕ(x,y), to estimate human preferences for different responses y given a prompt x, the next step in the Reinforcement Learning from Human Feedback (RLHF) process is to use this signal to fine-tune the language model itself. We want to adjust the parameters θ of our language model policy, πθ(y∣x), so that it generates responses that receive high scores from the reward model, effectively aligning the model's behavior with the learned preferences. Proximal Policy Optimization (PPO) is the most commonly used algorithm for this policy optimization phase in RLHF.
PPO is an on-policy reinforcement learning algorithm designed to achieve stable and efficient policy updates. In the context of LLMs, "on-policy" means that the updates are based on data generated by the current version of the policy πθ we are trying to improve. The core idea is to maximize the expected reward predicted by rϕ, while simultaneously ensuring the updated policy πθ doesn't stray too far from a reference policy, typically the initial supervised fine-tuned (SFT) model, denoted as πref. This constraint is important to prevent the model from collapsing into generating repetitive, high-reward but nonsensical text, a phenomenon sometimes called "reward hacking," and to maintain the model's general language capabilities.
The RLHF PPO Objective Function
The optimization process in RLHF typically involves maximizing an objective function that combines the reward signal with a penalty term. For a given prompt x sampled from a dataset D, and a response y sampled from the current policy πθ(y∣x), the objective for a single step can be formulated as:
Reward Term (rϕ(x,y)): This is the score assigned by the trained reward model to the prompt-response pair (x,y). Maximizing this term directly encourages the policy πθ to generate responses that are predicted to be preferred by humans.
KL Penalty Term: This term measures the Kullback-Leibler (KL) divergence between the current policy πθ and the reference policy πref. The KL divergence quantifies how much the probability distribution of responses generated by πθ differs from that of πref. Multiplying by the coefficient β controls the strength of this penalty.
A high β forces the updated policy to stay very close to the original SFT model, potentially limiting alignment gains.
A low β allows the policy to deviate more significantly, potentially leading to better alignment but risking degradation in language quality or coherence if the reward model is imperfect.
The reference policy πref is usually kept frozen during PPO training. Its role is crucial: it acts as an anchor, preventing the optimized policy πθ from drifting too far into regions of the policy space that might yield high reward signals but correspond to unnatural or repetitive language.
In practice, PPO includes additional components like value functions and advantage estimation to stabilize training, but the core objective remains balancing the reward maximization against the KL divergence constraint. The full PPO algorithm uses techniques like clipping the policy ratio to prevent excessively large updates in a single step, contributing further to stability.
The PPO Training Loop for LLMs
The PPO training process for LLMs typically proceeds iteratively as follows:
Sample Prompts: Draw a batch of prompts x from the prompt dataset D.
Generate Responses: For each prompt x, generate a response y by sampling from the current policy πθ(y∣x).
Calculate Rewards: Compute the reward for each pair (x,y) using the frozen reward model: r=rϕ(x,y).
Compute KL Divergence: Calculate the KL divergence between the current policy's output distribution πθ(y∣x) and the reference policy's output distribution πref(y∣x) for each generated response y.
Estimate Advantages (Simplified View): Determine how much better the obtained reward r is compared to an expected baseline (often estimated using a value function trained alongside the policy).
Compute PPO Objective: Calculate the loss based on the PPO objective function, incorporating the rewards, KL penalty, and potentially clipped advantage terms.
Update Policy: Perform a gradient update on the parameters θ of the policy πθ to maximize the objective (or minimize the negative objective/loss).
Repeat: Continue the loop for a specified number of steps or until convergence.
Iterative process of policy optimization using PPO in RLHF. The policy network πθ generates responses, which are evaluated by the reward model rϕ. The KL divergence relative to a reference model πref acts as a constraint. These components form the PPO objective, used to update πθ.
Practical Aspects
Implementing PPO for large language models requires careful handling of computational resources. Because it's an on-policy algorithm, new responses need to be generated frequently using the current policy πθ. This involves running inference with the large model within the training loop, which can be computationally intensive.
Model Synchronization: Keeping the policy πθ, the reference policy πref, the reward model rϕ, and potentially a value function model synchronized and accessible during training requires significant memory, often necessitating distributed training setups (like DeepSpeed ZeRO or FSDP).
Batching: Prompts are processed in batches, and responses are generated and evaluated together to leverage parallel computation.
Hyperparameter Sensitivity: The performance of PPO in RLHF is quite sensitive to hyperparameters, particularly the KL coefficient β, the learning rate, and parameters related to advantage estimation and clipping in the PPO algorithm itself. Tuning these often requires significant experimentation.
By carefully applying PPO, we can steer the LLM's behavior to align better with desired characteristics captured by the human preference data and encoded in the reward model, moving beyond simple supervised fine-tuning towards more nuanced behavioral adjustments. This step is fundamental for creating models that are not only capable but also helpful, harmless, and honest according to specified criteria.