Policy optimization fine-tunes a language model by leveraging a reward model, rϕ(x,y), which estimates human preferences for different responses y given a prompt x. This process involves adjusting the parameters θ of the language model policy, πθ(y∣x), to generate responses that receive high scores from the reward model. This effectively aligns the model's behavior with the learned preferences. Proximal Policy Optimization (PPO) is the most commonly used algorithm for this policy optimization phase in Reinforcement Learning from Human Feedback (RLHF).
PPO is an on-policy reinforcement learning algorithm designed to achieve stable and efficient policy updates. In the context of LLMs, "on-policy" means that the updates are based on data generated by the current version of the policy πθ we are trying to improve. The core idea is to maximize the expected reward predicted by rϕ, while simultaneously ensuring the updated policy πθ doesn't stray too far from a reference policy, typically the initial supervised fine-tuned (SFT) model, denoted as πref. This constraint is important to prevent the model from collapsing into generating repetitive, high-reward but nonsensical text, a phenomenon sometimes called "reward hacking," and to maintain the model's general language capabilities.
The RLHF PPO Objective Function
The optimization process in RLHF typically involves maximizing an objective function that combines the reward signal with a penalty term. For a given prompt x sampled from a dataset D, and a response y sampled from the current policy πθ(y∣x), the objective for a single step can be formulated as:
Reward Term (rϕ(x,y)): This is the score assigned by the trained reward model to the prompt-response pair (x,y). Maximizing this term directly encourages the policy πθ to generate responses that are predicted to be preferred by humans.
KL Penalty Term: This term measures the Kullback-Leibler (KL) divergence between the current policy πθ and the reference policy πref. The KL divergence quantifies how much the probability distribution of responses generated by πθ differs from that of πref. Multiplying by the coefficient β controls the strength of this penalty.
A high β forces the updated policy to stay very close to the original SFT model, potentially limiting alignment gains.
A low β allows the policy to deviate more significantly, potentially leading to better alignment but risking degradation in language quality or coherence if the reward model is imperfect.
The reference policy πref is usually kept frozen during PPO training. Its role is important: it acts as an anchor, preventing the optimized policy πθ from drifting too far into regions of the policy space that might yield high reward signals but correspond to unnatural or repetitive language.
In practice, PPO includes additional components like value functions and advantage estimation to stabilize training, but the core objective remains balancing the reward maximization against the KL divergence constraint. The full PPO algorithm uses techniques like clipping the policy ratio to prevent excessively large updates in a single step, contributing further to stability.
The PPO Training Loop for LLMs
The PPO training process for LLMs typically proceeds iteratively as follows:
Sample Prompts: Draw a batch of prompts x from the prompt dataset D.
Generate Responses: For each prompt x, generate a response y by sampling from the current policy πθ(y∣x).
Calculate Rewards: Compute the reward for each pair (x,y) using the frozen reward model: r=rϕ(x,y).
Compute KL Divergence: Calculate the KL divergence between the current policy's output distribution πθ(y∣x) and the reference policy's output distribution πref(y∣x) for each generated response y.
Estimate Advantages (Simplified View): Determine how much better the obtained reward r is compared to an expected baseline (often estimated using a value function trained alongside the policy).
Compute PPO Objective: Calculate the loss based on the PPO objective function, incorporating the rewards, KL penalty, and potentially clipped advantage terms.
Update Policy: Perform a gradient update on the parameters θ of the policy πθ to maximize the objective (or minimize the negative objective/loss).
Repeat: Continue the loop for a specified number of steps or until convergence.
Iterative process of policy optimization using PPO in RLHF. The policy network πθ generates responses, which are evaluated by the reward model rϕ. The KL divergence relative to a reference model πref acts as a constraint. These components form the PPO objective, used to update πθ.
Practical Aspects
Implementing PPO for large language models requires careful handling of computational resources. Because it's an on-policy algorithm, new responses need to be generated frequently using the current policy πθ. This involves running inference with the large model within the training loop, which can be computationally intensive.
Model Synchronization: Keeping the policy πθ, the reference policy πref, the reward model rϕ, and potentially a value function model synchronized and accessible during training requires significant memory, often necessitating distributed training setups (like DeepSpeed ZeRO or FSDP).
Batching: Prompts are processed in batches, and responses are generated and evaluated together to leverage parallel computation.
Hyperparameter Sensitivity: The performance of PPO in RLHF is quite sensitive to hyperparameters, particularly the KL coefficient β, the learning rate, and parameters related to advantage estimation and clipping in the PPO algorithm itself. Tuning these often requires significant experimentation.
By carefully applying PPO, we can steer the LLM's behavior to align better with desired characteristics captured by the human preference data and encoded in the reward model, moving from simple supervised fine-tuning towards more detailed behavioral adjustments. This step is fundamental for creating models that are not only capable but also helpful, harmless, and honest according to specified criteria.
Build LLM apps faster with Kerb
Cleaner syntax. Built-in debugging. Production-ready from day one.
Built for the AI systems behind ApX Machine Learning
Was this section helpful?
Proximal Policy Optimization Algorithms, John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov, 2017arXiv preprint arXiv:1707.06347DOI: 10.48550/arXiv.1707.06347 - Introduces the Proximal Policy Optimization (PPO) algorithm, a method for stable and efficient policy optimization in deep reinforcement learning, highly relevant for RLHF.
Training language models to follow instructions with human feedback, Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe, 2022arXiv preprint arXiv:2203.02155DOI: 10.48550/arXiv.2203.02155 - Demonstrates the application of Reinforcement Learning from Human Feedback (RLHF) and PPO to align large language models with human preferences, yielding models such as InstructGPT. Note: The full author list for this paper is exceptionally long and contains repetitions in the source; a representative subset is provided.
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto, 2018 (MIT Press) - Provides a thorough introduction to reinforcement learning concepts, essential for understanding the theoretical basis of algorithms used in policy optimization like PPO.