Okay, let's transition from loading our models to putting the policy model to work. Within the Reinforcement Learning loop, the core task of the policy model, which is the Large Language Model (LLM) we are fine-tuning, is to generate responses based on given prompts. This generation step represents the 'action' phase in the standard RL framework, where the policy πθ takes an action (generates text) given a state (the input prompt).
The process starts with a batch of prompts, often sourced from the same dataset used for reward model training or a curated set designed to elicit diverse behaviors. For each prompt x in the batch, the current policy model πθ is used to generate a corresponding text sequence y.
This generation is typically performed using the standard autoregressive decoding methods common to LLMs. However, a significant difference in the RLHF context compared to simple inference is the need for exploration. We don't just want the single most likely (greedy) response; we need to explore different possible responses to discover ones that might yield higher rewards according to the reward model.
Therefore, generation relies heavily on sampling techniques rather than purely greedy decoding. Common sampling strategies include:
These sampling methods are often used in combination (e.g., temperature scaling followed by top-k or top-p). Additionally, parameters controlling the generation length (max_new_tokens
) and potentially penalties for repetition (repetition_penalty
) are essential to produce coherent and useful responses.
The input to this stage is a prompt, and the output is the generated response from the active policy model (the one currently being updated via PPO).
Diagram illustrating the flow where an input prompt is fed into the active policy model, which uses a sampling strategy to generate a text response.
The result of this stage is a collection of (prompt, generated_response)
pairs. These pairs represent the experiences gathered by the policy. The next critical step in the pipeline, detailed in the following section, is to evaluate these generated responses using the trained reward model to determine how "good" each response is according to learned human preferences. This reward signal will then drive the PPO update.
© 2025 ApX Machine Learning