The core task of a policy model within a Reinforcement Learning (RL) loop is to generate responses based on given prompts. This policy model is typically a Large Language Model (LLM) that undergoes fine-tuning. This generation step represents the 'action' phase in the standard RL framework, where the policy $ \pi_{\theta} $ takes an action (generates text) given a state (the input prompt).Generating Text SequencesThe process starts with a batch of prompts, often sourced from the same dataset used for reward model training or a curated set designed to elicit diverse behaviors. For each prompt $ x $ in the batch, the current policy model $ \pi_{\theta} $ is used to generate a corresponding text sequence $ y $.This generation is typically performed using the standard autoregressive decoding methods common to LLMs. However, a significant difference in the RLHF context compared to simple inference is the need for exploration. We don't just want the single most likely (greedy) response; we need to explore different possible responses to discover ones that might yield higher rewards according to the reward model.Therefore, generation relies heavily on sampling techniques rather than purely greedy decoding. Common sampling strategies include:Temperature Scaling: A temperature parameter $ T > 0 $ is applied to the logits (pre-softmax scores) from the model's final layer. Higher temperatures ($ T > 1 $) flatten the probability distribution, increasing the likelihood of sampling less probable tokens and promoting diversity. Lower temperatures ($ T < 1 $) sharpen the distribution, making generation closer to greedy decoding. A typical starting point might be $ T \in [0.7, 1.0] $. $$ P(token_i | context) = \frac{\exp(logit_i / T)}{\sum_j \exp(logit_j / T)} $$Top-k Sampling: At each step, the vocabulary is restricted to the $ k $ tokens with the highest probabilities. The model then samples only from this reduced set. This prevents sampling extremely unlikely tokens while still allowing for some variation. Common values for $ k $ might range from 20 to 100.Top-p (Nucleus) Sampling: Instead of selecting the top $ k $ tokens, this method selects the smallest set of tokens whose cumulative probability exceeds a threshold $ p $. The model then samples only from this dynamically sized set. This adapts the number of choices based on the model's confidence at each step. A typical value for $ p $ is often around 0.9 or 0.95.These sampling methods are often used in combination (e.g., temperature scaling followed by top-k or top-p). Additionally, parameters controlling the generation length (max_new_tokens) and potentially penalties for repetition (repetition_penalty) are essential to produce coherent and useful responses.The Generation Process FlowThe input to this stage is a prompt, and the output is the generated response from the active policy model (the one currently being updated via PPO).digraph G { rankdir=LR; node [shape=box, style=rounded, fontname="Arial", fontsize=10, color="#495057", fillcolor="#e9ecef", style="filled, rounded"]; edge [fontname="Arial", fontsize=9, color="#495057"]; prompt [label="Input Prompt (State 's')"]; policy_model [label="Active Policy Model\n(π_θ)", shape=cylinder, fillcolor="#a5d8ff"]; sampling [label="Sampling Strategy\n(Temp, Top-k/Top-p)", shape=invhouse, fillcolor="#96f2d7"]; response [label="Generated Response (Action 'a')"]; prompt -> policy_model; policy_model -> sampling; sampling -> response; }Diagram illustrating the flow where an input prompt is fed into the active policy model, which uses a sampling strategy to generate a text response.Putting it into PracticeBatching: To improve computational efficiency, prompts are typically processed in batches. The policy model generates responses for all prompts in the batch in parallel, leveraging GPU acceleration.Model State: It's important to use the current state of the policy model $ \pi_{\theta} $ for generation, as this model is continuously updated during PPO training. This ensures that the generated actions reflect the latest policy improvements.Computational Cost: Text generation, especially with large models and sampling, is computationally intensive. This step often constitutes a significant portion of the overall RLHF training time and cost. Efficient implementation and hardware are necessary.The result of this stage is a collection of (prompt, generated_response) pairs. These pairs represent the experiences gathered by the policy. The next critical step in the pipeline, detailed in the following section, is to evaluate these generated responses using the trained reward model to determine how "good" each response is according to learned human preferences. This reward signal will then drive the PPO update.