After generating a batch of responses using the current policy model (as discussed in the previous section), the next critical step in the Reinforcement Learning (RL) loop is to evaluate how "good" these responses are according to the learned human preferences. This evaluation is performed by the Reward Model (RM) trained in Chapter 3. The output of this scoring process provides the essential reward signal that drives the Proximal Policy Optimization (PPO) update.

The Reward Model in Action

Recall that the Reward Model was trained on pairs of responses, learning to assign a higher scalar score to the response humans preferred for a given prompt. During the RL phase, we leverage this learned function. For each prompt-response pair generated by the policy model, the RM takes both the prompt and the complete generated response as input and outputs a single scalar value.

\text{score} = R_{\phi}(prompt, response)

Here, $R_{\phi}$ represents the Reward Model parameterized by $\phi$ . This score quantifies the desirability of the response given the prompt, according to the preferences captured during RM training. A higher score indicates a response predicted to be more aligned with human preferences.

Data flow for obtaining a reward score using the Reward Model.

Transforming Scores into Rewards for PPO

The scalar score produced by the RM serves directly as the reward signal for the PPO algorithm. In the standard RLHF setup using PPO, the objective function aims to maximize this reward while penalizing large deviations from the initial Supervised Fine-Tuned (SFT) policy using a KL divergence term. The per-token reward used in the PPO update often incorporates this RM score (applied to the full sequence) combined with the KL penalty calculated between the current policy and the reference SFT policy.

The core idea is:

\text{Total Reward} = \text{RM Score} - \beta \times \text{KL Divergence Penalty}

Where $\beta$ is the coefficient controlling the strength of the KL penalty. The PPO algorithm then uses this total reward (along with value function estimates) to compute advantages and update the policy model's parameters ( $\theta$ ) to generate responses that yield higher scores from the RM in the future, without deviating too drastically from the SFT model's behavior.

Practical Considerations: Normalization and Scaling

Raw scores from the RM can sometimes have arbitrary ranges or distributions that change during training. Feeding these raw scores directly into PPO can lead to instability or slow convergence. Therefore, it's common practice to normalize or standardize the reward scores within each batch before they are used in the PPO update calculation.

A common technique is whitening the rewards: subtracting the batch mean and dividing by the batch standard deviation.

\text{normalized\_score} = \frac{\text{score} - \text{mean}(\text{scores}_\text{batch})}{\text{std}(\text{scores}_\text{batch}) + \epsilon}

Where $\epsilon$ is a small constant for numerical stability. This centers the rewards around zero and scales them to have unit variance within the batch, making PPO less sensitive to the absolute magnitude of the RM scores and improving training stability. Other scaling functions or clipping might also be applied depending on the specific implementation and observed behavior.

Implementation Sketch

In practice, using libraries like Hugging Face's transformers and trl, scoring involves passing the generated sequences through the loaded reward model. Assuming you have tokenized prompts and generated responses:

# Assume:
# - `reward_model`: Loaded RM (e.g., AutoModelForSequenceClassification)
# - `tokenizer`: Tokenizer used for RM
# - `prompts`: List of prompt strings
# - `responses`: List of corresponding response strings generated by the policy
# - `device`: Computation device (e.g., 'cuda')

# 1. Prepare inputs for the Reward Model
inputs = []
for prompt, response in zip(prompts, responses):
    # Format input as required by the specific RM architecture
    # Often combines prompt and response, separated by EOS token perhaps
    text = prompt + response # Simplified example; precise format depends on RM training
    inputs.append(text)

# 2. Tokenize the combined texts
encoded_inputs = tokenizer(inputs, padding=True, truncation=True, return_tensors='pt').to(device)

# 3. Get scores from the Reward Model (in evaluation mode)
with torch.no_grad(): # Ensure no gradients are computed for RM
    outputs = reward_model(**encoded_inputs)
    # The raw score is often the first element of the output logits/tensor
    raw_scores = outputs.logits[:, 0].detach() # Example for a single-output head

# 4. (Optional but Recommended) Normalize scores per batch
mean_score = raw_scores.mean()
std_score = raw_scores.std()
normalized_scores = (raw_scores - mean_score) / (std_score + 1e-8) # Whitening

# `normalized_scores` tensor now holds the rewards for the PPO step
# Shape: (batch_size,)

Simplified Python sketch for scoring responses using a reward model.

This scoring step is executed repeatedly within the RL training loop, providing the feedback signal necessary to guide the policy towards generating outputs that align better with the preferences encoded in the Reward Model. However, be mindful that the RM is not perfect; it's an approximation of human preferences. The policy might learn to exploit weaknesses in the RM, leading to "reward hacking," a challenge addressed further in Chapters 6 and 7.