Now that we understand the goal is to learn a function mapping prompt-response pairs to a scalar score representing human preference, let's examine the typical architectures used for these Reward Models (RMs) in the context of Large Language Models (LLMs).
Given that our task involves understanding and evaluating text generated by an LLM, it's highly effective to leverage the capabilities of pre-trained LLMs themselves as the foundation for the RM. The most common and successful approach is to adapt a pre-trained transformer model, often the same base model used for the SFT phase or even the final policy model, to perform this scoring task.
Leveraging Pre-trained Language Models
The core idea is to take a pre-trained LLM (like GPT, Llama, Mistral, etc.) and modify its final layer. Instead of predicting the next token (as in standard language modeling), we add a regression head, typically a simple linear layer, on top of the final hidden state representation. This head is trained to output a single scalar value, RM(prompt,response), which represents the predicted reward or preference score for the given response in the context of the prompt.
Why use an LLM backbone?
- Transfer Learning: These models have already learned rich representations of language, grammar, semantics, and even some world knowledge during their pre-training. Fine-tuning this backbone allows the RM to quickly learn the nuances of human preference regarding helpfulness, harmlessness, style, and factuality, rather than learning these text understanding capabilities from scratch.
- Contextual Understanding: Transformers excel at processing sequences and understanding the relationship between the prompt and the response. This is fundamental for judging the quality and relevance of a response.
- Architectural Consistency: Using a similar architecture (or even the same base model) for the SFT model, the RM, and the eventual RL policy model can simplify the overall pipeline and potentially allow for parameter sharing or more stable training dynamics.
Input Representation and Output
The RM typically takes the concatenated prompt and response as input. For example, the input sequence might look like: [Prompt Tokens] [Separator Token] [Response Tokens]
.
The LLM processes this combined sequence. The regression head is usually applied to the hidden state corresponding to a specific token, often the final token of the sequence (e.g., the </s>
or [EOS]
token). This final hidden state is presumed to encode information about the entire input sequence (prompt and response). The linear layer then maps this high-dimensional hidden state vector to the single scalar reward value.
Diagram illustrating a common Reward Model architecture. The prompt and response are concatenated and fed into an LLM backbone. A linear regression head processes the final hidden state to produce a single scalar score representing predicted preference.
Architectural Considerations and Variations
- Model Size: Should the RM be the same size as the policy model? Not necessarily. Sometimes, a larger RM might capture preferences more accurately, while other times, a smaller RM might be sufficient and computationally cheaper. Using a significantly larger RM than the policy model being trained is a common practice (e.g., using a 7B parameter RM to train a 1B parameter policy). The rationale is that a more capable RM can provide a better quality reward signal. However, this increases the computational cost during the PPO phase, as scoring requires inference passes through this potentially large RM.
- Shared vs. Separate Backbones: While often initialized from the same pre-trained base, the RM and the policy model are typically trained separately after the SFT phase. The RM's weights are usually kept frozen during the PPO phase. Sharing parameters between the policy and the RM during PPO is less common and adds complexity.
- Value Head for PPO: Although not strictly part of the RM architecture, it's worth noting that during PPO training (covered in the next chapter), a value head is often added to the policy model (or a copy of it). This value head is also a regression head outputting a scalar, but it's trained to predict the expected future reward (the value function V(s)) rather than the immediate reward score assigned by the RM. Architecturally, it's similar to the RM's regression head but serves a different purpose within the RL algorithm.
The choice of architecture involves trade-offs between performance (how well the RM captures human preference), computational cost (training and inference time/memory), and complexity within the overall RLHF pipeline. For most applications, fine-tuning a pre-trained LLM with a scalar regression head provides a strong and effective starting point.