Reward Models (RMs) learn a function that maps prompt-response pairs to a scalar score representing human preference. We examine the typical architectures used for these RMs in the context of Large Language Models (LLMs).
Given that our task involves understanding and evaluating text generated by an LLM, it's highly effective to leverage the capabilities of pre-trained LLMs themselves as the foundation for the RM. The most common and successful approach is to adapt a pre-trained transformer model, often the same base model used for the SFT phase or even the final policy model, to perform this scoring task.
The core idea is to take a pre-trained LLM (like GPT, Llama, Mistral, etc.) and modify its final layer. Instead of predicting the next token (as in standard language modeling), we add a regression head, typically a simple linear layer, on top of the final hidden state representation. This head is trained to output a single scalar value, RM(prompt,response), which represents the predicted reward or preference score for the given response in the context of the prompt.
Why use an LLM backbone?
Transfer Learning: These models have already learned rich representations of language, grammar, semantics, and even some general knowledge during their pre-training. Fine-tuning this backbone allows the RM to quickly learn the nuances of human preference regarding helpfulness, harmlessness, style, and factuality, rather than learning these text understanding capabilities from scratch. 2. Contextual Understanding: Transformers excel at processing sequences and understanding the relationship between the prompt and the response. This is fundamental for judging the quality and relevance of a response. 3. Architectural Consistency: Using a similar architecture (or even the same base model) for the SFT model, the RM, and the eventual RL policy model can simplify the overall pipeline and potentially allow for parameter sharing or more stable training dynamics.
The RM typically takes the concatenated prompt and response as input. For example, the input sequence might look like: [Prompt Tokens] [Separator Token] [Response Tokens].
The LLM processes this combined sequence. The regression head is usually applied to the hidden state corresponding to a specific token, often the final token of the sequence (e.g., the </s> or [EOS] token). This final hidden state is presumed to encode information about the entire input sequence (prompt and response). The linear layer then maps this high-dimensional hidden state vector to the single scalar reward value.
Diagram illustrating a common Reward Model architecture. The prompt and response are concatenated and fed into an LLM backbone. A linear regression head processes the final hidden state to produce a single scalar score representing predicted preference.
The choice of architecture involves trade-offs between performance (how well the RM captures human preference), computational cost (training and inference time/memory), and complexity within the overall RLHF pipeline. For most applications, fine-tuning a pre-trained LLM with a scalar regression head provides a strong and effective starting point.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with