Now that we understand the goal is to train a reward model (RM) that reflects human preferences, the immediate question becomes: how do we acquire the data representing these preferences? The RM learns from examples of what humans consider 'better' or 'worse' AI behavior in response to specific prompts. This section details the practical process of collecting this human preference data, which forms the foundation for training effective reward models.
The most common approach relies on pairwise comparisons. Instead of asking annotators to assign an absolute quality score to a single response (which can be highly subjective and inconsistent), we present them with a prompt and two different responses generated by language models. The annotator's task is simply to choose which response they prefer. This relative judgment is generally easier and more reliable for humans to make consistently.
The Pairwise Comparison Workflow
- Prompt Selection: Input prompts are sourced, often from real-world interactions with previous model versions or curated datasets designed to cover diverse topics and styles (e.g., questions, instructions, creative writing prompts). The distribution of prompts significantly impacts the resulting reward model's scope.
- Response Generation: For each prompt, multiple responses are generated. These typically come from:
- Different versions of the language model being fine-tuned (e.g., the base SFT model vs. an earlier RLHF-tuned checkpoint).
- The same model using different decoding strategies (e.g., varying temperature or top-p sampling).
- Outputs from distinct language models.
- Human Annotation: Annotators are presented with the prompt and a pair of generated responses (often anonymized and randomly ordered as Response A and Response B). They select the preferred response based on predefined criteria (e.g., helpfulness, harmlessness, accuracy, adherence to instructions). Options for "equal quality" or "cannot decide" are also typically included.
A simplified diagram of the pairwise preference annotation task. An annotator compares two responses to a single prompt and indicates their preference.
This pairwise approach directly supports training objectives like the Bradley-Terry model mentioned earlier, where the reward model learns to assign scores RM(prompt,response) such that the difference in scores predicts the probability of one response being preferred over the other.
Alternative Collection Methods
While pairwise comparison is dominant, other methods exist:
- K-way Ranking: Annotators rank more than two responses (e.g., rank 3 or 4 responses from best to worst). This gathers more information per prompt but increases cognitive load on the annotator.
- Absolute Ratings: Annotators assign a score (e.g., 1-5 stars, or a numeric score on a Likert scale) to individual responses. While seemingly direct, achieving inter-annotator calibration and consistency with absolute scores is notoriously difficult. These scores often need post-processing to be converted into relative preferences anyway.
Pairwise comparisons generally offer a good balance between annotation efficiency and data quality for training reward models in RLHF.
Annotator Management and Guidelines
The quality of the preference data hinges on the human annotators. Considerations include:
- Annotator Source: Annotators can be in-house experts, dedicated contractor teams, or crowdworkers. The choice depends on budget, scale, required expertise, and quality control needs. Advanced RLHF often benefits from trained annotators familiar with the specific alignment goals (e.g., identifying subtle harmfulness or improving factual accuracy).
- Clear Instructions: Detailed guidelines are essential. These must specify the criteria for preference (e.g., "Is response A more helpful and harmless than response B?"). Examples of good and bad responses, edge cases, and how to handle ambiguity are necessary.
- Training and Calibration: Annotators require training on the guidelines and potentially calibration exercises where their judgments are compared against gold standards or expert consensus. Ongoing monitoring of inter-annotator agreement helps identify inconsistencies or misunderstandings.
- Interface Design: The annotation tool should be clear, efficient, and minimize bias. Randomizing the order of responses (A vs. B) prevents positional bias. The interface should smoothly present prompts and response pairs, record choices, and ideally allow for optional comments explaining the rationale for a preference, which can be valuable for analysis.
Data Quality and Bias
Collecting preference data is susceptible to various challenges:
- Annotator Disagreement: Humans naturally disagree. High disagreement rates might indicate unclear guidelines, ambiguous prompts, or genuinely subjective preferences. Analyzing disagreement patterns is important.
- Annotator Biases: Individual annotators may have inherent biases reflected in their preferences. Aggregating judgments from a diverse pool of annotators can help mitigate this.
- Prompt Representativeness: If the prompts used for data collection don't reflect the distribution of prompts the final model will encounter, the learned reward model may not generalize well.
- Gaming the System: Annotators might develop heuristics that don't align with the intended preference criteria, especially if paid per task without sufficient quality control.
Careful planning, clear guidelines, rigorous annotator training, and ongoing quality monitoring are necessary to collect high-fidelity preference data that enables the training of a useful reward model. This dataset, typically comprising (prompt, chosen_response, rejected_response) tuples, becomes the input for the next stage: training the reward model itself.