To effectively train a reward model using the preference-based approach discussed earlier, we need a consistent way to structure the human feedback data. The goal is to represent the outcome of pairwise comparisons, indicating which response was preferred for a given input prompt.
The fundamental unit of data for reward modeling typically involves a triplet: the input prompt (x), the response deemed better or "chosen" (yw), and the response deemed worse or "rejected" (yl). This structure directly feeds into the loss function based on the Bradley-Terry model or similar comparison frameworks.
Pairwise Comparison Tuples: This is the most prevalent format. Each data point explicitly represents a single comparison judgment.
(prompt, chosen_response, rejected_response)
("Explain RLHF.", "RLHF uses human feedback to train a reward model...", "RLHF is about reinforcement learning.")
chosen_response
than rejected_response
given the prompt
.Ranked Lists (Convertible to Pairwise): Sometimes, annotators might rank multiple responses generated for the same prompt (e.g., Best > Good > Fair > Bad).
(prompt, [ranked_response_1, ranked_response_2, ..., ranked_response_n])
where the order indicates preference (1 is best).[A, B, C]
(where A is best) implies the pairs (A, B)
, (A, C)
, and (B, C)
.Grouped Preferences: Datasets might group all comparisons related to a single prompt.
prompt
and the value is a list of (chosen_response, rejected_response)
pairs.A widely referenced dataset is Anthropic's "Helpful and Harmless Reinforcement Learning from Human Feedback" (HH-RLHF). It primarily uses the pairwise comparison format. Each entry contains:
prompt
(often the start of a conversation).chosen
completion (the response preferred by the human labeler).rejected
completion (the response not preferred).This clean structure makes it straightforward to apply the standard reward modeling loss function.
Here’s a simplified diagram illustrating the core relationship in a pairwise preference record:
A single preference data point links a prompt to two responses, explicitly identifying the preferred (
chosen
) and non-preferred (rejected
) options.
For processing, these preferences are often organized into tables or structured files (like JSON Lines or CSV).
Prompt | Chosen Response | Rejected Response |
---|---|---|
"Summarize the process of photosynthesis." | "Plants use sunlight, water, and CO2 to create..." | "Photosynthesis is how plants make food." |
"Write a short poem about a cat." | "Soft paws tread lightly,\nWhiskers twitch..." | "Cats are furry,\nThey like to sleep a lot." |
"Explain the concept of KL divergence." | "KL divergence measures the difference between..." | "It's a math thing for distributions." |
Beyond the core triplet, practical datasets often include metadata:
When training the reward model, the prompt
and each response
(chosen
and rejected
) are typically concatenated and tokenized. For instance, the input to the reward model to score the chosen response might look like [tokenizer.bos_token] + tokenize(prompt) + tokenize(chosen_response) + [tokenizer.eos_token]
. The model is trained to output a scalar score for such combined sequences. The difference between the scores for the (prompt, chosen_response)
pair and the (prompt, rejected_response)
pair is then used in the loss calculation, as shown in the chapter introduction's formula:
loss=−log(σ(RM(x,yw)−RM(x,yl)))
Understanding these data formats is essential for correctly preparing preference datasets and implementing the reward modeling stage of the RLHF pipeline. The choice of format impacts data collection interfaces, storage, and the preprocessing steps required before feeding the data into the reward model training loop.
© 2025 ApX Machine Learning