Masterclass
After the initial Supervised Fine-Tuning (SFT) step, the language model learns to follow instructions or specific formats. However, SFT alone often falls short of capturing the subtleties of human preferences regarding helpfulness, honesty, and harmlessness. Simply mimicking demonstration data doesn't guarantee alignment with these complex, often subjective, values. To bridge this gap, Reinforcement Learning from Human Feedback (RLHF) introduces a mechanism to directly incorporate human judgments into the model's optimization process. The foundation of RLHF is the collection of high-quality human preference data, which serves as the training signal for a subsequent Reward Model (RM). This section details the process of acquiring this critical preference data.
The core idea is not to ask humans to write "good" responses (which is closer to SFT data generation) or assign absolute scores (which can be inconsistent across labelers and prompts). Instead, the standard approach involves collecting pairwise comparisons. Given a specific input prompt, the model generates two or more candidate responses. Human labelers are then asked to choose which response they prefer based on predefined criteria (e.g., helpfulness, coherence, safety). This relative judgment tends to be more reliable and easier for humans to provide consistently than absolute scoring.
The process begins with a set of diverse prompts. These prompts should ideally mirror the types of inputs the final aligned model is expected to handle. Sources for prompts can include:
It's important to ensure diversity in prompt length, topic, complexity, and interaction style (e.g., question answering, summarization, creative writing, conversational turns).
Once a prompt is selected, the SFT model (the model resulting from the previous fine-tuning stage) is used to generate multiple candidate responses for that prompt. Typically, at least two responses (y1​,y2​) are generated for a given prompt x. Techniques like varying the decoding temperature or using top-k/top-p sampling can encourage diversity between the generated responses.
Core loop for generating a single preference data point.
Human labelers are presented with the prompt x and the two generated responses, y1​ and y2​. The interface should clearly display the prompt and both responses side-by-side. To mitigate presentation bias, the order of responses (left vs. right, top vs. bottom) should be randomized for each comparison task.
Labelers are given specific instructions and criteria for making their choice. These criteria often revolve around:
Labelers select the response (yc​hosen) they judge to be superior according to these criteria. The other response becomes the rejected response (yrejected​). Interfaces might also allow labelers to indicate if both responses are equally good/bad, or if neither is acceptable. Some setups might also ask for free-text explanations for the preference, which can be valuable for quality control and understanding failure modes, although this increases annotation cost.
The outcome of this process is a dataset of preference tuples. Each tuple typically contains the prompt, the preferred (chosen) response, and the unpreferred (rejected) response. This structure forms the direct input for training the reward model.
Here's a simplified example of how a single data point might be represented in Python:
preference_data_point = {
"prompt": "Explain the concept of gradient descent in simple terms.",
"chosen_response": (
"Imagine you're blindfolded on a hill and want to get to the bottom. "
"Gradient descent is like taking small steps in the steepest downhill "
"direction you can feel with your feet. You keep taking steps until "
"you can't go further down."
),
"rejected_response": (
"Gradient descent uses derivatives to find the minimum of a function. "
"It calculates the gradient and updates parameters."
),
"labeler_id": "annotator_123",
"timestamp": "2023-10-27T10:30:00Z",
# Optional fields
"reasoning": (
"Chosen response uses a helpful analogy, making it simpler."
),
"model_version": "sft_model_v1.2"
}
# Often stored in formats like JSON Lines or loaded into libraries
# like Hugging Face Datasets
# Example using Hugging Face datasets structure
# dataset = Dataset.from_dict({
# "prompt": ["prompt1", "prompt2", ...],
# "chosen": ["chosen1", "chosen2", ...],
# "rejected": ["rejected1", "rejected2", ...]
# })
The performance of the RLHF process heavily depends on the quality and quantity of the preference data.
Example distribution of prompt sources used for collecting preference data.
Collecting human preference data is a significant undertaking with several inherent challenges:
Despite these challenges, collecting high-quality preference data is a cornerstone of the RLHF process. This dataset directly shapes the reward signal used to steer the language model towards behaviors aligned with human values. The next step involves using this collected data to train the Reward Model itself.
© 2025 ApX Machine Learning