While the training objective for the reward model (RM), often based on the Bradley-Terry framework, effectively learns relative preferences between pairs of responses, the raw output scores RM(prompt,response) don't automatically possess a meaningful scale. The difference between two scores, RM(prompt,response1)−RM(prompt,response2), determines the predicted probability P(response1≻response2) via the sigmoid function. However, is a score difference of 2 twice as "strong" a preference as a difference of 1? Not necessarily.
Calibration addresses this issue. A well-calibrated reward model produces scores where the predicted probabilities accurately reflect the true probabilities observed in the human preference data. If the model predicts σ(RM1−RM2)=0.8, we expect that for pairs with approximately this score difference, humans indeed preferred response 1 around 80% of the time. Without calibration, the RM might be systematically overconfident (e.g., predicting 0.9 probability when the actual preference rate is only 70%) or underconfident.
A standard technique for evaluating calibration is using a reliability diagram (also known as a calibration plot).
Perfect calibration corresponds to the y=x line. Deviations indicate miscalibration: points below the line indicate overconfidence (predicted probability > actual accuracy), while points above indicate underconfidence.
Example reliability diagram comparing an uncalibrated model (overconfident at higher probabilities) and a better-calibrated model.
Quantitatively, calibration can be measured using metrics like the Expected Calibration Error (ECE), which computes the weighted average difference between predicted probabilities and observed accuracies across bins.
If assessment reveals poor calibration, several techniques can be applied:
Temperature Scaling: This is a simple and often effective post-processing method. It involves rescaling the logits (the inputs to the final activation function, in our case, the raw RM scores or their difference) by a learned temperature parameter T>0. The calibrated probability is calculated as: Pcalibrated(response1≻response2)=σ(TRM(prompt,response1)−RM(prompt,response2)) The temperature T is optimized on a hold-out validation set of preference pairs. The goal is typically to minimize the negative log-likelihood (NLL) or a calibration metric like ECE on this validation set.
Label Smoothing: Applied during the initial RM training phase, label smoothing replaces hard targets (0 or 1) with slightly softened targets (e.g., 0.05 and 0.95). This discourages the model from producing extremely high-confidence predictions (pushing logits towards positive or negative infinity) and can implicitly improve calibration.
Isotonic Regression: Another post-processing technique that fits a non-decreasing function to map the model's output probabilities to calibrated probabilities. It's more powerful than temperature scaling but requires more data and can sometimes be less stable.
Data Quality and Diversity: Fundamentally, calibration issues can arise from noisy labels, insufficient data in certain regions of the score space, or a mismatch between the training data distribution and the data encountered later. Improving the quality and diversity of the human preference dataset is always beneficial.
The subsequent RL phase (typically using PPO) relies heavily on the reward signal provided by the RM. The magnitude of the reward r=RM(prompt,response) or the advantage calculated using it directly influences the scale of policy updates.
A well-calibrated RM provides a more reliable and interpretable reward signal. The magnitude of the reward difference between two potential responses better reflects the actual strength of human preference, leading to more stable and effective policy updates during the RL fine-tuning stage. While PPO often involves normalizing advantages, the relative scale of rewards derived from a calibrated model is still beneficial for learning nuanced behaviors aligned with human judgments. Therefore, assessing and improving RM calibration is an important step before proceeding to the RL optimization phase.
© 2025 ApX Machine Learning