Evaluating models fine-tuned with RLHF requires moving beyond traditional NLP metrics. While metrics like perplexity, BLEU, or ROUGE can gauge fluency or semantic similarity, they fall short in measuring alignment with complex human preferences related to helpfulness, honesty, and harmlessness. Assessing an RLHF-tuned model demands a specialized set of metrics tailored to these alignment goals.
As outlined in the chapter introduction, achieving alignment means ensuring the model behaves according to human intent and values. Standard metrics don't capture whether a response, even if fluent and relevant, is actually helpful, truthful, or safe. Therefore, we need metrics specifically designed to probe these dimensions.
Consider a model fine-tuned to be helpful. A response might achieve a high ROUGE score against a reference answer by including similar keywords, yet fail to provide actionable advice or misunderstand the user's underlying need. Similarly, perplexity measures fluency under a language model's probability distribution but doesn't assess factual accuracy or safety. Relying solely on these metrics can be misleading when the objective is alignment.
A widely adopted framework for evaluating LLM alignment centers on three pillars: Helpfulness, Honesty, and Harmlessness (HHH). Metrics are often developed to target one or more of these aspects.
Helpfulness relates to how well the model assists users in achieving their goals, answering questions accurately and completely, and providing useful information.
Honesty involves the model providing factually accurate information, avoiding fabrication ("hallucination"), admitting uncertainty when appropriate, and citing sources if applicable.
TruthfulQA
are designed to measure whether a model avoids generating false statements that mimic common human misconceptions. Performance is typically measured by the percentage of questions answered truthfully, often distinguishing between generating a truthful statement and correctly identifying truthful statements in multiple-choice settings.Harmlessness focuses on the model's propensity to avoid generating toxic, biased, discriminatory, unsafe, or unethical content.
ToxiGen
) to score the toxicity level of model outputs. Metrics often involve the average toxicity score or the percentage of outputs flagged as toxic above a certain threshold.Bias Benchmark for QA (BBQ)
or analyzing performance disparities across different demographic groups (e.g., using Winogender Schemas
) to quantify social biases reflected in model outputs.Diagram illustrating the relationship between the HHH alignment pillars and the types of metrics used to evaluate them.
While not direct measures of ground-truth alignment, two other quantities generated during RLHF are informative:
Reward Model Score: As mentioned, the average score assigned by the RM to the policy model's outputs during evaluation provides insight into how well the policy is optimizing the learned preference objective. A high RM score suggests the policy is behaving in ways the RM predicts humans would prefer. Tracking this score during training is essential, but it should be interpreted alongside other metrics due to the risk of the policy exploiting loopholes in the RM (reward hacking).
KL Divergence: The Kullback-Leibler (KL) divergence between the RLHF-tuned policy (πRL) and the initial SFT policy (πSFT) measures how much the policy has changed during RL fine-tuning. It's typically used as a penalty during PPO training (KL(πRL∣∣πSFT)) to prevent the policy from deviating too drastically from the SFT model, preserving language quality and general capabilities. While primarily a constraint, the final KL divergence value can be reported as a metric. A very high KL might indicate significant behavioral changes (potentially good for alignment, but risky for capabilities), while a very low KL might suggest the RL tuning had minimal impact. Interpreting KL divergence requires context about the training setup and other evaluation results.
Often, a single metric is insufficient. Evaluating alignment typically involves a suite of metrics covering HHH aspects, alongside traditional capability benchmarks. Tools like the Holistic Evaluation of Language Models (HELM)
framework or the EleutherAI Language Model Evaluation Harness
provide infrastructure for running standardized evaluations across many metrics and datasets, offering a more comprehensive picture of model performance and alignment. These automated suites are discussed further in a later section.
Choosing the right set of metrics depends on the specific alignment goals for your model and the potential risks associated with its deployment context. A combination of automated metrics and human evaluation is usually necessary for a thorough assessment.
© 2025 ApX Machine Learning