Masterclass
After completing Supervised Fine-Tuning (SFT), the primary goal shifts from simply continuing pre-training to verifying if the model has actually become better aligned with desired behaviors. Evaluation in this context is less about raw language modeling ability (like perplexity on a general corpus) and more about assessing the model's capability to follow instructions, provide helpful responses, and adhere to specified constraints. This evaluation step is significant because it determines the success of the SFT phase and often informs subsequent alignment stages like Reinforcement Learning from Human Feedback (RLHF).
Before evaluating, clearly define the specific alignment goals SFT aimed to achieve. These typically include:
Measuring these qualities often requires moving beyond standard automated metrics.
For subjective attributes like helpfulness, instruction following fidelity on complex tasks, and harmlessness, human evaluation remains the most reliable method. Setting up effective human evaluation involves several considerations:
While powerful, human evaluation is resource-intensive (time, cost) and can suffer from inter-annotator disagreement. It's often used to validate automated metrics or for periodic deep assessments.
Workflow for evaluating SFT model responses using both human and automated methods.
To complement human evaluation and enable faster iteration, several automated approaches are used:
Model-Based Evaluation: Leverage a powerful, pre-existing LLM (often referred to as an "evaluator model," e.g., GPT-4, Claude) to assess the quality of the SFT model's responses.
You are an impartial judge evaluating the quality of an AI assistant's response to a user's instruction.
Instruction: "Summarize the following text into three bullet points:\n[Long text snippet here...]"
Assistant's Response: "[Model's generated summary here...]"
Evaluate the response based on the following criteria:
1. Accuracy: Does the summary accurately reflect the main points of the original text? (1-5)
2. Conciseness: Is the summary brief and to the point? (1-5)
3. Format Compliance: Did the assistant use exactly three bullet points? (Yes/No)
Provide your ratings in JSON format: {"accuracy": <score>, "conciseness": <score>, "format_compliance": "<yes/no>"}
Also provide a brief justification for your scores.
Benchmark Datasets: Evaluate the SFT model on established benchmarks designed specifically for instruction following or helpfulness. Examples include:
text-davinci-003
) on the Alpaca instruction set.Using benchmarks involves running the SFT model on the benchmark prompts and then using the benchmark's prescribed evaluation protocol (often model-based or human-based).
Reference-Based Metrics (Use with Caution): Metrics like ROUGE (for summarization) or BLEU (for translation) can be used if the SFT task involves generating text that should closely match a reference (e.g., fine-tuning for a specific summarization style). However, they are often poor indicators of general instruction following or helpfulness because:
Consider this simple example: Instruction: "Explain gravity in one sentence." Reference: "Gravity is the force by which a planet or other body draws objects toward its center." Model A: "Gravity is the fundamental force attracting objects with mass towards each other." (Good, low BLEU/ROUGE) Model B: "Gravity is a force. Objects get drawn by planets to their centers." (Okay, higher BLEU/ROUGE but less fluent)
Using ROUGE here might misleadingly favor Model B.
Let's illustrate fetching a response and preparing for evaluation. Assume you have loaded your SFT model and tokenizer using PyTorch and Hugging Face transformers
.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Assume model and tokenizer are loaded
# model = AutoModelForCausalLM.from_pretrained(
# "path/to/your/sft_model"
# )
# tokenizer = AutoTokenizer.from_pretrained(
# "path/to/your/sft_model"
# )
# device = torch.device(
# "cuda" if torch.cuda.is_available() else "cpu"
# )
# model.to(device)
# model.eval() # Set model to evaluation mode
# --- Placeholder for actual model loading ---
class DummyModel: # Simulate a loaded model
def generate(
self, input_ids, attention_mask, max_new_tokens, pad_token_id
):
# Simulate generation based on input length
new_tokens = torch.randint(
100,
1000,
(input_ids.shape[0], max_new_tokens),
device=input_ids.device
)
output_ids = torch.cat([input_ids, new_tokens], dim=1)
return output_ids
class DummyTokenizer: # Simulate a loaded tokenizer
def __init__(self):
self.pad_token_id = 0
def encode(self, text, return_tensors=None):
# Very simple simulation
tokens = [101] + [ i+1000 for i in range(len(text.split())) ]
return torch.tensor([tokens], dtype=torch.long)
def decode(self, ids, skip_special_tokens=False):
# Very simple simulation
words = [
f"word{i-1000}" if i >= 1000 else "[CLS]"
for i in ids[0].tolist()
]
return " ".join(words)
def __call__(
self, text, return_tensors=None, padding=False, truncation=False
):
# Simulate __call__ used often
encoded = self.encode(text, return_tensors)
return {"input_ids": encoded, "attention_mask": torch.ones_like(encoded)}
model = DummyModel()
tokenizer = DummyTokenizer()
device = torch.device("cpu") # Simplified for example
# --- End Placeholder ---
def generate_response(prompt_text, model, tokenizer, max_new_tokens=100):
"""Generates a response from the SFT model."""
inputs = tokenizer(
prompt_text, return_tensors="pt", padding=True, truncation=True
).to(device)
with torch.no_grad():
output_ids = model.generate(
inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=max_new_tokens,
pad_token_id=tokenizer.pad_token_id
)
# Decode only the newly generated tokens
input_length = inputs.input_ids.shape[1]
generated_ids = output_ids[:, input_length:]
response = tokenizer.decode(generated_ids, skip_special_tokens=True)
return response
# Example evaluation prompt
eval_prompt = (
"Instruction: Write a short python function to calculate factorial."
"\nResponse:"
)
# Note: Good SFT formatting includes clear separators like
# \nResponse:
generated_text = generate_response(eval_prompt, model, tokenizer)
print(f"Evaluation Prompt:\n{eval_prompt}")
print(f"\nGenerated Response:\n{generated_text}")
# --- Next Steps would be: ---
# 1. Send `eval_prompt` and `generated_text` to human evaluators.
# 2. Or, format them for a model-based evaluator (like GPT-4).
# 3. Or, if part of a benchmark, use the benchmark's specific evaluation
# script.
# 4. Or, apply simpler checks (e.g., check if 'def' and 'return' are
# in generated_text) - limited but fast.
Different evaluation methods provide different signals. It's often useful to combine them. For instance, use automated metrics/benchmarks for broad coverage and frequent checks, and use human evaluation periodically to validate the automated results and probe for subtle issues.
Hypothetical comparison of evaluation scores across different methods for assessing SFT model alignment. Note how automated checks might differ from nuanced human or model-based ratings.
Ultimately, evaluating SFT models effectively requires a clear understanding of the alignment goals and a thoughtful combination of human insight and scalable automated techniques. The results guide further fine-tuning efforts, helping to create LLMs that are not just capable, but genuinely helpful and reliable.
© 2025 ApX Machine Learning