Masterclass
To effectively evaluate a pre-trained Large Language Model (LLM) on downstream tasks, we often need to adapt it slightly to the specific format and objective of the target task. This process is known as fine-tuning. While zero-shot or few-shot evaluations (discussed later) test the model's inherent generalization capabilities, fine-tuning allows the model to learn task-specific patterns from labeled data, often yielding higher performance and providing a clearer picture of the pre-trained model's potential utility for that task.
Fine-tuning leverages the powerful representations learned during pre-training and adjusts them using a relatively small amount of task-specific labeled data. The core idea is that the pre-trained model already understands language structure, semantics, and context; fine-tuning simply teaches it how to apply this knowledge to a new problem formulation.
The standard procedure for fine-tuning an LLM for evaluation on a downstream task involves several steps:
Here's a diagram of the process:
A pre-trained LLM's core layers provide input representations to a newly added task-specific head. The head makes predictions, which are compared against labels from the task dataset using a task-specific loss function. Gradients from this loss update the weights of the head and typically some or all of the pre-trained layers.
The adaptation step primarily involves adding the correct head and formatting the data appropriately. Let's look at common examples:
[CLS]
token in BERT-style models, or the last token in causal models) and projects it to the number of output classes. A dropout layer is commonly added before the linear layer for regularization.[CLS] text_sequence [SEP]
. For causal models (like GPT), the input might just be text_sequence
, and the representation of the final token is used.import torch
import torch.nn as nn
from transformers import ( # Example using Hugging Face Transformers
AutoModel, AutoTokenizer
)
# Load pre-trained model (e.g., BERT)
model_name = "bert-base-uncased"
pretrained_model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Define classification head
num_labels = 3 # Example: Positive, Negative, Neutral sentiment
hidden_size = pretrained_model.config.hidden_size
classification_head = nn.Sequential(
nn.Dropout(0.1),
nn.Linear(hidden_size, num_labels)
)
# Example input processing
text = "This is an example sentence."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
# Forward pass (simplified)
# Get hidden states from the pre-trained model
outputs = pretrained_model(**inputs)
# Use the representation of the [CLS] token (first token)
cls_representation = outputs.last_hidden_state[:, 0, :]
# Pass through the classification head
logits = classification_head(cls_representation)
# 'logits' now contains raw scores for each class
# Apply CrossEntropyLoss using these logits and target labels during training
[CLS] question_text [SEP] context_text [SEP]
.# (Continuing from previous imports)
# Define QA head
qa_head = nn.Linear(hidden_size, 2)
# Output: start_logit, end_logit for each token
# Example input processing
question = "What is the capital of France?"
context = "France is a country in Europe. Paris is its capital and largest city."
inputs = tokenizer(
question,
context,
return_tensors="pt",
padding=True,
truncation=True
)
# Forward pass (simplified)
outputs = pretrained_model(**inputs)
sequence_output = outputs.last_hidden_state
# Shape: (batch_size, seq_len, hidden_size)
# Pass sequence output through the QA head
logits = qa_head(sequence_output) # Shape: (batch_size, seq_len, 2)
start_logits, end_logits = logits.split(1, dim=-1)
start_logits = start_logits.squeeze(-1) # Shape: (batch_size, seq_len)
end_logits = end_logits.squeeze(-1) # Shape: (batch_size, seq_len)
# 'start_logits' and 'end_logits' contain scores for start/end positions
# Apply CrossEntropyLoss using these and target start/end indices
# during training
input_article
. For translation: translate English to French: input_english_sentence
. The target sequence is used as labels during training.Fine-tuning involves training, but with some differences compared to pre-training:
import torch
from torch.optim import AdamW
from torch.utils.data import DataLoader, Dataset
from transformers import get_linear_schedule_with_warmup
# Assume 'model' includes the pre-trained backbone and the task-specific head
# Assume 'train_dataset' is a PyTorch Dataset yielding formatted,
# tokenized inputs and labels
# Assume 'loss_fn' is the appropriate loss function (e.g., nn.CrossEntropyLoss)
learning_rate = 3e-5
num_epochs = 3
batch_size = 16
warmup_steps = 100
total_training_steps = len(train_dataset) * num_epochs // batch_size # Approximation
optimizer = AdamW(model.parameters(), lr=learning_rate)
scheduler = get_linear_schedule_with_warmup(optimizer,
num_warmup_steps=warmup_steps,
num_training_steps=total_training_steps)
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.train()
for epoch in range(num_epochs):
for batch in train_dataloader:
# Move batch to device
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device) # Assuming labels are part of the batch
# Clear previous gradients
optimizer.zero_grad()
# Forward pass - Get model predictions (logits)
# The exact way to get logits depends on the model and head structure
# For classification: logits = model(
# input_ids=input_ids, attention_mask=attention_mask).logits
# For QA: outputs = model(...); start_logits, end_logits =
# outputs.start_logits, outputs.end_logits
# This needs to be adapted based on the specific model wrapper and task
outputs = model(input_ids=input_ids,
attention_mask=attention_mask)
logits = outputs.logits # Adjust based on actual model output structure
# Calculate loss
# Loss calculation depends on the task (e.g., shape of logits vs labels)
# For classification: loss = loss_fn(
# logits.view(-1, num_labels), labels.view(-1))
# For QA: loss = (loss_fn(start_logits, start_positions) +
# loss_fn(end_logits, end_positions)) / 2
loss = loss_fn(logits, labels) # Adjust based on specific task loss needs
# Backward pass
loss.backward()
# Update weights
optimizer.step()
scheduler.step()
print(f"Epoch {epoch + 1} completed. Last batch loss: {loss.item()}")
# After training, evaluate on the test set using task-specific metrics
Crucially, after fine-tuning, the model is evaluated using metrics specific to the downstream task. For classification, this might be accuracy or F1-score. For QA, Exact Match (EM) and F1-score over the predicted answer tokens are common. For summarization, ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L) are standard, measuring overlap with reference summaries. For translation, BLEU score is often used. These extrinsic metrics provide a direct measure of the model's performance on the task it was adapted for, complementing the insights from intrinsic metrics like perplexity.
Fine-tuning is a powerful technique for evaluating and adapting LLMs. While it requires labeled data and computational resources (though far less than pre-training), it allows us to assess how effectively a pre-trained model's learned knowledge can be transferred to solve specific, practical problems. The results obtained through fine-tuning often represent a strong performance baseline for the underlying pre-trained model on that particular task.
© 2025 ApX Machine Learning