Masterclass
While fine-tuning offers a way to deeply specialize a pre-trained model for a particular task, it requires labeled data and computational resources for training. Furthermore, it primarily evaluates the model's adaptability. Often, we are interested in assessing the inherent capabilities learned during the extensive pre-training phase, specifically the model's ability to understand and follow instructions or perform tasks without task-specific gradient updates. This is where zero-shot and few-shot evaluation methods become particularly insightful. These techniques measure how well a model generalizes its pre-trained knowledge to new tasks, guided only by natural language prompts and, potentially, a handful of examples provided directly within the input context.
Zero-shot evaluation assesses an LLM's ability to perform a task it has never explicitly been trained on using task-specific examples. Instead of fine-tuning, we rely entirely on the model's pre-trained knowledge and its capacity to understand task descriptions or instructions provided within the prompt. The model sees zero examples (k=0) of the specific downstream task format during the evaluation setup.
Consider evaluating a pre-trained LLM on sentiment analysis. In a zero-shot setting, you would not fine-tune the model on a sentiment dataset. Instead, you might provide a prompt like this:
Text: "This movie was absolutely fantastic, a masterpiece!"
Sentiment (positive/negative):
Or perhaps more explicitly:
Classify the sentiment of the following text as positive or negative.
Text: "The flight was delayed, and the service was terrible."
Sentiment:
The model is expected to leverage its understanding of language, including the semantic meaning of words like "fantastic" or "terrible," to generate the correct classification ("positive" or "negative").
Implementation Sketch:
Using a hypothetical pre-trained model interface (e.g., from libraries like transformers
), the process might look like this in PyTorch:
import torch
from transformers import (
AutoModelForCausalLM, AutoTokenizer
)
# Load a pre-trained model and tokenizer
# Note: Replace "gpt2" with a suitable instruction-tuned or
# large base model
model_name = "gpt2" # Example model name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Ensure padding token is set if needed
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
def classify_sentiment_zero_shot(text):
"""Classifies sentiment using a zero-shot prompt."""
prompt = f"""Classify the sentiment of the following text
as 'positive' or 'negative'.
Text: "{text}"
Sentiment:"""
inputs = tokenizer(
prompt,
return_tensors="pt",
padding=True,
truncation=True,
max_length=512
)
# Generate completion - focus on the very next token(s)
# for classification
# For generative classification, careful output parsing
# is needed.
# We might constrain generation or look at logits for
# 'positive'/'negative'.
# This is a simplified example; robust generation requires more care.
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=3, # Limit generation length
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id
)
# Decode the generated token(s) after the prompt
generated_ids = outputs[0, inputs.input_ids.shape[1]:]
result = tokenizer.decode(
generated_ids, skip_special_tokens=True
).strip().lower()
# Simple parsing (can be more sophisticated)
if "positive" in result:
return "positive"
elif "negative" in result:
return "negative"
else:
return "unknown" # Model might not follow instructions
# Example usage
text_to_classify = "The product broke after just one week."
sentiment = classify_sentiment_zero_shot(text_to_classify)
print(f"Text: '{text_to_classify}'")
print(f"Predicted Sentiment (Zero-Shot): {sentiment}")
# Expected output (might vary based on model):
# Text: 'The product broke after just one week.'
# Predicted Sentiment (Zero-Shot): negative
Advantages:
Disadvantages:
Few-shot evaluation, often referred to as in-context learning for LLMs, provides the model with a small number (k, usually between 1 and 32) of examples of the task directly within the prompt. Importantly, the model's weights are not updated based on these examples; they simply serve as context or demonstrations to guide the model's prediction for the actual query instance that follows.
Continuing the sentiment analysis example, a 1-shot (k=1) prompt might look like this:
Classify the sentiment of the text.
Text: "I loved the concert, the band was amazing!"
Sentiment: positive
Text: "This book was quite boring and predictable."
Sentiment:
A 2-shot (k=2) prompt:
Classify the sentiment of the text.
Text: "I loved the concert, the band was amazing!"
Sentiment: positive
Text: "The customer support was unhelpful and slow."
Sentiment: negative
Text: "This new coffee shop has a great atmosphere and delicious drinks."
Sentiment:
The model observes the pattern (input text followed by the desired output label) from the provided examples and applies it to the final, unlabeled instance.
Implementation Sketch:
The code structure is similar to zero-shot, but the prompt construction changes to include the examples.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Assume model and tokenizer are loaded as before
# model_name = "gpt2" # Example model name
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModelForCausalLM.from_pretrained(model_name)
# if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token
def classify_sentiment_few_shot(text, examples, k):
"""Classifies sentiment using a k-shot prompt."""
prompt = (
"Classify the sentiment of the text as 'positive' or 'negative'.\n\n"
)
# Add k examples to the prompt
for i in range(min(k, len(examples))):
example_text, example_sentiment = examples[i]
prompt += f"Text: \"{example_text}\"\n"
prompt += f"Sentiment: {example_sentiment}\n\n"
# Add the query text
prompt += f"Text: \"{text}\"\nSentiment:"
inputs = tokenizer(
prompt,
return_tensors="pt",
padding=True,
truncation=True,
max_length=1024 # Increased max_length
)
# Generate completion
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=3,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id
)
generated_ids = outputs[0, inputs.input_ids.shape[1]:]
decoded_text = tokenizer.decode(
generated_ids, skip_special_tokens=True
)
result = decoded_text.strip().lower()
# Simple parsing
if "positive" in result:
return "positive"
elif "negative" in result:
return "negative"
else:
return "unknown"
# Example usage
few_shot_examples = [
("The weather today is beautiful and sunny.", "positive"),
("My order arrived damaged and incomplete.", "negative")
]
text_to_classify = "The movie had stunning visuals but a weak plot."
k_shots = 2
sentiment = classify_sentiment_few_shot(
text_to_classify, few_shot_examples, k_shots
)
print(f"Text: '{text_to_classify}'")
print(f"Predicted Sentiment ({k_shots}-Shot): {sentiment}")
# Expected output (might vary):
# Text: 'The movie had stunning visuals but a weak plot.'
# Predicted Sentiment (2-Shot): negative
# # Or potentially positive, depending on model bias
Advantages:
Disadvantages:
Both zero-shot and few-shot evaluations underscore the significance of prompt engineering. The way a task is described or demonstrated can drastically alter the model's output. Important aspects include:
Zero-shot and few-shot evaluations are essential tools for understanding LLMs. They complement fine-tuning by providing insights into the model's general knowledge and its ability to apply that knowledge to new tasks based solely on context and instructions, reflecting a different dimension of model capability than task-specific adaptation through gradient descent. When analyzing benchmarks like GLUE or SuperGLUE, results are often reported for all three paradigms (zero-shot, few-shot, fine-tuned) to give a comprehensive picture of model performance.
© 2025 ApX Machine Learning