Masterclass
While intrinsic metrics like perplexity provide a valuable signal about a language model's core capabilities, they often fall short of predicting how well the model will perform on specific, practical tasks. To get a more comprehensive picture of an LLM's utility, we turn to extrinsic evaluation, measuring its performance on a suite of established downstream Natural Language Processing (NLP) tasks. This approach grounds the evaluation in real-world applications and helps us understand the model's strengths and weaknesses in different contexts.
Evaluating on downstream tasks typically involves adapting the pre-trained LLM, often through fine-tuning or prompting techniques, and then measuring its performance using task-specific metrics. Let's examine some of the most common tasks used for this purpose.
Text classification is a fundamental NLP task where the goal is to assign a predefined category or label to a given text input. Examples include:
Relevance: This task tests the model's ability to understand the overall meaning, tone, and subject matter of a text passage.
Evaluation: Performance is typically measured using accuracy, precision, recall, and F1-score, especially in cases of imbalanced classes (like spam detection).
LLM Application: For fine-tuning, a common approach is to add a linear classification layer on top of the final hidden state of a special token (like [CLS]
in BERT-style models) or the pooled output of the sequence. The model is then fine-tuned on a labeled dataset specific to the classification task.
import torch.nn as nn
from transformers import AutoModel
# Example: Adding a classification head to a pre-trained model
class SimpleClassifier(nn.Module):
def __init__(self, model_name, num_labels):
super().__init__()
self.transformer = AutoModel.from_pretrained(model_name)
# Use the model's configuration to get the hidden size
self.classifier_head = nn.Linear(
self.transformer.config.hidden_size, num_labels
)
self.num_labels = num_labels
def forward(self, input_ids, attention_mask):
# Get outputs from the base transformer model
outputs = self.transformer(
input_ids=input_ids,
attention_mask=attention_mask
)
# Typically use the hidden state of the first token ([CLS])
# for classification
pooled_output = outputs.last_hidden_state[:, 0]
# Pass through the classification layer
logits = self.classifier_head(pooled_output)
return logits
# Usage
# model = SimpleClassifier("bert-base-uncased", num_labels=3)
# # e.g., for positive/negative/neutral
# Assume input_ids and attention_mask are prepared tensors
# logits = model(input_ids, attention_mask)
# loss = loss_function(
# logits.view(-1, self.num_labels),
# labels.view(-1)
# )
Alternatively, in few-shot or zero-shot settings, LLMs can be prompted with instructions and examples to perform classification without explicit fine-tuning.
Question Answering systems aim to provide answers to questions posed in natural language. Common variants include:
Relevance: QA tasks rigorously test reading comprehension, information retrieval, and sometimes, the ability to synthesize and generate coherent text.
Evaluation: Extractive QA is often evaluated using Exact Match (EM) and F1-score over the predicted answer span compared to the ground truth. Abstractive and open-domain QA often use metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) or BLEU (Bilingual Evaluation Understudy), although human evaluation is also important.
LLM Application: For extractive QA, models are often fine-tuned to predict the start and end token indices of the answer span within the context. For abstractive QA, sequence-to-sequence models or decoder-only models are fine-tuned or prompted to generate the answer directly.
Input components (Context, Question) and the resulting Answer for an Extractive QA task.
The goal of text summarization is to produce a shorter version of a source document while retaining its most important information.
Relevance: Summarization tests the model's ability to identify important information, understand context, and generate fluent, concise text.
Evaluation: ROUGE scores (specifically ROUGE-1, ROUGE-2, ROUGE-L) are standard metrics, comparing the generated summary to human-written reference summaries based on n-gram overlap. Human evaluation of coherence, fluency, and information coverage is also common.
LLM Application: Sequence-to-sequence architectures were traditionally used, but large decoder-only models are now very effective. They are typically fine-tuned on document-summary pairs or prompted with instructions like "Summarize the following text:".
Machine Translation involves translating text from a source language to a target language.
Relevance: Tests the model's understanding of grammar, syntax, semantics, and cultural context in multiple languages, as well as its generation capabilities.
Evaluation: BLEU score is a widely used metric, measuring the precision of n-grams in the generated translation compared to reference translations. Other metrics like METEOR and chrF are also used. Human evaluation remains important for assessing translation quality and fluency.
LLM Application: Sequence-to-sequence models were the standard. Large multilingual LLMs can often perform translation in zero-shot or few-shot settings by including examples or instructions in the prompt (e.g., "Translate the following English text to French: ..."). Fine-tuning on parallel corpora remains a common practice for achieving high performance on specific language pairs.
NLI tasks require the model to determine the logical relationship between a pair of sentences: a "premise" and a "hypothesis". The relationship is typically one of:
Relevance: NLI is considered a good proxy for general language understanding and reasoning ability, as it requires grasping the meaning and implications of the sentences.
Evaluation: Accuracy is the primary metric, measuring the percentage of correctly classified relationships.
LLM Application: Similar to text classification, models are often fine-tuned by concatenating the premise and hypothesis (with a separator token), feeding this to the LLM, and adding a classification head over the pooled output to predict one of the three labels.
Input components (Premise, Hypothesis) and the resulting Label for an NLI task.
These are just some of the common downstream tasks used for extrinsic LLM evaluation. Others include named entity recognition (NER), coreference resolution, sentiment analysis across different domains, and many more specialized tasks found in benchmarks like GLUE and SuperGLUE, which we will discuss next. Evaluating across a diverse set of these tasks provides a robust assessment of an LLM's overall capabilities and its suitability for various applications.
© 2025 ApX Machine Learning