Benchmarking LLM Outputs with Evaluation Metrics

While generating text is straightforward, measuring its quality can be a significant challenge. How do you know if your new prompt is better than the old one? How can you prove that switching to a more expensive model is worth the cost? Subjective human evaluation is slow and expensive. This is where automated evaluation and benchmarking come into play.

By establishing a standardized set of tests and using quantitative metrics, you can systematically measure the performance of your LLM applications. This allows you to compare different prompts, models, or configurations, track improvements over time, and catch regressions before they reach production.

Evaluating with Ground-Truth Metrics

The foundation of any good benchmark is a set of reliable metrics that compare a model's generated output to a "ground-truth" or reference answer. The evaluation module provides several standard, reference-based metrics, each suited for different tasks.

Lexical Metrics: BLEU, ROUGE, and F1-Score

Lexical metrics operate by comparing the word overlap between the generated output and the reference text.

BLEU (Bilingual Evaluation Understudy) originated in machine translation and measures precision. It counts how many n-grams (sequences of words) in the generated text appear in the reference text. It also includes a brevity penalty to discourage outputs that are too short. Scores range from 0 to 1, where 1 is a perfect match.

It is most useful for tasks where word choice and order are important, such as translation or code generation.

from kerb.evaluation import calculate_bleu

reference = "The quick brown fox jumps over the lazy dog"
candidate = "The quick brown fox jumped over the lazy dog"

# The candidate has good n-gram overlap but is not a perfect match
bleu_score = calculate_bleu(candidate, reference)
print(f"BLEU Score: {bleu_score:.3f}")

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is commonly used for evaluating summaries. It measures recall by checking how many n-grams from the reference text appear in the generated output. The most common variant, ROUGE-L, uses the longest common subsequence (LCS) to score outputs, making it more flexible with word order.

from kerb.evaluation import calculate_rouge

reference_summary = "AI and machine learning are transforming technology with deep learning."
generated_summary = "AI is revolutionizing technology through machine learning and deep learning advances."

# ROUGE-L is good for summaries as it captures sentence-level structure
rouge_l_scores = calculate_rouge(generated_summary, reference_summary, rouge_type="rouge-l")
print(f"ROUGE-L F1-Score: {rouge_l_scores['fmeasure']:.3f}")

F1-Score and Exact Match are common in question-answering tasks. An exact match requires the generated output to be identical to the reference. The F1-score provides a more forgiving alternative by calculating the harmonic mean of token-level precision and recall, effectively measuring the overlap of words without being sensitive to their order.

from kerb.evaluation import calculate_f1_score, calculate_exact_match

reference = "William Shakespeare"
candidate = "Shakespeare"

f1 = calculate_f1_score(candidate, reference)
exact_match = calculate_exact_match(candidate, reference)

print(f"F1-Score: {f1:.3f}")
print(f"Exact Match: {exact_match}")

Semantic Similarity

While lexical metrics are fast and useful, they fail when a generated output is semantically correct but uses different words. Semantic similarity addresses this by converting both the candidate and reference texts into vector embeddings and measuring the cosine similarity between them. This captures the similarity in meaning, not just words.

A score close to 1.0 indicates that the two texts are very similar in meaning.

from kerb.evaluation import calculate_semantic_similarity

reference = "The new feature improves system performance."
candidate = "The update enhances the application's speed."

# The words are different, but the meaning is the same
similarity = calculate_semantic_similarity(candidate, reference)
print(f"Semantic Similarity: {similarity:.3f}")

Building and Running a Benchmark

A benchmark is a structured evaluation run on a standardized dataset. This process involves creating a set of test cases, defining how to run your system against them, and specifying how to score the results.

First, you define your evaluation dataset using the TestCase class. Each test case pairs an input with an expected output that serves as the ground truth.

from kerb.evaluation import TestCase

test_cases = [
    TestCase(
        id="qa_python_creator",
        input="Who created Python?",
        expected_output="Guido van Rossum"
    ),
    TestCase(
        id="qa_ml_definition",
        input="What is machine learning?",
        expected_output="Machine learning is a subset of AI that enables systems to learn from data."
    ),
]

Next, you need a function that takes an input and generates an output. This would typically be your LLM generation logic. For testing purposes, we can simulate it.

def simple_qa_generator(question: str) -> str:
    """A simple generator function for demonstration."""
    if "who created python" in question.lower():
        return "Python was created by Guido van Rossum."
    if "machine learning" in question.lower():
        return "Machine learning is a field of AI where systems learn from data."
    return "I don't know."

Finally, you combine these components using the run_benchmark function. You provide the test cases, the generator function, and an evaluation function that scores each output against its expected value.

from kerb.evaluation import run_benchmark, calculate_f1_score

# The evaluation function defines how to score each test case
def evaluate_answer(output: str, expected: str) -> float:
    return calculate_f1_score(output, expected)

# Run the benchmark
benchmark_result = run_benchmark(
    test_cases=test_cases,
    generator_fn=simple_qa_generator,
    evaluator_fn=evaluate_answer,
    threshold=0.5, # A score of 0.5 or higher is a "pass"
    name="Q&A System Benchmark"
)

# Print the results
print(f"Pass Rate: {benchmark_result.pass_rate:.1f}%")
print(f"Average Score: {benchmark_result.average_score:.3f}")
print(f"Passed: {benchmark_result.passed_tests} / {benchmark_result.total_tests}")

The output gives you a high-level summary of your system's performance. The pass_rate tells you what percentage of test cases met your quality threshold, while the average_score gives an overall sense of performance across the entire dataset. Analyzing individual scores can help you identify specific areas where your system struggles.

Practical Use: Comparing Prompt Templates

Benchmarking is an excellent tool for making data-driven decisions during development. For instance, you can use it to determine which of several prompt templates performs best.

The benchmark_prompts function streamlines this process. It runs a benchmark for each prompt template against a list of test inputs and aggregates the results.

Let's compare three different prompts for a summarization task.

from kerb.evaluation import benchmark_prompts, calculate_rouge

# Three different prompts to test
prompts = [
    ("simple", "Summarize: {input}"),
    ("instructive", "Create a concise one-sentence summary of the following text: {input}"),
    ("detailed", "Analyze the following text and generate a detailed summary covering the main points: {input}"),
]

# Test inputs and their expected outputs (ground truth)
test_data = [
    {
        "input": "Python is a high-level, interpreted programming language known for its clear syntax...",
        "expected": "Python is a versatile and readable programming language."
    },
    {
        "input": "Machine learning is a subset of AI that enables systems to learn from data...",
        "expected": "Machine learning is a field of AI where systems learn from data."
    }
]
test_inputs = [item["input"] for item in test_data]
expected_outputs = {item["input"]: item["expected"] for item in test_data}

# A mock generator function that uses the prompt template
def generator_with_template(template: str, input_text: str) -> str:
    # In a real application, this would call an LLM with the formatted prompt
    # For this example, we'll return a fixed response based on keywords
    if "concise" in template:
        return "This is a concise summary."
    elif "detailed" in template:
        return "This is a very detailed and comprehensive summary of the text provided."
    else:
        return "This is a summary."

# An evaluation function using ROUGE-L
def evaluate_summary(output: str, input_text: str) -> float:
    expected = expected_outputs[input_text]
    rouge_scores = calculate_rouge(output, expected, rouge_type="rouge-l")
    return rouge_scores['fmeasure']

# Run the prompt comparison
results = benchmark_prompts(
    prompts,
    test_inputs,
    generator_with_template,
    evaluate_summary
)

# Analyze the results
for name, result in results.items():
    print(f"\nTemplate '{name}':")
    print(f"  Average Score: {result.average_score:.3f}")

best_prompt = max(results.items(), key=lambda item: item[1].average_score)
print(f"\nBest performing prompt: '{best_prompt[0]}'")

By running this comparison, you can quantitatively determine which prompt structure yields the highest-quality outputs according to your chosen metric. This systematic approach replaces guesswork with data, allowing you to iteratively refine and improve your LLM applications with confidence.

The instructive prompt significantly outperforms the simple and detailed versions on the chosen metric, providing a clear path for improvement.

Was this section helpful?

References

BLEU: a Method for Automatic Evaluation of Machine Translation, Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu, 2002 Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (Association for Computational Linguistics) DOI: 10.3115/1073083.1073135 - The foundational paper introducing BLEU, a widely used metric for evaluating machine translation and other text generation tasks based on n-gram overlap.
ROUGE: A Package for Automatic Evaluation of Summaries, Chin-Yew Lin, 2004 Text Summarization Branches Out (Association for Computational Linguistics) DOI: 10.3115/1621896.1621901 - The original paper presenting ROUGE, a set of metrics commonly used for evaluating summarization quality, focusing on recall based on n-gram and longest common subsequence overlap.
Speech and Language Processing: An Introduction to Computational Linguistics, Speech Recognition, and Language Technology, Daniel Jurafsky, James H. Martin, 2025 (Stanford University Press) - A standard textbook in natural language processing that covers fundamental concepts of text generation evaluation, including lexical overlap metrics like BLEU and ROUGE, and metrics for question answering such as F1-score and exact match.