While standard machine learning metrics like accuracy, precision, and recall provide a baseline, they often fall short when assessing the multifaceted performance of applications built with Large Language Models (LLMs). LLM outputs possess qualities like coherence, relevance, helpfulness, safety, and adherence to specific formats or tones, which are difficult to capture with simple quantitative measures. Evaluating a production LangChain application effectively requires defining custom metrics tailored to its specific function and desired behavior.
Moving beyond simple correctness checks allows you to measure what truly matters for your application's success. For instance, is your RAG system retrieving relevant context and grounding its answers faithfully? Is your customer support agent maintaining a helpful and safe tone? Is your summarization tool capturing the essence without distortion? Answering these requires bespoke evaluation strategies.
Traditional metrics typically rely on exact matches or predefined classifications. LLM outputs, however, are often generative and nuanced:
Therefore, developing custom evaluation metrics becomes a significant step in building reliable LLM applications.
Custom metrics can be broadly categorized based on how they assess the output:
Programmatic & Rule-Based Metrics: These involve writing code to check specific, objective criteria.
json
, xml.etree.ElementTree
) within your evaluation function.Semantic Similarity Metrics: These use embeddings to measure the semantic closeness between the generated output and a reference answer or the input query/context.
Model-Based Evaluation (LLM-as-Judge): This powerful technique uses another LLM (often a capable one like GPT-4) to evaluate the output based on specific criteria defined in a prompt.
# Simplified example of an LLM-as-judge prompt structure
EVALUATION_PROMPT = """
You are an impartial evaluator. Assess the quality of the submitted 'RESPONSE'
based on the provided 'CONTEXT' and 'QUERY'.
QUERY: {query}
CONTEXT: {context}
RESPONSE: {response}
Evaluate the RESPONSE based on the following criteria on a scale of 1 to 5 (1=Poor, 5=Excellent):
1. Faithfulness: Does the RESPONSE accurately reflect the information in the CONTEXT without adding unsupported claims?
2. Relevance: Is the RESPONSE directly relevant to the QUERY?
Provide your scores in JSON format: {"faithfulness": score, "relevance": score}
"""
Human-in-the-Loop (HITL): Collecting feedback directly from humans remains the gold standard for subjective qualities.
LangChain provides abstractions, often integrated with LangSmith, to streamline the implementation of custom evaluators. Typically, you define an evaluation function or class that takes the run information (inputs, outputs, etc.) and returns an EvaluationResult
.
import re
from langsmith.evaluation import EvaluationResult, run_evaluator
# Example: Programmatic metric to check if output contains a specific warning phrase
@run_evaluator
def must_contain_warning(run, example) -> EvaluationResult:
"""Checks if the output contains 'Warning:'."""
output = run.outputs.get("output") or ""
if isinstance(output, str) and "Warning:" in output:
score = 1 # Contains the warning
else:
score = 0 # Does not contain the warning
return EvaluationResult(key="contains_warning", score=score)
# Example: Simplified semantic similarity check (conceptual)
# Assume `get_embedding` and `cosine_similarity` are defined elsewhere
@run_evaluator
def check_semantic_similarity(run, example) -> EvaluationResult:
"""Compares output embedding to reference answer embedding."""
output = run.outputs.get("output") or ""
reference = example.outputs.get("reference_answer") or ""
if not output or not reference:
return EvaluationResult(key="semantic_similarity", score=0, comment="Missing output or reference")
output_embedding = get_embedding(output)
reference_embedding = get_embedding(reference)
similarity = cosine_similarity(output_embedding, reference_embedding) # Returns value between -1 and 1
# Normalize score to 0-1 for consistency if desired
normalized_score = (similarity + 1) / 2
return EvaluationResult(key="semantic_similarity", score=normalized_score)
These evaluator functions can then be applied to datasets within LangSmith or used in custom evaluation scripts.
Different evaluation methods assess the generated output, producing scores and qualitative feedback.
By thoughtfully defining custom metrics, you gain deeper insights into your LangChain application's performance, enabling targeted improvements and ensuring it meets the specific requirements of its production environment. These metrics form the bedrock for the automated evaluation pipelines and monitoring strategies discussed later in this chapter.
© 2025 ApX Machine Learning