While quantitative metrics provide a high-level view of model performance, they often mask specific failure modes and nuances. Qualitative analysis, the systematic review of individual model outputs, is essential for understanding how and why a fine-tuned model succeeds or fails on specific inputs. This practice section guides you through setting up and conducting such an analysis to identify recurring error patterns. This process directly informs iterative improvements to your data, fine-tuning strategy, or post-processing steps.
Before examining outputs, define the scope and objective of your analysis. Are you primarily interested in:
Your objective will guide the selection of data and the types of errors you prioritize.
Next, select a representative sample of evaluation data, including the input prompts and the corresponding model generations. Simply using random sampling might not be sufficient. Consider strategies like:
Aim for a sample size large enough to reveal patterns, but manageable for manual review. Often, reviewing 50-200 examples provides significant insights.
A structured classification system, or taxonomy, is necessary for consistent error identification. Without it, analysis becomes subjective and difficult to aggregate. Your taxonomy should be tailored to your model's task and the objectives defined earlier. Start with broad categories and refine them as you encounter specific error types.
Here's a sample starting taxonomy, drawing from the concepts discussed earlier in this chapter:
Common LLM Error Categories:
Ignored Constraint
: Failed to adhere to a specific instruction (e.g., length limit, format).Misinterpreted Instruction
: Understood the instruction incorrectly.Partial Adherence
: Followed some instructions but missed others.Factual Inaccuracy
: Stated incorrect information verifiable against external knowledge.Confabulation
: Generated plausible but entirely fabricated information.Outdated Information
: Provided information that was once correct but is no longer current.Off-Topic
: Response is unrelated to the prompt.Generic/Uninformative
: Output is too vague or fails to provide specifics.Repetitive
: Unnecessarily repeats phrases or sentences.Incoherent
: Text is nonsensical or logically flawed.Grammatical Errors
: Issues with syntax, spelling, or punctuation.Awkward Phrasing
: Text is grammatically correct but unnatural.Harmful Content
: Promotes illegal acts, hate speech, etc.Stereotyping
: Perpetuates harmful stereotypes.Unfair Bias
: Shows prejudice against a group.Incorrect Format
: Failed to produce output in the specified format (e.g., JSON, Markdown table).Adapt this list based on your specific needs. For instance, a code generation model might need categories like Syntax Error
, Logical Error in Code
, or Inefficient Code
.
With your sample data and error taxonomy ready, begin the review process. For each input-output pair:
You can perform this annotation using simple tools like spreadsheets (CSV/Excel) or leverage dedicated annotation platforms if available. Consistency is important, especially if multiple reviewers are involved. Consider calculating inter-annotator agreement (IAA) scores (like Cohen's Kappa) on a subset of the data to ensure the taxonomy is applied uniformly.
Example Annotation Snippet (in a Spreadsheet):
Input Prompt | Model Output | Error Category(ies) | Notes | Severity |
---|---|---|---|---|
Summarize the following text in exactly 50 words | The provided document discusses advancements in battery technology... (75 words) | Instruction Following Errors (Ignored Constraint) |
Exceeded word count limit. | Moderate |
What is the capital of Australia? | The capital of Australia is Sydney. | Factual Errors (Factual Inaccuracy) |
Incorrect capital city stated. Correct is Canberra. | Severe |
Explain quantum entanglement simply. | Quantum entanglement is when two particles link up in a spooky way... | None | Simple, accurate explanation. | N/A |
Generate Python code to sort a list x . |
sorted_list = x.sort() |
Fluency Errors (Grammatical Errors) |
sort() modifies in-place and returns None, needs sorted(x) |
Moderate |
Translate to French: "Hello world" | Salut monde | Fluency Errors (Awkward Phrasing) |
Correct is "Bonjour le monde". "Salut" is too informal here. | Minor |
Once annotation is complete, aggregate the results. Calculate the frequency of each error category. This quantitative summary of your qualitative findings helps pinpoint the most common problems.
For example, you might find that 30% of errors are Factual Inaccuracies
, 25% are Ignored Constraint
, 15% are Generic/Uninformative
, and the rest are distributed among other categories.
Visualizations can make these patterns clearer. A simple bar chart is often effective for showing the distribution of error types.
Error frequency analysis highlights the most prevalent issues in the evaluated sample.
Look for correlations. Do certain types of prompts consistently lead to specific errors? For example, do complex instructions with multiple constraints frequently result in Partial Adherence
errors? Do open-ended questions often trigger Generic/Uninformative
responses? These insights are invaluable.
Imagine you fine-tuned an LLM to summarize patient discharge notes, requiring extraction of diagnosis, medications, and follow-up instructions into a structured format.
Input 1 (Prompt):
Summarize the key information from this discharge note into JSON:
Patient admitted with pneumonia. Treated with levofloxacin 750mg IV daily for 5 days. Responded well. Discharged on oral levofloxacin 500mg daily for 7 days. Follow up with PCP Dr. Smith in 1 week. Advised rest and hydration.
Model Output 1:
{
"diagnosis": "Pneumonia",
"medications": ["levofloxacin 750mg IV", "levofloxacin 500mg Oral"],
"follow_up": "PCP Dr. Smith in 1 week",
"advice": "Rest and hydration"
}
Analysis 1: Excellent. Followed instructions, extracted information correctly, used the specified format. Error: None.
Input 2 (Prompt):
Create a JSON summary:
Patient presented with chest pain, ruled out for MI. Diagnosed with costochondritis. Prescribed ibuprofen 600mg TID PRN. No follow-up needed unless symptoms worsen. Patient allergic to penicillin.
Model Output 2:
{
"diagnosis": "Costochondritis",
"treatment": "Ibuprofen 600mg TID",
"follow_up": "None specified"
}
Analysis 2: Good, but missed key information.
Instruction Following Errors
(Partial Adherence).Input 3 (Prompt):
Summarize using JSON:
68yo male with history of hypertension, DM2. Admitted for COPD exacerbation. Treated with steroids, nebulizers, BiPAP. Discharged on prednisone taper, Symbicort inhaler. Follow up with pulmonology clinic in 2 weeks. Continue home meds (lisinopril, metformin).
Model Output 3:
{
"patient_profile": "68yo male, Hx HTN, DM2",
"admission_reason": "COPD Flare",
"discharge_meds": ["prednisone", "Symbicort", "Lisinopril", "Metformin"],
"plan": "Pulmonology F/U 2wk"
}
Analysis 3: Mostly correct, captures key info.
Factual Errors
(Minor inaccuracy/imprecision).This qualitative analysis, informed by a structured taxonomy and careful review, provides actionable insights that purely metric-based evaluation might miss. In our medical summary example, we identified issues with capturing negative constraints (allergies) and potential loss of clinical nuance.
These findings suggest specific next steps:
Analyzing model outputs manually is an iterative process. As you refine your model and data, repeat the analysis to track improvements and identify new potential issues. It's a fundamental part of building reliable and effective fine-tuned models.
© 2025 ApX Machine Learning