Quantitative metrics provide a high-level view of model performance, but they often mask specific failure modes and details. Qualitative analysis, the systematic review of individual model outputs, is essential for understanding how and why a fine-tuned model succeeds or fails on specific inputs. Guidance on setting up and conducting such an analysis is provided to identify recurring error patterns. This process directly informs iterative improvements to your data, fine-tuning strategy, or post-processing steps.Establishing the Analysis FrameworkBefore examining outputs, define the scope and objective of your analysis. Are you primarily interested in:Instruction Adherence: How well does the model follow specific constraints or formats requested in the prompt?Factual Correctness: Is the model generating accurate information, especially in domain-specific applications?Safety and Bias: Does the model produce harmful, biased, or inappropriate content?Task-Specific Quality: For tasks like summarization or translation, does the output meet the desired quality criteria (e.g., coverage, fidelity, fluency)?Your objective will guide the selection of data and the types of errors you prioritize.Next, select a representative sample of evaluation data, including the input prompts and the corresponding model generations. Simply using random sampling might not be sufficient. Consider strategies like:Random Sampling: A baseline approach to get a general overview.Stratified Sampling: Ensure representation across different types of inputs (e.g., varying instruction complexity, different topics).Failure Case Sampling: If preliminary metrics indicate poor performance on certain subsets, intentionally oversample those cases.Adversarial Sampling: Include inputs designed to specifically challenge the model (e.g., prompts known to elicit hallucinations or refusals).Aim for a sample size large enough to reveal patterns, but manageable for manual review. Often, reviewing 50-200 examples provides significant insights.Developing an Error TaxonomyA structured classification system, or taxonomy, is necessary for consistent error identification. Without it, analysis becomes subjective and difficult to aggregate. Your taxonomy should be tailored to your model's task and the objectives defined earlier. Start with broad categories and refine them as you encounter specific error types.Here's a sample starting taxonomy, drawing from the concepts discussed earlier in this chapter:Common LLM Error Categories:Instruction Following Errors:Ignored Constraint: Failed to adhere to a specific instruction (e.g., length limit, format).Misinterpreted Instruction: Understood the instruction incorrectly.Partial Adherence: Followed some instructions but missed others.Factual Errors / Hallucinations:Factual Inaccuracy: Stated incorrect information verifiable against external knowledge.Confabulation: Generated plausible but entirely fabricated information.Outdated Information: Provided information that was once correct but is no longer current.Relevance & Coherence Errors:Off-Topic: Response is unrelated to the prompt.Generic/Uninformative: Output is too vague or fails to provide specifics.Repetitive: Unnecessarily repeats phrases or sentences.Incoherent: Text is nonsensical or logically flawed.Fluency Errors:Grammatical Errors: Issues with syntax, spelling, or punctuation.Awkward Phrasing: Text is grammatically correct but unnatural.Bias & Safety Errors:Harmful Content: Promotes illegal acts, hate speech, etc.Stereotyping: Perpetuates harmful stereotypes.Unfair Bias: Shows prejudice against a group.Formatting Errors:Incorrect Format: Failed to produce output in the specified format (e.g., JSON, Markdown table).Adapt this list based on your specific needs. For instance, a code generation model might need categories like Syntax Error, Logical Error in Code, or Inefficient Code.The Annotation ProcessWith your sample data and error taxonomy ready, begin the review process. For each input-output pair:Read the Input Prompt Carefully: Understand the intended task and any constraints.Evaluate the Model Output: Compare the output against the input and your knowledge of the expected correct/ideal response.Identify Errors: Based on your taxonomy, classify any observed errors. An output can have multiple errors.Add Notes (Optional but Recommended): Briefly describe the specific error or provide context. This is helpful for later analysis.(Optional) Assign Severity: You might assign severity levels (e.g., Minor, Moderate, Severe) to prioritize fixes.You can perform this annotation using simple tools like spreadsheets (CSV/Excel) or leverage dedicated annotation platforms if available. Consistency is important, especially if multiple reviewers are involved. Consider calculating inter-annotator agreement (IAA) scores (like Cohen's Kappa) on a subset of the data to ensure the taxonomy is applied uniformly.Example Annotation Snippet (in a Spreadsheet):Input PromptModel OutputError Category(ies)NotesSeveritySummarize the following text in exactly 50 wordsThe provided document discusses advancements in battery technology... (75 words)Instruction Following Errors (Ignored Constraint)Exceeded word count limit.ModerateWhat is the capital of Australia?The capital of Australia is Sydney.Factual Errors (Factual Inaccuracy)Incorrect capital city stated. Correct is Canberra.SevereExplain quantum entanglement simply.Quantum entanglement is when two particles link up in a spooky way...NoneSimple, accurate explanation.N/AGenerate Python code to sort a list x.sorted_list = x.sort()Fluency Errors (Grammatical Errors)sort() modifies in-place and returns None, needs sorted(x)ModerateTranslate to French: "Hello world"Salut mondeFluency Errors (Awkward Phrasing)Correct is "Bonjour le monde". "Salut" is too informal here.MinorAnalyzing and Visualizing ErrorsOnce annotation is complete, aggregate the results. Calculate the frequency of each error category. This quantitative summary of your qualitative findings helps pinpoint the most common problems.For example, you might find that 30% of errors are Factual Inaccuracies, 25% are Ignored Constraint, 15% are Generic/Uninformative, and the rest are distributed among other categories.Visualizations can make these patterns clearer. A simple bar chart is often effective for showing the distribution of error types.{"data": [{"type": "bar", "x": ["Factual Inaccuracy", "Ignored Constraint", "Generic Output", "Repetitive", "Awkward Phrasing", "Other"], "y": [30, 25, 15, 10, 8, 12], "marker": {"color": ["#fa5252", "#ff922b", "#fab005", "#74c0fc", "#91a7ff", "#adb5bd"]}}], "layout": {"title": "Distribution of Error Types in Sampled Outputs", "xaxis": {"title": "Error Category"}, "yaxis": {"title": "Percentage of Errors (%)"}}}Error frequency analysis highlights the most prevalent issues in the evaluated sample.Look for correlations. Do certain types of prompts consistently lead to specific errors? For example, do complex instructions with multiple constraints frequently result in Partial Adherence errors? Do open-ended questions often trigger Generic/Uninformative responses? These insights are invaluable.Example Walkthrough: Medical Report SummarizerImagine you fine-tuned an LLM to summarize patient discharge notes, requiring extraction of diagnosis, medications, and follow-up instructions into a structured format.Input 1 (Prompt):Summarize the important information from this discharge note into JSON: Patient admitted with pneumonia. Treated with levofloxacin 750mg IV daily for 5 days. Responded well. Discharged on oral levofloxacin 500mg daily for 7 days. Follow up with PCP Dr. Smith in 1 week. Advised rest and hydration.Model Output 1:{ "diagnosis": "Pneumonia", "medications": ["levofloxacin 750mg IV", "levofloxacin 500mg Oral"], "follow_up": "PCP Dr. Smith in 1 week", "advice": "Rest and hydration" }Analysis 1: Excellent. Followed instructions, extracted information correctly, used the specified format. Error: None.Input 2 (Prompt):Create a JSON summary: Patient presented with chest pain, ruled out for MI. Diagnosed with costochondritis. Prescribed ibuprofen 600mg TID PRN. No follow-up needed unless symptoms worsen. Patient allergic to penicillin.Model Output 2:{ "diagnosis": "Costochondritis", "treatment": "Ibuprofen 600mg TID", "follow_up": "None specified" }Analysis 2: Good, but missed important information.Error Category: Instruction Following Errors (Partial Adherence).Note: Failed to extract the important detail about the penicillin allergy. This could be critical. Needs refinement to capture allergies or contraindications mentioned. Severity: Severe.Input 3 (Prompt):Summarize using JSON: 68yo male with history of hypertension, DM2. Admitted for COPD exacerbation. Treated with steroids, nebulizers, BiPAP. Discharged on prednisone taper, Symbicort inhaler. Follow up with pulmonology clinic in 2 weeks. Continue home meds (lisinopril, metformin).Model Output 3:{ "patient_profile": "68yo male, Hx HTN, DM2", "admission_reason": "COPD Flare", "discharge_meds": ["prednisone", "Symbicort", "Lisinopril", "Metformin"], "plan": "Pulmonology F/U 2wk" }Analysis 3: Mostly correct, captures important info.Error Category: Factual Errors (Minor inaccuracy/imprecision).Note: Simplified "prednisone taper" to "prednisone". While not strictly wrong, it loses detail. Could be improved for clinical precision. Severity: Minor.Synthesizing Findings and Driving ImprovementsThis qualitative analysis, informed by a structured taxonomy and careful review, provides actionable insights that purely metric-based evaluation might miss. In our medical summary example, we identified issues with capturing negative constraints (allergies) and potential loss of clinical detail.These findings suggest specific next steps:Data Augmentation: Add more examples to the fine-tuning dataset that explicitly include allergies or require capturing treatment details like tapers.Prompt Engineering: Modify the instruction template to specifically ask for allergies or contraindications if mentioned.Model Re-tuning: Fine-tune the model further with the augmented dataset.Post-processing Rules: Implement checks to ensure specific fields (like allergies) are populated if keywords appear in the source text.Analyzing model outputs manually is an iterative process. As you refine your model and data, repeat the analysis to track improvements and identify new potential issues. It's a fundamental part of building reliable and effective fine-tuned models.