Understanding the theory behind evaluation metrics is one thing; applying that understanding to actual RAG system outputs is where practical learning happens. This section provides hands-on practice in analyzing the quality of responses generated by a RAG system, helping you connect the concepts of retrieval relevance, generation faithfulness, and overall answer quality to tangible examples.
Let's assume we have a basic RAG system configured as follows:
- Data Source: A collection of internal company documents, including HR policies, project reports, and technical documentation.
- Chunking Strategy: Fixed-size chunking with 256 tokens per chunk and 32 tokens of overlap.
- Embedding Model: A standard Sentence-BERT variant.
- Retriever: Basic cosine similarity search over embeddings stored in a vector index, retrieving the top 3 most similar chunks.
- Generator: A general-purpose instruction-following LLM.
- Prompt Template: "Answer the following question based only on the provided context. If the context does not contain the answer, say 'I cannot answer based on the provided information'.\n\nContext:\n{retrieved_context}\n\nQuestion: {user_query}\n\nAnswer:"
We will now examine a few query-response pairs generated by this system and analyze their quality.
Example 1: Query about a Specific Policy
User Query: "What is the process for requesting vacation time?"
Retrieved Context:
- Chunk 1: "... All employees accrue Paid Time Off (PTO) based on their years of service. Full-time employees receive 15 days annually for the first 5 years. PTO can be used for vacation, personal days, or sick leave. Unused PTO..."
- Chunk 2: "... To request time off, employees must submit a request through the 'HR-Portal' system at least two weeks in advance for requests longer than 3 days. Shorter requests should be submitted at least 48 hours prior. Manager approval is required..."
- Chunk 3: "... Project 'Phoenix' quarterly review meeting is scheduled for next Friday. All team leads must attend and present their progress updates. Please finalize your slides by Wednesday EOD. Meeting link will be sent via calendar invite..."
Generated Response: "Employees accrue PTO based on service time, starting at 15 days per year. To request time off, use the 'HR-Portal' system with advance notice (2 weeks for >3 days, 48 hours for shorter). Manager approval is needed. Project Phoenix review is next Friday."
Analysis:
-
Retrieval Quality:
- Chunk 1 provides relevant background information on PTO accrual.
- Chunk 2 directly addresses the process for requesting time off, which is central to the query.
- Chunk 3 is completely irrelevant (about a project meeting).
- Overall: Retrieval is decent but not perfect. It found the most important chunk (Chunk 2) but included noise (Chunk 3). This might happen if keywords like "request" or "time" appeared coincidentally in the project update document.
-
Generation Quality:
- Faithfulness: The response accurately reflects the information present in Chunks 1 and 2. It even includes the irrelevant detail from Chunk 3. It does not appear to add information that wasn't in the context (no hallucination).
- Relevance: The first part of the response directly answers the user's query about the process. However, including the detail about Project Phoenix makes the overall answer less relevant and concise.
- Coherence: The response is grammatically correct but slightly disjointed due to the inclusion of the unrelated project meeting.
-
Overall Assessment & Potential Improvements:
- The system provided the core answer but included irrelevant information due to imperfect retrieval.
- Possible improvements could involve:
- Improving retrieval: Trying a different embedding model, adjusting the chunking strategy, or using a re-ranking step after initial retrieval might filter out Chunk 3.
- Improving generation/prompting: A stricter prompt explicitly telling the LLM to ignore context sections that don't directly relate to the specific question might help, although our current prompt already tries to do this. The LLM might be overly inclined to summarize all provided text.
Example 2: Query Requiring Synthesis
User Query: "Summarize the main goals of Project Atlas based on the Q3 report."
Retrieved Context:
- Chunk 1: "...Project Atlas: Q3 Progress Report. Key focus this quarter was stabilizing the deployment pipeline. We successfully reduced deployment failures by 30% compared to Q2, exceeding our target of 25%. Resource allocation remained within budget..."
- Chunk 2: "...primary objective for Atlas remains the migration of legacy user database to the new cloud infrastructure by EOY. Secondary goal: achieve >99.9% uptime for the core service post-migration. Q3 work primarily supported the migration objective through infrastructure prep..."
- Chunk 3: "...challenges encountered included unexpected compatibility issues with the older CRM integration module. This required an additional 2 sprints of engineering effort, delaying the start of performance testing. Mitigation plans are in place..."
Generated Response: "Based on the Q3 report, Project Atlas focused on stabilizing the deployment pipeline, reducing failures by 30%. The main goal is migrating the legacy database to the cloud by EOY, with a secondary goal of achieving >99.9% uptime. Some challenges arose with CRM integration."
Analysis:
-
Retrieval Quality:
- All three retrieved chunks are highly relevant to Project Atlas and its Q3 status. Chunk 1 covers Q3 activity, Chunk 2 details the overarching goals, and Chunk 3 discusses challenges.
- Overall: Excellent retrieval. The system successfully identified the most pertinent sections of the documentation.
-
Generation Quality:
- Faithfulness: The response accurately summarizes information present across the three chunks without introducing outside facts.
- Relevance: The response directly addresses the user's query, identifying the main goals (migration, uptime) and mentioning the Q3 focus (pipeline stabilization) and challenges.
- Coherence: The generated text flows well and presents a concise summary synthesized from the different pieces of context.
-
Overall Assessment & Potential Improvements:
- This is an example of the RAG system working well. Relevant context was retrieved, and the generator successfully synthesized it into a coherent and accurate answer.
- While this specific output is good, ongoing monitoring would still be necessary. Perhaps for a different query about the same project, retrieval might fail, or the LLM might misinterpret the context.
Example 3: Query with No Answer in Context
User Query: "What is the company's policy on bringing pets to the office?"
Retrieved Context:
- Chunk 1: "... The office kitchen is stocked with complimentary snacks and beverages. Please clean up after yourself and report any issues with appliances to the facilities team. Use the designated bins for recycling..."
- Chunk 2: "... Security protocols require all visitors to sign in at the front desk and wear a visitor badge. Employees must use their access cards for entry. Doors should not be propped open. Report suspicious activity immediately..."
- Chunk 3: "... Our company values collaboration and innovation. We encourage cross-functional team projects and provide various tools for communication, including Slack and video conferencing suites. Regular town halls keep everyone informed..."
Generated Response: "I cannot answer based on the provided information."
Analysis:
-
Retrieval Quality:
- None of the retrieved chunks mention pets or office policies related to animals. They cover unrelated topics like the kitchen, security, and company values.
- Overall: Poor retrieval in the sense that the answer isn't present, but potentially good retrieval in the sense that the system correctly found no relevant documents if none exist in the knowledge base. We assume for this example that no such policy document was indexed.
-
Generation Quality:
- Faithfulness: The LLM correctly followed the instruction in the prompt ("If the context does not contain the answer, say 'I cannot answer based on the provided information'"). It did not attempt to hallucinate an answer.
- Relevance: The response is perfectly relevant to the situation where the answer cannot be found in the provided context.
- Coherence: The response is clear and direct.
-
Overall Assessment & Potential Improvements:
- The system behaved correctly given the available (or lack of) information. It correctly identified that the retrieved context did not contain the answer and informed the user accordingly.
- This highlights the importance of the prompt instructing the LLM on how to handle cases where the context is insufficient. It also underscores that a RAG system's knowledge is limited to the documents it has access to. If the pet policy document exists but wasn't indexed, the failure lies in the data preparation stage.
Practice analyzing outputs
Reviewing outputs like these is a critical part of understanding your RAG system's behavior. Look at the responses your own system generates for various queries. Ask yourself:
- Was the retrieved context relevant? Did it contain the necessary information? Was irrelevant noise included?
- Was the generated answer faithful to the context? Did it invent information? Did it ignore important parts of the context?
- Did the answer directly address the user's query? Was it concise and clear?
- If the answer was poor, was the primary issue retrieval or generation?
This qualitative analysis, performed regularly, complements quantitative evaluation metrics and provides valuable insights into where your system excels and where it needs refinement. It helps you diagnose problems and guides your efforts in adjusting chunking, selecting embedding models, refining prompts, or even curating the underlying knowledge base.