You've seen how important the quality of retrieved documents is for the overall RAG system. While techniques like domain-specific embedding models and hybrid search broaden the net and improve initial candidate selection, re-ranking acts as a fine-toothed comb, meticulously sifting through these candidates to bring the absolute best to the forefront. This hands-on section will walk you through implementing an advanced re-ranking stage using a cross-encoder model and evaluating its impact.
We'll simulate a common scenario: a user asks a question, our initial retrieval (often called a bi-encoder based system) fetches a list of potentially relevant documents, and then a re-ranker (a cross-encoder) re-evaluates these top candidates to produce a more precise final list.
First, ensure you have the necessary libraries. We'll primarily use sentence-transformers
for both our initial retriever and the re-ranker, as it provides convenient interfaces for various pre-trained models.
# Ensure you have these libraries installed:
# pip install sentence-transformers torch
Let's define a small corpus of documents and a few sample queries with their known relevant documents. In a real-world scenario, this corpus would be much larger, and the ground truth for evaluation would be more extensive.
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import torch
# Sample documents (our knowledge base)
documents = [
{"id": "doc1", "text": "Our software supports Windows 10, Windows 11, and macOS Monterey or newer."},
{"id": "doc2", "text": "To install, download the installer from our website and run it. Follow the on-screen prompts."},
{"id": "doc3", "text": "The license key can be found in your purchase confirmation email. Enter it in the 'Activation' window."}
{"id": "doc4", "text": "For troubleshooting, please check our online knowledge base or contact support via support@example.com."},
{"id": "doc5", "text": "System requirements include at least 4GB of RAM and 10GB of free disk space. A modern CPU is recommended for optimal performance."},
{"id": "doc6", "text": "Updates are automatically downloaded and installed. You can check for updates manually via the 'Help' menu."}
]
doc_texts = [doc['text'] for doc in documents]
# Sample queries with ground truth for evaluation
queries_with_ground_truth = [
{"query": "How do I install the software?", "relevant_doc_id": "doc2", "relevant_doc_text": documents[1]["text"]},
{"query": "What operating systems are supported?", "relevant_doc_id": "doc1", "relevant_doc_text": documents[0]["text"]},
{"query": "Where is my license key?", "relevant_doc_id": "doc3", "relevant_doc_text": documents[2]["text"]},
{"query": "What are the RAM requirements?", "relevant_doc_id": "doc5", "relevant_doc_text": documents[4]["text"]}
]
A bi-encoder model, like those commonly used for semantic search, independently computes embeddings for the query and all documents. The relevance is then determined by the similarity (e.g., cosine similarity) between these embeddings.
# Load a bi-encoder model for initial retrieval
bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')
# Encode our document corpus
doc_embeddings = bi_encoder.encode(doc_texts, convert_to_tensor=True)
# Function to perform initial retrieval
def retrieve_initial_documents(query_text, top_k=3):
query_embedding = bi_encoder.encode(query_text, convert_to_tensor=True)
# We use cosine-similarity and torch.topk to find the highest scores
cos_scores = util.cos_sim(query_embedding, doc_embeddings)[0]
top_results = torch.topk(cos_scores, k=top_k)
retrieved_docs = []
print(f"\nQuery: {query_text}")
print("Top initial results (Bi-Encoder):")
for i, (score, idx) in enumerate(zip(top_results[0], top_results[1])):
retrieved_docs.append({"id": documents[idx.item()]["id"], "text": documents[idx.item()]["text"], "score": score.item()})
print(f"{i+1}. ID: {documents[idx.item()]['id']}, Score: {score.item():.4f}, Text: {documents[idx.item()]['text'][:100]}...")
return retrieved_docs
# Let's test initial retrieval for one query
sample_query = queries_with_ground_truth[0]["query"] # "How do I install the software?"
initial_candidates = retrieve_initial_documents(sample_query, top_k=3)
You'll notice that the initial retrieval is fast. However, the top results might not always place the most relevant document at the very top, or it might include documents that are only tangentially related. For "How do I install the software?", doc2
is ideal. Let's see if it's ranked first. Sometimes, document doc6
("Updates are automatically downloaded and installed...") might appear highly due to shared terms like "install", even if it's not about the initial setup.
Cross-encoder models work differently. Instead of comparing independent embeddings, they take a query and a document pair as input and output a single score representing their relevance. This allows the model to perform a much deeper, more fine-grained comparison, often leading to superior relevance ranking but at a higher computational cost. That's why they are typically used to re-rank a smaller set of candidates from an initial, faster retrieval stage.
# Load a cross-encoder model for re-ranking
# Common choices include models fine-tuned on MS MARCO or similar passage ranking datasets.
# 'cross-encoder/ms-marco-MiniLM-L-6-v2' is a good, relatively small model.
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
# Function to re-rank documents using the cross-encoder
def rerank_documents(query_text, candidate_docs):
# Prepare pairs for the cross-encoder: [ (query, doc_text1), (query, doc_text2), ... ]
pairs = []
for doc in candidate_docs:
pairs.append((query_text, doc['text']))
# Get scores from the cross-encoder
# The cross_encoder.predict() method takes a list of pairs and returns a list of scores.
scores = cross_encoder.predict(pairs)
# Combine candidates with their new scores and sort
for i in range(len(candidate_docs)):
candidate_docs[i]['cross_score'] = scores[i]
# Sort by the new cross-encoder score in descending order
reranked_docs = sorted(candidate_docs, key=lambda x: x['cross_score'], reverse=True)
print("\nRe-ranked results (Cross-Encoder):")
for i, doc in enumerate(reranked_docs):
print(f"{i+1}. ID: {doc['id']}, Cross-Score: {doc['cross_score']:.4f}, Text: {doc['text'][:100]}...")
return reranked_docs
# Re-rank the candidates from our previous example
reranked_candidates = rerank_documents(sample_query, initial_candidates)
Observe the output. You should see that the cross-encoder potentially re-orders the initial_candidates
. Ideally, the most relevant document (e.g., doc2
for "How do I install the software?") now has the highest cross_score
and is ranked first. The scores themselves are different from the bi-encoder's cosine similarity; cross-encoder scores are often logits that are not bounded between 0 and 1 but directly reflect relevance.
To objectively measure the improvement, we need evaluation metrics. For ranking tasks, common metrics include:
Let's implement a simple evaluation.
def calculate_mrr_and_precision_at_1(ranked_results_list, ground_truth_list):
reciprocal_ranks = []
precision_at_1_scores = []
for i, ranked_docs in enumerate(ranked_results_list):
query_info = ground_truth_list[i]
relevant_id = query_info["relevant_doc_id"]
found_rank = -1
for rank, doc in enumerate(ranked_docs):
if doc["id"] == relevant_id:
found_rank = rank + 1
break
if found_rank != -1:
reciprocal_ranks.append(1.0 / found_rank)
if found_rank == 1:
precision_at_1_scores.append(1.0)
else:
precision_at_1_scores.append(0.0)
else:
reciprocal_ranks.append(0.0) # Relevant document not found in top_k
precision_at_1_scores.append(0.0)
mrr = sum(reciprocal_ranks) / len(reciprocal_ranks) if reciprocal_ranks else 0
p_at_1 = sum(precision_at_1_scores) / len(precision_at_1_scores) if precision_at_1_scores else 0
return mrr, p_at_1
# --- Evaluation ---
print("\n--- Evaluating Performance ---")
initial_retrieval_results_all_queries = []
reranked_results_all_queries = []
K_INITIAL = 3 # Number of documents from initial retrieval to consider for re-ranking
for item in queries_with_ground_truth:
query = item["query"]
print(f"\nProcessing query: {query}")
# Initial Retrieval
initial_docs = retrieve_initial_documents(query, top_k=K_INITIAL)
initial_retrieval_results_all_queries.append(initial_docs)
# Re-ranking
reranked_docs = rerank_documents(query, initial_docs) # Re-rank the same initial set
reranked_results_all_queries.append(reranked_docs)
# Calculate metrics
mrr_initial, p1_initial = calculate_mrr_and_precision_at_1(initial_retrieval_results_all_queries, queries_with_ground_truth)
mrr_reranked, p1_reranked = calculate_mrr_and_precision_at_1(reranked_results_all_queries, queries_with_ground_truth)
print("\n--- Evaluation Summary ---")
print(f"Initial Retrieval (Bi-Encoder) -> MRR: {mrr_initial:.4f}, Precision@1: {p1_initial:.4f}")
print(f"After Re-ranking (Cross-Encoder) -> MRR: {mrr_reranked:.4f}, Precision@1: {p1_reranked:.4f}")
Performance comparison before and after applying a re-ranking stage. Actual values depend on the dataset and models, but an upward trend is typical. (Note: The values 0.625, 0.9375, 0.50, 0.75 are illustrative based on a good run on the sample data; your exact results may vary.)
You should typically see an improvement in both MRR and Precision@1 after applying the re-ranker. This demonstrates that the re-ranking step is effectively promoting more relevant documents to higher positions.
K_INITIAL
documents through a typically larger model for each query. This is why it's a re-ranking step on a subset of candidates, not a primary retrieval method for large corpora.K_INITIAL
: The number of documents passed from the initial retriever to the re-ranker (K_INITIAL
in our code, often referred to as k'
or top_n_for_reranking
) is an important hyperparameter.
K_INITIAL
is too small, the truly relevant document might not even make it to the re-ranking stage.K_INITIAL
is too large, the latency penalty of re-ranking increases significantly.
Typical values range from 20 to 100, depending on the application's latency budget and the quality of the initial retriever.This hands-on exercise demonstrates a powerful technique for significantly enhancing the precision of your RAG system's retrieval component. By carefully selecting an initial candidate set and then applying a more sophisticated re-ranking model, you can ensure that the context provided to your generator is of the highest possible relevance, directly impacting the quality and factual accuracy of the final generated output. Remember to always evaluate the impact on both relevance metrics and system latency to find the right balance for your production environment.
Was this section helpful?
© 2025 ApX Machine Learning