Let's put the theory of Learning to Rank (LTR) with gradient boosting into practice using XGBoost. The goal is to train a model that, given a query, predicts scores for documents (or items) such that sorting by these scores approximates the true relevance ranking. We will use the rank:pairwise
objective, which focuses on minimizing the number of pairs of documents within the same query group that are ordered incorrectly.
First, ensure you have the necessary libraries installed:
pip install xgboost numpy pandas scikit-learn
For LTR, our data needs a specific structure:
qid
) grouping documents that were retrieved for the same query. The ranking objective operates within these groups.Let's simulate a small dataset representing search results for different queries.
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import GroupKFold
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore", category=UserWarning) # Suppress XGBoost warnings for illustration
# Simulate LTR data
np.random.seed(42)
n_queries = 10
n_docs_per_query = 15
n_features = 5
# Generate features
X = np.random.rand(n_queries * n_docs_per_query, n_features)
# Generate query IDs
qids = np.repeat(np.arange(n_queries), n_docs_per_query)
# Generate relevance scores (higher relevance correlated with first feature)
# Simulate imperfect correlation and add noise
base_relevance = X[:, 0] * 2 + np.random.randn(X.shape[0]) * 0.5
# Assign discrete relevance levels (e.g., 0, 1, 2) based on quantiles within each query
y = np.zeros_like(base_relevance, dtype=int)
for qid in range(n_queries):
query_mask = (qids == qid)
query_relevance = base_relevance[query_mask]
# Assign labels based on relevance quantiles within the query
q_75 = np.percentile(query_relevance, 75)
q_25 = np.percentile(query_relevance, 25)
y[query_mask & (query_relevance >= q_75)] = 2 # Highly relevant
y[query_mask & (query_relevance >= q_25) & (query_relevance < q_75)] = 1 # Somewhat relevant
# Others remain 0 (Irrelevant)
# Create a DataFrame
df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(n_features)])
df['qid'] = qids
df['relevance'] = y
print("Simulated Dataset Head:")
print(df.head())
print("\nDataset Info:")
df.info()
print("\nRelevance Distribution:")
print(df['relevance'].value_counts())
This gives us a DataFrame df
with features, query IDs (qid
), and relevance scores (relevance
).
XGBoost's ranking objectives require knowing the size of each query group. We also need to split the data while keeping documents from the same query within the same split (either training or testing). GroupKFold
from scikit-learn is suitable for this.
# --- Data Splitting respecting groups ---
# We'll use GroupKFold to get indices for one split
gkf = GroupKFold(n_splits=5) # Use 5 splits, take the first one for train/test
train_idx, test_idx = next(gkf.split(df, groups=df['qid']))
X_train, X_test = df.iloc[train_idx].drop(['qid', 'relevance'], axis=1), df.iloc[test_idx].drop(['qid', 'relevance'], axis=1)
y_train, y_test = df.iloc[train_idx]['relevance'], df.iloc[test_idx]['relevance']
qids_train, qids_test = df.iloc[train_idx]['qid'], df.iloc[test_idx]['qid']
# --- Feature Scaling (Optional but often good practice) ---
# Scale features based on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# --- Calculate Group Sizes ---
# XGBoost needs the size of each group (number of documents per query)
# Sort data by qid first to ensure groups are contiguous
train_order = np.argsort(qids_train.values)
X_train_scaled = X_train_scaled[train_order]
y_train = y_train.iloc[train_order]
qids_train = qids_train.iloc[train_order]
group_train = qids_train.value_counts().sort_index().values
test_order = np.argsort(qids_test.values)
X_test_scaled = X_test_scaled[test_order]
y_test = y_test.iloc[test_order]
qids_test = qids_test.iloc[test_order]
group_test = qids_test.value_counts().sort_index().values
# --- Create DMatrix ---
# The special DMatrix is used to pass data and group info to XGBoost
dtrain = xgb.DMatrix(X_train_scaled, label=y_train)
dtrain.set_group(group_train)
dtest = xgb.DMatrix(X_test_scaled, label=y_test)
dtest.set_group(group_test)
print(f"\nTraining set: {len(X_train)} samples, {len(group_train)} queries")
print(f"Test set: {len(X_test)} samples, {len(group_test)} queries")
print(f"Train group sizes (first 5): {group_train[:5]}")
print(f"Test group sizes (first 5): {group_test[:5]}")
Key steps here:
GroupKFold
to ensure all documents for a given qid
end up in either the training or test set, preventing data leakage.StandardScaler
based only on the training data.group_train
, group_test
) after sorting the data by qid
. This array tells XGBoost how many documents belong to the first query, the second query, and so on.xgb.DMatrix
objects, passing the feature matrices and labels. The essential step for LTR is calling dtrain.set_group(group_train)
and dtest.set_group(group_test)
.Now, we configure and train the XGBoost model. We set the objective to rank:pairwise
and use ndcg@k
(Normalized Discounted Cumulative Gain at cutoff k) as the evaluation metric. NDCG measures the quality of the ranking by comparing it to the ideal ranking based on true relevance scores.
# --- XGBoost Parameters for LTR ---
params = {
'objective': 'rank:pairwise', # Pairwise ranking objective
'eval_metric': ['ndcg@5', 'ndcg@10'], # Evaluate using NDCG at cutoffs 5 and 10
'eta': 0.1, # Learning rate
'gamma': 1.0, # Minimum loss reduction for split
'min_child_weight': 1, # Minimum sum of instance weight needed in a child
'max_depth': 4, # Maximum tree depth
'seed': 42
}
num_boost_round = 100 # Number of boosting rounds
evals = [(dtrain, 'train'), (dtest, 'test')] # Datasets for evaluation during training
# --- Train the Model ---
print("\nTraining XGBoost LTR model...")
bst = xgb.train(
params,
dtrain,
num_boost_round=num_boost_round,
evals=evals,
verbose_eval=20 # Print evaluation results every 20 rounds
)
The training output shows the NDCG scores on both the training and test sets improving over boosting rounds. We monitor the test set performance (ndcg@5-test
, ndcg@10-test
) to gauge generalization.
The trained model (bst
) predicts a score for each document. Higher scores imply higher predicted relevance. To evaluate performance, we need to:
# --- Make Predictions ---
# Predictions are relevance scores, higher means more relevant
y_pred_scores = bst.predict(dtest)
# --- Evaluate Performance (NDCG Calculation) ---
# We need a function to calculate NDCG@k for the entire test set
def calculate_ndcg_at_k(y_true, y_pred_scores, groups, k):
"""Calculates mean NDCG@k over all query groups."""
ndcg_scores = []
start_idx = 0
for group_size in groups:
end_idx = start_idx + group_size
# Get true relevance and predicted scores for the current group
group_y_true = y_true[start_idx:end_idx]
group_y_pred = y_pred_scores[start_idx:end_idx]
# Sort documents by predicted score in descending order
sorted_indices = np.argsort(group_y_pred)[::-1]
sorted_y_true = group_y_true[sorted_indices]
# Calculate DCG@k (Discounted Cumulative Gain)
actual_k = min(k, group_size)
dcg = np.sum((2**sorted_y_true[:actual_k] - 1) / np.log2(np.arange(2, actual_k + 2)))
# Calculate IDCG@k (Ideal DCG)
ideal_sorted_y_true = np.sort(group_y_true)[::-1] # Sort true relevance descendingly
idcg = np.sum((2**ideal_sorted_y_true[:actual_k] - 1) / np.log2(np.arange(2, actual_k + 2)))
# Calculate NDCG@k for the group
ndcg = dcg / idcg if idcg > 0 else 0.0
ndcg_scores.append(ndcg)
start_idx = end_idx
return np.mean(ndcg_scores)
# Get the true relevance labels corresponding to the test set order
y_test_ordered = y_test.values # Ensure it's a NumPy array
# Calculate NDCG@5 and NDCG@10 on the test set
ndcg_at_5 = calculate_ndcg_at_k(y_test_ordered, y_pred_scores, group_test, k=5)
ndcg_at_10 = calculate_ndcg_at_k(y_test_ordered, y_pred_scores, group_test, k=10)
print(f"\nEvaluation on Test Set:")
print(f"Calculated Mean NDCG@5: {ndcg_at_5:.4f}")
print(f"Calculated Mean NDCG@10: {ndcg_at_10:.4f}")
# Compare with XGBoost's internal evaluation during the last round
print("\nComparison with final round metrics from training:")
print(f"XGBoost reported NDCG@5-test: {bst.eval(dtest).split()[1].split(':')[1]}")
print(f"XGBoost reported NDCG@10-test: {bst.eval(dtest).split()[2].split(':')[1]}")
Our manual calculation should closely match the final NDCG reported by XGBoost during training, confirming our understanding and implementation of the metric.
We can still analyze feature importance for ranking models to understand which features contribute most to the predicted relevance scores.
# --- Feature Importance ---
importance = bst.get_score(importance_type='gain') # 'gain', 'weight', 'cover'
sorted_importance = sorted(importance.items(), key=lambda item: item[1], reverse=True)
print("\nFeature Importance (Gain):")
for feature, score in sorted_importance:
print(f"{feature}: {score:.4f}")
# Optional: Visualize feature importance
try:
import matplotlib.pyplot as plt
xgb.plot_importance(bst, importance_type='gain', max_num_features=10, height=0.8, title='Feature Importance (Gain)')
plt.tight_layout()
plt.show()
except ImportError:
print("\nInstall matplotlib to visualize feature importance: pip install matplotlib")
This practical exercise demonstrated the end-to-end process of using XGBoost for a learning-to-rank task. Key takeaways include the specific data preparation required (query groups), the use of ranking objectives like rank:pairwise
, and evaluation using metrics like NDCG. This approach allows you to leverage the power of gradient boosting for optimizing the order of items, a common requirement in search engines, recommendation systems, and question answering. Remember that other objectives like rank:ndcg
or rank:map
are also available and might yield better results depending on the specific dataset and evaluation criteria.
© 2025 ApX Machine Learning