The goal is to train a model that, given a query, predicts scores for documents (or items) such that sorting by these scores approximates the true relevance ranking. We will use the rank:pairwise objective, which focuses on minimizing the number of pairs of documents within the same query group that are ordered incorrectly.Setting Up the Environment and DataFirst, ensure you have the necessary libraries installed:pip install xgboost numpy pandas scikit-learnFor LTR, our data needs a specific structure:Features for each document/item.A relevance score (label) for each document/item. Often, these are graded relevance levels (e.g., 0=irrelevant, 1=somewhat relevant, 2=highly relevant).A query ID (qid) grouping documents that were retrieved for the same query. The ranking objective operates within these groups.Let's simulate a small dataset representing search results for different queries.import numpy as np import pandas as pd import xgboost as xgb from sklearn.model_selection import GroupKFold from sklearn.preprocessing import StandardScaler import warnings warnings.filterwarnings("ignore", category=UserWarning) # Suppress XGBoost warnings for illustration # Simulate LTR data np.random.seed(42) n_queries = 10 n_docs_per_query = 15 n_features = 5 # Generate features X = np.random.rand(n_queries * n_docs_per_query, n_features) # Generate query IDs qids = np.repeat(np.arange(n_queries), n_docs_per_query) # Generate relevance scores (higher relevance correlated with first feature) # Simulate imperfect correlation and add noise base_relevance = X[:, 0] * 2 + np.random.randn(X.shape[0]) * 0.5 # Assign discrete relevance levels (e.g., 0, 1, 2) based on quantiles within each query y = np.zeros_like(base_relevance, dtype=int) for qid in range(n_queries): query_mask = (qids == qid) query_relevance = base_relevance[query_mask] # Assign labels based on relevance quantiles within the query q_75 = np.percentile(query_relevance, 75) q_25 = np.percentile(query_relevance, 25) y[query_mask & (query_relevance >= q_75)] = 2 # Highly relevant y[query_mask & (query_relevance >= q_25) & (query_relevance < q_75)] = 1 # Somewhat relevant # Others remain 0 (Irrelevant) # Create a DataFrame df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(n_features)]) df['qid'] = qids df['relevance'] = y print("Simulated Dataset Head:") print(df.head()) print("\nDataset Info:") df.info() print("\nRelevance Distribution:") print(df['relevance'].value_counts())This gives us a DataFrame df with features, query IDs (qid), and relevance scores (relevance).Data Preparation for XGBoost LTRXGBoost's ranking objectives require knowing the size of each query group. We also need to split the data while keeping documents from the same query within the same split (either training or testing). GroupKFold from scikit-learn is suitable for this.# --- Data Splitting respecting groups --- # We'll use GroupKFold to get indices for one split gkf = GroupKFold(n_splits=5) # Use 5 splits, take the first one for train/test train_idx, test_idx = next(gkf.split(df, groups=df['qid'])) X_train, X_test = df.iloc[train_idx].drop(['qid', 'relevance'], axis=1), df.iloc[test_idx].drop(['qid', 'relevance'], axis=1) y_train, y_test = df.iloc[train_idx]['relevance'], df.iloc[test_idx]['relevance'] qids_train, qids_test = df.iloc[train_idx]['qid'], df.iloc[test_idx]['qid'] # --- Feature Scaling (Optional but often good practice) --- # Scale features based on training data only scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # --- Calculate Group Sizes --- # XGBoost needs the size of each group (number of documents per query) # Sort data by qid first to ensure groups are contiguous train_order = np.argsort(qids_train.values) X_train_scaled = X_train_scaled[train_order] y_train = y_train.iloc[train_order] qids_train = qids_train.iloc[train_order] group_train = qids_train.value_counts().sort_index().values test_order = np.argsort(qids_test.values) X_test_scaled = X_test_scaled[test_order] y_test = y_test.iloc[test_order] qids_test = qids_test.iloc[test_order] group_test = qids_test.value_counts().sort_index().values # --- Create DMatrix --- # The special DMatrix is used to pass data and group info to XGBoost dtrain = xgb.DMatrix(X_train_scaled, label=y_train) dtrain.set_group(group_train) dtest = xgb.DMatrix(X_test_scaled, label=y_test) dtest.set_group(group_test) print(f"\nTraining set: {len(X_train)} samples, {len(group_train)} queries") print(f"Test set: {len(X_test)} samples, {len(group_test)} queries") print(f"Train group sizes (first 5): {group_train[:5]}") print(f"Test group sizes (first 5): {group_test[:5]}")Important steps here:We used GroupKFold to ensure all documents for a given qid end up in either the training or test set, preventing data leakage.Features were scaled using StandardScaler based only on the training data.Crucially, we calculated the size of each query group (group_train, group_test) after sorting the data by qid. This array tells XGBoost how many documents belong to the first query, the second query, and so on.We created xgb.DMatrix objects, passing the feature matrices and labels. The essential step for LTR is calling dtrain.set_group(group_train) and dtest.set_group(group_test).Training the XGBoost Ranking ModelNow, we configure and train the XGBoost model. We set the objective to rank:pairwise and use ndcg@k (Normalized Discounted Cumulative Gain at cutoff k) as the evaluation metric. NDCG measures the quality of the ranking by comparing it to the ideal ranking based on true relevance scores.# --- XGBoost Parameters for LTR --- params = { 'objective': 'rank:pairwise', # Pairwise ranking objective 'eval_metric': ['ndcg@5', 'ndcg@10'], # Evaluate using NDCG at cutoffs 5 and 10 'eta': 0.1, # Learning rate 'gamma': 1.0, # Minimum loss reduction for split 'min_child_weight': 1, # Minimum sum of instance weight needed in a child 'max_depth': 4, # Maximum tree depth 'seed': 42 } num_boost_round = 100 # Number of boosting rounds evals = [(dtrain, 'train'), (dtest, 'test')] # Datasets for evaluation during training # --- Train the Model --- print("\nTraining XGBoost LTR model...") bst = xgb.train( params, dtrain, num_boost_round=num_boost_round, evals=evals, verbose_eval=20 # Print evaluation results every 20 rounds )The training output shows the NDCG scores on both the training and test sets improving over boosting rounds. We monitor the test set performance (ndcg@5-test, ndcg@10-test) to gauge generalization.Prediction and EvaluationThe trained model (bst) predicts a score for each document. Higher scores imply higher predicted relevance. To evaluate performance, we need to:Get the predicted scores for the test set.For each query in the test set, sort the documents based on these predicted scores.Calculate NDCG@k using the sorted list and the true relevance labels.# --- Make Predictions --- # Predictions are relevance scores, higher means more relevant y_pred_scores = bst.predict(dtest) # --- Evaluate Performance (NDCG Calculation) --- # We need a function to calculate NDCG@k for the entire test set def calculate_ndcg_at_k(y_true, y_pred_scores, groups, k): """Calculates mean NDCG@k over all query groups.""" ndcg_scores = [] start_idx = 0 for group_size in groups: end_idx = start_idx + group_size # Get true relevance and predicted scores for the current group group_y_true = y_true[start_idx:end_idx] group_y_pred = y_pred_scores[start_idx:end_idx] # Sort documents by predicted score in descending order sorted_indices = np.argsort(group_y_pred)[::-1] sorted_y_true = group_y_true[sorted_indices] # Calculate DCG@k (Discounted Cumulative Gain) actual_k = min(k, group_size) dcg = np.sum((2**sorted_y_true[:actual_k] - 1) / np.log2(np.arange(2, actual_k + 2))) # Calculate IDCG@k (Ideal DCG) ideal_sorted_y_true = np.sort(group_y_true)[::-1] # Sort true relevance descendingly idcg = np.sum((2**ideal_sorted_y_true[:actual_k] - 1) / np.log2(np.arange(2, actual_k + 2))) # Calculate NDCG@k for the group ndcg = dcg / idcg if idcg > 0 else 0.0 ndcg_scores.append(ndcg) start_idx = end_idx return np.mean(ndcg_scores) # Get the true relevance labels corresponding to the test set order y_test_ordered = y_test.values # Ensure it's a NumPy array # Calculate NDCG@5 and NDCG@10 on the test set ndcg_at_5 = calculate_ndcg_at_k(y_test_ordered, y_pred_scores, group_test, k=5) ndcg_at_10 = calculate_ndcg_at_k(y_test_ordered, y_pred_scores, group_test, k=10) print(f"\nEvaluation on Test Set:") print(f"Calculated Mean NDCG@5: {ndcg_at_5:.4f}") print(f"Calculated Mean NDCG@10: {ndcg_at_10:.4f}") # Compare with XGBoost's internal evaluation during the last round print("\nComparison with final round metrics from training:") print(f"XGBoost reported NDCG@5-test: {bst.eval(dtest).split()[1].split(':')[1]}") print(f"XGBoost reported NDCG@10-test: {bst.eval(dtest).split()[2].split(':')[1]}")Our manual calculation should closely match the final NDCG reported by XGBoost during training, confirming our understanding and implementation of the metric.Feature ImportanceWe can still analyze feature importance for ranking models to understand which features contribute most to the predicted relevance scores.# --- Feature Importance --- importance = bst.get_score(importance_type='gain') # 'gain', 'weight', 'cover' sorted_importance = sorted(importance.items(), key=lambda item: item[1], reverse=True) print("\nFeature Importance (Gain):") for feature, score in sorted_importance: print(f"{feature}: {score:.4f}") # Optional: Visualize feature importance try: import matplotlib.pyplot as plt xgb.plot_importance(bst, importance_type='gain', max_num_features=10, height=0.8, title='Feature Importance (Gain)') plt.tight_layout() plt.show() except ImportError: print("\nInstall matplotlib to visualize feature importance: pip install matplotlib") This practical exercise demonstrated the end-to-end process of using XGBoost for a learning-to-rank task. Important takeaways include the specific data preparation required (query groups), the use of ranking objectives like rank:pairwise, and evaluation using metrics like NDCG. This approach allows you to leverage the power of gradient boosting for optimizing the order of items, a common requirement in search engines, recommendation systems, and question answering. Remember that other objectives like rank:ndcg or rank:map are also available and might yield better results depending on the specific dataset and evaluation criteria.