Hands-on Practical: Training an XGBoost Model

You will train your first XGBoost model, implementing its features using the core Python API. The process includes loading data, monitoring training, making predictions, and interpreting the results. This exercise applies concepts like the DMatrix data structure and the use of an objective function.

We will use the California Housing dataset, a classic regression problem where the goal is to predict the median house value in a California district based on several features.

Setting Up the Environment and Data

First, ensure you have the necessary libraries. You will need xgboost, scikit-learn for the dataset and evaluation metrics, and pandas for data manipulation.

Let's begin by loading the dataset and splitting it into training and testing sets. This is a standard procedure that separates the data used for model training from the data used for evaluating its performance on unseen examples.

import xgboost as xgb
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# Load the dataset
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target)

# Create a training and testing split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

Converting Data into DMatrix

As we covered previously, XGBoost has its own optimized data structure called DMatrix. It is highly efficient for both memory usage and training speed. We need to convert our pandas DataFrames and Series into this format before we can proceed with training.

# Convert the dataset into DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

print("DMatrix object created for training and testing sets.")

Defining Parameters and Training the Model

With our data prepared, the next step is to define the model's hyperparameters. These are set in a Python dictionary. For this initial model, we will define an objective function, an evaluation metric, and a few parameters to control the learning process and tree structure.

objective: 'reg:squarederror' specifies that we are performing a regression task and want to minimize the squared difference between the actual and predicted values.
eval_metric: 'rmse' (Root Mean Squared Error) will be used to monitor the model's performance on the evaluation set during training.
eta: This is the learning rate, which scales the contribution of each tree. A lower value makes the boosting process more conservative.
max_depth: This controls the maximum depth of each decision tree, which helps prevent overfitting.

We will also create a watchlist, which is a list of DMatrix objects. The model will report its performance on these datasets after each boosting round, giving us a live view of how it is learning.

# Specify model parameters
params = {
    'objective': 'reg:squarederror',
    'eval_metric': 'rmse',
    'eta': 0.1,
    'max_depth': 4,
    'seed': 42
}

# Specify the number of boosting rounds
num_boost_round = 100

# Create a watchlist to monitor performance
watchlist = [(dtrain, 'train'), (dtest, 'test')]

# Train the model
bst = xgb.train(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    evals=watchlist,
    verbose_eval=10  # Print evaluation results every 10 rounds
)

When you run this code, XGBoost will print the RMSE for both the training and testing sets every 10 rounds. You should observe the RMSE decreasing for both sets. If the test RMSE starts to increase while the train RMSE continues to decrease, it's a sign of overfitting. Using a watchlist is an effective way to identify the optimal number of boosting rounds.

Making Predictions and Evaluating Performance

Once the model is trained, we can use it to make predictions on our test data. The predict method takes a DMatrix as input and returns an array of predictions. We can then compare these predictions to the actual values (y_test) using the same metric we monitored during training, RMSE.

# Make predictions on the test set
preds = bst.predict(dtest)

# Evaluate the model
rmse = np.sqrt(mean_squared_error(y_test, preds))
print(f"Final RMSE on the test set: {rmse:.4f}")

The resulting RMSE gives you a single number to quantify your model's prediction error. An RMSE of 0.5, for instance, means the model's predictions are, on average, off by about $500,000, since the target variable is in units of$ 100,000s.

Interpreting the Model with Feature Importance

A significant advantage of tree-based models like XGBoost is their interpretability. We can easily inspect which features the model found most important when making its predictions. XGBoost provides a simple way to get these importance scores and even plot them.

The default importance type, 'weight', measures the number of times a feature appears in a tree.

# To get feature importance scores
importance_scores = bst.get_score(importance_type='weight')
print(importance_scores)

# To plot feature importance using XGBoost's built-in function
# requires matplotlib to be installed
# xgb.plot_importance(bst)
# plt.show()

Visualizing the feature importance makes it much easier to understand the model's behavior. The chart below shows the importance scores for our trained model, indicating which housing features had the most influence on its predictions.

The F-score indicates how many times each feature was used to create a split in the decision trees. Higher values suggest greater importance. In this case, Median Income (MedInc) was the most frequently used feature.

You have now successfully trained, evaluated, and interpreted an XGBoost model. This workflow forms the basis for nearly all gradient boosting tasks. In the next chapters, we will explore other powerful libraries and learn how to systematically tune the model's hyperparameters to extract even better performance.

Was this section helpful?

References

XGBoost Documentation, XGBoost Contributors, Current - A comprehensive guide for XGBoost's Python API, data structures, and training process.
XGBoost: A Scalable Tree Boosting System, Tianqi Chen and Carlos Guestrin, 2016 KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM) DOI: 10.1145/2939672.2939785 - The original academic paper describing the design and optimization techniques of the XGBoost algorithm.
Applied Predictive Modeling, Max Kuhn and Kjell Johnson, 2013 (Springer) DOI: 10.1007/978-1-4614-6849-3 - A book that provides a practical context for predictive modeling, including ensemble methods like gradient boosting.