The XGBoost API: A Walkthrough

Applying XGBoost effectively requires a working knowledge of its Python interface. The library offers two primary methods for model construction: its native Python API and a wrapper class compatible with the Scikit-Learn API. The native API will be examined first, as it reveals the main components of the library and provides the greatest flexibility.

The Core Data Structure: DMatrix

Unlike Scikit-Learn estimators that work directly with NumPy arrays or Pandas DataFrames, XGBoost's native API uses an internal data structure called a DMatrix. This is a memory-efficient and performance-optimized data container designed specifically for the library's algorithms. Converting your data into a DMatrix is the first step in the native workflow.

You can create a DMatrix from several data types, including NumPy arrays, SciPy sparse matrices, and Pandas DataFrames. When creating a DMatrix for training, you provide both the feature matrix (your X data) and the target vector (your y data) using the label argument.

import xgboost as xgb
import numpy as np
import pandas as pd

# Generate sample training data
X_train_data = np.random.rand(100, 5)
y_train_data = np.random.rand(100)

# Create a DMatrix from a NumPy array
dtrain = xgb.DMatrix(X_train_data, label=y_train_data)

print(f"Type of the created object: {type(dtrain)}")

The dtrain object now holds our data in a format ready for high-speed training. For prediction data, you create a DMatrix in the same way but omit the label argument, as the target is unknown.

Specifying Model Hyperparameters

In the native API, hyperparameters are not passed as arguments to a model constructor. Instead, they are defined in a Python dictionary. This dictionary contains key-value pairs where the key is the parameter name (e.g., 'max_depth') and the value is its setting.

This approach makes it easy to manage, save, and modify parameter sets. Let's define a basic parameter dictionary for a regression task.

# Define the model's hyperparameters in a dictionary
params = {
    'objective': 'reg:squarederror',  # The loss function to be minimized
    'max_depth': 3,                   # Maximum depth of each decision tree
    'eta': 0.1,                       # Learning rate, also known as 'learning_rate'
    'eval_metric': 'rmse'             # The metric used for evaluation on a validation set
}

The objective parameter is one of the most important, as it specifies the learning task. Common objectives include reg:squarederror for regression, binary:logistic for binary classification, and multi:softmax for multi-class classification. The eta parameter, a synonym for learning rate, controls the step size at each boosting iteration.

Training the Model with xgb.train

With the training data in a DMatrix and the parameters in a dictionary, you can train a model using the xgb.train() function. This function requires at least three arguments:

params: The dictionary of hyperparameters.
dtrain: The DMatrix containing the training data.
num_boost_round: The total number of trees to build, equivalent to n_estimators in Scikit-Learn.

# Set the number of boosting rounds
num_boost_round = 50

# Train the model
bst = xgb.train(params, dtrain, num_boost_round)

The function returns a trained model object, which we've named bst. This object can now be used to make predictions on new data.

Generating Predictions

To make predictions, you first convert your test dataset into a DMatrix (without the label). Then, you call the .predict() method on the trained model object.

# Generate sample test data
X_test_data = np.random.rand(20, 5)

# Convert the test data into a DMatrix
dtest = xgb.DMatrix(X_test_data)

# Generate predictions
predictions = bst.predict(dtest)

print("Sample predictions:")
print(predictions[:5])

The output is a NumPy array containing the model's predictions for each sample in the test set. The diagram below summarizes the native API workflow.

A summary of the native XGBoost API workflow, from data preparation to prediction.

The Scikit-Learn Wrapper API

While the native API provides full control, XGBoost also includes a Scikit-Learn compatible wrapper. This is extremely convenient if you are already familiar with Scikit-Learn's .fit() and .predict() syntax or if you want to integrate XGBoost into a Scikit-Learn Pipeline or hyperparameter search tool like GridSearchCV.

The primary classes are XGBRegressor for regression and XGBClassifier for classification. Hyperparameters are passed directly to the model's constructor, just like any other Scikit-Learn estimator.

Let's replicate our regression task using the XGBRegressor.

from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Use the same data as before
X, y = X_train_data, y_train_data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate the model with Scikit-Learn syntax
# Parameters are passed as arguments to the constructor
xgb_reg = XGBRegressor(
    objective='reg:squarederror',
    n_estimators=50,
    learning_rate=0.1,  # Note the use of 'learning_rate' instead of 'eta'
    max_depth=3,
    random_state=42
)

# Fit the model using the familiar .fit() method
xgb_reg.fit(X_train, y_train)

# Make predictions using the .predict() method
predictions_sklearn = xgb_reg.predict(X_test)

# Evaluate the model
rmse = np.sqrt(mean_squared_error(y_test, predictions_sklearn))
print(f"RMSE with Scikit-Learn wrapper: {rmse:.4f}")

As you can see, this approach requires less code and aligns perfectly with the standard Scikit-Learn workflow. You don't need to manually create DMatrix objects; the wrapper handles the data conversion internally. For many applications, especially those involving cross-validation and automated tuning, the Scikit-Learn wrapper is the more practical choice. The next section provides a hands-on opportunity to apply these APIs to solve a complete modeling problem.

Was this section helpful?

References

XGBoost Documentation, XGBoost Contributors, 2024 - Official guide to the XGBoost library, covering its native API, DMatrix data structure, hyperparameters, and Scikit-Learn wrapper. A primary resource for API usage.
XGBoost: A Scalable Tree Boosting System, Tianqi Chen and Carlos Guestrin, 2016 Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (ACM) DOI: 10.1145/2939672.2939785 - Foundational paper introducing the XGBoost algorithm and its design principles, which inform the architecture of its API.
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, Aurélien Géron, 2022 (O'Reilly Media) - A widely used practical guide that includes sections on gradient boosting and XGBoost, demonstrating its application using both native and Scikit-Learn APIs within a broader machine learning context. (4th edition)