This practical application demonstrates the optimization of an XGBoost model for a regression task using systematic tuning methods. It begins by establishing a baseline performance with a default model. Then, RandomizedSearchCV is used to explore a wide range of hyperparameter values efficiently. Finally, the tuned model is evaluated to measure the improvement.
Our goal is to improve the predictive accuracy of a model on the California Housing dataset, a classic regression problem where the objective is to predict the median house value for California districts.
First, let's import the necessary libraries and load our dataset. We will use XGBoost for our model and Scikit-Learn for data handling and the tuning utilities.
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error
# Load the dataset
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target)
# Create training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Training data shape:", X_train.shape)
print("Testing data shape:", X_test.shape)
Before we begin tuning, it's important to know our starting point. We will train an XGBRegressor with its default parameters and evaluate its performance on the test set using Mean Squared Error (MSE). A lower MSE indicates a better fit.
# Initialize the XGBRegressor with default parameters
xgb_baseline = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)
# Train the model
xgb_baseline.fit(X_train, y_train)
# Make predictions on the test set
y_pred_baseline = xgb_baseline.predict(X_test)
# Calculate and print the baseline MSE
mse_baseline = mean_squared_error(y_test, y_pred_baseline)
print(f"Baseline Model MSE: {mse_baseline:.4f}")
This baseline score is the number we aim to beat. Any model with an MSE lower than this value represents an improvement.
Grid Search can be computationally expensive when the hyperparameter space is large. RandomizedSearchCV is a more efficient alternative that samples a fixed number of parameter combinations from specified distributions. This approach allows us to test a wide range of values without trying every single combination.
Let's define a search space for the hyperparameters we discussed earlier:
n_estimators: The number of boosting rounds.learning_rate: The step size shrinkage.max_depth: The maximum depth of a tree.subsample: The fraction of observations to be randomly sampled for each tree.colsample_bytree: The fraction of columns to be randomly sampled for each tree.gamma: Minimum loss reduction required to make a further partition.# Define the hyperparameter grid for Randomized Search
param_dist = {
'n_estimators': [100, 200, 300, 400, 500],
'learning_rate': [0.01, 0.05, 0.1, 0.2],
'max_depth': [3, 4, 5, 6, 7, 8],
'subsample': [0.6, 0.7, 0.8, 0.9, 1.0],
'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0],
'gamma': [0, 0.1, 0.2, 0.3]
}
# Initialize the XGBRegressor
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)
# Initialize RandomizedSearchCV
# n_iter controls how many different combinations to try.
# cv is the number of cross-validation folds.
# n_jobs=-1 uses all available CPU cores to speed up the process.
random_search = RandomizedSearchCV(
estimator=xgb_model,
param_distributions=param_dist,
n_iter=50,
scoring='neg_mean_squared_error',
cv=5,
verbose=1,
random_state=42,
n_jobs=-1
)
# Fit RandomizedSearchCV to the training data
random_search.fit(X_train, y_train)
After the search completes, RandomizedSearchCV stores the best combination of hyperparameters it found.
# Print the best parameters found
print("Best Hyperparameters found by Randomized Search:")
print(random_search.best_params_)
Now, we use the best parameters identified by our search to train a new, final model. It is standard practice to train this model on the entire training dataset to give it the most data to learn from. Finally, we evaluate its performance on the same held-out test set we used for the baseline.
# Get the best estimator from the search
best_xgb_model = random_search.best_estimator_
# Make predictions on the test set
y_pred_tuned = best_xgb_model.predict(X_test)
# Calculate and print the tuned model's MSE
mse_tuned = mean_squared_error(y_test, y_pred_tuned)
print(f"Tuned Model MSE: {mse_tuned:.4f}")
print(f"Improvement over Baseline: {mse_baseline - mse_tuned:.4f}")
You should see that the tuned model's MSE is lower than the baseline model's MSE, indicating that our hyperparameter tuning process successfully improved the model's predictive performance.
A simple chart can effectively illustrate the impact of our optimization efforts. Let's compare the Mean Squared Error of the baseline model against the tuned model.
The reduction in Mean Squared Error demonstrates the value of systematic hyperparameter tuning.
In this hands-on exercise, you followed a structured process to optimize a gradient boosting model. You began by establishing a performance baseline, used RandomizedSearchCV to efficiently explore the hyperparameter space, and concluded by training and evaluating a final model with the optimized settings.
This iterative process of establishing a baseline, searching for better parameters, and evaluating the result is a fundamental workflow in applied machine learning. While we used RandomizedSearchCV, you could refine the search further by using the results to inform a more focused GridSearchCV around the most promising parameter values.
For even more complex optimization tasks, you might explore advanced techniques like Bayesian Optimization, which can often find better hyperparameters in fewer iterations. Libraries like Hyperopt and Optuna provide powerful tools for implementing these advanced strategies.
Was this section helpful?
RandomizedSearchCV, providing detailed usage, parameters, and examples for efficient hyperparameter tuning within the scikit-learn framework.RandomizedSearchCV with Scikit-Learn in a hands-on manner.© 2026 ApX Machine LearningEngineered with