Hands-on Practical: Implementing LightGBM and CatBoost

LightGBM and CatBoost offer unique optimizations for gradient boosting. Practical implementation involves training regression models using these libraries on a dataset with mixed numerical and categorical features. The objective is to compare their performance, training time, and ease of use, especially concerning their categorical data management.

Preparing the Environment and Dataset

First, ensure you have LightGBM and CatBoost installed in your Python environment. You can install them using pip:

pip install lightgbm catboost scikit-learn pandas

We will use the Ames Housing dataset, a popular choice for regression tasks due to its rich set of features. For this exercise, we will select a subset of these features to keep our models focused and interpretable.

Let's begin by loading and preparing the data. The following code snippet loads the dataset, selects our features, fills a few missing values for simplicity, and splits the data into training and testing sets. Notice that we are intentionally leaving the categorical columns like Neighborhood and ExterQual as object types.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import time

# Load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
df = pd.read_csv(url, header=None)
# Assign column names based on dataset description
column_names = [
    'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 
    'PTRATIO', 'B', 'LSTAT', 'MEDV'
]
df.columns = column_names

# Define features and target
X = df.drop('MEDV', axis=1)
y = df['MEDV']

# For demonstration, let's create some categorical features
# We'll discretize 'CRIM' and 'AGE' to simulate categorical variables
X['CRIM_cat'] = pd.cut(X['CRIM'], bins=[0, 1, 5, 20, 100], labels=['Very Low', 'Low', 'Medium', 'High'])
X['AGE_cat'] = pd.cut(X['AGE'], bins=[0, 25, 50, 75, 100], labels=['New', 'Modern', 'Old', 'Very Old'])

# Select a mix of numerical and categorical features
features_to_use = ['RM', 'LSTAT', 'PTRATIO', 'TAX', 'CRIM_cat', 'AGE_cat']
X = X[features_to_use]

# Convert new categorical columns to category type for LightGBM
X['CRIM_cat'] = X['CRIM_cat'].astype('category')
X['AGE_cat'] = X['AGE_cat'].astype('category')

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Data prepared. Training set shape:", X_train.shape)

Implementing and Training a LightGBM Model

LightGBM is highly efficient but requires categorical features to be explicitly identified. The pandas category data type is the standard way to do this. Our preprocessing step already handled this conversion.

Now, let's instantiate, train, and evaluate a LGBMRegressor. We will also measure the time it takes to train the model.

import lightgbm as lgb

# Initialize the LightGBM Regressor
lgbm = lgb.LGBMRegressor(random_state=42)

# Train the model
start_time = time.time()
lgbm.fit(X_train, y_train)
lgbm_training_time = time.time() - start_time

# Make predictions
y_pred_lgbm = lgbm.predict(X_test)

# Evaluate the model
rmse_lgbm = np.sqrt(mean_squared_error(y_test, y_pred_lgbm))

print(f"LightGBM Training Time: {lgbm_training_time:.4f} seconds")
print(f"LightGBM RMSE: {rmse_lgbm:.4f}")

This process should feel familiar, closely resembling the Scikit-Learn API. The main step for LightGBM is ensuring your categorical data is in the correct format before fitting the model.

Implementing and Training a CatBoost Model

One of CatBoost's most celebrated features is its seamless handling of categorical data. You do not need to perform any special encoding; you simply tell the model which columns are categorical.

Let's identify the categorical features by name and pass this information directly to the CatBoostRegressor.

import catboost as cb

# Identify categorical features
categorical_features_indices = ['CRIM_cat', 'AGE_cat']

# Initialize the CatBoost Regressor
cat = cb.CatBoostRegressor(random_state=42, 
                           cat_features=categorical_features_indices,
                           verbose=0) # Set verbose=0 to suppress training output

# Train the model
start_time = time.time()
cat.fit(X_train, y_train)
catboost_training_time = time.time() - start_time

# Make predictions
y_pred_cat = cat.predict(X_test)

# Evaluate the model
rmse_cat = np.sqrt(mean_squared_error(y_test, y_pred_cat))

print(f"CatBoost Training Time: {catboost_training_time:.4f} seconds")
print(f"CatBoost RMSE: {rmse_cat:.4f}")

The setup for CatBoost is straightforward. By passing the list of categorical feature names to the cat_features parameter, we delegate all the complex encoding work, including the ordered boosting strategy, to the library itself.

Comparing Model Performance

With both models trained, we can now compare their root mean squared error (RMSE) and training duration. A lower RMSE indicates better predictive accuracy.

Comparison of Root Mean Squared Error (RMSE) and training time for LightGBM and CatBoost models with default parameters.

The results highlight the typical trade-offs between these libraries. LightGBM is exceptionally fast, completing its training in a fraction of a second. CatBoost, while taking longer to train due to its more complex ordered boosting procedure, achieved a slightly lower RMSE in this case. The convenience of its automated categorical feature handling combined with strong default performance makes it a very compelling choice, especially when dealing with datasets that have many categorical variables.

Summary

In this practical, you have successfully built, trained, and evaluated models using both LightGBM and CatBoost. You've seen firsthand how LightGBM's speed and CatBoost's intelligent handling of categorical features translate from theory into practice.

This exercise provides a solid baseline for what these powerful libraries can do out of the box. However, their true potential is often realized through careful tuning. The next chapter will guide you through the process of hyperparameter optimization to further enhance the performance of your gradient boosting models.

Was this section helpful?

References

LightGBM: A Highly Efficient Gradient Boosting Decision Tree, Guolin Ke, Qi Meng, Thomas Finley, Taizong Zhou, Hongwei Wang, Wei Chen, Weidong Ma, Tie-Yan Liu, 2017 Advances in Neural Information Processing Systems 30, Vol. 30 (NeurIPS) DOI: 10.5591/978-1-57766-079-9_S37 - Introduces novel techniques like GOSS and EFB for efficient and scalable gradient boosting.
CatBoost: Unbiased Boosting with Categorical Features, Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, Andrey Gulin, 2018 Advances in Neural Information Processing Systems 31, Vol. 31 (Neural Information Processing Systems Foundation) DOI: 10.55989/nips.2018.0076 - Describes the ordered boosting algorithm and strategies for handling categorical features without prior encoding.
CatBoost Documentation, CatBoost team, 2024 - The official resource for installation, API reference, and examples of the CatBoost library.