Alright, let's put the theory of CatBoost into practice. We've discussed how CatBoost uniquely handles categorical features using techniques like Ordered Target Statistics and Ordered Boosting, aiming to improve accuracy and simplify preprocessing. Now, we'll walk through building, training, and evaluating a CatBoost model using its Python library, paying special attention to its categorical feature capabilities.
First, ensure you have the necessary libraries installed. You'll primarily need catboost
, pandas
, and scikit-learn
. If you haven't installed CatBoost yet, you can do so using pip:
pip install catboost pandas scikit-learn plotly
We'll use a dataset that contains a mix of numerical and categorical features. A common example is the "Adult" census income dataset, where the task is to predict whether an individual's income exceeds $50K per year.
Let's load the data using pandas and perform minimal preprocessing. CatBoost can internally handle missing values and categorical features, but we need to identify which columns are categorical.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
from catboost import CatBoostClassifier, Pool
import numpy as np
import plotly.graph_objects as go
# Load the dataset (assuming adult.csv is available)
# You might need to download it or adjust the path/URL
try:
# Attempt to load from a common online source if local file not found
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
column_names = [
'age', 'workclass', 'fnlwgt', 'education', 'education-num',
'marital-status', 'occupation', 'relationship', 'race', 'sex',
'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
'income'
]
data = pd.read_csv(url, header=None, names=column_names, sep=',\s*', engine='python', na_values='?')
except FileNotFoundError:
print("Error: adult.data not found. Please download it or adjust the path.")
# You might want to exit or handle this error appropriately
exit()
except Exception as e:
print(f"An error occurred while loading data: {e}")
exit()
# Drop rows with missing values for simplicity in this example
# Note: CatBoost can handle NaN directly, but we simplify here.
data.dropna(inplace=True)
# Define target variable and features
X = data.drop('income', axis=1)
y = data['income'].apply(lambda x: 1 if x == '>50K' else 0) # Convert target to binary
# Identify categorical features by their column names or indices
categorical_features_indices = np.where(X.dtypes != np.number)[0]
# Alternatively, provide column names:
# categorical_features_names = X.select_dtypes(include=['object']).columns.tolist()
print(f"Categorical feature indices: {categorical_features_indices}")
# print(f"Categorical feature names: {categorical_features_names}")
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)
print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")
Notice that we didn't perform one-hot encoding or label encoding on the categorical features like workclass
, education
, marital-status
, etc. We simply identified their column indices. This is where CatBoost shines.
Now, let's instantiate and train a CatBoostClassifier
. The essential step is to inform the model which features are categorical using the cat_features
parameter.
# Instantiate the CatBoostClassifier
model = CatBoostClassifier(
iterations=500, # Number of trees to build
learning_rate=0.05, # Step size shrinkage
depth=6, # Depth of the trees (Oblivious Trees)
l2_leaf_reg=3, # L2 regularization coefficient
loss_function='Logloss', # Objective function
eval_metric='AUC', # Metric for evaluation during training
random_seed=42, # For reproducibility
verbose=100 # Print progress every 100 iterations
)
# Train the model
# Pass categorical feature indices directly to the fit method
model.fit(
X_train, y_train,
cat_features=categorical_features_indices,
eval_set=(X_test, y_test), # Provide evaluation set for early stopping and metrics
early_stopping_rounds=50 # Stop if eval_metric doesn't improve for 50 rounds
)
# Make predictions on the test set
y_pred_proba = model.predict_proba(X_test)[:, 1] # Get probability for the positive class
y_pred_class = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred_class)
auc = roc_auc_score(y_test, y_pred_proba)
print(f"\nModel Evaluation:")
print(f"Test Set Accuracy: {accuracy:.4f}")
print(f"Test Set AUC: {auc:.4f}")
In the fit
method, we provided X_train
, y_train
, and crucially, the cat_features
argument pointing to our categorical columns. We also included an eval_set
comprising the test data. This allows CatBoost to monitor performance on unseen data during training (using the specified eval_metric
, in this case, AUC) and apply early_stopping_rounds
to prevent overfitting and find a potentially better number of iterations automatically. The verbose
parameter controls how often training progress is printed.
CatBoost provides a Pool
class, which is an optimized data structure for holding the dataset, including features, labels, and metadata like categorical feature indices and weights. Using Pool
can sometimes offer performance benefits, especially for larger datasets or repeated experiments.
# Create Pool objects for training and evaluation data
train_pool = Pool(
data=X_train,
label=y_train,
cat_features=categorical_features_indices
)
eval_pool = Pool(
data=X_test,
label=y_test,
cat_features=categorical_features_indices
)
# Instantiate a new model (optional, or retrain the existing one)
model_pooled = CatBoostClassifier(
iterations=500,
learning_rate=0.05,
depth=6,
l2_leaf_reg=3,
loss_function='Logloss',
eval_metric='AUC',
random_seed=42,
verbose=100
)
# Train using the Pool objects
model_pooled.fit(
train_pool,
eval_set=eval_pool,
early_stopping_rounds=50
)
# Predictions and evaluations are similar
y_pred_proba_pooled = model_pooled.predict_proba(eval_pool)[:, 1]
auc_pooled = roc_auc_score(y_test, y_pred_proba_pooled)
print(f"\nAUC using Pool: {auc_pooled:.4f}")
The results should be identical or very similar to the previous run, but using Pool
encapsulates the data nicely.
CatBoost provides straightforward access to feature importance scores, which estimate the contribution of each feature to the model's predictions.
# Get feature importance scores
feature_importances = model.get_feature_importance(train_pool) # Or pass data directly
feature_names = X_train.columns
# Create a Pandas Series for easier handling
importance_df = pd.DataFrame({'feature': feature_names, 'importance': feature_importances})
importance_df = importance_df.sort_values(by='importance', ascending=False)
print("\nFeature Importances:")
print(importance_df)
# Visualize feature importances using Plotly
fig = go.Figure(go.Bar(
x=importance_df['importance'],
y=importance_df['feature'],
orientation='h',
marker_color='#228be6' # Blue from palette
))
fig.update_layout(
title='CatBoost Feature Importances',
yaxis_title='Feature',
xaxis_title='Importance Score',
yaxis={'categoryorder':'total ascending'}, # Show most important at top
height=500,
margin=dict(l=150, r=20, t=50, b=50) # Adjust margins for feature names
)
# fig.show() # Use in interactive environments
# To display in static docs, generate JSON:
# print(fig.to_json())
Feature importance scores calculated by CatBoost. Categorical features like 'relationship', 'marital-status', and 'occupation' appear alongside numerical features, demonstrating CatBoost's integrated handling. Note: Values are illustrative.
The plot helps visualize which features CatBoost found most predictive. Note how both numerical (capital-gain
, age
) and categorical (relationship
, marital-status
) features contribute significantly, without requiring manual transformations for the latter group.
This example demonstrates the basic implementation of CatBoost, highlighting its core strength in handling categorical data seamlessly. Key takeaways from this practical session include:
cat_features
parameter during training or Pool
creation. CatBoost handles the encoding internally using methods like Ordered TS.eval_set
and early_stopping_rounds
is important for preventing overfitting and optimizing the number of boosting iterations.Achieving optimal performance often requires careful hyperparameter tuning, exploring parameters like learning_rate
, depth
, l2_leaf_reg
, and CatBoost-specific parameters related to categorical handling (e.g., one_hot_max_size
). We will cover systematic hyperparameter optimization techniques in Chapter 8.
You are now equipped to apply CatBoost to your own datasets, especially those rich in categorical information, benefiting from its specialized algorithms and ease of use.
© 2025 ApX Machine Learning