Alright, let's put the theory of CatBoost into practice. We've discussed how CatBoost uniquely handles categorical features using techniques like Ordered Target Statistics and Ordered Boosting, aiming to improve accuracy and simplify preprocessing. Now, we'll walk through building, training, and evaluating a CatBoost model using its Python library, paying special attention to its categorical feature capabilities.

Setting Up the Environment

First, ensure you have the necessary libraries installed. You'll primarily need catboost, pandas, and scikit-learn. If you haven't installed CatBoost yet, you can do so using pip:

pip install catboost pandas scikit-learn plotly

Preparing the Data

We'll use a dataset that contains a mix of numerical and categorical features. A common example is the "Adult" census income dataset, where the task is to predict whether an individual's income exceeds $50K per year.

Let's load the data using pandas and perform minimal preprocessing. CatBoost can internally handle missing values and categorical features, but we need to identify which columns are categorical.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
from catboost import CatBoostClassifier, Pool
import numpy as np
import plotly.graph_objects as go

# Load the dataset (assuming adult.csv is available)
# You might need to download it or adjust the path/URL
try:
    # Attempt to load from a common online source if local file not found
    url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
    column_names = [
        'age', 'workclass', 'fnlwgt', 'education', 'education-num',
        'marital-status', 'occupation', 'relationship', 'race', 'sex',
        'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
        'income'
    ]
    data = pd.read_csv(url, header=None, names=column_names, sep=',\s*', engine='python', na_values='?')
except FileNotFoundError:
    print("Error: adult.data not found. Please download it or adjust the path.")
    # You might want to exit or handle this error appropriately
    exit()
except Exception as e:
    print(f"An error occurred while loading data: {e}")
    exit()


# Drop rows with missing values for simplicity in this example
# Note: CatBoost can handle NaN directly, but we simplify here.
data.dropna(inplace=True)

# Define target variable and features
X = data.drop('income', axis=1)
y = data['income'].apply(lambda x: 1 if x == '>50K' else 0) # Convert target to binary

# Identify categorical features by their column names or indices
categorical_features_indices = np.where(X.dtypes != np.number)[0]
# Alternatively, provide column names:
# categorical_features_names = X.select_dtypes(include=['object']).columns.tolist()

print(f"Categorical feature indices: {categorical_features_indices}")
# print(f"Categorical feature names: {categorical_features_names}")

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")

Notice that we didn't perform one-hot encoding or label encoding on the categorical features like workclass, education, marital-status, etc. We simply identified their column indices. This is where CatBoost shines.

Training a Basic CatBoost Classifier

Now, let's instantiate and train a CatBoostClassifier. The essential step is to inform the model which features are categorical using the cat_features parameter.

# Instantiate the CatBoostClassifier
model = CatBoostClassifier(
    iterations=500,          # Number of trees to build
    learning_rate=0.05,      # Step size shrinkage
    depth=6,                 # Depth of the trees (Oblivious Trees)
    l2_leaf_reg=3,           # L2 regularization coefficient
    loss_function='Logloss', # Objective function
    eval_metric='AUC',       # Metric for evaluation during training
    random_seed=42,          # For reproducibility
    verbose=100              # Print progress every 100 iterations
)

# Train the model
# Pass categorical feature indices directly to the fit method
model.fit(
    X_train, y_train,
    cat_features=categorical_features_indices,
    eval_set=(X_test, y_test), # Provide evaluation set for early stopping and metrics
    early_stopping_rounds=50   # Stop if eval_metric doesn't improve for 50 rounds
)

# Make predictions on the test set
y_pred_proba = model.predict_proba(X_test)[:, 1] # Get probability for the positive class
y_pred_class = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred_class)
auc = roc_auc_score(y_test, y_pred_proba)

print(f"\nModel Evaluation:")
print(f"Test Set Accuracy: {accuracy:.4f}")
print(f"Test Set AUC: {auc:.4f}")

In the fit method, we provided X_train, y_train, and crucially, the cat_features argument pointing to our categorical columns. We also included an eval_set comprising the test data. This allows CatBoost to monitor performance on unseen data during training (using the specified eval_metric, in this case, AUC) and apply early_stopping_rounds to prevent overfitting and find a potentially better number of iterations automatically. The verbose parameter controls how often training progress is printed.

Using the Pool Class

CatBoost provides a Pool class, which is an optimized data structure for holding the dataset, including features, labels, and metadata like categorical feature indices and weights. Using Pool can sometimes offer performance benefits, especially for larger datasets or repeated experiments.

# Create Pool objects for training and evaluation data
train_pool = Pool(
    data=X_train,
    label=y_train,
    cat_features=categorical_features_indices
)

eval_pool = Pool(
    data=X_test,
    label=y_test,
    cat_features=categorical_features_indices
)

# Instantiate a new model (optional, or retrain the existing one)
model_pooled = CatBoostClassifier(
    iterations=500,
    learning_rate=0.05,
    depth=6,
    l2_leaf_reg=3,
    loss_function='Logloss',
    eval_metric='AUC',
    random_seed=42,
    verbose=100
)

# Train using the Pool objects
model_pooled.fit(
    train_pool,
    eval_set=eval_pool,
    early_stopping_rounds=50
)

# Predictions and evaluations are similar
y_pred_proba_pooled = model_pooled.predict_proba(eval_pool)[:, 1]
auc_pooled = roc_auc_score(y_test, y_pred_proba_pooled)
print(f"\nAUC using Pool: {auc_pooled:.4f}")

The results should be identical or very similar to the previous run, but using Pool encapsulates the data nicely.

Understanding Feature Importance

CatBoost provides straightforward access to feature importance scores, which estimate the contribution of each feature to the model's predictions.

# Get feature importance scores
feature_importances = model.get_feature_importance(train_pool) # Or pass data directly
feature_names = X_train.columns

# Create a Pandas Series for easier handling
importance_df = pd.DataFrame({'feature': feature_names, 'importance': feature_importances})
importance_df = importance_df.sort_values(by='importance', ascending=False)

print("\nFeature Importances:")
print(importance_df)

# Visualize feature importances using Plotly
fig = go.Figure(go.Bar(
    x=importance_df['importance'],
    y=importance_df['feature'],
    orientation='h',
    marker_color='#228be6' # Blue from palette
))

fig.update_layout(
    title='CatBoost Feature Importances',
    yaxis_title='Feature',
    xaxis_title='Importance Score',
    yaxis={'categoryorder':'total ascending'}, # Show most important at top
    height=500,
    margin=dict(l=150, r=20, t=50, b=50) # Adjust margins for feature names
)
# fig.show() # Use in interactive environments

# To display in static docs, generate JSON:
# print(fig.to_json())

Feature importance scores calculated by CatBoost. Categorical features like 'relationship', 'marital-status', and 'occupation' appear alongside numerical features, demonstrating CatBoost's integrated handling. Note: Values are illustrative.

The plot helps visualize which features CatBoost found most predictive. Note how both numerical (capital-gain, age) and categorical (relationship, marital-status) features contribute significantly, without requiring manual transformations for the latter group.

Tuning and Further Steps

This example demonstrates the basic implementation of CatBoost, highlighting its core strength in handling categorical data seamlessly. Key takeaways from this practical session include:

Minimal Preprocessing for Categoricals: Simply identify categorical features by index or name and pass them to the cat_features parameter during training or Pool creation. CatBoost handles the encoding internally using methods like Ordered TS.
Standard API: The CatBoost Python API follows conventions similar to Scikit-learn and other boosting libraries, making it relatively easy to integrate into existing workflows.
Performance Monitoring: Using eval_set and early_stopping_rounds is important for preventing overfitting and optimizing the number of boosting iterations.
Feature Importance: Easily extract and visualize feature importances to understand model drivers.

Achieving optimal performance often requires careful hyperparameter tuning, exploring parameters like learning_rate, depth, l2_leaf_reg, and CatBoost-specific parameters related to categorical handling (e.g., one_hot_max_size). We will cover systematic hyperparameter optimization techniques in Chapter 8.

You are now equipped to apply CatBoost to your own datasets, especially those rich in categorical information, benefiting from its specialized algorithms and ease of use.