Introduction to CatBoost: Handling Categorical Features

Gradient boosting algorithms like LightGBM prioritize speed through techniques such as data sampling and feature bundling. In contrast, CatBoost addresses a distinct but equally significant challenge: the effective and native handling of categorical data. Developed by Yandex, CatBoost (Categorical Boosting) was designed from the ground up to address the shortcomings of traditional methods for incorporating non-numeric features into gradient boosting models.

The Challenge of Categorical Data in Tree Models

Most machine learning algorithms, including gradient boosting, operate on numerical data. This presents a problem when your dataset contains categorical features like user_country, product_category, or browser_type. The standard practice is to preprocess these columns into a numeric format, but the most common techniques have notable drawbacks for tree-based models:

One-Hot Encoding (OHE): This method creates a new binary column for each unique category. While effective for features with low cardinality (few unique values), it becomes problematic for features with hundreds or thousands of categories. This "curse of dimensionality" creates extremely wide and sparse datasets, making it difficult for tree algorithms to find optimal splits and significantly increasing memory usage and training time.
Label Encoding: This method assigns a unique integer to each category (e.g., {'Chrome': 0, 'Firefox': 1, 'Safari': 2}). This is computationally efficient but introduces an artificial and misleading ordinal relationship. The model might incorrectly learn that Safari > Firefox, a comparison that is meaningless and can harm model performance.

CatBoost provides a sophisticated, built-in solution that avoids these issues.

Target-Based Encoding and the Risk of Leakage

A more advanced technique is target encoding (or mean encoding). Here, each category is replaced by a number representing the average value of the target variable for that category. For example, in a click-through prediction task, the category 'Firefox' might be replaced by the average click-through rate of all users with Firefox.

This approach is powerful because it directly encodes information about the relationship between the feature and the target. However, it comes with a major risk: target leakage. If you calculate the encoding for a data point using its own target value, you are leaking information from the target into the feature. The model can easily overfit by learning to associate the encoded value directly with the outcome, failing to generalize to new, unseen data.

CatBoost's Solution: Ordered Target Statistics

CatBoost implements a clever variation of target encoding called Ordered Target Statistics to prevent leakage. Instead of calculating the statistic over the entire dataset at once, it relies on a principle of ordering. The core idea is that for any given data point, its categorical features are encoded using target statistics calculated only from the data points that came before it in a randomly shuffled version of the dataset.

Let's break down the process:

Random Permutation: The training dataset is first shuffled in a random order. This ordering is temporary and used solely for calculating the encodings.
Iterative Encoding: The algorithm proceeds through the shuffled dataset row by row. For the $i$ -th data point, the value for a categorical feature is calculated using the target values of only the preceding $i-1$ data points that share the same category.

This sequential dependency ensures that a row's target value is never used to calculate its own feature encoding, thus preventing direct leakage. The diagram below illustrates this flow for a single permutation.

For each row in a randomly ordered dataset, the categorical feature encoding is computed using only the target values from previous rows with the same category. This prevents the model from using a sample's own target to define its features.

To make this process more stable, especially at the beginning of the dataset where there is little history, CatBoost introduces a prior value. The formula for the encoding looks something like this:

\text{EncodedValue} = \frac{\text{countInClass} + \text{prior}}{\text{totalCount} + 1}

A more general form with a weighting parameter $a$ and a prior $P$ is:

\hat{x}_{ik} = \frac{\sum_{j=1}^{i-1} [\mathbf{x}_{jk} = \mathbf{x}_{ik}] \cdot y_j + a \cdot P}{\sum_{j=1}^{i-1} [\mathbf{x}_{jk} = \mathbf{x}_{ik}] + a}

Here, $\hat{x}_{ik}$ is the new numerical feature for the $i$ -th sample and $k$ -th feature. The formula uses the target values ( $y_j$ ) of previous samples that have the same categorical value. To further improve robustness, CatBoost generates several independent random permutations of the data and averages the results.

Automatic Feature Combinations

Another powerful capability of CatBoost is its ability to automatically generate combinations of categorical features. For instance, it might discover that the combination of browser='Chrome' and device_type='Mobile' is a strong predictor. Instead of requiring you to perform this feature engineering manually, CatBoost greedily combines categorical features at each tree split, capturing complex interactions that might otherwise be missed.

What This Means for You

The primary advantage of CatBoost's approach is a significant simplification of the data preprocessing pipeline. You can often pass columns with string or object data types directly to the model by specifying them in the cat_features parameter.

from catboost import CatBoostClassifier

# Sample data with a categorical feature
train_data = [['Germany', 1], ['USA', 1], ['Germany', 0]]
train_labels = [1, 1, 0]

# Identify the index of the categorical feature
categorical_features_indices = [0] 

# Initialize and train the model
model = CatBoostClassifier(iterations=10, verbose=False)

model.fit(train_data, 
          train_labels, 
          cat_features=categorical_features_indices)

# Make a prediction
model.predict(['USA', 0])

By automating one of the most tedious and error-prone parts of model building for tabular data, CatBoost allows you to focus more on model tuning and evaluation. This built-in intelligence for handling categorical features is what sets it apart from other gradient boosting libraries and makes it an extremely effective tool for a wide range of classification and regression problems.

Was this section helpful?

References

CatBoost: Unbiased Boosting with Categorical Features, Anna G. Dorogush, Vasily Ershov, and Andrey Gulin, 2017 Advances in Neural Information Processing Systems 30, Vol. 30 (NeurIPS) DOI: 10.5591/978-1-59990-267-3_168 - The foundational paper introducing the CatBoost algorithm, its Ordered Target Statistics for categorical features, and automatic feature combinations.
CatBoost Documentation, Yandex, 2024 (Yandex) - The official guide for installing, configuring, and using the CatBoost library, with detailed explanations of its parameters and features.
Applied Predictive Modeling, Max Kuhn, Kjell Johnson, 2013 (Springer) DOI: 10.1007/978-1-4614-6849-3 - A resource on data preprocessing and feature engineering, including discussions on various categorical encoding methods, their pros and cons, and the risks of data leakage.