Having explored CatBoost's innovative approaches like Ordered Target Statistics and Oblivious Trees, we now turn to controlling its behavior through its API. Understanding the available parameters is essential for configuring CatBoost models effectively, particularly when dealing with datasets rich in categorical information or requiring performance optimization.
The primary interfaces are the CatBoostClassifier
and CatBoostRegressor
classes from the catboost
Python library. While many parameters overlap with other gradient boosting implementations, CatBoost offers unique options specifically designed for its algorithms.
These parameters control the fundamental aspects of the boosting process, similar to those found in libraries like Scikit-learn, XGBoost, and LightGBM.
iterations
(or n_estimators
): Specifies the maximum number of boosting rounds or trees to build. This is one of the most significant parameters affecting model complexity and training time. Its optimal value is often determined using early stopping. Default: 1000.learning_rate
: Controls the step size at each iteration, shrinking the contribution of each new tree. Smaller values (e.g., 0.01 to 0.1) generally require more iterations
but can lead to better generalization by preventing overfitting. Default: Usually auto-detected based on dataset size and iteration count, but often around 0.03.depth
: Defines the depth of the base learners (trees). CatBoost uses oblivious trees, meaning the same splitting criterion (feature and threshold) is applied to all nodes at the same level. This structure contributes to faster prediction times and some inherent regularization. Typical values range from 4 to 10. Default: 6.l2_leaf_reg
: Specifies the coefficient for the L2 regularization term on leaf values. This penalty discourages overly large weights in the leaves, helping to prevent overfitting. Corresponds to lambda
in XGBoost's objective function. Default: 3.0.loss_function
: Determines the objective function to be optimized during training. Common choices include:
'RMSE'
(Root Mean Squared Error - default), 'MAE'
(Mean Absolute Error), 'Quantile'
, 'LogLinQuantile'
, 'Poisson'
, 'MAPE'
(Mean Absolute Percentage Error).'Logloss'
(Binary classification - default), 'MultiClass'
(Multi-class classification), 'CrossEntropy'
.eval_metric
: Specifies the metric(s) used for evaluating model performance during training, particularly for early stopping. Examples include 'RMSE'
, 'MAE'
, 'Logloss'
, 'AUC'
, 'Accuracy'
, 'F1'
, 'Precision'
, 'Recall'
, 'MultiClass'
, 'NDCG'
, 'MAP'
. The default often matches the loss_function
.random_seed
(or random_state
): Sets the seed for random number generation, ensuring reproducibility for operations like data shuffling, feature selection (if applicable), and bootstrapping.These are CatBoost's signature parameters, enabling its specialized handling of categorical data.
cat_features
: This is arguably the most important parameter for leveraging CatBoost's strengths. It accepts a list of indices or names of the columns that should be treated as categorical. Crucially, you should not preprocess these features (e.g., via one-hot encoding) yourself if you list them here. CatBoost applies its internal methods (like Ordered TS) to these designated features. If left unspecified (None
), CatBoost may attempt to auto-detect categorical features, but explicitly providing them is recommended for clarity and control. Default: None
.one_hot_max_size
: Sets a threshold for the number of unique values in a categorical feature. If a feature's cardinality is less than or equal to this value, CatBoost will apply one-hot encoding instead of its default target-based statistics. This can sometimes be faster for low-cardinality features but might not capture complex relationships as effectively as Ordered TS. Setting it to a small value (e.g., 2 or 10) forces most categorical features to be handled by CatBoost's specialized methods. Default: 2.max_ctr_complexity
: Controls the maximum number of categorical features that can be combined simultaneously when generating feature combinations. Increasing this allows for higher-order interactions but increases computational cost and memory usage. Default: 4.has_time
: Set to True
if the input data has a temporal order (e.g., time series). This influences how Ordered TS is calculated to avoid look-ahead bias. Default: False
.simple_ctr
, combinations_ctr
: These parameters allow fine-grained control over the specific types of Counter statistics (CTRs) calculated for categorical features and their combinations. Modifying these is typically reserved for advanced users aiming to optimize specific scenarios.These parameters help manage computational resources and training speed.
task_type
: Specifies the hardware for training. Set to 'GPU'
to leverage NVIDIA GPUs (if the library is installed with GPU support), significantly accelerating training on large datasets. Otherwise, use 'CPU'
. Default: 'CPU'
.devices
: If task_type='GPU'
, this parameter specifies which GPU device(s) to use (e.g., '0'
for the first GPU, '0:1'
for the first two).thread_count
: Controls the number of CPU threads used for computation when task_type='CPU'
. Setting it to -1 typically uses all available cores. Default: -1.border_count
: Determines the number of splits (bins) used for discretizing numerical features before building trees. Higher values allow for more precise splits but increase memory usage and computation time during the histogram construction phase. Values typically range from 32 to 255. Default: 254 (CPU), 128 (GPU).leaf_estimation_method
: Method used to calculate values in the tree leaves. 'Newton'
uses second-order derivatives (faster convergence), while 'Gradient'
uses first-order derivatives (can be more stable for some objectives). Default: 'Newton'
.These parameters provide additional mechanisms to prevent overfitting and manage the training process.
early_stopping_rounds
: Activates early stopping. Training will halt if the eval_metric
on a specified validation set (eval_set
) does not improve for this number of consecutive rounds. This helps find the optimal number of iterations
automatically and prevents overfitting. Requires eval_set
to be provided during the fit
call. Default: None
(disabled).use_best_model
: If True
(and early_stopping_rounds
is active), the model state corresponding to the best score on the validation set will be restored after training finishes. If False
, the model from the final iteration is kept. Default: True
when eval_set
is provided, False
otherwise.subsample
: Controls the fraction of the training data sampled for building each tree (row subsampling). Values less than 1.0 introduce randomness and act as a regularizer, similar to Stochastic Gradient Boosting. Default: 1.0 (CPU), 0.8 (GPU).bootstrap_type
: Defines the method for sampling observation weights.
'Bayesian'
: Default. Uses Bayesian bootstrapping, assigning random weights drawn from an exponential distribution. This is intrinsically linked to the Ordered Boosting mechanism.'Bernoulli'
: Samples observations with replacement (standard subsampling). Used when subsample
< 1.0.'MVS'
(Minimum Variance Sampling): A more advanced sampling technique.'No'
: No sampling.colsample_bylevel
: Controls the fraction of features randomly selected at each tree level when searching for the best split. Less commonly tuned in CatBoost compared to XGBoost/LightGBM due to oblivious trees. Default: 1.0.Here's how you might instantiate and configure CatBoostClassifier
with some common parameters, including specifying categorical features and setting up early stopping:
import pandas as pd
from catboost import CatBoostClassifier, Pool
# Sample Data (replace with your actual data)
train_data = pd.DataFrame({
'num_feature1': [1.2, 3.4, 0.5, 2.1, 4.5, 1.8],
'num_feature2': [5, 2, 8, 6, 3, 7],
'cat_feature1': ['A', 'B', 'A', 'C', 'B', 'A'],
'cat_feature2': ['X', 'Y', 'Y', 'X', 'X', 'Y'],
'target': [1, 0, 1, 0, 1, 0]
})
eval_data = pd.DataFrame({
'num_feature1': [2.5, 0.8, 3.1],
'num_feature2': [4, 9, 1],
'cat_feature1': ['B', 'A', 'C'],
'cat_feature2': ['Y', 'X', 'Y'],
'target': [0, 1, 0]
})
# Identify categorical feature indices or names
categorical_features_indices = [2, 3] # Indices of 'cat_feature1', 'cat_feature2'
# Or by names: categorical_features_names = ['cat_feature1', 'cat_feature2']
# Prepare data using CatBoost Pool for efficiency and explicit feature declaration
train_pool = Pool(data=train_data.drop('target', axis=1),
label=train_data['target'],
cat_features=categorical_features_indices)
eval_pool = Pool(data=eval_data.drop('target', axis=1),
label=eval_data['target'],
cat_features=categorical_features_indices)
# Configure the model
model = CatBoostClassifier(
iterations=1000, # Max trees
learning_rate=0.05, # Step size
depth=6, # Tree depth (oblivious)
l2_leaf_reg=3, # L2 regularization
loss_function='Logloss', # Objective for binary classification
eval_metric='AUC', # Metric for evaluation and early stopping
cat_features=categorical_features_indices, # Explicitly pass categorical features
early_stopping_rounds=50, # Stop if AUC doesn't improve for 50 rounds
random_seed=42, # For reproducibility
verbose=100, # Print evaluation metric every 100 iterations
# task_type='GPU', # Uncomment to use GPU if available
# devices='0' # Specify GPU device if using GPU
)
# Train the model
model.fit(train_pool,
eval_set=eval_pool,
# verbose=False, # Suppress iteration output if preferred
plot=False # Set to True to visualize training in Jupyter
)
# Make predictions
# preds_proba = model.predict_proba(eval_pool)
# preds_class = model.predict(eval_pool)
print(f"Best score achieved: {model.get_best_score()['validation']['AUC']:.4f}")
print(f"Best iteration: {model.get_best_iteration()}")
Example configuration for
CatBoostClassifier
, highlighting essential parameters likeiterations
,learning_rate
,depth
,l2_leaf_reg
,loss_function
,eval_metric
,cat_features
, andearly_stopping_rounds
. UsingPool
objects is recommended for optimal performance and clear feature specification.
When working with the CatBoost API:
cat_features
: This is fundamental to enabling CatBoost's specialized handling. Ensure these features are passed in their raw categorical format.early_stopping_rounds
: This is the standard way to find a good number of iterations
and prevent overfitting, requiring an eval_set
.learning_rate
, depth
, and l2_leaf_reg
are usually the first parameters to tune after setting up early stopping.task_type='GPU'
: If you have compatible hardware and large datasets, GPU training offers significant speed advantages.one_hot_max_size
or exploring CTR configurations (max_ctr_complexity
, etc.) might yield improvements in specific cases, especially with very high cardinality features or complex interactions.Mastering these parameters allows you to fine-tune CatBoost models, harnessing their power for robust performance, especially on datasets dominated by categorical variables. The next chapters will build upon this foundation, covering hyperparameter optimization strategies and advanced applications.
© 2025 ApX Machine Learning