While optimizing a model using a standard loss function like LogLoss or Mean Squared Error is common practice, the ultimate measure of a model's success often hinges on domain-specific criteria or competition rules that aren't directly represented by these built-in objectives. The evaluation metrics used during training and validation serve as vital signposts, guiding hyperparameter tuning and signaling when to stop training via early stopping. When standard metrics like Accuracy, F1-score, AUC, or RMSE don't fully capture the desired performance characteristics, implementing custom evaluation metrics becomes necessary.
Unlike custom loss functions, which must provide gradient information to drive the boosting process, custom evaluation metrics are simpler. Their primary role is to quantify performance based on true labels and predictions at specific intervals (e.g., each boosting round). They don't influence the gradient calculations directly but are essential for monitoring progress and making informed decisions about the model's adequacy.
You might need a custom evaluation metric in several situations:
Most popular gradient boosting libraries (XGBoost, LightGBM, CatBoost) provide interfaces for incorporating custom evaluation logic. While the exact parameter names might differ slightly, the core structure is generally consistent. A typical custom metric function in Python needs to:
y_pred
) and the true labels (often encapsulated in a library-specific data structure like XGBoost's DMatrix
or LightGBM's Dataset
, from which true labels y_true
can be extracted).y_true
and y_pred
.metric_name
: A string identifying the metric (e.g., 'custom_rmse'
).metric_value
: The computed numerical score.is_higher_better
: A boolean indicating whether a higher score signifies better performance (e.g., True
for AUC, False
for RMSE). This is crucial for early stopping.Let's look at how to implement this in practice.
XGBoost allows custom evaluation functions via the feval
parameter in its xgb.train
function or the eval_metric
parameter when using the scikit-learn API (which can accept a callable). For xgb.train
, the function signature should accept preds
(predictions) and dtrain
(an xgb.DMatrix
object containing true labels).
import numpy as np
import xgboost as xgb
from sklearn.metrics import mean_absolute_percentage_error # Example metric
# Define a custom evaluation metric: Mean Absolute Percentage Error (MAPE)
# Note: XGBoost predictions might be raw scores before transformation
# Adjust accordingly based on your objective function (e.g., apply sigmoid for binary classification)
def xg_mape(preds: np.ndarray, dtrain: xgb.DMatrix):
"""Custom MAPE metric for XGBoost"""
labels = dtrain.get_label()
# Ensure no division by zero or near-zero for labels
# Clip labels to avoid instability, or handle appropriately
epsilon = 1e-6
safe_labels = np.maximum(np.abs(labels), epsilon)
mape = np.mean(np.abs((labels - preds) / safe_labels))
return 'MAPE', mape # XGBoost assumes lower is better by default for custom metrics if not specified otherwise
# --- Sample Data (Replace with your actual data) ---
X_train = np.random.rand(100, 5)
y_train = np.random.rand(100) * 10
X_eval = np.random.rand(50, 5)
y_eval = np.random.rand(50) * 10
dtrain = xgb.DMatrix(X_train, label=y_train)
deval = xgb.DMatrix(X_eval, label=y_eval)
# --- Training with the custom metric ---
params = {
'objective': 'reg:squarederror',
'eta': 0.1,
'max_depth': 3
}
evals = [(dtrain, 'train'), (deval, 'eval')]
# Pass the custom function to feval
# To use for early stopping, set maximize=False as MAPE should be minimized
bst = xgb.train(
params,
dtrain,
num_boost_round=100,
evals=evals,
feval=xg_mape,
# Use custom metric for early stopping:
early_stopping_rounds=10,
# Specify maximize=False because lower MAPE is better
# Note: XGBoost looks at the *last* metric in evals_result for stopping by default
# or specify the metric explicitly if using sklearn API's fit method.
# For xgb.train, it often uses the last metric in 'eval_metric' list implicitly
# or the metric specified in 'maximize'. Let's assume we want to stop on eval-MAPE.
# If multiple metrics are present, control might be needed.
# To be explicit, check XGBoost docs for specific version behavior on feval + early_stopping.
verbose_eval=10 # Print evaluation results every 10 rounds
)
print(f"\nBest MAPE on validation set: {bst.best_score}")
Important Note: When using feval
with xgb.train
, XGBoost often assumes the metric should be minimized unless maximize=True
is passed to xgb.train
. Check the documentation for your specific version, especially regarding how early stopping interacts when multiple standard and custom metrics are used. For the scikit-learn interface (XGBRegressor
/XGBClassifier
), you pass the callable to eval_metric
and control maximization via the eval_set
and potentially dedicated early stopping parameters.
LightGBM uses a very similar mechanism with the feval
parameter in lgb.train
. The function signature expects preds
and train_data
(a lgb.Dataset
object).
import numpy as np
import lightgbm as lgb
from sklearn.metrics import matthews_corrcoef # Example metric
# Define a custom evaluation metric: Matthews Correlation Coefficient (MCC)
# Assumes binary classification with probability outputs
def lgbm_mcc(preds: np.ndarray, train_data: lgb.Dataset):
"""Custom MCC metric for LightGBM"""
labels = train_data.get_label()
# Convert probabilities to binary predictions (threshold at 0.5)
pred_labels = (preds > 0.5).astype(int)
mcc = matthews_corrcoef(labels, pred_labels)
# Return format: (metric_name, value, is_higher_better)
return 'MCC', mcc, True
# --- Sample Data (Replace with your actual data) ---
X_train = np.random.rand(100, 5)
y_train = np.random.randint(0, 2, size=100)
X_eval = np.random.rand(50, 5)
y_eval = np.random.randint(0, 2, size=50)
lgb_train = lgb.Dataset(X_train, label=y_train)
lgb_eval = lgb.Dataset(X_eval, label=y_eval, reference=lgb_train)
# --- Training with the custom metric ---
params = {
'objective': 'binary',
'metric': 'binary_logloss', # Can still track standard metrics
'boosting_type': 'gbdt',
'num_leaves': 31,
'learning_rate': 0.05,
'feature_fraction': 0.9
}
evals_result = {} # To store evaluation results
# Pass the custom function list to feval
bst = lgb.train(
params,
lgb_train,
num_boost_round=100,
valid_sets=[lgb_train, lgb_eval],
valid_names=['train', 'eval'],
evals_result=evals_result,
feval=[lgbm_mcc], # Pass as a list
callbacks=[
lgb.early_stopping(stopping_rounds=10, first_metric_only=False, verbose=True),
lgb.log_evaluation(period=10)
]
# Early stopping will use the metric specified in params['metric']
# AND any metrics returned by feval functions. It uses the 'is_higher_better'
# flag correctly. The 'metric' parameter in params determines the optimization objective.
# The 'metric' in params AND feval metrics are monitored for early stopping.
)
print("\nEvaluation results including custom MCC:")
# print(evals_result) # Full history
print(f"Best MCC on validation set: {max(evals_result['eval']['MCC'])}")
LightGBM requires the custom metric function to explicitly return the is_higher_better
boolean, making its use with early stopping straightforward. You can pass multiple custom metric functions in the feval
list.
CatBoost also supports custom metrics through the eval_metric
parameter in its fit
method (for the CatBoostClassifier
/CatBoostRegressor
classes) or train
function. You can pass either a string representing a built-in metric or a Python object (typically a class) with specific methods.
For a custom metric, you usually define a class with at least these methods:
__init__(self)
: Constructor (optional).is_max_optimal(self)
: Returns True
if higher values of the metric are better, False
otherwise.evaluate(self, approxes, target, weight)
: Calculates the metric.
approxes
: List of lists containing the predicted raw scores/values for each document. For multi-class, it's a list per class.target
: True labels.weight
: Sample weights (can be None
).(sum_of_metric_values, sum_of_weights)
. CatBoost averages this internally.get_final_error(self, error, weight)
: Calculates the final metric value from the sums returned by evaluate
.import numpy as np
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import f1_score # Example metric
# Define a custom metric class for F1 Score (Binary)
class CatBoostF1Metric:
def is_max_optimal(self):
# True because higher F1 score is better
return True
def evaluate(self, approxes, target, weight):
# approxes is a list of lists, need to flatten for binary classification
# The inner list contains the raw prediction score (e.g., log-odds)
# Convert raw scores to probabilities using sigmoid
assert len(approxes) == 1 # Expecting one list for binary approx
preds_raw = np.array(approxes[0])
preds_prob = 1.0 / (1.0 + np.exp(-preds_raw))
# Convert probabilities to binary predictions
pred_labels = (preds_prob > 0.5).astype(int)
# Calculate F1 score
f1 = f1_score(target, pred_labels)
# CatBoost expects sum of metric values and sum of weights
# For F1, which is calculated over the whole set, we can return (f1 * count, count)
# Alternatively, handle weights if provided. Here, assuming no weights or equal weights.
count = len(target) if target is not None else 0
if count == 0:
return 0.0, 0.0 # Avoid division by zero if target is empty
# Return (sum_metric, sum_weight)
# Here, weight is effectively the number of samples for simple averaging
return f1 * count, count
def get_final_error(self, error_sum, weight_sum):
# Calculate the final averaged error
if weight_sum == 0:
return 0.0 # Avoid division by zero
return error_sum / weight_sum
# --- Sample Data (Replace with your actual data) ---
X_train = np.random.rand(100, 5)
y_train = np.random.randint(0, 2, size=100)
X_eval = np.random.rand(50, 5)
y_eval = np.random.randint(0, 2, size=50)
train_pool = Pool(X_train, label=y_train)
eval_pool = Pool(X_eval, label=y_eval)
# --- Training with the custom metric ---
model = CatBoostClassifier(
iterations=100,
learning_rate=0.1,
loss_function='Logloss',
custom_metric=[CatBoostF1Metric()], # Pass the custom metric class instance or list
eval_metric='Logloss', # Can still use standard metrics for optimization/reporting
early_stopping_rounds=10,
verbose=10
)
model.fit(
train_pool,
eval_set=eval_pool
# CatBoost uses the first metric in eval_metric for optimization by default,
# but monitors all specified metrics (including custom ones) for early stopping.
# It uses the is_max_optimal() result from the custom metric class.
)
print("\nCustom F1 Metric values during training (on eval set):")
eval_metrics = model.get_evals_result()
if 'learn' in eval_metrics and 'CatBoostF1Metric' in eval_metrics['learn']:
print(f"Train F1: {eval_metrics['learn']['CatBoostF1Metric'][-1]:.4f}")
if 'validation' in eval_metrics and 'CatBoostF1Metric' in eval_metrics['validation']:
print(f"Eval F1: {eval_metrics['validation']['CatBoostF1Metric'][-1]:.4f}")
CatBoost's class-based approach for custom metrics provides a structured way to encapsulate the metric's logic and properties. Remember to handle the specific format of approxes
based on your problem type (binary, multi-class, regression).
is_higher_better
flag (or is_max_optimal
method in CatBoost) is critical for early stopping to work correctly. Ensure it accurately reflects whether the metric should be maximized or minimized. The boosting library will use this information to determine if performance on the validation set is improving.preds
or approxes
). They might be raw scores (e.g., log-odds for logistic regression), probabilities, or final predicted values, depending on the library, objective function, and specific API call. Adjust your metric calculation accordingly (e.g., apply sigmoid or softmax if needed).By implementing custom evaluation metrics, you gain finer control over how model performance is tracked and assessed, allowing you to align the evaluation process more closely with the specific requirements of your machine learning task or business objectives. This capability, combined with custom loss functions and interpretability tools, provides a powerful toolkit for building highly tailored and effective gradient boosting models.
© 2025 ApX Machine Learning