In the preceding sections, we examined the principles and mechanics behind creating custom components that conform to the Scikit-learn API. Now, let's consolidate this knowledge by building a practical, non-trivial example: a custom stacking ensemble estimator. Ensemble methods often improve predictive performance by combining the outputs of multiple models. While Scikit-learn provides StackingClassifier
and StackingRegressor
, building one ourselves offers valuable insight into estimator composition and the API requirements.
Our goal is to create a StackingEstimator
class that takes a list of base estimators and a final meta-learner. During fit
, it trains the base estimators on the input data and then trains the meta-learner on the predictions generated by the base estimators. During predict
, it combines the predictions from the base estimators and feeds them into the meta-learner to produce the final output.
To integrate smoothly with Scikit-learn tools like Pipeline
and GridSearchCV
, our StackingEstimator
must adhere to the established conventions:
BaseEstimator
and an appropriate mixin (e.g., ClassifierMixin
or RegressorMixin
). This provides essential methods like get_params
and set_params
.__init__
): All parameters must be explicit keyword arguments in __init__
, and these arguments should not be validated or mutated there. Store the unmodified arguments directly as public attributes (e.g., self.base_estimators = base_estimators
).fit
(like the trained base models and meta-learner) should be stored with a trailing underscore (e.g., self.fitted_base_estimators_
).fit
Method: Accepts X
, y
and returns self
. It performs the core training logic.predict
Method: Accepts X
and returns predictions based on the fitted models. If building a classifier, implementing predict_proba
is often desirable.For our stacking estimator, the key parameters will be base_estimators
(a list of estimator instances) and meta_learner
(a single estimator instance).
Let's start building the StackingEstimator
. We'll focus on a classifier version for illustration, inheriting from ClassifierMixin
.
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin, clone
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels
# Helper function to generate meta-features
def _generate_meta_features(estimators, X):
"""Generates predictions from fitted estimators for meta-learner input."""
# Check if estimators is a list and not empty
if not isinstance(estimators, list) or len(estimators) == 0:
raise ValueError("Expected a list of fitted estimators.")
# Collect predictions. Handle predict_proba if available, else predict.
predictions = []
for name, estimator in estimators:
try:
# Prefer probabilities for classification tasks if possible
pred = estimator.predict_proba(X)
# Handle case where predict_proba returns multiple columns (e.g., binary)
if pred.ndim > 1 and pred.shape[1] > 1:
# Use probability of the positive class (common convention)
# Or potentially all probabilities if meta-learner can handle it.
# Here we take the second column for simplicity, assuming binary.
# A more robust implementation might need configuration.
if pred.shape[1] == 2:
predictions.append(pred[:, 1].reshape(-1, 1))
else: # Multi-class probabilities
predictions.append(pred) # Add all probability columns
else: # Single probability vector or predict output
predictions.append(pred.reshape(-1, 1))
except AttributeError:
# Fallback to predict if predict_proba is not available
pred = estimator.predict(X)
predictions.append(pred.reshape(-1, 1))
# Stack predictions horizontally
if not predictions:
raise ValueError("No predictions generated from base estimators.")
return np.hstack(predictions)
class StackingEstimator(BaseEstimator, ClassifierMixin):
"""
A basic Stacking ensemble classifier.
Trains base estimators and uses their predictions
as input for a final meta-learner.
Parameters
----------
base_estimators : list of (str, estimator) tuples
The base estimators to be fitted on the data. Each estimator
is cloned before fitting.
meta_learner : estimator object
The meta-learner to be fitted on the predictions of the
base estimators. Cloned before fitting.
Attributes
----------
fitted_base_estimators_ : list of (str, estimator) tuples
The fitted base estimators.
fitted_meta_learner_ : estimator object
The fitted meta-learner.
classes_ : ndarray of shape (n_classes,)
The classes labels observed during fit.
"""
def __init__(self, base_estimators, meta_learner):
self.base_estimators = base_estimators
self.meta_learner = meta_learner
def fit(self, X, y):
"""
Fit the stacking estimator.
Trains the base estimators on X, y, then trains the meta-learner
on the predictions of the base estimators.
Parameters
----------
X : array-like of shape (n_samples, n_features)
Training vector.
y : array-like of shape (n_samples,)
Target values.
Returns
-------
self : object
Returns the instance itself.
"""
# Validate input data
X, y = check_X_y(X, y)
# Store the classes seen during fit
self.classes_ = unique_labels(y)
# Input validation for estimators (basic checks)
if not isinstance(self.base_estimators, list) or len(self.base_estimators) == 0:
raise ValueError("`base_estimators` must be a non-empty list of (name, estimator) tuples.")
if self.meta_learner is None:
raise ValueError("`meta_learner` cannot be None.")
# Clone estimators to avoid modifying originals
self.fitted_base_estimators_ = []
for name, estimator in self.base_estimators:
fitted_estimator = clone(estimator).fit(X, y)
self.fitted_base_estimators_.append((name, fitted_estimator))
# Generate meta-features from base estimator predictions
X_meta = _generate_meta_features(self.fitted_base_estimators_, X)
# Clone and fit the meta-learner
self.fitted_meta_learner_ = clone(self.meta_learner).fit(X_meta, y)
return self
def predict(self, X):
"""
Predict class labels for samples in X.
Parameters
----------
X : array-like of shape (n_samples, n_features)
The input samples.
Returns
-------
y_pred : ndarray of shape (n_samples,)
Predicted class labels.
"""
# Check if fit has been called
check_is_fitted(self, ['fitted_base_estimators_', 'fitted_meta_learner_'])
# Validate input
X = check_array(X)
# Generate meta-features from base estimators
X_meta = _generate_meta_features(self.fitted_base_estimators_, X)
# Predict using the fitted meta-learner
return self.fitted_meta_learner_.predict(X_meta)
def predict_proba(self, X):
"""
Predict class probabilities for samples in X.
Parameters
----------
X : array-like of shape (n_samples, n_features)
The input samples.
Returns
-------
p : ndarray of shape (n_samples, n_classes)
The class probabilities of the input samples.
"""
# Check if fit has been called
check_is_fitted(self, ['fitted_base_estimators_', 'fitted_meta_learner_'])
# Validate input
X = check_array(X)
# Generate meta-features
X_meta = _generate_meta_features(self.fitted_base_estimators_, X)
# Check if meta-learner supports predict_proba
if not hasattr(self.fitted_meta_learner_, "predict_proba"):
raise AttributeError(
f"The meta-learner {self.fitted_meta_learner_.__class__.__name__} "
f"does not support predict_proba."
)
# Predict probabilities using the fitted meta-learner
return self.fitted_meta_learner_.predict_proba(X_meta)
# get_params and set_params are inherited from BaseEstimator
# Needs proper __init__ signature and public attributes matching __init__ args.
# Optional: Implement _more_tags if needed for specific Scikit-learn integrations
def _more_tags(self):
# Indicates this estimator needs y in predict if base estimators do (rare)
# Or other tags like 'requires_positive_X' etc.
return {'requires_y': False}
This implementation provides a basic stacking classifier. Note the use of clone
to ensure that the original estimators passed by the user are not modified. The helper function _generate_meta_features
handles the collection of predictions from base models, attempting to use predict_proba
where available, which is often beneficial for the meta-learner. We've included basic checks using Scikit-learn's validation utilities like check_X_y
, check_array
, and check_is_fitted
.
Now, let's see how to use our StackingEstimator
. We'll define some base models and a meta-learner, then integrate it into a typical Scikit-learn workflow.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Generate synthetic classification data
X, y = make_classification(n_samples=500, n_features=20, n_informative=10,
n_redundant=5, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Define base estimators
base_estimators = [
('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
('svc', Pipeline([('scaler', StandardScaler()), # SVC sensitive to scaling
('svc', SVC(probability=True, random_state=42))]))
]
# Define meta-learner
meta_learner = LogisticRegression(solver='liblinear', random_state=42)
# Instantiate our custom StackingEstimator
stacking_clf = StackingEstimator(base_estimators=base_estimators,
meta_learner=meta_learner)
# --- Option 1: Direct Fit and Predict ---
print("Fitting StackingEstimator directly...")
stacking_clf.fit(X_train, y_train)
y_pred = stacking_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"StackingEstimator Test Accuracy: {accuracy:.4f}")
# Check predict_proba
try:
y_proba = stacking_clf.predict_proba(X_test)
print(f"Predict probabilities shape: {y_proba.shape}")
# print("Sample probabilities:\n", y_proba[:5]) # Uncomment to view
except AttributeError as e:
print(f"Could not get probabilities: {e}")
# --- Option 2: Using Cross-Validation ---
print("\nEvaluating StackingEstimator with cross-validation...")
# Note: CV might be slow as it refits the entire stack multiple times
cv_scores = cross_val_score(stacking_clf, X, y, cv=3, scoring='accuracy')
print(f"Cross-validation Accuracy Scores: {cv_scores}")
print(f"Mean CV Accuracy: {np.mean(cv_scores):.4f}")
# --- Option 3: Integration in a Pipeline (Example) ---
# Although our base 'svc' already includes scaling, this demonstrates the principle.
# Maybe we want overall scaling *before* any estimator sees the data.
print("\nUsing StackingEstimator within a Pipeline...")
pipeline = Pipeline([
('scaler', StandardScaler()), # Scale data before feeding to the stacker
('stacker', StackingEstimator(base_estimators=base_estimators, meta_learner=meta_learner))
])
pipeline.fit(X_train, y_train)
y_pred_pipeline = pipeline.predict(X_test)
accuracy_pipeline = accuracy_score(y_test, y_pred_pipeline)
print(f"Pipeline with StackingEstimator Test Accuracy: {accuracy_pipeline:.4f}")
This example demonstrates how the custom StackingEstimator
can be instantiated, trained, used for prediction, evaluated with cross-validation, and even included as a step within a larger Scikit-learn Pipeline
. Because it adheres to the API, it works seamlessly with these standard tools.
check_estimator
A significant step in developing robust Scikit-learn components is using the check_estimator
utility. This function runs a comprehensive suite of tests to verify API compliance, invariant properties, and expected behaviors.
from sklearn.utils.estimator_checks import check_estimator
print("\nRunning check_estimator (this can take a while and be verbose)...")
try:
# Need to instantiate with simple base estimators for some checks to pass
simple_base = [('lr', LogisticRegression(solver='liblinear')), ('rf', RandomForestClassifier(n_estimators=5))]
simple_meta = LogisticRegression(solver='liblinear')
check_estimator(StackingEstimator(base_estimators=simple_base, meta_learner=simple_meta))
print("check_estimator passed (or showed non-critical warnings).")
except Exception as e:
print(f"check_estimator failed: {e}")
Running check_estimator
is invaluable but can sometimes be challenging to pass completely, especially for complex estimators like ensembles. Failures often point to subtle API violations or edge cases that need addressing. For instance, our basic implementation might fail checks related to handling sparse matrices or specific metadata routing, depending on the base estimators used. Addressing all check_estimator
failures often requires deeper interaction with Scikit-learn's internal mechanisms.
This hands-on example provides a foundation. Several enhancements could make StackingEstimator
more robust and flexible:
fit
method. Train base models on k−1 folds and predict on the held-out fold to generate meta-features for the entire training set without leakage. This is how Scikit-learn's official StackingClassifier
/StackingRegressor
operate by default.X
to be passed through to the meta-learner alongside the base model predictions.predict
, predict_proba
, or decision_function
to generate meta-features.joblib
or concurrent.futures
(as discussed in Chapter 5) to fit base estimators in parallel, potentially speeding up the fit
process.fit
(or a dedicated private validation method) to ensure estimators are compatible.Building custom estimators like this StackingEstimator
solidifies understanding of the Scikit-learn API and empowers you to create highly specialized components for your machine learning pipelines, moving beyond off-the-shelf solutions when necessary.
© 2025 ApX Machine Learning