Practice: Build Custom Ensemble Estimator

In the preceding sections, we examined the principles and mechanics behind creating custom components that conform to the Scikit-learn API. Now, let's consolidate this knowledge by building a practical, non-trivial example: a custom stacking ensemble estimator. Ensemble methods often improve predictive performance by combining the outputs of multiple models. While Scikit-learn provides StackingClassifier and StackingRegressor, building one ourselves offers valuable insight into estimator composition and the API requirements.

Our goal is to create a StackingEstimator class that takes a list of base estimators and a final meta-learner. During fit, it trains the base estimators on the input data and then trains the meta-learner on the predictions generated by the base estimators. During predict, it combines the predictions from the base estimators and feeds them into the meta-learner to produce the final output.

Design and Scikit-learn Compliance

To integrate smoothly with Scikit-learn tools like Pipeline and GridSearchCV, our StackingEstimator must adhere to the established conventions:

Inheritance: It should inherit from BaseEstimator and an appropriate mixin (e.g., ClassifierMixin or RegressorMixin). This provides essential methods like get_params and set_params.
Constructor (__init__): All parameters must be explicit keyword arguments in __init__, and these arguments should not be validated or mutated there. Store the unmodified arguments directly as public attributes (e.g., self.base_estimators = base_estimators).
Fitted Attributes: Attributes learned during fit (like the trained base models and meta-learner) should be stored with a trailing underscore (e.g., self.fitted_base_estimators_).
fit Method: Accepts X, y and returns self. It performs the core training logic.
predict Method: Accepts X and returns predictions based on the fitted models. If building a classifier, implementing predict_proba is often desirable.

For our stacking estimator, the key parameters will be base_estimators (a list of estimator instances) and meta_learner (a single estimator instance).

Implementation Steps

Let's start building the StackingEstimator. We'll focus on a classifier version for illustration, inheriting from ClassifierMixin.

import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin, clone
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels

# Helper function to generate meta-features
def _generate_meta_features(estimators, X):
    """Generates predictions from fitted estimators for meta-learner input."""
    # Check if estimators is a list and not empty
    if not isinstance(estimators, list) or len(estimators) == 0:
        raise ValueError("Expected a list of fitted estimators.")

    # Collect predictions. Handle predict_proba if available, else predict.
    predictions = []
    for name, estimator in estimators:
        try:
            # Prefer probabilities for classification tasks if possible
            pred = estimator.predict_proba(X)
            # Handle case where predict_proba returns multiple columns (e.g., binary)
            if pred.ndim > 1 and pred.shape[1] > 1:
                 # Use probability of the positive class (common convention)
                 # Or potentially all probabilities if meta-learner can handle it.
                 # Here we take the second column for simplicity, assuming binary.
                 # A more robust implementation might need configuration.
                 if pred.shape[1] == 2:
                     predictions.append(pred[:, 1].reshape(-1, 1))
                 else: # Multi-class probabilities
                     predictions.append(pred) # Add all probability columns
            else: # Single probability vector or predict output
                predictions.append(pred.reshape(-1, 1))
        except AttributeError:
            # Fallback to predict if predict_proba is not available
            pred = estimator.predict(X)
            predictions.append(pred.reshape(-1, 1))

    # Stack predictions horizontally
    if not predictions:
         raise ValueError("No predictions generated from base estimators.")

    return np.hstack(predictions)


class StackingEstimator(BaseEstimator, ClassifierMixin):
    """
    A basic Stacking ensemble classifier.

    Trains base estimators and uses their predictions
    as input for a final meta-learner.

    Parameters
    ----------
    base_estimators : list of (str, estimator) tuples
        The base estimators to be fitted on the data. Each estimator
        is cloned before fitting.

    meta_learner : estimator object
        The meta-learner to be fitted on the predictions of the
        base estimators. Cloned before fitting.

    Attributes
    ----------
    fitted_base_estimators_ : list of (str, estimator) tuples
        The fitted base estimators.

    fitted_meta_learner_ : estimator object
        The fitted meta-learner.

    classes_ : ndarray of shape (n_classes,)
        The classes labels observed during fit.
    """
    def __init__(self, base_estimators, meta_learner):
        self.base_estimators = base_estimators
        self.meta_learner = meta_learner

    def fit(self, X, y):
        """
        Fit the stacking estimator.

        Trains the base estimators on X, y, then trains the meta-learner
        on the predictions of the base estimators.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Training vector.
        y : array-like of shape (n_samples,)
            Target values.

        Returns
        -------
        self : object
            Returns the instance itself.
        """
        # Validate input data
        X, y = check_X_y(X, y)

        # Store the classes seen during fit
        self.classes_ = unique_labels(y)

        # Input validation for estimators (basic checks)
        if not isinstance(self.base_estimators, list) or len(self.base_estimators) == 0:
            raise ValueError("`base_estimators` must be a non-empty list of (name, estimator) tuples.")
        if self.meta_learner is None:
             raise ValueError("`meta_learner` cannot be None.")

        # Clone estimators to avoid modifying originals
        self.fitted_base_estimators_ = []
        for name, estimator in self.base_estimators:
            fitted_estimator = clone(estimator).fit(X, y)
            self.fitted_base_estimators_.append((name, fitted_estimator))

        # Generate meta-features from base estimator predictions
        X_meta = _generate_meta_features(self.fitted_base_estimators_, X)

        # Clone and fit the meta-learner
        self.fitted_meta_learner_ = clone(self.meta_learner).fit(X_meta, y)

        return self

    def predict(self, X):
        """
        Predict class labels for samples in X.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The input samples.

        Returns
        -------
        y_pred : ndarray of shape (n_samples,)
            Predicted class labels.
        """
        # Check if fit has been called
        check_is_fitted(self, ['fitted_base_estimators_', 'fitted_meta_learner_'])

        # Validate input
        X = check_array(X)

        # Generate meta-features from base estimators
        X_meta = _generate_meta_features(self.fitted_base_estimators_, X)

        # Predict using the fitted meta-learner
        return self.fitted_meta_learner_.predict(X_meta)

    def predict_proba(self, X):
        """
        Predict class probabilities for samples in X.

        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            The input samples.

        Returns
        -------
        p : ndarray of shape (n_samples, n_classes)
            The class probabilities of the input samples.
        """
        # Check if fit has been called
        check_is_fitted(self, ['fitted_base_estimators_', 'fitted_meta_learner_'])

        # Validate input
        X = check_array(X)

        # Generate meta-features
        X_meta = _generate_meta_features(self.fitted_base_estimators_, X)

        # Check if meta-learner supports predict_proba
        if not hasattr(self.fitted_meta_learner_, "predict_proba"):
            raise AttributeError(
                f"The meta-learner {self.fitted_meta_learner_.__class__.__name__} "
                f"does not support predict_proba."
            )

        # Predict probabilities using the fitted meta-learner
        return self.fitted_meta_learner_.predict_proba(X_meta)

    # get_params and set_params are inherited from BaseEstimator
    # Needs proper __init__ signature and public attributes matching __init__ args.

    # Optional: Implement _more_tags if needed for specific Scikit-learn integrations
    def _more_tags(self):
        # Indicates this estimator needs y in predict if base estimators do (rare)
        # Or other tags like 'requires_positive_X' etc.
        return {'requires_y': False}

This implementation provides a basic stacking classifier. Note the use of clone to ensure that the original estimators passed by the user are not modified. The helper function _generate_meta_features handles the collection of predictions from base models, attempting to use predict_proba where available, which is often beneficial for the meta-learner. We've included basic checks using Scikit-learn's validation utilities like check_X_y, check_array, and check_is_fitted.

Using the Custom Estimator

Now, let's see how to use our StackingEstimator. We'll define some base models and a meta-learner, then integrate it into a typical Scikit-learn workflow.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Generate synthetic classification data
X, y = make_classification(n_samples=500, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define base estimators
base_estimators = [
    ('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
    ('svc', Pipeline([('scaler', StandardScaler()), # SVC sensitive to scaling
                      ('svc', SVC(probability=True, random_state=42))]))
]

# Define meta-learner
meta_learner = LogisticRegression(solver='liblinear', random_state=42)

# Instantiate our custom StackingEstimator
stacking_clf = StackingEstimator(base_estimators=base_estimators,
                                 meta_learner=meta_learner)

# --- Option 1: Direct Fit and Predict ---
print("Fitting StackingEstimator directly...")
stacking_clf.fit(X_train, y_train)
y_pred = stacking_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"StackingEstimator Test Accuracy: {accuracy:.4f}")

# Check predict_proba
try:
    y_proba = stacking_clf.predict_proba(X_test)
    print(f"Predict probabilities shape: {y_proba.shape}")
    # print("Sample probabilities:\n", y_proba[:5]) # Uncomment to view
except AttributeError as e:
    print(f"Could not get probabilities: {e}")


# --- Option 2: Using Cross-Validation ---
print("\nEvaluating StackingEstimator with cross-validation...")
# Note: CV might be slow as it refits the entire stack multiple times
cv_scores = cross_val_score(stacking_clf, X, y, cv=3, scoring='accuracy')
print(f"Cross-validation Accuracy Scores: {cv_scores}")
print(f"Mean CV Accuracy: {np.mean(cv_scores):.4f}")


# --- Option 3: Integration in a Pipeline (Example) ---
# Although our base 'svc' already includes scaling, this demonstrates the principle.
# Maybe we want overall scaling *before* any estimator sees the data.
print("\nUsing StackingEstimator within a Pipeline...")
pipeline = Pipeline([
    ('scaler', StandardScaler()), # Scale data before feeding to the stacker
    ('stacker', StackingEstimator(base_estimators=base_estimators, meta_learner=meta_learner))
])

pipeline.fit(X_train, y_train)
y_pred_pipeline = pipeline.predict(X_test)
accuracy_pipeline = accuracy_score(y_test, y_pred_pipeline)
print(f"Pipeline with StackingEstimator Test Accuracy: {accuracy_pipeline:.4f}")

This example demonstrates how the custom StackingEstimator can be instantiated, trained, used for prediction, evaluated with cross-validation, and even included as a step within a larger Scikit-learn Pipeline. Because it adheres to the API, it works seamlessly with these standard tools.

Testing with `check_estimator`

A significant step in developing robust Scikit-learn components is using the check_estimator utility. This function runs a comprehensive suite of tests to verify API compliance, invariant properties, and expected behaviors.

from sklearn.utils.estimator_checks import check_estimator

print("\nRunning check_estimator (this can take a while and be verbose)...")
try:
    # Need to instantiate with simple base estimators for some checks to pass
    simple_base = [('lr', LogisticRegression(solver='liblinear')), ('rf', RandomForestClassifier(n_estimators=5))]
    simple_meta = LogisticRegression(solver='liblinear')
    check_estimator(StackingEstimator(base_estimators=simple_base, meta_learner=simple_meta))
    print("check_estimator passed (or showed non-critical warnings).")
except Exception as e:
    print(f"check_estimator failed: {e}")

Running check_estimator is invaluable but can sometimes be challenging to pass completely, especially for complex estimators like ensembles. Failures often point to subtle API violations or edge cases that need addressing. For instance, our basic implementation might fail checks related to handling sparse matrices or specific metadata routing, depending on the base estimators used. Addressing all check_estimator failures often requires deeper interaction with Scikit-learn's internal mechanisms.

Potential Enhancements

This hands-on example provides a foundation. Several enhancements could make StackingEstimator more robust and flexible:

Cross-Validated Meta-Features: Instead of training the meta-learner on predictions made on the same data the base models were trained on (which risks overfitting), use cross-validation within the fit method. Train base models on $k-1$ folds and predict on the held-out fold to generate meta-features for the entire training set without leakage. This is how Scikit-learn's official StackingClassifier/StackingRegressor operate by default.
Feature Passthrough: Allow the original features X to be passed through to the meta-learner alongside the base model predictions.
Handling Different Prediction Methods: Allow configuration of whether base models should use predict, predict_proba, or decision_function to generate meta-features.
Parallel Fitting: Utilize joblib or concurrent.futures (as discussed in Chapter 5) to fit base estimators in parallel, potentially speeding up the fit process.
Improved Parameter Validation: Add more rigorous checks in fit (or a dedicated private validation method) to ensure estimators are compatible.

Building custom estimators like this StackingEstimator solidifies understanding of the Scikit-learn API and empowers you to create highly specialized components for your machine learning pipelines, moving beyond off-the-shelf solutions when necessary.

Hands-on Practical: Building a Custom Ensemble Estimator

Design and Scikit-learn Compliance

Implementation Steps

Using the Custom Estimator

Testing with check_estimator

Potential Enhancements

Testing with `check_estimator`