Building a custom stacking ensemble estimator provides a practical, non-trivial example of creating components that conform to the Scikit-learn API. Ensemble methods often improve predictive performance by combining the outputs of multiple models. While Scikit-learn provides StackingClassifier and StackingRegressor, constructing a custom estimator offers valuable insight into estimator composition and the API requirements.Our goal is to create a StackingEstimator class that takes a list of base estimators and a final meta-learner. During fit, it trains the base estimators on the input data and then trains the meta-learner on the predictions generated by the base estimators. During predict, it combines the predictions from the base estimators and feeds them into the meta-learner to produce the final output.Design and Scikit-learn ComplianceTo integrate smoothly with Scikit-learn tools like Pipeline and GridSearchCV, our StackingEstimator must adhere to the established conventions:Inheritance: It should inherit from BaseEstimator and an appropriate mixin (e.g., ClassifierMixin or RegressorMixin). This provides essential methods like get_params and set_params.Constructor (__init__): All parameters must be explicit keyword arguments in __init__, and these arguments should not be validated or mutated there. Store the unmodified arguments directly as public attributes (e.g., self.base_estimators = base_estimators).Fitted Attributes: Attributes learned during fit (like the trained base models and meta-learner) should be stored with a trailing underscore (e.g., self.fitted_base_estimators_).fit Method: Accepts X, y and returns self. It performs the core training logic.predict Method: Accepts X and returns predictions based on the fitted models. If building a classifier, implementing predict_proba is often desirable.For our stacking estimator, the main parameters will be base_estimators (a list of estimator instances) and meta_learner (a single estimator instance).Implementation StepsLet's start building the StackingEstimator. We'll focus on a classifier version for illustration, inheriting from ClassifierMixin.import numpy as np from sklearn.base import BaseEstimator, ClassifierMixin, clone from sklearn.utils.validation import check_X_y, check_array, check_is_fitted from sklearn.utils.multiclass import unique_labels # Helper function to generate meta-features def _generate_meta_features(estimators, X): """Generates predictions from fitted estimators for meta-learner input.""" # Check if estimators is a list and not empty if not isinstance(estimators, list) or len(estimators) == 0: raise ValueError("Expected a list of fitted estimators.") # Collect predictions. Handle predict_proba if available, else predict. predictions = [] for name, estimator in estimators: try: # Prefer probabilities for classification tasks if possible pred = estimator.predict_proba(X) # Handle case where predict_proba returns multiple columns (e.g., binary) if pred.ndim > 1 and pred.shape[1] > 1: # Use probability of the positive class (common convention) # Or potentially all probabilities if meta-learner can handle it. # Here we take the second column for simplicity, assuming binary. # An implementation might need configuration. if pred.shape[1] == 2: predictions.append(pred[:, 1].reshape(-1, 1)) else: # Multi-class probabilities predictions.append(pred) # Add all probability columns else: # Single probability vector or predict output predictions.append(pred.reshape(-1, 1)) except AttributeError: # Fallback to predict if predict_proba is not available pred = estimator.predict(X) predictions.append(pred.reshape(-1, 1)) # Stack predictions horizontally if not predictions: raise ValueError("No predictions generated from base estimators.") return np.hstack(predictions) class StackingEstimator(BaseEstimator, ClassifierMixin): """ A basic Stacking ensemble classifier. Trains base estimators and uses their predictions as input for a final meta-learner. Parameters ---------- base_estimators : list of (str, estimator) tuples The base estimators to be fitted on the data. Each estimator is cloned before fitting. meta_learner : estimator object The meta-learner to be fitted on the predictions of the base estimators. Cloned before fitting. Attributes ---------- fitted_base_estimators_ : list of (str, estimator) tuples The fitted base estimators. fitted_meta_learner_ : estimator object The fitted meta-learner. classes_ : ndarray of shape (n_classes,) The classes labels observed during fit. """ def __init__(self, base_estimators, meta_learner): self.base_estimators = base_estimators self.meta_learner = meta_learner def fit(self, X, y): """ Fit the stacking estimator. Trains the base estimators on X, y, then trains the meta-learner on the predictions of the base estimators. Parameters ---------- X : array-like of shape (n_samples, n_features) Training vector. y : array-like of shape (n_samples,) Target values. Returns ------- self : object Returns the instance itself. """ # Validate input data X, y = check_X_y(X, y) # Store the classes seen during fit self.classes_ = unique_labels(y) # Input validation for estimators (basic checks) if not isinstance(self.base_estimators, list) or len(self.base_estimators) == 0: raise ValueError("`base_estimators` must be a non-empty list of (name, estimator) tuples.") if self.meta_learner is None: raise ValueError("`meta_learner` cannot be None.") # Clone estimators to avoid modifying originals self.fitted_base_estimators_ = [] for name, estimator in self.base_estimators: fitted_estimator = clone(estimator).fit(X, y) self.fitted_base_estimators_.append((name, fitted_estimator)) # Generate meta-features from base estimator predictions X_meta = _generate_meta_features(self.fitted_base_estimators_, X) # Clone and fit the meta-learner self.fitted_meta_learner_ = clone(self.meta_learner).fit(X_meta, y) return self def predict(self, X): """ Predict class labels for samples in X. Parameters ---------- X : array-like of shape (n_samples, n_features) The input samples. Returns ------- y_pred : ndarray of shape (n_samples,) Predicted class labels. """ # Check if fit has been called check_is_fitted(self, ['fitted_base_estimators_', 'fitted_meta_learner_']) # Validate input X = check_array(X) # Generate meta-features from base estimators X_meta = _generate_meta_features(self.fitted_base_estimators_, X) # Predict using the fitted meta-learner return self.fitted_meta_learner_.predict(X_meta) def predict_proba(self, X): """ Predict class probabilities for samples in X. Parameters ---------- X : array-like of shape (n_samples, n_features) The input samples. Returns ------- p : ndarray of shape (n_samples, n_classes) The class probabilities of the input samples. """ # Check if fit has been called check_is_fitted(self, ['fitted_base_estimators_', 'fitted_meta_learner_']) # Validate input X = check_array(X) # Generate meta-features X_meta = _generate_meta_features(self.fitted_base_estimators_, X) # Check if meta-learner supports predict_proba if not hasattr(self.fitted_meta_learner_, "predict_proba"): raise AttributeError( f"The meta-learner {self.fitted_meta_learner_.__class__.__name__} " f"does not support predict_proba." ) # Predict probabilities using the fitted meta-learner return self.fitted_meta_learner_.predict_proba(X_meta) # get_params and set_params are inherited from BaseEstimator # Needs proper __init__ signature and public attributes matching __init__ args. # Optional: Implement _more_tags if needed for specific Scikit-learn integrations def _more_tags(self): # Indicates this estimator needs y in predict if base estimators do (rare) # Or other tags like 'requires_positive_X' etc. return {'requires_y': False} This implementation provides a basic stacking classifier. Note the use of clone to ensure that the original estimators passed by the user are not modified. The helper function _generate_meta_features handles the collection of predictions from base models, attempting to use predict_proba where available, which is often beneficial for the meta-learner. We've included basic checks using Scikit-learn's validation utilities like check_X_y, check_array, and check_is_fitted.Using the Custom EstimatorNow, let's see how to use our StackingEstimator. We'll define some base models and a meta-learner, then integrate it into a typical Scikit-learn workflow.from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split, cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler # Generate synthetic classification data X, y = make_classification(n_samples=500, n_features=20, n_informative=10, n_redundant=5, n_classes=2, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Define base estimators base_estimators = [ ('rf', RandomForestClassifier(n_estimators=50, random_state=42)), ('svc', Pipeline([('scaler', StandardScaler()), # SVC sensitive to scaling ('svc', SVC(probability=True, random_state=42))])) ] # Define meta-learner meta_learner = LogisticRegression(solver='liblinear', random_state=42) # Instantiate our custom StackingEstimator stacking_clf = StackingEstimator(base_estimators=base_estimators, meta_learner=meta_learner) # --- Option 1: Direct Fit and Predict --- print("Fitting StackingEstimator directly...") stacking_clf.fit(X_train, y_train) y_pred = stacking_clf.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print(f"StackingEstimator Test Accuracy: {accuracy:.4f}") # Check predict_proba try: y_proba = stacking_clf.predict_proba(X_test) print(f"Predict probabilities shape: {y_proba.shape}") # print("Sample probabilities:\n", y_proba[:5]) # Uncomment to view except AttributeError as e: print(f"Could not get probabilities: {e}") # --- Option 2: Using Cross-Validation --- print("\nEvaluating StackingEstimator with cross-validation...") # Note: CV might be slow as it refits the entire stack multiple times cv_scores = cross_val_score(stacking_clf, X, y, cv=3, scoring='accuracy') print(f"Cross-validation Accuracy Scores: {cv_scores}") print(f"Mean CV Accuracy: {np.mean(cv_scores):.4f}") # --- Option 3: Integration in a Pipeline (Example) --- # Although our base 'svc' already includes scaling, this demonstrates the principle. # Maybe we want overall scaling *before* any estimator sees the data. print("\nUsing StackingEstimator within a Pipeline...") pipeline = Pipeline([ ('scaler', StandardScaler()), # Scale data before feeding to the stacker ('stacker', StackingEstimator(base_estimators=base_estimators, meta_learner=meta_learner)) ]) pipeline.fit(X_train, y_train) y_pred_pipeline = pipeline.predict(X_test) accuracy_pipeline = accuracy_score(y_test, y_pred_pipeline) print(f"Pipeline with StackingEstimator Test Accuracy: {accuracy_pipeline:.4f}") This example demonstrates how the custom StackingEstimator can be instantiated, trained, used for prediction, evaluated with cross-validation, and even included as a step within a larger Scikit-learn Pipeline. Because it adheres to the API, it works with these standard tools.Testing with check_estimatorA significant step in developing Scikit-learn components is using the check_estimator utility. This function runs a comprehensive suite of tests to verify API compliance, invariant properties, and expected behaviors.from sklearn.utils.estimator_checks import check_estimator print("\nRunning check_estimator (this can take a while and be verbose)...") try: # Need to instantiate with simple base estimators for some checks to pass simple_base = [('lr', LogisticRegression(solver='liblinear')), ('rf', RandomForestClassifier(n_estimators=5))] simple_meta = LogisticRegression(solver='liblinear') check_estimator(StackingEstimator(base_estimators=simple_base, meta_learner=simple_meta)) print("check_estimator passed (or showed non-critical warnings).") except Exception as e: print(f"check_estimator failed: {e}") Running check_estimator is invaluable but can sometimes be challenging to pass completely, especially for complex estimators like ensembles. Failures often point to subtle API violations or edge cases that need addressing. For instance, our basic implementation might fail checks related to handling sparse matrices or specific metadata routing, depending on the base estimators used. Addressing all check_estimator failures often requires deeper interaction with Scikit-learn's internal mechanisms.Potential EnhancementsThis hands-on example provides a foundation. Several enhancements could make StackingEstimator more effective and flexible:Cross-Validated Meta-Features: Instead of training the meta-learner on predictions made on the same data the base models were trained on (which risks overfitting), use cross-validation within the fit method. Train base models on $k-1$ folds and predict on the held-out fold to generate meta-features for the entire training set without leakage. This is how Scikit-learn's official StackingClassifier/StackingRegressor operate by default.Feature Passthrough: Allow the original features X to be passed through to the meta-learner alongside the base model predictions.Handling Different Prediction Methods: Allow configuration of whether base models should use predict, predict_proba, or decision_function to generate meta-features.Parallel Fitting: Utilize joblib or concurrent.futures (as discussed in Chapter 5) to fit base estimators in parallel, potentially speeding up the fit process.Improved Parameter Validation: Add more rigorous checks in fit (or a dedicated private validation method) to ensure estimators are compatible.Building custom estimators like this StackingEstimator solidifies understanding of the Scikit-learn API and helps you create highly specialized components for your machine learning pipelines, when off-the-shelf solutions are insufficient.