Unlike filter methods that assess features independently or wrapper methods that repeatedly train a model on different subsets, embedded methods perform feature selection as an integral part of the model training process. Think of it as building a model that inherently learns which features are important and which ones can be disregarded, all within a single training run.
These methods are often computationally less expensive than wrapper methods because they don't require training numerous models. They combine the advantages of both filter and wrapper methods: they consider feature interactions (like wrapper methods) but are generally faster. The selection process is 'embedded' within the model's learning algorithm itself.
The core idea is that the model algorithm has a built-in mechanism that penalizes complexity or assigns importance scores to features during fitting. Features that are deemed less influential according to the model's criteria are either given very small weights (effectively ignored) or assigned zero weight altogether, removing them from the final model.
We will look closely at two popular categories of embedded methods in the upcoming sections:
Regularization Methods: Techniques like L1 regularization (used in Lasso regression) add a penalty to the model's loss function based on the magnitude of the coefficients. The L1 penalty encourages sparsity, meaning it tends to shrink the coefficients of less important features exactly to zero. Features with zero coefficients are effectively selected out. The penalized loss function often looks something like:
Loss=Original Loss+λj=1∑p∣βj∣Here, βj represents the coefficient for the j-th feature, and λ is the regularization strength. A larger λ leads to more coefficients being shrunk to zero.
Tree-Based Importance: Ensemble methods like Random Forests and Gradient Boosting naturally compute feature importance scores during their training. These scores typically measure how much a feature contributes to reducing impurity (e.g., Gini impurity or entropy in classification) or variance (in regression) across all the trees in the ensemble. Features with low importance scores can be considered less relevant and potentially removed.
While powerful, embedded methods are model-dependent; the selected features are optimized for the specific algorithm used (e.g., Lasso or Random Forest). The effectiveness can also depend on hyperparameter tuning, such as the regularization strength λ in Lasso.
The following sections will provide practical details on implementing L1 regularization and using tree-based feature importance for feature selection using Scikit-learn.
© 2025 ApX Machine Learning