Embedded methods integrate the feature selection process directly within the model training phase. Unlike filter methods that assess features independently or wrapper methods that repeatedly train a model on different subsets, embedded techniques perform feature selection as part of learning the model parameters. This approach can be computationally more efficient than wrapper methods and often captures feature interactions better than filter methods.
One of the most prominent examples of embedded feature selection is L1 regularization, commonly known as Lasso (Least Absolute Shrinkage and Selection Operator). Lasso is typically applied to linear models (like Linear Regression or Logistic Regression).
Before looking specifically at Lasso, let's briefly touch upon regularization. In machine learning, especially with linear models, we aim to find model parameters (coefficients) that minimize a loss function (like Mean Squared Error for regression). However, simply minimizing the training error can lead to complex models that overfit the training data and perform poorly on unseen data.
Regularization adds a penalty term to the loss function. This penalty discourages overly complex models, typically by constraining the magnitude of the coefficients. The model now tries to minimize:
Loss Function+Penalty TermLasso uses the L1 norm of the coefficient vector as its penalty term. For a linear model with coefficients β=(β1,β2,...,βp) for p features, the L1 norm is the sum of the absolute values of the coefficients:
∣∣β∣∣1=j=1∑p∣βj∣The objective function for Lasso Regression (minimizing Mean Squared Error with an L1 penalty) becomes:
βargmin(i=1∑n(yi−j=1∑pxijβj)2+αj=1∑p∣βj∣)Here:
The crucial aspect of the L1 penalty is its tendency to shrink some coefficients to exactly zero. This happens because of the shape of the L1 constraint (a diamond in two dimensions, a multi-dimensional equivalent in higher dimensions) when optimizing the loss function. Features whose coefficients are shrunk to zero are effectively removed from the model.
This property makes Lasso an embedded feature selection technique: the process of fitting the model itself selects the relevant features by assigning non-zero coefficients only to them.
Illustration of how Lasso coefficients change as the regularization strength (
alpha
) increases. Notice how some coefficients (Feature 2, Feature 3, Feature 4) are shrunk to exactly zero at differentalpha
values, effectively removing them from the model. Feature 1 persists longer but also shrinks.
Scikit-learn provides the Lasso
class for regression tasks. You can fit it like any other Scikit-learn model and then inspect the coef_
attribute to see which features were assigned non-zero coefficients.
import numpy as np
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
# Generate synthetic data
X, y = make_regression(n_samples=200, n_features=20, n_informative=10, noise=15, random_state=42)
# Introduce some irrelevant features (making last 5 features redundant)
X[:,-5:] = X[:,:5] * np.random.uniform(0.9, 1.1, size=(X.shape[0], 5))
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale features (Important for Lasso!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Initialize and fit Lasso model
# Adjust alpha to control sparsity (higher alpha = more zeros)
alpha_value = 1.0
lasso = Lasso(alpha=alpha_value, random_state=42)
lasso.fit(X_train_scaled, y_train)
# Inspect coefficients
print(f"Lasso Coefficients (alpha={alpha_value}):")
print(lasso.coef_)
# Identify selected features (non-zero coefficients)
selected_features_mask = lasso.coef_ != 0
n_selected_features = np.sum(selected_features_mask)
print(f"\nNumber of features selected: {n_selected_features} out of {X_train_scaled.shape[1]}")
# Get indices of selected features
selected_indices = np.where(selected_features_mask)[0]
print(f"Indices of selected features: {selected_indices}")
# You can now use these selected features for subsequent modeling
# X_train_selected = X_train_scaled[:, selected_features_mask]
# X_test_selected = X_test_scaled[:, selected_features_mask]
Important Note: Lasso, like many regularized models and distance-based algorithms, is sensitive to the scale of the input features. It's standard practice to scale your data (e.g., using StandardScaler
or MinMaxScaler
) before applying Lasso.
You don't have to use Lasso solely for building the final prediction model. It can be used as a feature selection step within a larger pipeline. Scikit-learn's SelectFromModel
meta-transformer makes this easy. It selects features based on importance weights (like coefficients from Lasso
or feature importances from trees).
from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression # Example for classification
# --- Assuming X_train_scaled, y_train are ready ---
# Example for classification - use LogisticRegression with L1 penalty
# Create a pipeline: Lasso (as selector) -> Final Model
# Use LassoCV or LogisticRegressionCV to find best alpha internally, or set alpha manually
lasso_selector_model = Lasso(alpha=1.0, random_state=42) # Or LogisticRegression(penalty='l1', C=1.0, solver='liblinear') for classification
selector = SelectFromModel(estimator=lasso_selector_model, threshold="median") # Select features with coef > median coef
# Example pipeline with Logistic Regression as the final estimator
# For classification, you might use LogisticRegression(penalty='l1', C=0.1, solver='liblinear') directly in SelectFromModel
# For regression, you could follow with another regressor like Ridge or LinearRegression
pipeline = Pipeline([
('scaler', StandardScaler()), # Ensure scaling happens within the pipeline
('feature_selection', selector),
('classification', LogisticRegression(random_state=42)) # Example final model
])
# Fit the pipeline (selector and classifier trained)
# pipeline.fit(X_train, y_train) # Assuming y_train is categorical for LogisticRegression
# Now 'pipeline' represents the full process including L1 feature selection
Pros:
Cons:
LassoCV
) is typically needed to find a good value.Lasso provides a powerful and widely used embedded method for feature selection, particularly effective when dealing with high-dimensional datasets where many features might be irrelevant. Remember to scale your data and carefully tune the alpha
parameter for optimal results.
© 2025 ApX Machine Learning