After implementing and evaluating initial versions of supervised learning models, the next logical step is optimization. Most machine learning algorithms have settings, known as hyperparameters, that are not learned directly from the data during training but are set beforehand. Examples include the regularization strength C in Logistic Regression or Support Vector Machines, the number of trees (n_estimators
) in a Random Forest, or the learning rate in Gradient Boosting. The choice of hyperparameters can significantly influence model performance, generalization ability, and training time. Finding a good combination of these settings is often essential for building high-performing models.
Manually adjusting hyperparameters through trial and error is possible but quickly becomes inefficient and unreliable as the number of hyperparameters or the range of possible values grows. It's easy to miss optimal combinations or spend excessive time exploring suboptimal regions of the hyperparameter space. Therefore, systematic approaches are preferred. Two widely used techniques for automated hyperparameter tuning are Grid Search and Randomized Search, often used in conjunction with cross-validation to ensure the chosen hyperparameters generalize well.
Grid Search is perhaps the most straightforward automated tuning method. It performs an exhaustive search over a specified subset of the hyperparameter space. You define a "grid" of possible values for each hyperparameter you want to tune, and Grid Search trains and evaluates a model for every possible combination of these values.
For example, if you are tuning a Random Forest and want to explore:
n_estimators
: [100, 200, 300]max_depth
: [5, 10, None]min_samples_split
: [2, 4]Grid Search would train and evaluate 3×3×2=18 different models.
Typically, the evaluation for each combination is done using cross-validation. This provides a more stable estimate of the performance for that hyperparameter set, reducing the risk of overfitting to a specific train-test split. The combination yielding the best average cross-validation score is then selected as the optimal set of hyperparameters.
Scikit-learn provides the GridSearchCV
class for this purpose. It takes an estimator (like a classifier or regressor), a parameter grid (defined as a dictionary), and cross-validation settings.
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
# Define the model
rf = RandomForestClassifier(random_state=42)
# Define the parameter grid
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [5, 10, 15],
'min_samples_split': [2, 5, 10]
}
# Instantiate GridSearchCV
# cv=5 means 5-fold cross-validation
# n_jobs=-1 uses all available CPU cores
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=1, scoring='accuracy')
# Fit the grid search to the data
grid_search.fit(X, y)
# Print the best parameters and the best score
print(f"Best parameters found: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")
# The best estimator is already refitted on the entire dataset
best_rf_model = grid_search.best_estimator_
In this example, GridSearchCV
will evaluate 3×3×3=27 combinations. The verbose
parameter controls how much information is printed during the search. After fitting, grid_search.best_params_
holds the dictionary of the best hyperparameter combination found, and grid_search.best_score_
contains the corresponding mean cross-validation score. The grid_search.best_estimator_
attribute provides the model refitted on the entire dataset using these best parameters.
Randomized Search offers a more efficient alternative to the exhaustive approach of Grid Search. Instead of trying all combinations, it samples a fixed number (n_iter
) of hyperparameter combinations from specified statistical distributions or lists.
For each hyperparameter, you can provide either a list of values (like in Grid Search) or, more powerfully, a distribution from which to sample (e.g., a uniform distribution for a continuous parameter like learning rate, or a geometric distribution for an integer parameter like n_estimators
).
The core idea, supported by research (Bergstra & Bengio, 2012), is that for many problems, only a few hyperparameters significantly impact performance. Randomized Search spends more time exploring potentially important values across different hyperparameters rather than exhaustively checking all combinations of less important ones. With the same computational budget, Randomized Search can often explore a wider range of values and find better or equally good models compared to Grid Search.
This visualization contrasts the systematic point evaluation of Grid Search with the stochastic sampling approach of Randomized Search over a hypothetical two-dimensional hyperparameter space. With the same number of evaluations (9 in this case), Randomized Search covers a more diverse set of value combinations.
Scikit-learn provides RandomizedSearchCV
, which works similarly to GridSearchCV
but requires defining parameter distributions and the number of iterations (n_iter
).
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from scipy.stats import randint, uniform
# Generate synthetic data (same as before)
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
# Define the model
rf = RandomForestClassifier(random_state=42)
# Define the parameter distributions or lists to sample from
# Use distributions for potentially continuous or wide-ranging parameters
param_dist = {
'n_estimators': randint(100, 500), # Sample integers between 100 and 499
'max_depth': [5, 10, 15, 20, None], # Sample from this list
'min_samples_split': randint(2, 11), # Sample integers between 2 and 10
'min_samples_leaf': randint(1, 11), # Sample integers between 1 and 10
'bootstrap': [True, False] # Sample from this list
}
# Instantiate RandomizedSearchCV
# n_iter controls the number of parameter settings sampled
# Increase n_iter for more thorough search, decrease for speed
random_search = RandomizedSearchCV(estimator=rf,
param_distributions=param_dist,
n_iter=50, # Number of parameter settings that are sampled
cv=5,
n_jobs=-1,
verbose=1,
scoring='accuracy',
random_state=42) # for reproducible results
# Fit the randomized search to the data
random_search.fit(X, y)
# Print the best parameters and the best score
print(f"Best parameters found: {random_search.best_params_}")
print(f"Best cross-validation score: {random_search.best_score_:.4f}")
# The best estimator
best_rf_model_random = random_search.best_estimator_
Here, param_dist
uses scipy.stats.randint
to define uniform sampling over a range of integers for n_estimators
, min_samples_split
, and min_samples_leaf
. max_depth
and bootstrap
use lists, from which values are sampled uniformly. n_iter=50
means 50 different combinations will be sampled and evaluated using 5-fold cross-validation.
n_iter
and the random sampling process.scipy.stats.loguniform
) are often suitable for parameters like learning rates or regularization strengths that span several orders of magnitude.n_iter
(Randomized Search), and the number of cross-validation folds (cv
) to fit your time constraints. Using n_jobs=-1
parallelizes the process across available CPU cores, significantly speeding up the search.Pipeline
that includes preprocessing steps (like scaling or encoding). This prevents data leakage from the validation folds into the hyperparameter tuning process for preprocessing steps (e.g., fitting a scaler on the whole dataset before CV). You can define hyperparameters for steps within the pipeline using the stepname__parameter
syntax (e.g., randomforestclassifier__n_estimators
).By employing Grid Search or Randomized Search, you move from manual, potentially biased tuning to a more systematic and reproducible method for optimizing your models. While more advanced techniques like Bayesian Optimization exist, Grid Search and Randomized Search are robust, widely used, and readily available tools that significantly improve the process of finding effective hyperparameter configurations for your supervised learning models.
© 2025 ApX Machine Learning