While deterministic algorithms follow a predictable path given the same input, randomized algorithms incorporate an element of chance into their logic. This might seem counterintuitive for tasks requiring precision, but introducing randomness is a powerful technique in machine learning, often used to improve robustness, escape local optima, enhance generalization, and handle large-scale problems more efficiently. Instead of guaranteeing the exact same result every time, these algorithms often provide results that are correct or optimal with high probability, or they use randomness to explore the solution space more effectively.
Randomness isn't just about unpredictability; it serves specific purposes in ML algorithms:
Let's look at how randomization is applied in some significant machine learning techniques.
One of the most prominent uses of randomization in ML is bootstrapping, which forms the basis of Bagging (Bootstrap Aggregating) and popular ensemble models like Random Forests.
Bootstrapping involves creating multiple new datasets from an original dataset by sampling with replacement. Each new dataset has the same size as the original, but because sampling is done with replacement, some data points from the original set might appear multiple times in a bootstrap sample, while others might not appear at all.
Creating multiple bootstrap samples from an original dataset via sampling with replacement. Each sample is used to train a separate model in an ensemble.
In Random Forests, this process is taken a step further:
The combination of these two sources of randomness makes Random Forests robust against overfitting and generally yields high predictive accuracy. The final prediction is typically made by averaging the predictions (for regression) or taking a majority vote (for classification) across all trees in the forest.
Dropout is a regularization technique specifically designed for neural networks. During training, for each training example (or mini-batch), dropout randomly "drops" (sets to zero) a fraction of the neuron activations in a layer. The probability p of dropping a neuron is a hyperparameter (often around 0.5).
A neural network layer during training with dropout applied. Neurons N2 and N4 (right side) are randomly deactivated for this particular training step.
This random deactivation prevents neurons from becoming overly co-dependent on each other. It forces the network to learn more redundant representations, making it less sensitive to the presence or absence of specific neurons and improving its ability to generalize to unseen data. At test time, dropout is typically turned off, and the activations are scaled down to account for the fact that more neurons are active than during training.
While iterative optimization methods like Gradient Descent are primarily deterministic given a starting point and a dataset, the Stochastic Gradient Descent (SGD) variant introduces randomness. Instead of computing the gradient using the entire dataset (which is computationally expensive), SGD computes the gradient based on a single randomly selected data point or a small random subset (mini-batch) at each iteration.
This stochasticity adds noise to the gradient updates. While this noise can make the convergence path less smooth compared to batch gradient descent, it has a beneficial side effect: it can help the optimization process escape shallow local minima and potentially find better, deeper minima in the loss landscape. The inherent randomness in selecting mini-batches makes SGD a randomized algorithm in practice.
A consequence of using randomized algorithms is that running the same algorithm twice on the same data might produce slightly different results (e.g., slightly different model weights, different trees in a forest). For development, debugging, and comparing experiments, it's often necessary to ensure reproducibility.
This is achieved by setting the seed of the pseudo-random number generator (PRNG) used by the algorithm. Libraries like NumPy, Scikit-learn, TensorFlow, and PyTorch provide functions to set the random seed. Setting the seed to a specific integer ensures that the sequence of "random" numbers generated will be the same every time the code is run, making the results of the randomized algorithm reproducible.
import numpy as np
import sklearn.ensemble
import sklearn.model_selection
# Set a seed for NumPy's random number generator
np.random.seed(42)
# Operations using np.random will now be deterministic
random_indices = np.random.choice(10, 5, replace=False)
print(f"Random indices with seed 42: {random_indices}")
# Many ML libraries use NumPy's RNG or have their own seed parameter
# Example with RandomForestClassifier
# Setting 'random_state' ensures reproducibility of bootstrapping and feature selection
X = np.random.rand(100, 10)
y = np.random.randint(0, 2, 100)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X, y, test_size=0.3, random_state=42 # Seed for the split
)
model = sklearn.ensemble.RandomForestClassifier(n_estimators=10, random_state=42) # Seed for the forest
model.fit(X_train, y_train)
print(f"Model score: {model.score(X_test, y_test)}")
# Running this code multiple times will produce the exact same output.
# Changing the seed (or not setting it) will lead to different results.
In summary, randomized algorithms are not a compromise on correctness but a strategic choice in machine learning. They leverage randomness to build models that are often more robust, generalize better, and can be trained more efficiently, especially on large datasets. Techniques like bootstrapping, dropout, and the stochasticity in SGD are fundamental tools for building effective modern machine learning systems. Understanding when and why randomness is used helps in selecting appropriate algorithms and interpreting their behavior.
© 2025 ApX Machine Learning