As we discussed in the previous section, overfitting occurs when a model learns the training data too well, including its noise and specific patterns, leading to poor performance on new, unseen data. Regularization techniques are designed to combat this by constraining the complexity of the model, encouraging it to learn more general patterns. Let's look at three common methods: L1, L2, and Dropout.
L1 and L2 regularization work by adding a penalty term to the model's loss function. This penalty is based on the magnitude of the network's weights (w). The idea is that models with excessively large weights are often more complex and prone to overfitting, as large weights can cause sharp changes in output for small changes in input. By penalizing large weights, we encourage the model to find simpler solutions that generalize better.
The modified loss function looks like this:
New Loss=Original Loss+λ×Regularization TermHere, λ (lambda) is the regularization strength hyperparameter. A larger λ imposes a stronger penalty.
L1 regularization adds a penalty proportional to the absolute value of the weights:
L1 Penalty=λi∑∣wi∣A notable effect of L1 regularization is that it encourages sparsity. It tends to push some weights to become exactly zero, effectively performing a form of automatic feature selection by removing the influence of less important inputs.
L2 regularization adds a penalty proportional to the square of the weights:
L2 Penalty=λi∑wi2L2 regularization is generally more common in deep learning than L1. It encourages weights to be small and distributed, preventing any single weight from becoming excessively large, but it doesn't typically force weights to become exactly zero. This is often referred to as "weight decay" because, during gradient descent, it adds a term that pushes weights towards zero.
You can easily add L1 or L2 regularization penalties to Keras layers using the kernel_regularizer
, bias_regularizer
, and activity_regularizer
arguments. The kernel_regularizer
penalizes the layer's main weights (the kernel), while bias_regularizer
penalizes the bias terms. activity_regularizer
penalizes the layer's output (activation), which is less common.
Here's how you might add L2 regularization to a Dense
layer:
import os
import torch # Import torch just to confirm it's available if needed elsewhere
# Set the Keras backend to PyTorch (must be done before importing Keras)
os.environ["KERAS_BACKEND"] = "torch"
# Now import Keras components
import keras
from keras import layers
from keras import regularizers
# Example L2 regularization
model = keras.Sequential([
layers.Dense(
64,
activation='relu',
kernel_regularizer=regularizers.l2(0.001), # Apply L2 penalty to weights
input_shape=(784,)
),
layers.Dense(
10,
activation='softmax',
kernel_regularizer=regularizers.l2(0.001) # Apply L2 penalty here too
)])
# You can also use L1 or combined L1/L2
# regularizers.l1(0.01)
# regularizers.l1_l2(l1=0.01, l2=0.001)
# Compile the model as usual
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
model.summary()
# --- Optional: Verify backend and run a quick test ---
print(f"\nKeras backend confirmed: {keras.backend.backend()}")
import numpy as np
print("Running a quick test step with dummy data...")
(x_test, y_test) = (np.random.rand(10, 784).astype(np.float32),
keras.utils.to_categorical(np.random.randint(10, size=10), num_classes=10))
loss, acc = model.evaluate(x_test, y_test, verbose=0)
print(f"Test evaluation complete (Loss: {loss:.4f}, Accuracy: {acc:.4f})")
The value passed to regularizers.l1()
, regularizers.l2()
, or regularizers.l1_l2()
is the regularization factor λ. Choosing the right value often requires experimentation and is typically determined through hyperparameter tuning. Values often range from 0.1 down to 0.0001.
Dropout is a conceptually different but highly effective regularization technique specifically developed for neural networks. Instead of modifying the loss function, Dropout modifies the network itself during training.
At each training step, for a given layer where Dropout is applied, a random fraction of its output units (neurons) are temporarily "dropped out" – meaning their outputs are set to zero. The fraction of units to drop is determined by the dropout rate, a hyperparameter usually set between 0.1 and 0.5.
Illustration of dropout. During different training steps, different neurons (shown grayed out and dashed) are randomly deactivated.
Why does this work?
During inference (when evaluating or making predictions), dropout is turned off, and all neurons are used. However, to compensate for the fact that more neurons are active than during training, the outputs of the dropout layer are typically scaled down by a factor equal to the dropout rate (or equivalently, activations were scaled up during training - this is called "inverted dropout" and is the common implementation).
Dropout is implemented in Keras using the Dropout
layer. You insert it between the layers whose outputs you want to regularize.
from tensorflow import keras
from tensorflow.keras import layers
model = keras.Sequential([
layers.Dense(128, activation='relu', input_shape=(784,)),
layers.Dropout(0.3), # Apply dropout with a rate of 30%
layers.Dense(64, activation='relu'),
layers.Dropout(0.3), # Apply dropout again
layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
model.summary()
The argument to the Dropout
layer is the dropout rate (the fraction of units to drop). A common practice is to apply dropout after activation functions, typically in the denser parts of the network. The optimal rate often needs tuning.
It's also common to combine techniques, for example, using both L2 weight regularization and Dropout in the same network. The strength of regularization (λ for L1/L2, the dropout rate) are important hyperparameters that usually need to be tuned based on the validation set performance.
By applying these techniques, you can significantly reduce overfitting and build models that generalize better to new data, which is a fundamental goal in machine learning.
© 2025 ApX Machine Learning