Alright, let's get our hands dirty and see how to actually apply regularization techniques to a neural network model. In the previous sections, we discussed why we need regularization, primarily to combat overfitting and improve how well our model performs on unseen data. Now, we'll modify the core components of our network's training process, specifically the loss calculation and the forward/backward passes, to incorporate L2 regularization and dropout.
We'll assume you have a basic neural network structure and training loop set up, perhaps similar to the one built in the previous chapter. Our focus here will be on the modifications needed to add regularization.
L2 regularization adds a penalty to the loss function based on the squared magnitude of the weights. The goal is to keep the weights small, preventing the model from relying too heavily on any single input feature.
1. Modifying the Loss Function
The original loss function (e.g., Cross-Entropy or Mean Squared Error) measures the prediction error. For L2 regularization, we add a penalty term. The new L2-regularized loss Jreg is:
Jreg=Joriginal+2mλl∑∣∣W[l]∣∣F2Where:
In your code, when calculating the total loss for a batch, you would compute the original loss and then add this penalty term. You'd need access to all the weight matrices in your network.
# Example: Calculating L2 regularization cost
# Assuming 'parameters' is a dictionary containing weight matrices W1, W2, ...
# and 'lambd' is the regularization hyperparameter
# and 'm' is the batch size
l2_cost = 0
num_layers = len(parameters) // 2 # Assuming parameters are W1, b1, W2, b2, ...
for l in range(1, num_layers + 1):
W = parameters['W' + str(l)]
l2_cost += np.sum(np.square(W))
l2_cost = (lambd / (2 * m)) * l2_cost
total_cost = original_cost + l2_cost
2. Modifying the Gradient Calculation (Backpropagation)
Since the loss function has changed, the gradients with respect to the weights also change. The derivative of the L2 penalty term with respect to a weight matrix W[l] is mλW[l]. This term needs to be added to the original gradient calculation for each weight matrix during backpropagation.
The update rule for the weights becomes:
∂W[l]∂Jreg=∂W[l]∂Joriginal+mλW[l]So, in your backpropagation implementation, after calculating the original gradient dW[l] (which is ∂W[l]∂Joriginal), you simply add the regularization term:
# Example: Modifying gradient calculation for W[l] during backprop
# Assuming dW_original is the gradient calculated without regularization
dW_regularized = dW_original + (lambd / m) * parameters['W' + str(l)]
# Use dW_regularized in the parameter update step
# parameters['W' + str(l)] = parameters['W' + str(l)] - learning_rate * dW_regularized
Remember, the bias gradients db[l] are typically not regularized, so their calculation remains unchanged.
Dropout works differently. It doesn't change the loss function directly but modifies the network structure itself during training. In each forward pass during training, dropout randomly "drops" (sets to zero) a fraction of the neurons' outputs in a layer.
1. Modifying the Forward Pass
During the forward pass for a specific layer (usually applied to hidden layers), after calculating the activations (e.g., after applying ReLU or Tanh), you perform the following steps:
D
of the same shape as the layer's activation output A
, where each element is randomly 0 (with probability keep_prob
) or 1 (with probability 1 - keep_prob
). keep_prob
is the probability of keeping a neuron active.A
element-wise by the mask D
. This effectively sets some activations to zero (A_dropped = A * D
).keep_prob
(A_scaled = A_dropped / keep_prob
). This scaling ensures that the expected output of the layer remains the same during training as it would be during testing (when dropout is turned off). This technique, called inverted dropout, is the standard practice.# Example: Applying dropout during forward propagation for layer l
# A_prev is the activation from the previous layer
# W, b are parameters for the current layer
# activation_func is the activation function (e.g., relu)
# keep_prob is the probability of keeping a neuron
Z = np.dot(W, A_prev) + b
A = activation_func(Z) # Activation output
# Apply Dropout
D = np.random.rand(A.shape[0], A.shape[1]) < keep_prob # Create mask
A = A * D # Apply mask
A = A / keep_prob # Scale (Inverted Dropout)
# Store the mask D along with Z and A in the cache for backpropagation
# cache = (..., D, ...)
2. Modifying the Backward Pass
During backpropagation, the gradient calculation for the layer where dropout was applied needs to account for the mask D
. The gradient flowing back needs to be shut off for the neurons that were dropped. You simply re-apply the same mask D
(that you stored in the cache during the forward pass) to the gradient of the activation dA
before calculating the gradients dW
, db
, and dA_prev
. Remember to also scale by keep_prob
.
# Example: Applying dropout mask during backpropagation for layer l
# dA is the gradient of the cost with respect to the activation A of the current layer
# cache contains the mask D used in the forward pass for this layer
# keep_prob is the same probability used in forward pass
# Retrieve mask from cache
# D = cache[some_index] # Get the mask used for this layer
dA = dA * D # Apply the mask used during forward pass
dA = dA / keep_prob # Apply scaling
# Continue with the rest of the backpropagation steps using the modified dA
# dZ = dA * activation_gradient(Z)
# dW = ...
# db = ...
# dA_prev = ...
Important: Dropout should only be active during training. When evaluating your model or making predictions on new data, you must turn dropout off. This means you don't apply the mask or the scaling during the forward pass in evaluation/prediction mode. Using inverted dropout makes this straightforward, as you don't need to modify weights post-training.
How do you know if regularization is working? Monitor your training and validation loss (and accuracy) over epochs. Without regularization, you might see the training loss continuously decrease while the validation loss starts to increase or stagnate after some point. This gap indicates overfitting.
Applying L2 regularization or dropout should ideally:
Hypothetical comparison of training and validation loss with and without regularization. Regularization often increases training loss slightly but decreases validation loss, reducing the gap and indicating better generalization.
This practice involves modifying specific parts of your network's implementation. Try adding L2 regularization first by adjusting the cost and gradient calculations. Then, experiment with adding dropout layers during the forward pass and ensure you handle it correctly during backpropagation and turn it off during evaluation. Observe the impact on your training and validation metrics and try different values for the regularization strength λ or the dropout keep_prob
.
© 2025 ApX Machine Learning