To combat overfitting and improve neural network model performance on unseen data, regularization techniques are applied by modifying the core components of the network's training process. Specifically, this involves adjusting the loss calculation and the forward/backward passes to incorporate L2 regularization and dropout.We'll assume you have a basic neural network structure and training loop set up, perhaps similar to the one built in the previous chapter. Our focus here will be on the modifications needed to add regularization.Implementing L2 RegularizationL2 regularization adds a penalty to the loss function based on the squared magnitude of the weights. The goal is to keep the weights small, preventing the model from relying too heavily on any single input feature.1. Modifying the Loss FunctionThe original loss function (e.g., Cross-Entropy or Mean Squared Error) measures the prediction error. For L2 regularization, we add a penalty term. The new L2-regularized loss $J_{reg}$ is:$$ J_{reg} = J_{original} + \frac{\lambda}{2m} \sum_{l} ||W^{[l]}||_F^2 $$Where:$J_{original}$ is the original loss (e.g., cross-entropy).$\lambda$ (lambda) is the regularization hyperparameter. It controls the strength of the penalty. A higher $\lambda$ means stronger regularization (smaller weights).$m$ is the number of examples in the batch.$W^{[l]}$ represents the weight matrix for layer $l$.$||W^{[l]}||_F^2$ is the squared Frobenius norm of the weight matrix for layer $l$, which is simply the sum of the squares of all the weights in that layer. We usually don't regularize bias terms.In your code, when calculating the total loss for a batch, you would compute the original loss and then add this penalty term. You'd need access to all the weight matrices in your network.# Example: Calculating L2 regularization cost # Assuming 'parameters' is a dictionary containing weight matrices W1, W2, ... # and 'lambd' is the regularization hyperparameter # and 'm' is the batch size l2_cost = 0 num_layers = len(parameters) // 2 # Assuming parameters are W1, b1, W2, b2, ... for l in range(1, num_layers + 1): W = parameters['W' + str(l)] l2_cost += np.sum(np.square(W)) l2_cost = (lambd / (2 * m)) * l2_cost total_cost = original_cost + l2_cost2. Modifying the Gradient Calculation (Backpropagation)Since the loss function has changed, the gradients with respect to the weights also change. The derivative of the L2 penalty term with respect to a weight matrix $W^{[l]}$ is $\frac{\lambda}{m} W^{[l]}$. This term needs to be added to the original gradient calculation for each weight matrix during backpropagation.The update rule for the weights becomes:$$ \frac{\partial J_{reg}}{\partial W^{[l]}} = \frac{\partial J_{original}}{\partial W^{[l]}} + \frac{\lambda}{m} W^{[l]} $$So, in your backpropagation implementation, after calculating the original gradient $dW^{[l]}$ (which is $\frac{\partial J_{original}}{\partial W^{[l]}}$), you simply add the regularization term:# Example: Modifying gradient calculation for W[l] during backprop # Assuming dW_original is the gradient calculated without regularization dW_regularized = dW_original + (lambd / m) * parameters['W' + str(l)] # Use dW_regularized in the parameter update step # parameters['W' + str(l)] = parameters['W' + str(l)] - learning_rate * dW_regularizedRemember, the bias gradients $db^{[l]}$ are typically not regularized, so their calculation remains unchanged.Implementing DropoutDropout works differently. It doesn't change the loss function directly but modifies the network structure itself during training. In each forward pass during training, dropout randomly "drops" (sets to zero) a fraction of the neurons' outputs in a layer.1. Modifying the Forward PassDuring the forward pass for a specific layer (usually applied to hidden layers), after calculating the activations (e.g., after applying ReLU or Tanh), you perform the following steps:Create a dropout mask: Generate a matrix D of the same shape as the layer's activation output A, where each element is randomly 0 (with probability keep_prob) or 1 (with probability 1 - keep_prob). keep_prob is the probability of keeping a neuron active.Apply the mask: Multiply the activations A element-wise by the mask D. This effectively sets some activations to zero (A_dropped = A * D).Scale the remaining activations (Inverted Dropout): Divide the result by keep_prob (A_scaled = A_dropped / keep_prob). This scaling ensures that the expected output of the layer remains the same during training as it would be during testing (when dropout is turned off). This technique, called inverted dropout, is the standard practice.# Example: Applying dropout during forward propagation for layer l # A_prev is the activation from the previous layer # W, b are parameters for the current layer # activation_func is the activation function (e.g., relu) # keep_prob is the probability of keeping a neuron Z = np.dot(W, A_prev) + b A = activation_func(Z) # Activation output # Apply Dropout D = np.random.rand(A.shape[0], A.shape[1]) < keep_prob # Create mask A = A * D # Apply mask A = A / keep_prob # Scale (Inverted Dropout) # Store the mask D along with Z and A in the cache for backpropagation # cache = (..., D, ...)2. Modifying the Backward PassDuring backpropagation, the gradient calculation for the layer where dropout was applied needs to account for the mask D. The gradient flowing back needs to be shut off for the neurons that were dropped. You simply re-apply the same mask D (that you stored in the cache during the forward pass) to the gradient of the activation dA before calculating the gradients dW, db, and dA_prev. Remember to also scale by keep_prob.# Example: Applying dropout mask during backpropagation for layer l # dA is the gradient of the cost with respect to the activation A of the current layer # cache contains the mask D used in the forward pass for this layer # keep_prob is the same probability used in forward pass # Retrieve mask from cache # D = cache[some_index] # Get the mask used for this layer dA = dA * D # Apply the mask used during forward pass dA = dA / keep_prob # Apply scaling # Continue with the rest of the backpropagation steps using the modified dA # dZ = dA * activation_gradient(Z) # dW = ... # db = ... # dA_prev = ...Important: Dropout should only be active during training. When evaluating your model or making predictions on new data, you must turn dropout off. This means you don't apply the mask or the scaling during the forward pass in evaluation/prediction mode. Using inverted dropout makes this straightforward, as you don't need to modify weights post-training.Observing the EffectsHow do you know if regularization is working? Monitor your training and validation loss (and accuracy) over epochs. Without regularization, you might see the training loss continuously decrease while the validation loss starts to increase or stagnate after some point. This gap indicates overfitting.Applying L2 regularization or dropout should ideally:Slow down the decrease in training loss (or even increase it slightly compared to no regularization).Keep the validation loss lower or decrease it for longer, reducing the gap between training and validation performance.{"layout": {"title": "Effect of Regularization on Loss", "xaxis": {"title": "Epochs"}, "yaxis": {"title": "Loss"}, "legend": {"title":"Legend"}, "colorway": ["#339af0", "#339af0", "#fd7e14", "#fd7e14"]}, "data": [{"x": [1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50], "y": [1.5, 0.8, 0.5, 0.3, 0.2, 0.15, 0.12, 0.1, 0.09, 0.08, 0.075], "mode": "lines", "name": "Training Loss (No Reg)", "line": {"color": "#339af0"}}, {"x": [1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50], "y": [1.6, 0.95, 0.65, 0.45, 0.35, 0.3, 0.27, 0.25, 0.24, 0.23, 0.225], "mode": "lines", "name": "Training Loss (L2/Dropout)", "line": {"dash": "dash", "color": "#339af0"}}, {"x": [1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50], "y": [1.7, 1.1, 0.8, 0.7, 0.65, 0.68, 0.72, 0.75, 0.79, 0.83, 0.87], "mode": "lines", "name": "Validation Loss (No Reg)", "line": {"color": "#fd7e14"}}, {"x": [1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50], "y": [1.8, 1.2, 0.9, 0.8, 0.75, 0.72, 0.70, 0.69, 0.685, 0.68, 0.68], "mode": "lines", "name": "Validation Loss (L2/Dropout)", "line": {"dash": "dash", "color": "#fd7e14"}}]}Comparison of training and validation loss with and without regularization. Regularization often increases training loss slightly but decreases validation loss, reducing the gap and indicating better generalization.This practice involves modifying specific parts of your network's implementation. Try adding L2 regularization first by adjusting the cost and gradient calculations. Then, experiment with adding dropout layers during the forward pass and ensure you handle it correctly during backpropagation and turn it off during evaluation. Observe the impact on your training and validation metrics and try different values for the regularization strength $\lambda$ or the dropout keep_prob.