Now that we've discussed the concept of internal covariate shift and how Batch Normalization (BN) aims to mitigate it, let's put theory into practice. In this hands-on section, we'll build and train two simple neural networks: one without BN and one with BN integrated. Our goal is to observe firsthand the effects of BN on training stability and convergence speed.We'll use PyTorch for this exercise. Ensure you have PyTorch installed. We'll also use a simple synthetic dataset to keep the focus squarely on the impact of the normalization technique itself.Setting up the ScenarioFirst, let's import necessary libraries and generate some synthetic classification data.import torch import torch.nn as nn import torch.optim as optim from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt # Or use Plotly for interactive plots # Generate synthetic data X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=5, n_classes=2, random_state=42) # Standardize features scaler = StandardScaler() X = scaler.fit_transform(X) # Convert to PyTorch tensors X_tensor = torch.tensor(X, dtype=torch.float32) y_tensor = torch.tensor(y, dtype=torch.float32).unsqueeze(1) # Target needs shape (N, 1) for BCEWithLogitsLoss # Split data X_train, X_val, y_train, y_val = train_test_split(X_tensor, y_tensor, test_size=0.2, random_state=42) # Create DataLoaders (optional but good practice) train_dataset = torch.utils.data.TensorDataset(X_train, y_train) val_dataset = torch.utils.data.TensorDataset(X_val, y_val) train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True) val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=64, shuffle=False) print(f"Training samples: {len(train_loader.dataset)}") print(f"Validation samples: {len(val_loader.dataset)}")Model 1: Simple MLP without Batch NormalizationLet's define a basic Multi-Layer Perceptron (MLP) with two hidden layers.class SimpleMLP(nn.Module): def __init__(self, input_size, hidden_size1, hidden_size2, output_size): super(SimpleMLP, self).__init__() self.layer1 = nn.Linear(input_size, hidden_size1) self.relu1 = nn.ReLU() self.layer2 = nn.Linear(hidden_size1, hidden_size2) self.relu2 = nn.ReLU() self.output_layer = nn.Linear(hidden_size2, output_size) def forward(self, x): x = self.layer1(x) x = self.relu1(x) x = self.layer2(x) x = self.relu2(x) x = self.output_layer(x) return x # Instantiate the model input_dim = X_train.shape[1] hidden_dim1 = 128 hidden_dim2 = 64 output_dim = 1 # Binary classification model_no_bn = SimpleMLP(input_dim, hidden_dim1, hidden_dim2, output_dim) print("Model without Batch Normalization:") print(model_no_bn)Model 2: MLP with Batch NormalizationNow, let's create a similar MLP but add BatchNorm1d layers after each linear transformation, just before the activation function. This is a common placement strategy.class MLPWithBN(nn.Module): def __init__(self, input_size, hidden_size1, hidden_size2, output_size): super(MLPWithBN, self).__init__() self.layer1 = nn.Linear(input_size, hidden_size1) self.bn1 = nn.BatchNorm1d(hidden_size1) # BN layer for layer1 output self.relu1 = nn.ReLU() self.layer2 = nn.Linear(hidden_size1, hidden_size2) self.bn2 = nn.BatchNorm1d(hidden_size2) # BN layer for layer2 output self.relu2 = nn.ReLU() self.output_layer = nn.Linear(hidden_size2, output_size) def forward(self, x): x = self.layer1(x) x = self.bn1(x) # Apply BN before activation x = self.relu1(x) x = self.layer2(x) x = self.bn2(x) # Apply BN before activation x = self.relu2(x) x = self.output_layer(x) return x # Instantiate the model model_with_bn = MLPWithBN(input_dim, hidden_dim1, hidden_dim2, output_dim) print("\nModel with Batch Normalization:") print(model_with_bn)Notice the nn.BatchNorm1d layers added. The 1d signifies that we expect inputs of shape (Batch Size, Features), which is typical for fully connected layers processing non-spatial data.Training LoopWe'll define a standard training loop function. Pay close attention to the use of model.train() and model.eval(). This is particularly important for models with BN (and Dropout), as these layers behave differently during training and evaluation.model.train(): Puts the model in training mode. BN layers use the current mini-batch statistics ($\mu$, $\sigma^2$) for normalization and update their running estimates of the population statistics.model.eval(): Puts the model in evaluation mode. BN layers use the previously learned running estimates for normalization and do not update them.def train_model(model, train_loader, val_loader, epochs=20, lr=0.01): criterion = nn.BCEWithLogitsLoss() # Combines Sigmoid and Binary Cross Entropy optimizer = optim.Adam(model.parameters(), lr=lr) train_losses = [] val_losses = [] for epoch in range(epochs): model.train() # Set model to training mode running_train_loss = 0.0 for inputs, labels in train_loader: optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() running_train_loss += loss.item() * inputs.size(0) epoch_train_loss = running_train_loss / len(train_loader.dataset) train_losses.append(epoch_train_loss) model.eval() # Set model to evaluation mode running_val_loss = 0.0 with torch.no_grad(): # Disable gradient calculations for validation for inputs, labels in val_loader: outputs = model(inputs) loss = criterion(outputs, labels) running_val_loss += loss.item() * inputs.size(0) epoch_val_loss = running_val_loss / len(val_loader.dataset) val_losses.append(epoch_val_loss) if (epoch + 1) % 5 == 0 or epoch == 0: print(f"Epoch {epoch+1}/{epochs} - Train Loss: {epoch_train_loss:.4f}, Val Loss: {epoch_val_loss:.4f}") return train_losses, val_losses # --- Train the models --- print("\nTraining Model without Batch Normalization...") # Re-initialize model to ensure fair comparison model_no_bn = SimpleMLP(input_dim, hidden_dim1, hidden_dim2, output_dim) train_losses_no_bn, val_losses_no_bn = train_model(model_no_bn, train_loader, val_loader, epochs=25, lr=0.01) print("\nTraining Model with Batch Normalization...") # Re-initialize model model_with_bn = MLPWithBN(input_dim, hidden_dim1, hidden_dim2, output_dim) train_losses_bn, val_losses_bn = train_model(model_with_bn, train_loader, val_loader, epochs=25, lr=0.01)Analyzing the ResultsNow, let's visualize the training and validation loss curves for both models.{"layout": {"title": "Training Loss Comparison: Batch Normalization vs No BN", "xaxis": {"title": "Epoch"}, "yaxis": {"title": "BCEWithLogitsLoss"}, "legend": {"title": "Model"}, "hovermode": "x unified", "template": "plotly_white"}, "data": [{"type": "scatter", "mode": "lines", "name": "No BN - Train", "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], "y": [0.6012, 0.4153, 0.3218, 0.2559, 0.2011, 0.1587, 0.1254, 0.1019, 0.0832, 0.0695, 0.0581, 0.0483, 0.0417, 0.0355, 0.0301, 0.0262, 0.0228, 0.0199, 0.0176, 0.0157, 0.0141, 0.0128, 0.0117, 0.0107, 0.0098], "line": {"color": "#ff6b6b"}}, {"type": "scatter", "mode": "lines", "name": "No BN - Validation", "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], "y": [0.4855, 0.3801, 0.3452, 0.3410, 0.3555, 0.3821, 0.4103, 0.4512, 0.4901, 0.5350, 0.5805, 0.6288, 0.6773, 0.7250, 0.7781, 0.8315, 0.8852, 0.9399, 0.9941, 1.0480, 1.1012, 1.1540, 1.2061, 1.2578, 1.3090], "line": {"color": "#ffa8a8", "dash": "dash"}}, {"type": "scatter", "mode": "lines", "name": "With BN - Train", "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], "y": [0.3850, 0.2015, 0.1258, 0.0801, 0.0533, 0.0371, 0.0269, 0.0198, 0.0151, 0.0119, 0.0097, 0.0080, 0.0068, 0.0058, 0.0050, 0.0044, 0.0039, 0.0035, 0.0031, 0.0028, 0.0025, 0.0023, 0.0021, 0.0019, 0.0018], "line": {"color": "#228be6"}}, {"type": "scatter", "mode": "lines", "name": "With BN - Validation", "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], "y": [0.2505, 0.2251, 0.2402, 0.2755, 0.3151, 0.3580, 0.4001, 0.4410, 0.4805, 0.5188, 0.5560, 0.5925, 0.6285, 0.6640, 0.6990, 0.7338, 0.7685, 0.8030, 0.8371, 0.8710, 0.9045, 0.9378, 0.9708, 1.0035, 1.0360], "line": {"color": "#74c0fc", "dash": "dash"}}]}Comparison of training and validation loss curves over 25 epochs for models with and without Batch Normalization. Note: Actual results may vary slightly due to random initialization and data shuffling.Observing the plot (based on typical expected results):Faster Convergence: The model with Batch Normalization (blue lines) likely shows a much steeper initial drop in training loss compared to the model without BN (red lines). It often reaches a lower training loss value much faster.Stability: The training loss curve for the BN model might appear smoother, although noise is expected in both due to mini-batch updates. More significantly, BN often makes training less sensitive to the choice of learning rate and initialization. While we used the same learning rate here, BN frequently allows for higher learning rates, further accelerating training.Validation Performance: Compare the validation loss curves (dashed lines). Does the BN model achieve a lower validation loss initially or overall? Sometimes BN can offer a slight regularization effect, potentially leading to better generalization (lower validation loss, or a smaller gap between training and validation loss), although its primary role is stabilization. In this example plot, both models eventually overfit (validation loss increases while training loss decreases), but the BN model starts overfitting later and potentially from a lower loss value. Early stopping (covered later) would be beneficial here.Experiment FurtherConsider trying the following:Increase the Learning Rate: Rerun the training for both models but use a significantly higher learning rate (e.g., lr=0.1). Observe if the non-BN model struggles to converge or becomes unstable, while the BN model might handle it better.Change BN Placement: Try placing the BN layer after the ReLU activation function. Does this change the training dynamics? (Note: Placement before activation is generally more common and often recommended).Deeper Network: Add more layers to both models and see how BN affects training in a deeper architecture.This practical exercise demonstrates how integrating BatchNorm1d layers can lead to faster, more stable training for feedforward networks. It addresses the internal covariate shift problem by normalizing activations, making the optimization process smoother and often allowing for more aggressive learning rates. Remember the importance of model.train() and model.eval() to ensure BN behaves correctly during different phases.