批标准化(BN)旨在缓解深度学习中的常见问题——内部协变量偏移。其实际应用涉及构建和训练两个简单的神经网络:一个不含BN,另一个整合了BN。主要目的是观察BN对训练稳定性和收敛速度的影响。我们将使用PyTorch进行本次练习。请确保您已安装PyTorch。我们还将使用一个简单的合成数据集,以便将关注点完全放在归一化技术本身的影响上。场景设置首先,让我们导入所需的库并生成一些合成分类数据。import torch import torch.nn as nn import torch.optim as optim from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt # 或者使用 Plotly 制作交互式图表 # 生成合成数据 X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=5, n_classes=2, random_state=42) # 标准化特征 scaler = StandardScaler() X = scaler.fit_transform(X) # 转换为 PyTorch 张量 X_tensor = torch.tensor(X, dtype=torch.float32) y_tensor = torch.tensor(y, dtype=torch.float32).unsqueeze(1) # 目标张量形状需为 (N, 1) 以用于 BCEWithLogitsLoss # 划分数据 X_train, X_val, y_train, y_val = train_test_split(X_tensor, y_tensor, test_size=0.2, random_state=42) # 创建数据加载器(可选但推荐) train_dataset = torch.utils.data.TensorDataset(X_train, y_train) val_dataset = torch.utils.data.TensorDataset(X_val, y_val) train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True) val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=64, shuffle=False) print(f"训练样本数: {len(train_loader.dataset)}") print(f"验证样本数: {len(val_loader.dataset)}")模型1:不带批标准化的简单多层感知机让我们定义一个包含两个隐藏层的基本多层感知机(MLP)。class SimpleMLP(nn.Module): def __init__(self, input_size, hidden_size1, hidden_size2, output_size): super(SimpleMLP, self).__init__() self.layer1 = nn.Linear(input_size, hidden_size1) self.relu1 = nn.ReLU() self.layer2 = nn.Linear(hidden_size1, hidden_size2) self.relu2 = nn.ReLU() self.output_layer = nn.Linear(hidden_size2, output_size) def forward(self, x): x = self.layer1(x) x = self.relu1(x) x = self.layer2(x) x = self.relu2(x) x = self.output_layer(x) return x # 实例化模型 input_dim = X_train.shape[1] hidden_dim1 = 128 hidden_dim2 = 64 output_dim = 1 # 二分类 model_no_bn = SimpleMLP(input_dim, hidden_dim1, hidden_dim2, output_dim) print("不带批标准化的模型:") print(model_no_bn)模型2:带批标准化的多层感知机现在,让我们创建一个类似的多层感知机,但在每次线性变换之后、激活函数之前添加 BatchNorm1d 层。这是一种常见的放置策略。class MLPWithBN(nn.Module): def __init__(self, input_size, hidden_size1, hidden_size2, output_size): super(MLPWithBN, self).__init__() self.layer1 = nn.Linear(input_size, hidden_size1) self.bn1 = nn.BatchNorm1d(hidden_size1) # 用于第一层输出的 BN 层 self.relu1 = nn.ReLU() self.layer2 = nn.Linear(hidden_size1, hidden_size2) self.bn2 = nn.BatchNorm1d(hidden_size2) # 用于第二层输出的 BN 层 self.relu2 = nn.ReLU() self.output_layer = nn.Linear(hidden_size2, output_size) def forward(self, x): x = self.layer1(x) x = self.bn1(x) # 在激活前应用 BN x = self.relu1(x) x = self.layer2(x) x = self.bn2(x) # 在激活前应用 BN x = self.relu2(x) x = self.output_layer(x) return x # 实例化模型 model_with_bn = MLPWithBN(input_dim, hidden_dim1, hidden_dim2, output_dim) print(" 带批标准化的模型:") print(model_with_bn)请注意新增的 nn.BatchNorm1d 层。其中的 1d 表示我们预期输入形状为(批量大小,特征),这对于处理非空间数据的全连接层来说是常见的。训练循环我们将定义一个标准的训练循环函数。请密切注意 model.train() 和 model.eval() 的用法。这对于包含批标准化(和 Dropout)的模型尤为重要,因为这些层在训练和评估期间表现不同。model.train(): 将模型设置为训练模式。BN层使用当前小批次的统计数据($\mu$,$\sigma^2$)进行归一化,并更新其群体统计数据的运行估计值。model.eval(): 将模型设置为评估模式。BN层使用之前学习到的运行估计值进行归一化,并且不更新它们。def train_model(model, train_loader, val_loader, epochs=20, lr=0.01): criterion = nn.BCEWithLogitsLoss() # 结合了 Sigmoid 和二元交叉熵 optimizer = optim.Adam(model.parameters(), lr=lr) train_losses = [] val_losses = [] for epoch in range(epochs): model.train() # 将模型设置为训练模式 running_train_loss = 0.0 for inputs, labels in train_loader: optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() running_train_loss += loss.item() * inputs.size(0) epoch_train_loss = running_train_loss / len(train_loader.dataset) train_losses.append(epoch_train_loss) model.eval() # 将模型设置为评估模式 running_val_loss = 0.0 with torch.no_grad(): # 验证时禁用梯度计算 for inputs, labels in val_loader: outputs = model(inputs) loss = criterion(outputs, labels) running_val_loss += loss.item() * inputs.size(0) epoch_val_loss = running_val_loss / len(val_loader.dataset) val_losses.append(epoch_val_loss) if (epoch + 1) % 5 == 0 or epoch == 0: print(f"Epoch {epoch+1}/{epochs} - Train Loss: {epoch_train_loss:.4f}, Val Loss: {epoch_val_loss:.4f}") return train_losses, val_losses # --- 训练模型 --- print(" 正在训练不带批标准化的模型...") # 重新初始化模型以确保公平比较 model_no_bn = SimpleMLP(input_dim, hidden_dim1, hidden_dim2, output_dim) train_losses_no_bn, val_losses_no_bn = train_model(model_no_bn, train_loader, val_loader, epochs=25, lr=0.01) print(" 正在训练带批标准化的模型...") # 重新初始化模型 model_with_bn = MLPWithBN(input_dim, hidden_dim1, hidden_dim2, output_dim) train_losses_bn, val_losses_bn = train_model(model_with_bn, train_loader, val_loader, epochs=25, lr=0.01)结果分析现在,让我们可视化两个模型的训练和验证损失曲线。{"layout": {"title": "训练损失比较:批标准化 vs 无 BN", "xaxis": {"title": "周期"}, "yaxis": {"title": "BCEWithLogitsLoss"}, "legend": {"title": "模型"}, "hovermode": "x unified", "template": "plotly_white"}, "data": [{"type": "scatter", "mode": "lines", "name": "无 BN - 训练", "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], "y": [0.6012, 0.4153, 0.3218, 0.2559, 0.2011, 0.1587, 0.1254, 0.1019, 0.0832, 0.0695, 0.0581, 0.0483, 0.0417, 0.0355, 0.0301, 0.0262, 0.0228, 0.0199, 0.0176, 0.0157, 0.0141, 0.0128, 0.0117, 0.0107, 0.0098], "line": {"color": "#ff6b6b"}}, {"type": "scatter", "mode": "lines", "name": "无 BN - 验证", "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], "y": [0.4855, 0.3801, 0.3452, 0.3410, 0.3555, 0.3821, 0.4103, 0.4512, 0.4901, 0.5350, 0.5805, 0.6288, 0.6773, 0.7250, 0.7781, 0.8315, 0.8852, 0.9399, 0.9941, 1.0480, 1.1012, 1.1540, 1.2061, 1.2578, 1.3090], "line": {"color": "#ffa8a8", "dash": "dash"}}, {"type": "scatter", "mode": "lines", "name": "带 BN - 训练", "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], "y": [0.3850, 0.2015, 0.1258, 0.0801, 0.0533, 0.0371, 0.0269, 0.0198, 0.0151, 0.0119, 0.0097, 0.0080, 0.0068, 0.0058, 0.0050, 0.0044, 0.0039, 0.0035, 0.0031, 0.0028, 0.0025, 0.0023, 0.0021, 0.0019, 0.0018], "line": {"color": "#228be6"}}, {"type": "scatter", "mode": "lines", "name": "带 BN - 验证", "x": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], "y": [0.2505, 0.2251, 0.2402, 0.2755, 0.3151, 0.3580, 0.4001, 0.4410, 0.4805, 0.5188, 0.5560, 0.5925, 0.6285, 0.6640, 0.6990, 0.7338, 0.7685, 0.8030, 0.8371, 0.8710, 0.9045, 0.9378, 0.9708, 1.0035, 1.0360], "line": {"color": "#74c0fc", "dash": "dash"}}]}比较了带批标准化和不带批标准化的模型在25个周期内的训练和验证损失曲线。注意:实际结果可能因随机初始化和数据混洗而略有不同。观察图表(基于通常预期结果):更快的收敛: 带批标准化的模型(蓝色线)与不带BN的模型(红色线)相比,其训练损失的初始下降可能要陡峭得多。它通常能更快地达到一个更低的训练损失值。稳定性: BN模型的训练损失曲线可能显得更平滑,尽管由于小批量更新,两者都可能存在噪声。更值得关注的是,BN通常能使训练对学习率和初始化的选择不那么敏感。虽然我们在这里使用了相同的学习率,但BN常常允许使用更高的学习率,从而进一步加速训练。验证表现: 比较验证损失曲线(虚线)。BN模型是最初还是整体上达到了更低的验证损失?有时BN可以提供轻微的正则化作用,可能带来更好的泛化能力(更低的验证损失,或训练与验证损失之间更小的差距),尽管其主要作用是稳定训练。在此示例图中,两个模型最终都过拟合了(验证损失增加而训练损失减少),但BN模型开始过拟合的时间较晚,并且可能从较低的损失值开始。早期停止(稍后会介绍)在此处会很有用。进一步实验考虑尝试以下操作:提高学习率: 为两个模型重新运行训练,但使用显著更高的学习率(例如 lr=0.1)。观察不带BN的模型是否难以收敛或变得不稳定,而BN模型可能处理得更好。改变BN放置位置: 尝试将BN层放置在ReLU激活函数之后。这会改变训练动态吗?(注意:在激活函数之前放置通常更常见,也常被推荐)。更深的网络: 为两个模型添加更多层,观察BN在更深层架构中如何影响训练。这个实践练习表明了整合 BatchNorm1d 层如何能使前馈网络的训练更快、更稳定。它通过归一化激活值来应对内部协变量偏移问题,使优化过程更顺畅,并常常允许使用更激进的学习率。请记住 model.train() 和 model.eval() 的重要性,以确保BN在不同阶段表现正确。