理解梯度下降的理论是一回事;亲眼看它运作能更直观地理解它如何在损失曲面上找到最佳参数。在本实践部分,我们将使用简单的可视化方式来演示梯度下降算法的行为。我们将从一个非常简单的一维凸函数开始,比如 $f(x) = x^2$。可以将其视为一个简化的损失函数,我们只有一个参数 $x$ 需要优化。我们的目标是找到使 $f(x)$ 最小的 $x$ 值。我们从微积分知道最小值在 $x=0$ 处。我们来看看梯度下降是如何找到它的。$f(x) = x^2$ 的梯度(在这种一维情况下是导数)是 $f'(x) = 2x$。梯度下降更新规则是: $$x_{new} = x_{old} - \alpha \cdot f'(x_{old})$$ $$x_{new} = x_{old} - \alpha \cdot (2x_{old})$$ 其中 $\alpha$ 是学习率。我们来模拟这个过程。我们将从一个任意点开始,比如 $x_0 = 4$,并选择一个学习率,比如 $\alpha = 0.1$。步骤 0: $x_0 = 4.0$。$f(x_0) = 16.0$。梯度 $f'(x_0) = 2 \times 4.0 = 8.0$。步骤 1: $x_1 = x_0 - \alpha \cdot f'(x_0) = 4.0 - 0.1 \times 8.0 = 4.0 - 0.8 = 3.2$。$f(x_1) = 10.24$。梯度 $f'(x_1) = 2 \times 3.2 = 6.4$。步骤 2: $x_2 = x_1 - \alpha \cdot f'(x_1) = 3.2 - 0.1 \times 6.4 = 3.2 - 0.64 = 2.56$。$f(x_2) = 6.55$。梯度 $f'(x_2) = 2 \times 2.56 = 5.12$。步骤 3: $x_3 = x_2 - \alpha \cdot f'(x_2) = 2.56 - 0.1 \times 5.12 = 2.56 - 0.512 = 2.048$。$f(x_3) = 4.19$。……依此类推。请注意,$x$ 的值如何逐渐接近 0,并且随着我们接近最小值,步长(由梯度决定)也随之减小。我们来将这条路径可视化。{"layout": {"title": "f(x) = x^2 上的梯度下降 (alpha=0.1)", "xaxis": {"title": "x", "range": [-4.5, 4.5]}, "yaxis": {"title": "f(x)", "range": [-1, 17]}, "showlegend": true}, "data": [{"x": [-4.5, -4, -3.5, -3, -2.5, -2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5], "y": [20.25, 16, 12.25, 9, 6.25, 4, 2.25, 1, 0.25, 0, 0.25, 1, 2.25, 4, 6.25, 9, 12.25, 16, 20.25], "mode": "lines", "name": "f(x) = x^2", "line": {"color": "#adb5bd"}}, {"x": [4.0, 3.2, 2.56, 2.048, 1.6384, 1.3107], "y": [16.0, 10.24, 6.5536, 4.1943, 2.6844, 1.718], "mode": "markers+lines", "name": "梯度下降步骤", "marker": {"color": "#228be6", "size": 8}, "line": {"color": "#228be6", "dash": "dot"}}]}梯度下降从 $x=4$ 开始,以学习率 $\alpha=0.1$ 迭代地向二次函数 $f(x) = x^2$ 的最小值移动。学习率的影响学习率 $\alpha$ 是一个重要的超参数。我们来看看不同值的情况,仍然从 $x_0 = 4$ 开始。小学习率 ($\alpha = 0.01$): 步长会非常小,导致向最小值移动的速度非常慢。大学习率 ($\alpha = 0.95$): 步长可能过大,导致算法越过最小值并来回振荡。它可能仍然收敛,但效率可能不高。过大学习率 ($\alpha = 1.05$): 步长如此之大,以至于每次更新都比上一个点离最小值更远。算法发散。{"layout": {"title": "学习率对梯度下降的影响 (f(x) = x^2)", "xaxis": {"title": "x", "range": [-5, 5]}, "yaxis": {"title": "f(x)", "range": [-5, 25]}, "showlegend": true}, "data": [{"x": [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5], "y": [25, 16, 9, 4, 1, 0, 1, 4, 9, 16, 25], "mode": "lines", "name": "f(x) = x^2", "line": {"color": "#adb5bd"}}, {"x": [4.0, 3.92, 3.8416, 3.7648, 3.6895], "y": [16.0, 15.3664, 14.7575, 14.1735, 13.6123], "mode": "markers+lines", "name": "alpha=0.01 (慢速)", "marker": {"color": "#f76707", "size": 7}, "line": {"color": "#f76707", "dash": "dot"}}, {"x": [4.0, 3.2, 2.56, 2.048, 1.6384], "y": [16.0, 10.24, 6.5536, 4.1943, 2.6844], "mode": "markers+lines", "name": "alpha=0.1 (良好)", "marker": {"color": "#228be6", "size": 7}, "line": {"color": "#228be6", "dash": "dot"}}, {"x": [4.0, -3.6, 3.24, -2.916, 2.6244], "y": [16.0, 12.96, 10.4976, 8.5031, 6.8875], "mode": "markers+lines", "name": "alpha=0.95 (越过)", "marker": {"color": "#ae3ec9", "size": 7}, "line": {"color": "#ae3ec9", "dash": "dot"}}, {"x": [4.0, -4.4, 4.84, -5.324, 5.8564], "y": [16.0, 19.36, 23.4256, 28.345, 34.2975], "mode": "markers+lines", "name": "alpha=1.05 (发散)", "marker": {"color": "#f03e3e", "size": 7}, "line": {"color": "#f03e3e", "dash": "dot"}}]}在 $f(x) = x^2$ 上不同学习率 ($\alpha$) 的梯度下降路径比较。小值收敛缓慢,合适的值有效收敛,较大的值可能越过甚至发散。这表明了选择合适的学习率对成功训练的重要性。二维可视化神经网络损失函数通常依赖于数百万个参数,这使得它们的曲面无法直接可视化。然而,通过可视化二变量函数(例如 $f(x, y) = x^2 + y^2$)上的梯度下降,我们可以获得更好的直观理解。最小值显然在 $(0, 0)$。梯度现在是一个向量: $$\nabla f(x, y) = \left[ \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \right] = [2x, 2y]$$更新规则变为: $$x_{new} = x_{old} - \alpha \cdot (2x_{old})$$ $$y_{new} = y_{old} - \alpha \cdot (2y_{old})$$我们从 $(x_0, y_0) = (3, 4)$ 开始,学习率 $\alpha = 0.1$。步骤 0: $(x_0, y_0) = (3.0, 4.0)$。$f(x_0, y_0) = 9 + 16 = 25$。梯度 $\nabla f(3, 4) = [6, 8]$。步骤 1:$x_1 = 3.0 - 0.1 \times 6 = 3.0 - 0.6 = 2.4$。$y_1 = 4.0 - 0.1 \times 8 = 4.0 - 0.8 = 3.2$。$(x_1, y_1) = (2.4, 3.2)$。$f(x_1, y_1) = 5.76 + 10.24 = 16.0$。梯度 $\nabla f(2.4, 3.2) = [4.8, 6.4]$。步骤 2:$x_2 = 2.4 - 0.1 \times 4.8 = 2.4 - 0.48 = 1.92$。$y_2 = 3.2 - 0.1 \times 6.4 = 3.2 - 0.64 = 2.56$。$(x_2, y_2) = (1.92, 2.56)$。$f(x_2, y_2) = 3.6864 + 6.5536 = 10.24$。我们可以将其可视化为函数 $f(x, y)$ 等高线图上的一条路径。任何一点的梯度都垂直于穿过该点的等高线,并指向最陡峭的上升方向。梯度下降则向相反的方向(最陡峭的下降方向)迈进。{"layout": {"title": "f(x, y) = x^2 + y^2 上的梯度下降 (alpha=0.1)", "xaxis": {"title": "x", "range": [-4, 4]}, "yaxis": {"title": "y", "range": [-1, 5], "scaleanchor": "x", "scaleratio": 1}, "showlegend": true}, "data": [{"type": "contour", "z": [[16, 17, 20, 25], [9, 10, 13, 18], [4, 5, 8, 13], [1, 2, 5, 10], [0, 1, 4, 9], [1, 2, 5, 10], [4, 5, 8, 13]], "x": [-2, -1, 0, 1, 2], "y": [-4, -3, -2, -1, 0, 1, 2, 3, 4], "colorscale": "Greys", "contours": {"coloring": "lines", "start": 0, "end": 30, "size": 4}, "line": {"width": 1}, "showscale": false, "name": "等高线"}, {"x": [3.0, 2.4, 1.92, 1.536, 1.2288, 0.983], "y": [4.0, 3.2, 2.56, 2.048, 1.6384, 1.311], "mode": "markers+lines", "name": "梯度下降路径", "marker": {"color": "#1c7ed6", "size": 9}, "line": {"color": "#1c7ed6", "dash": "dash"}}]}梯度下降在 $f(x, y) = x^2 + y^2$ 的等高线图上的路径。从 (3, 4) 开始,算法沿着垂直于等高线的方向向最小值 (0, 0) 移动。这些可视化虽然简单,但演示了梯度下降的核心机制。在深度学习中,损失曲面要复杂得多,并且是高维的,可能存在许多局部最小值和鞍点(如 gradient-descent-challenges 中所述)。然而,基本思想保持不变:沿着负梯度方向迭代地减小损失。理解这种视觉直觉有助于诊断训练问题或调整学习率等超参数。