Manually calculating the gradients for a very small neural network helps solidify understanding of loss functions, gradient descent, and the backpropagation algorithm. This exercise also visualizes how the chain rule operates to determine how each parameter contributes to the overall error.Consider a simple network with one input, one hidden neuron (using the Sigmoid activation function), and one output neuron (also using Sigmoid). Our goal is to calculate how changes in the weights ($w_1$, $w_2$) and biases ($b_1$, $b_2$) affect the final loss for a single training example.digraph G { rankdir=TB; node [shape=circle, style=filled, color="#a5d8ff", fontname="helvetica"]; edge [fontname="helvetica"]; x [label="x"]; h [label="h", color="#96f2d7"]; o [label="o", color="#ffc9c9"]; b1 [label="b1", shape=plaintext]; b2 [label="b2", shape=plaintext]; x -> h [label="w1"]; b1 -> h [style=dashed]; h -> o [label="w2"]; b2 -> o [style=dashed]; subgraph cluster_0 { style=filled; color="#e9ecef"; node [style=filled,color=white]; x; label = "Input Layer"; } subgraph cluster_1 { style=filled; color="#e9ecef"; node [style=filled,color=white]; h; b1; label = "Hidden Layer (Sigmoid)"; } subgraph cluster_2 { style=filled; color="#e9ecef"; node [style=filled,color=white]; o; b2; label = "Output Layer (Sigmoid)"; } }A simple feedforward network with one input, one hidden neuron, and one output neuron.Network SetupLet's define the components and initial values:Input: $x = 0.5$Target Output: $y = 0.8$Weights: $w_1 = 0.2$, $w_2 = 0.9$Biases: $b_1 = 0.1$, $b_2 = -0.3$Activation Function: Sigmoid, $\sigma(z) = \frac{1}{1 + e^{-z}}$. Its derivative is $\sigma'(z) = \sigma(z)(1 - \sigma(z))$.Loss Function: Mean Squared Error (MSE), $L = \frac{1}{2}(y - o)^2$. Its derivative with respect to the output $o$ is $\frac{\partial L}{\partial o} = o - y$.1. Forward PropagationFirst, we compute the network's output ($o$) for the given input ($x$) and parameters.Hidden Layer Pre-activation ($z_1$): $$ z_1 = w_1 x + b_1 = (0.2 \times 0.5) + 0.1 = 0.1 + 0.1 = 0.2 $$Hidden Layer Activation ($h$): $$ h = \sigma(z_1) = \sigma(0.2) = \frac{1}{1 + e^{-0.2}} \approx \frac{1}{1 + 0.8187} \approx 0.5498 $$Output Layer Pre-activation ($z_2$): $$ z_2 = w_2 h + b_2 = (0.9 \times 0.5498) + (-0.3) \approx 0.4948 - 0.3 = 0.1948 $$Output Layer Activation ($o$): $$ o = \sigma(z_2) = \sigma(0.1948) = \frac{1}{1 + e^{-0.1948}} \approx \frac{1}{1 + 0.8230} \approx 0.5486 $$So, the network's prediction is $o \approx 0.5486$.2. Loss CalculationNow, calculate the error using the MSE loss function:$$ L = \frac{1}{2}(y - o)^2 = \frac{1}{2}(0.8 - 0.5486)^2 = \frac{1}{2}(0.2514)^2 \approx \frac{1}{2}(0.0632) \approx 0.0316 $$ The loss for this example is approximately $0.0316$.3. Backward Propagation (Gradient Calculation)Our goal is to find the gradients: $\frac{\partial L}{\partial w_2}$, $\frac{\partial L}{\partial b_2}$, $\frac{\partial L}{\partial w_1}$, and $\frac{\partial L}{\partial b_1}$. We use the chain rule, working backward from the loss.Derivative of Loss w.r.t. Network Output ($o$): $$ \frac{\partial L}{\partial o} = o - y \approx 0.5486 - 0.8 = -0.2514 $$Gradients for Output Layer ($w_2, b_2$): We need the derivative of the output activation $o$ w.r.t. its pre-activation $z_2$. $$ \frac{\partial o}{\partial z_2} = \sigma'(z_2) = o (1 - o) \approx 0.5486 \times (1 - 0.5486) \approx 0.5486 \times 0.4514 \approx 0.2476 $$ Now, apply the chain rule to find the gradient of the loss w.r.t. $z_2$: $$ \frac{\partial L}{\partial z_2} = \frac{\partial L}{\partial o} \frac{\partial o}{\partial z_2} \approx (-0.2514) \times (0.2476) \approx -0.0622 $$ The gradients for $w_2$ and $b_2$ depend on how $z_2$ changes with respect to them: $$ \frac{\partial z_2}{\partial w_2} = h \approx 0.5498 $$ $$ \frac{\partial z_2}{\partial b_2} = 1 $$ Using the chain rule again: $$ \frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial z_2} \frac{\partial z_2}{\partial w_2} \approx (-0.0622) \times (0.5498) \approx -0.0342 $$ $$ \frac{\partial L}{\partial b_2} = \frac{\partial L}{\partial z_2} \frac{\partial z_2}{\partial b_2} \approx (-0.0622) \times 1 = -0.0622 $$Gradients for Hidden Layer ($w_1, b_1$): We need to propagate the gradient further back. First, find the gradient of the loss w.r.t. the hidden activation $h$: $$ \frac{\partial L}{\partial h} = \frac{\partial L}{\partial z_2} \frac{\partial z_2}{\partial h} $$ We need $\frac{\partial z_2}{\partial h}$: $$ \frac{\partial z_2}{\partial h} = w_2 = 0.9 $$ So, $$ \frac{\partial L}{\partial h} \approx (-0.0622) \times 0.9 = -0.0560 $$ Next, we need the derivative of the hidden activation $h$ w.r.t. its pre-activation $z_1$: $$ \frac{\partial h}{\partial z_1} = \sigma'(z_1) = h (1 - h) \approx 0.5498 \times (1 - 0.5498) \approx 0.5498 \times 0.4502 \approx 0.2475 $$ Now, apply the chain rule to find the gradient of the loss w.r.t. $z_1$: $$ \frac{\partial L}{\partial z_1} = \frac{\partial L}{\partial h} \frac{\partial h}{\partial z_1} \approx (-0.0560) \times (0.2475) \approx -0.0139 $$ Finally, the gradients for $w_1$ and $b_1$ depend on how $z_1$ changes with respect to them: $$ \frac{\partial z_1}{\partial w_1} = x = 0.5 $$ $$ \frac{\partial z_1}{\partial b_1} = 1 $$ Using the chain rule one last time: $$ \frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial z_1} \frac{\partial z_1}{\partial w_1} \approx (-0.0139) \times 0.5 = -0.0070 $$ $$ \frac{\partial L}{\partial b_1} = \frac{\partial L}{\partial z_1} \frac{\partial z_1}{\partial b_1} \approx (-0.0139) \times 1 = -0.0139 $$Summary of GradientsWe have manually calculated the gradients of the loss function with respect to each parameter:$\frac{\partial L}{\partial w_2} \approx -0.0342$$\frac{\partial L}{\partial b_2} \approx -0.0622$$\frac{\partial L}{\partial w_1} \approx -0.0070$$\frac{\partial L}{\partial b_1} \approx -0.0139$These gradients tell us the direction and magnitude of change needed for each parameter to reduce the loss. For instance, a negative gradient like $\frac{\partial L}{\partial w_2} \approx -0.0342$ suggests that increasing $w_2$ slightly would decrease the loss (because the update rule involves subtracting the gradient).Next StepIn a real training scenario, these gradients would be used with a chosen learning rate ($\eta$) to update the parameters using gradient descent:$$ w_{new} = w_{old} - \eta \frac{\partial L}{\partial w_{old}} $$ $$ b_{new} = b_{old} - \eta \frac{\partial L}{\partial b_{old}} $$This manual calculation, while tedious for larger networks, clearly demonstrates the mechanics of backpropagation and how error signals flow backward through the network to inform parameter updates. Frameworks like TensorFlow and PyTorch automate this process, but understanding the underlying calculations is significant for effective model building and debugging.