After calculating the loss and performing backpropagation using loss.backward()
, your model's parameters (weights and biases that have requires_grad=True
) now have their respective gradients computed and stored in their .grad
attribute. These gradients, like ∇θL, indicate the direction in the parameter space that would most steeply increase the loss. To minimize the loss, we need to move the parameters in the opposite direction. This is precisely the job of the optimizer.
In Chapter 4, you learned how to instantiate optimizers like torch.optim.SGD
or torch.optim.Adam
, passing the model's parameters (model.parameters()
) and configuration details like the learning rate. Now, within the training loop, you use the optimizer's step()
method to perform the parameter update.
# Assume model, criterion, and optimizer are already defined
# optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
# ... Inside the training loop for one batch ...
# Forward pass
outputs = model(inputs)
# Calculate loss
loss = criterion(outputs, labels)
# Backpropagation - compute gradients
loss.backward()
# Update weights using the optimizer
optimizer.step()
Calling optimizer.step()
iterates through all the parameters registered with the optimizer during its initialization. For each parameter p
, it uses the gradient stored in p.grad
to update the parameter's value p.data
.
The most basic update rule, used by Stochastic Gradient Descent (SGD), is:
parameter=parameter−learning_rate×gradientOr more formally, for a parameter θ:
θnew=θold−η∇θLwhere η is the learning rate (like the lr
argument passed when creating the optimizer) and ∇θL is the gradient computed by loss.backward()
and accessed internally by the optimizer via parameter.grad
.
Different optimizers implement more sophisticated update rules. For instance, optimizers like Adam use adaptive learning rates for each parameter and incorporate momentum concepts, but the core idea remains the same: use the computed gradients to adjust parameters to minimize loss. The optimizer.step()
call abstracts away the specific update logic defined by the chosen optimization algorithm.
It is essential to call loss.backward()
before calling optimizer.step()
. The backward()
call computes the gradients, and the step()
call uses these gradients to update the weights. Without calling backward()
first, the .grad
attributes would not be populated (or would contain stale values from a previous iteration), and the optimizer wouldn't know how to adjust the parameters effectively.
The next crucial step after updating the weights is to clear the gradients before processing the next batch. We will cover that in the following section, "Zeroing Gradients".
© 2025 ApX Machine Learning