As we discussed in the chapter introduction, the foundation of training neural networks lies in calculating the gradient of the loss function with respect to the model's parameters. PyTorch's Autograd engine handles this complex task automatically. But how does Autograd know which calculations need to be tracked for differentiation? The answer lies in a specific attribute of PyTorch tensors: requires_grad
.
requires_grad
AttributeEvery PyTorch tensor possesses a boolean attribute called requires_grad
. This attribute acts as a flag, signaling to Autograd whether operations involving this tensor should be recorded for potential gradient computation later.
By default, when you create a tensor, its requires_grad
attribute is set to False
.
import torch
# Default behavior: requires_grad is False
x = torch.tensor([1.0, 2.0, 3.0])
print(f"Tensor x: {x}")
print(f"x.requires_grad: {x.requires_grad}")
# Create another tensor explicitly setting requires_grad to False
y = torch.tensor([4.0, 5.0, 6.0], requires_grad=False)
print(f"\nTensor y: {y}")
print(f"y.requires_grad: {y.requires_grad}")
This default behavior is sensible for efficiency. Many tensors in a typical workflow don't need gradients. For instance, input data or target labels are usually fixed and don't require gradient computation with respect to themselves. Tracking operations unnecessarily would consume extra memory and computation.
To instruct PyTorch to track operations and prepare for gradient computation for a specific tensor, you set its requires_grad
attribute to True
. There are two primary ways to do this:
During Tensor Creation: Pass requires_grad=True
as an argument to the tensor creation function.
# Enable gradient tracking at creation time
w = torch.tensor([0.5, -1.0], requires_grad=True)
print(f"Tensor w: {w}")
print(f"w.requires_grad: {w.requires_grad}")
After Tensor Creation (In-place): Use the in-place method .requires_grad_(True)
on an existing tensor.
b = torch.tensor([0.1])
print(f"Tensor b (before): {b}")
print(f"b.requires_grad (before): {b.requires_grad}")
# Enable gradient tracking after creation
b.requires_grad_(True)
print(f"\nTensor b (after): {b}")
print(f"b.requires_grad (after): {b.requires_grad}")
Important Note: Gradient computation is typically only meaningful for floating-point tensors (like torch.float32
or torch.float64
). Derivatives involve continuous changes, which aligns with floating-point types. Attempting to set requires_grad=True
on integer tensors will usually result in an error or may behave unexpectedly, as gradients are not defined for discrete values in the same way. PyTorch will often raise a RuntimeError
if you try to compute gradients for integer tensors directly involved in tracked operations.
# Attempting requires_grad on an integer tensor
try:
int_tensor = torch.tensor([1, 2], dtype=torch.int64, requires_grad=True)
# This line might not error immediately, but subsequent backward() calls involving it would.
print(f"Integer tensor created with requires_grad=True: {int_tensor.requires_grad}")
# Let's try a simple operation that might lead to issues later
result = int_tensor * 2.0 # Multiply by float to see if it causes issues
print(f"Result requires_grad: {result.requires_grad}")
# result.backward() # This would likely fail if we tried to backpropagate
except RuntimeError as e:
print(f"\nError setting requires_grad on integer tensor: {e}")
# Best practice: Use float tensors for parameters/computations needing gradients
float_tensor = torch.tensor([1.0, 2.0], requires_grad=True)
print(f"\nFloat tensor created with requires_grad=True: {float_tensor.requires_grad}")
requires_grad
Crucially, the requires_grad
status propagates through operations. If any input tensor participating in an operation has requires_grad=True
, the output tensor resulting from that operation will automatically have requires_grad=True
. This ensures that the entire chain of calculations involving parameters (which typically have requires_grad=True
) is tracked.
Let's illustrate this:
# Define tensors: x (input), w (weight), b (bias)
x = torch.tensor([1.0, 2.0]) # Input data, gradients not needed
w = torch.tensor([0.5, -1.0], requires_grad=True) # Weight parameter, track gradients
b = torch.tensor([0.1], requires_grad=True) # Bias parameter, track gradients
print(f"x requires_grad: {x.requires_grad}")
print(f"w requires_grad: {w.requires_grad}")
print(f"b requires_grad: {b.requires_grad}")
# Perform an operation: y = w * x + b
# Note: PyTorch handles broadcasting for b
intermediate = w * x
print(f"\nintermediate (w * x) requires_grad: {intermediate.requires_grad}")
y = intermediate + b
print(f"y requires_grad: {y.requires_grad}")
Notice that even though x
did not require gradients, because w
required gradients, the result of w * x
(intermediate
) also requires gradients. Subsequently, since intermediate
required gradients (and b
also did), the final output y
also has requires_grad=True
.
.grad_fn
AttributeThis propagation is intrinsically linked to how PyTorch builds the computation graph. When a new tensor is created by an operation, and its requires_grad
is True
, PyTorch attaches a .grad_fn
attribute to this new tensor. This attribute references the function (like AddBackward0
or MulBackward0
) that performed the operation and knows how to compute the corresponding gradients during the backward pass.
Tensors created directly by the user (like our x
, w
, and b
examples above) are considered "leaf" tensors in the graph. If they have requires_grad=True
, their .grad_fn
is None
because they weren't created by a tracked operation within the graph. Tensors resulting from operations on tensors requiring gradients are "non-leaf" tensors and will have a .grad_fn
.
Let's inspect the .grad_fn
from our previous example:
print(f"\nx.grad_fn: {x.grad_fn}")
print(f"w.grad_fn: {w.grad_fn}")
print(f"b.grad_fn: {b.grad_fn}")
print(f"intermediate.grad_fn: {intermediate.grad_fn}") # Result of multiplication
print(f"y.grad_fn: {y.grad_fn}") # Result of addition
You can see that x
, w
, and b
(our leaf tensors) have grad_fn=None
. In contrast, intermediate
has a MulBackward0
function, and y
has an AddBackward0
function, indicating the operations that created them. This chain of grad_fn
references is the dynamic computation graph that Autograd uses.
A simplified view of the computation graph for
y = w * x + b
. Tensors requiring gradients are highlighted in blue. Notice how operations (*
,+
) create new tensors (intermediate
,y
) which reference the operation viagrad_fn
if gradient tracking is enabled through their inputs.
By setting requires_grad=True
on the tensors we want to optimize (typically model parameters like weights w
and biases b
), we enable Autograd to build this graph and trace the computations back from the final output (usually the loss) to these parameters, preparing everything for the gradient calculation step using .backward()
, which we will cover next.
© 2025 ApX Machine Learning