Now that we've covered the concepts behind PyTorch's Autograd system, let's solidify our understanding with some practical examples. These exercises will guide you through setting gradient requirements, performing backpropagation, inspecting gradients, observing accumulation, and disabling gradient tracking. Ensure you have PyTorch installed and can import the torch
library.
First, let's import PyTorch:
import torch
Let's start with a very simple computation and track gradients. We'll define two tensors, x
and w
, where w
represents a weight we want to optimize. We'll compute a simple output y
and then a scalar loss L
.
Create Tensors: Define x
as a tensor with some data and w
as a tensor we want to compute gradients for (using requires_grad=True
).
# Input data
x = torch.tensor([2.0, 4.0, 6.0])
# Weight tensor - requires gradient computation
w = torch.tensor([0.5], requires_grad=True)
print(f"x: {x}")
print(f"w: {w}")
print(f"x.requires_grad: {x.requires_grad}")
print(f"w.requires_grad: {w.requires_grad}")
Notice that x
does not require gradients by default, while we explicitly set it for w
.
Define Computation: Perform a simple operation. Any tensor resulting from an operation involving a tensor with requires_grad=True
will also have requires_grad=True
.
# Forward pass: y = w * x
y = w * x
# Define a simple scalar loss L (e.g., mean of y)
L = y.mean()
print(f"y: {y}")
print(f"L: {L}")
print(f"y.requires_grad: {y.requires_grad}")
print(f"L.requires_grad: {L.requires_grad}")
You'll see that both y
and L
now require gradients because they depend on w
.
Compute Gradients: Use the .backward()
method on the final scalar output (L
) to compute gradients throughout the graph.
# Perform backpropagation
L.backward()
Inspect Gradients: Check the .grad
attribute of the tensor w
.
# Gradient is stored in w.grad
print(f"Gradient dL/dw: {w.grad}")
# x did not require gradients, so its gradient is None
print(f"Gradient dL/dx: {x.grad}")
Let's analyze the result for w.grad
. The computation was:
yi=w∗xi
L=31∑yi=31(wx1+wx2+wx3)
The gradient ∂w∂L is:
With x=[2.0,4.0,6.0], the gradient is 31(2.0+4.0+6.0)=312.0=4.0. This matches the output tensor([4.])
. Because x
was created without requires_grad=True
, its gradient is not computed and remains None
.
Autograd builds a graph dynamically. Let's trace a slightly more complex example.
Create Tensors:
a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)
c = torch.tensor(4.0, requires_grad=False) # Does not require grad
print(f"a: {a}, requires_grad={a.requires_grad}")
print(f"b: {b}, requires_grad={b.requires_grad}")
print(f"c: {c}, requires_grad={c.requires_grad}")
Define Computation:
d = a * b
e = d + c
f = e * 2
print(f"d: {d}, requires_grad={d.requires_grad}") # True (depends on a, b)
print(f"e: {e}, requires_grad={e.requires_grad}") # True (depends on d)
print(f"f: {f}, requires_grad={f.requires_grad}") # True (depends on e)
Compute and Inspect Gradients:
# Backpropagate from the final scalar output f
f.backward()
# Check gradients
print(f"Gradient df/da: {a.grad}")
print(f"Gradient df/db: {b.grad}")
print(f"Gradient df/dc: {c.grad}") # Expected: None
Let's calculate manually: d=a×b e=d+c=a×b+c f=2×e=2(a×b+c)
∂a∂f=2×b=2×3.0=6.0 ∂b∂f=2×a=2×2.0=4.0 ∂c∂f=2
The computed gradients for a
and b
match. Since c
was defined with requires_grad=False
, Autograd did not track operations involving it for gradient computation relative to c
itself, so c.grad
is None
.
By default, gradients are accumulated in the .grad
attribute every time .backward()
is called. This is useful for scenarios like calculating gradients for multiple losses or simulating larger batch sizes, but it requires explicit zeroing of gradients during standard training loops.
Setup: Let's use a simple setup again.
x = torch.tensor(5.0, requires_grad=True)
y = x * x
print(f"Initial x.grad: {x.grad}") # Should be None initially
First Backward Pass:
# Perform backward pass on y. Note: backward() typically needs a scalar.
# If called on a non-scalar tensor, need to provide gradient argument.
# For demonstration, let's compute gradient of y wrt x (which is 2x).
# We'll use y.backward(gradient=torch.tensor(1.0)) to achieve this.
# More commonly, you'd have a scalar loss L derived from y.
# Let L = y.mean() (if y was multi-element) or just y if scalar.
y.backward(retain_graph=True) # retain_graph=True needed for multiple backward passes
print(f"x.grad after 1st backward: {x.grad}") # Expected: 2*x = 10.0
Second Backward Pass (Accumulation): Call backward again without zeroing the gradient.
y.backward(retain_graph=True) # Call backward again
print(f"x.grad after 2nd backward: {x.grad}") # Expected: 10.0 + 10.0 = 20.0
The gradient is accumulated (added) to the previous value.
Zeroing Gradients: Manually zero the gradient. In a typical training loop, this is done using optimizer.zero_grad()
.
if x.grad is not None:
x.grad.zero_() # In-place zeroing
print(f"x.grad after zeroing: {x.grad}") # Expected: 0.0
Third Backward Pass (After Zeroing):
y.backward() # No need for retain_graph on the final backward pass
print(f"x.grad after 3rd backward: {x.grad}") # Expected: 10.0
The gradient is computed fresh after being zeroed. Forgetting to zero gradients is a common source of errors in training loops.
Sometimes, you need to perform operations without tracking them for gradient computation, most commonly during model evaluation (inference) or when adjusting parameters outside the optimization step.
Using torch.no_grad()
: This context manager is the standard way to disable gradient tracking for a block of code.
a = torch.tensor(2.0, requires_grad=True)
print(f"Outside context: a.requires_grad = {a.requires_grad}")
with torch.no_grad():
print(f"Inside context: a.requires_grad = {a.requires_grad}") # Still True
b = a * 2
print(f"Inside context: b = {b}, b.requires_grad = {b.requires_grad}") # False!
# Outside the context, computations resume tracking if inputs require grad
c = a * 3
print(f"Outside context: c = {c}, c.requires_grad = {c.requires_grad}") # True
Inside the torch.no_grad()
block, even though a
requires gradients, the resulting tensor b
does not. This makes operations within the block more memory-efficient and faster, as the history for backpropagation isn't saved.
Using .detach()
: This method creates a new tensor that shares the same data but is detached from the computation history. It doesn't require gradients.
a = torch.tensor(5.0, requires_grad=True)
b = a * a # b requires grad and is part of the graph connected to a
# Detach a to create a new tensor c that doesn't require gradients
c = a.detach()
print(f"a.requires_grad: {a.requires_grad}") # True
print(f"c.requires_grad: {c.requires_grad}") # False
# Operations with c won't be tracked back to a
d = c * 3 # d does not require grad
print(f"d.requires_grad: {d.requires_grad}") # False
# If you perform backward on a computation involving 'b',
# it flows back to 'a'. If you use 'd', it doesn't.
L1 = b.mean() # Depends on 'a'
L1.backward()
print(f"Gradient dL1/da: {a.grad}") # Expected: 2*a = 10.0
# Zero gradients before next backward call
if a.grad is not None:
a.grad.zero_()
# Try backpropagating through 'd' - it won't affect 'a's gradient
try:
# L2 = d.mean() # Need a computation that requires grad eventually
# Example: Use 'a' again with the detached result
L2 = (a + d).mean() # L2 = (a + a.detach()*3).mean()
L2.backward()
print(f"Gradient dL2/da: {a.grad}") # Only gradient from 'a' path is computed (1.0)
except RuntimeError as e:
print(f"Error demonstrating backward with detached: {e}")
# You might get an error if the final scalar doesn't depend
# on any input requiring grad after detachment.
# Here, L2 depends on 'a', so the grad is 1.0.
# The path through 'd' contributes nothing to a.grad.
# Modify c (the detached tensor) - it affects a because they share data!
with torch.no_grad():
c[0] = 100.0 # Modify c in-place (use index for scalar)
print(f"After modifying c, a = {a}") # 'a' also changes!
print(f"After modifying c, c = {c}")
detach()
is useful when you want to use a tensor's value in a calculation but prevent gradients from flowing back through that specific path, or when you need a tensor without gradient history (e.g., for plotting or logging). Be mindful that it shares data storage, so in-place modifications affect the original tensor unless you .clone()
it first (c = a.detach().clone()
).
These exercises demonstrate the core mechanics of Autograd. You've practiced enabling gradient tracking, performing backpropagation, inspecting the computed gradients, understanding accumulation, and disabling tracking when necessary. Mastering these operations is fundamental for building and training neural networks in PyTorch.
© 2025 ApX Machine Learning