Automatic differentiation is the engine that powers modern neural network training, allowing us to efficiently compute gradients of a loss function with respect to model parameters. Both TensorFlow and PyTorch provide sophisticated systems for this, but they approach it with slightly different philosophies and APIs. If you're familiar with TensorFlow's tf.GradientTape
, understanding PyTorch's autograd
system will be a smooth transition.
At its heart, automatic differentiation tracks operations performed on tensors to build a computation graph. When gradients are needed, these systems traverse this graph backward from the output (typically a scalar loss) to the inputs (typically model weights), applying the chain rule at each step.
tf.GradientTape
In TensorFlow, gradient calculation is primarily managed using the tf.GradientTape
context manager. Operations involving tf.Variable
objects (or tensors explicitly "watched" by the tape) are recorded onto this "tape" within its scope. Once the forward pass is complete within the with
block, you can call tape.gradient(target, sources)
to compute the gradients of a target
tensor (e.g., loss) with respect to one or more source
tensors (e.g., model weights).
Let's look at a simple example. Suppose we have a function y=w2+b, and we want to find ∂w∂y and ∂b∂y.
import tensorflow as tf
# Define trainable variables
w_tf = tf.Variable(2.0, name='weight')
b_tf = tf.Variable(1.0, name='bias')
with tf.GradientTape() as tape:
# Define the computation. Operations involving w_tf and b_tf are recorded.
y_tf = w_tf * w_tf + b_tf # y = w^2 + b
# Calculate gradients
# dy/dw = 2w = 2*2 = 4
# dy/db = 1
gradients_tf = tape.gradient(y_tf, {'w': w_tf, 'b': b_tf})
print(f"TensorFlow y: {y_tf.numpy()}")
print(f"TensorFlow d(y)/d(w): {gradients_tf['w'].numpy()}")
print(f"TensorFlow d(y)/d(b): {gradients_tf['b'].numpy()}")
# Output:
# TensorFlow y: 5.0
# TensorFlow d(y)/d(w): 4.0
# TensorFlow d(y)/d(b): 1.0
By default, a tf.GradientTape
is consumed after one call to tape.gradient()
. If you need to compute multiple gradients from the same recorded operations (e.g., for different losses or for higher-order derivatives), you must create a persistent tape by setting persistent=True
when initializing it: with tf.GradientTape(persistent=True) as tape:
.
autograd
PyTorch's automatic differentiation system, torch.autograd
, is more deeply integrated with the torch.Tensor
object itself. Tensors have a requires_grad
attribute (defaults to False
for tensors you create, but True
for model parameters like nn.Linear
weights). If a tensor has requires_grad=True
, PyTorch automatically tracks all operations involving it, building a dynamic computation graph.
After the forward pass, where you compute your final output (usually a scalar loss), you call the .backward()
method on this output tensor. This triggers the gradient computation for all tensors in the graph that have requires_grad=True
and were ancestors of the output tensor. The computed gradients are then accumulated in the .grad
attribute of those respective leaf tensors.
Let's replicate the previous example in PyTorch: y=w2+b.
import torch
# Define tensors that require gradient tracking
w_pt = torch.tensor(2.0, requires_grad=True)
b_pt = torch.tensor(1.0, requires_grad=True)
# Define the computation. PyTorch tracks operations on w_pt and b_pt.
y_pt = w_pt * w_pt + b_pt # y = w^2 + b
# y_pt is a scalar, so we can call backward() directly.
# This computes gradients for all tensors with requires_grad=True
# that contributed to y_pt.
y_pt.backward()
# Gradients are stored in the .grad attribute of the tensors.
# dy/dw = 2w = 2*2 = 4
# dy/db = 1
print(f"PyTorch y: {y_pt.item()}")
print(f"PyTorch d(y)/d(w): {w_pt.grad.item()}")
print(f"PyTorch d(y)/d(b): {b_pt.grad.item()}")
# Output:
# PyTorch y: 5.0
# PyTorch d(y)/d(w): 4.0
# PyTorch d(y)/d(b): 1.0
One important detail in PyTorch is that gradients accumulate. This means if you call .backward()
multiple times (e.g., in a training loop), the new gradients will be added to the existing ones in the .grad
attributes. Therefore, before each backward pass in a typical training iteration, you must explicitly zero out the gradients, usually by calling optimizer.zero_grad()
or manually iterating through parameters and calling param.grad.zero_()
if param.grad
is not None
.
# Example of gradient accumulation and zeroing
w = torch.tensor(2.0, requires_grad=True)
optimizer = torch.optim.SGD([w], lr=0.1) # An optimizer needs parameters
# First pass
y1 = w * w
y1.backward()
print(f"After first backward(): w.grad = {w.grad}") # w.grad = 4.0
# optimizer.step() # would update w based on this gradient
# If we don't zero gradients:
# optimizer.zero_grad() # <--- FORGOT THIS STEP!
y2 = w * 3
y2.backward() # This will add to the existing gradient
# New gradient for y2 is 3.0. Accumulated: 4.0 + 3.0 = 7.0
print(f"After second backward() (no zero_grad): w.grad = {w.grad}")
# Correct way: zero gradients before next computation
optimizer.zero_grad() # or w.grad.zero_()
y3 = w * 4
y3.backward()
print(f"After third backward() (with zero_grad): w.grad = {w.grad}") # w.grad = 4.0
The backward()
method is usually called on a scalar tensor (like a loss value). If y_pt
were a non-scalar tensor, y_pt.backward()
would require a gradient
argument of the same shape as y_pt
, representing the "gradient of the final scalar loss with respect to y_pt
." This is often torch.ones_like(y_pt)
if you simply want the sum of gradients for each element.
GradientTape
and autograd
Here's a summary of the differences and similarities:
Diagram illustrating the operational flow for gradient computation in TensorFlow using
GradientTape
and in PyTorch usingautograd
. TensorFlow relies on an explicit recording context, while PyTorch'sautograd
tracks operations based on therequires_grad
attribute of tensors.
Main Distinctions:
Recording Mechanism:
with tf.GradientTape() as tape:
block. Only operations within this block on watched tensors are recorded.requires_grad=True
, the operation is tracked and added to the computation graph associated with its inputs.Initiating Gradient Computation:
tape.gradient(target, sources)
. This returns the gradients.target.backward()
(where target
is usually a scalar loss). Gradients are not returned directly but are accumulated in the .grad
attribute of the leaf tensors that had requires_grad=True
.Gradient Accumulation:
tape.gradient()
computes fresh gradients each time.tensor.grad
accumulates gradients. You must manually call optimizer.zero_grad()
or tensor.grad.zero_()
before each new backward pass in a training loop if accumulation is not desired. This accumulation can be useful for scenarios like gradient accumulation over multiple mini-batches.Persistence of Computation Graph:
GradientTape
: By default, the resources held by the tape are released after tape.gradient()
is called once. To call it multiple times (e.g., for higher-order derivatives or multiple distinct gradients from the same computation), you need persistent=True
.autograd
: By default, the graph used to compute the gradients is freed after .backward()
is called (this is controlled by the retain_graph=False
default in backward()
). If you need to call .backward()
again on the same part of the graph (e.g., for multiple loss components contributing to the same parameters, or for higher-order gradients), you must specify loss.backward(retain_graph=True)
.Higher-Order Gradients:
GradientTape
contexts or using a persistent tape.
x_tf = tf.Variable(2.0)
with tf.GradientTape() as tape2:
with tf.GradientTape() as tape1:
y_tf = x_tf * x_tf * x_tf # x^3
dy_dx_tf = tape1.gradient(y_tf, x_tf) # dy/dx = 3x^2 = 12
d2y_dx2_tf = tape2.gradient(dy_dx_tf, x_tf) # d2y/dx2 = 6x = 12
print(f"TensorFlow dy/dx: {dy_dx_tf.numpy()}, d2y/dx2: {d2y_dx2_tf.numpy()}")
.backward(create_graph=True)
on intermediate gradients, which allows further differentiation through the gradient computation itself.
x_pt = torch.tensor(2.0, requires_grad=True)
y_pt = x_pt * x_pt * x_pt # x^3
# First derivative
dy_dx_pt = torch.autograd.grad(y_pt, x_pt, create_graph=True)[0] # dy/dx = 3x^2 = 12
# Second derivative
d2y_dx2_pt = torch.autograd.grad(dy_dx_pt, x_pt)[0] # d2y/dx2 = 6x = 12
print(f"PyTorch dy/dx: {dy_dx_pt.item()}, d2y/dx2: {d2y_dx2_pt.item()}")
Note the use of torch.autograd.grad
here, which is a more general function for computing gradients and is useful for higher-order derivatives. When y_pt.backward(create_graph=True)
is called, the .grad
attributes of tensors involved in computing y_pt
will also have graphs attached, allowing further backward()
calls or torch.autograd.grad
calls.Stopping Gradient Flow:
tf.Variable
is not watched by the tape, or by using tf.stop_gradient()
on a tensor.requires_grad
attribute to False
, using tensor.detach()
to create a new tensor that doesn't require gradients and is detached from the current graph, or by wrapping code in a with torch.no_grad():
block.# PyTorch: Stopping gradient flow
x = torch.randn(2, 2, requires_grad=True)
y = x * 2
print(f"y.requires_grad: {y.requires_grad}") # True
# Using torch.no_grad()
with torch.no_grad():
z = x * 2
print(f"z.requires_grad: {z.requires_grad}") # False, z doesn't track history
# Using .detach()
w = x * 2
w_detached = w.detach() # Creates a new tensor, shares data, but no grad history
print(f"w.requires_grad: {w.requires_grad}") # True
print(f"w_detached.requires_grad: {w_detached.requires_grad}") # False
For developers coming from TensorFlow, the most significant adjustments will be:
requires_grad
being the primary switch for gradient tracking, rather than an explicit GradientTape
scope for every gradient computation.optimizer.zero_grad()
in your training loop to prevent gradient accumulation.loss.backward()
populates the .grad
fields of your model parameters, which are then used by the optimizer.Both systems are powerful and flexible, designed to handle the complex differentiation needs of deep learning models. PyTorch's autograd
often feels more integrated into the tensor system, reflecting its define-by-run nature, where the graph is built as operations occur. TensorFlow's GradientTape
provides a more explicit way to control when and what operations are recorded for differentiation. As you work more with PyTorch, its autograd
system will become second nature, especially when constructing custom training loops.
© 2025 ApX Machine Learning