Automatic differentiation is the engine that powers modern neural network training, allowing us to efficiently compute gradients of a loss function with respect to model parameters. Both TensorFlow and PyTorch provide sophisticated systems for this, but they approach it with slightly different philosophies and APIs. A comparison of TensorFlow's tf.GradientTape and PyTorch's autograd system reveals their distinctions and aids understanding for developers.
Fundamentally, automatic differentiation tracks operations performed on tensors to build a computation graph. When gradients are needed, these systems traverse this graph backward from the output (typically a scalar loss) to the inputs (typically model weights), applying the chain rule at each step.
tf.GradientTapeIn TensorFlow, gradient calculation is primarily managed using the tf.GradientTape context manager. Operations involving tf.Variable objects (or tensors explicitly "watched" by the tape) are recorded onto this "tape" within its scope. Once the forward pass is complete within the with block, you can call tape.gradient(target, sources) to compute the gradients of a target tensor (e.g., loss) with respect to one or more source tensors (e.g., model weights).
Let's look at a simple example. Suppose we have a function y=w2+b, and we want to find ∂w∂y and ∂b∂y.
import tensorflow as tf
# Define trainable variables
w_tf = tf.Variable(2.0, name='weight')
b_tf = tf.Variable(1.0, name='bias')
with tf.GradientTape() as tape:
# Define the computation. Operations involving w_tf and b_tf are recorded.
y_tf = w_tf * w_tf + b_tf # y = w^2 + b
# Calculate gradients
# dy/dw = 2w = 2*2 = 4
# dy/db = 1
gradients_tf = tape.gradient(y_tf, {'w': w_tf, 'b': b_tf})
print(f"TensorFlow y: {y_tf.numpy()}")
print(f"TensorFlow d(y)/d(w): {gradients_tf['w'].numpy()}")
print(f"TensorFlow d(y)/d(b): {gradients_tf['b'].numpy()}")
# Output:
# TensorFlow y: 5.0
# TensorFlow d(y)/d(w): 4.0
# TensorFlow d(y)/d(b): 1.0
By default, a tf.GradientTape is consumed after one call to tape.gradient(). If you need to compute multiple gradients from the same recorded operations (e.g., for different losses or for higher-order derivatives), you must create a persistent tape by setting persistent=True when initializing it: with tf.GradientTape(persistent=True) as tape:.
autogradPyTorch's automatic differentiation system, torch.autograd, is more deeply integrated with the torch.Tensor object itself. Tensors have a requires_grad attribute (defaults to False for tensors you create, but True for model parameters like nn.Linear weights). If a tensor has requires_grad=True, PyTorch automatically tracks all operations involving it, building a dynamic computation graph.
After the forward pass, where you compute your final output (usually a scalar loss), you call the .backward() method on this output tensor. This triggers the gradient computation for all tensors in the graph that have requires_grad=True and were ancestors of the output tensor. The computed gradients are then accumulated in the .grad attribute of those respective leaf tensors.
Let's replicate the previous example in PyTorch: y=w2+b.
import torch
# Define tensors that require gradient tracking
w_pt = torch.tensor(2.0, requires_grad=True)
b_pt = torch.tensor(1.0, requires_grad=True)
# Define the computation. PyTorch tracks operations on w_pt and b_pt.
y_pt = w_pt * w_pt + b_pt # y = w^2 + b
# y_pt is a scalar, so we can call backward() directly.
# This computes gradients for all tensors with requires_grad=True
# that contributed to y_pt.
y_pt.backward()
# Gradients are stored in the .grad attribute of the tensors.
# dy/dw = 2w = 2*2 = 4
# dy/db = 1
print(f"PyTorch y: {y_pt.item()}")
print(f"PyTorch d(y)/d(w): {w_pt.grad.item()}")
print(f"PyTorch d(y)/d(b): {b_pt.grad.item()}")
# Output:
# PyTorch y: 5.0
# PyTorch d(y)/d(w): 4.0
# PyTorch d(y)/d(b): 1.0
One important detail in PyTorch is that gradients accumulate. This means if you call .backward() multiple times (e.g., in a training loop), the new gradients will be added to the existing ones in the .grad attributes. Therefore, before each backward pass in a typical training iteration, you must explicitly zero out the gradients, usually by calling optimizer.zero_grad() or manually iterating through parameters and calling param.grad.zero_() if param.grad is not None.
# Example of gradient accumulation and zeroing
w = torch.tensor(2.0, requires_grad=True)
optimizer = torch.optim.SGD([w], lr=0.1) # An optimizer needs parameters
# First pass
y1 = w * w
y1.backward()
print(f"After first backward(): w.grad = {w.grad}") # w.grad = 4.0
# optimizer.step() # would update w based on this gradient
# If we don't zero gradients:
# optimizer.zero_grad() # <--- FORGOT THIS STEP!
y2 = w * 3
y2.backward() # This will add to the existing gradient
# New gradient for y2 is 3.0. Accumulated: 4.0 + 3.0 = 7.0
print(f"After second backward() (no zero_grad): w.grad = {w.grad}")
# Correct way: zero gradients before next computation
optimizer.zero_grad() # or w.grad.zero_()
y3 = w * 4
y3.backward()
print(f"After third backward() (with zero_grad): w.grad = {w.grad}") # w.grad = 4.0
The backward() method is usually called on a scalar tensor (like a loss value). If y_pt were a non-scalar tensor, y_pt.backward() would require a gradient argument of the same shape as y_pt, representing the "gradient of the final scalar loss with respect to y_pt." This is often torch.ones_like(y_pt) if you simply want the sum of gradients for each element.
GradientTape and autogradHere's a summary of the differences and similarities:
Diagram illustrating the operational flow for gradient computation in TensorFlow using
GradientTapeand in PyTorch usingautograd. TensorFlow relies on an explicit recording context, while PyTorch'sautogradtracks operations based on therequires_gradattribute of tensors.
Main Distinctions:
Recording Mechanism:
with tf.GradientTape() as tape: block. Only operations within this block on watched tensors are recorded.requires_grad=True, the operation is tracked and added to the computation graph associated with its inputs.Initiating Gradient Computation:
tape.gradient(target, sources). This returns the gradients.target.backward() (where target is usually a scalar loss). Gradients are not returned directly but are accumulated in the .grad attribute of the leaf tensors that had requires_grad=True.Gradient Accumulation:
tape.gradient() computes fresh gradients each time.tensor.grad accumulates gradients. You must manually call optimizer.zero_grad() or tensor.grad.zero_() before each new backward pass in a training loop if accumulation is not desired. This accumulation can be useful for scenarios like gradient accumulation over multiple mini-batches.Persistence of Computation Graph:
GradientTape: By default, the resources held by the tape are released after tape.gradient() is called once. To call it multiple times (e.g., for higher-order derivatives or multiple distinct gradients from the same computation), you need persistent=True.autograd: By default, the graph used to compute the gradients is freed after .backward() is called (this is controlled by the retain_graph=False default in backward()). If you need to call .backward() again on the same part of the graph (e.g., for multiple loss components contributing to the same parameters, or for higher-order gradients), you must specify loss.backward(retain_graph=True).Higher-Order Gradients:
GradientTape contexts or using a persistent tape.
x_tf = tf.Variable(2.0)
with tf.GradientTape() as tape2:
with tf.GradientTape() as tape1:
y_tf = x_tf * x_tf * x_tf # x^3
dy_dx_tf = tape1.gradient(y_tf, x_tf) # dy/dx = 3x^2 = 12
d2y_dx2_tf = tape2.gradient(dy_dx_tf, x_tf) # d2y/dx2 = 6x = 12
print(f"TensorFlow dy/dx: {dy_dx_tf.numpy()}, d2y/dx2: {d2y_dx2_tf.numpy()}")
.backward(create_graph=True) on intermediate gradients, which allows further differentiation through the gradient computation itself.
x_pt = torch.tensor(2.0, requires_grad=True)
y_pt = x_pt * x_pt * x_pt # x^3
# First derivative
dy_dx_pt = torch.autograd.grad(y_pt, x_pt, create_graph=True)[0] # dy/dx = 3x^2 = 12
# Second derivative
d2y_dx2_pt = torch.autograd.grad(dy_dx_pt, x_pt)[0] # d2y/dx2 = 6x = 12
print(f"PyTorch dy/dx: {dy_dx_pt.item()}, d2y/dx2: {d2y_dx2_pt.item()}")
Note the use of torch.autograd.grad here, which is a more general function for computing gradients and is useful for higher-order derivatives. When y_pt.backward(create_graph=True) is called, the .grad attributes of tensors involved in computing y_pt will also have graphs attached, allowing further backward() calls or torch.autograd.grad calls.Stopping Gradient Flow:
tf.Variable is not watched by the tape, or by using tf.stop_gradient() on a tensor.requires_grad attribute to False, using tensor.detach() to create a new tensor that doesn't require gradients and is detached from the current graph, or by wrapping code in a with torch.no_grad(): block.# PyTorch: Stopping gradient flow
x = torch.randn(2, 2, requires_grad=True)
y = x * 2
print(f"y.requires_grad: {y.requires_grad}") # True
# Using torch.no_grad()
with torch.no_grad():
z = x * 2
print(f"z.requires_grad: {z.requires_grad}") # False, z doesn't track history
# Using .detach()
w = x * 2
w_detached = w.detach() # Creates a new tensor, shares data, but no grad history
print(f"w.requires_grad: {w.requires_grad}") # True
print(f"w_detached.requires_grad: {w_detached.requires_grad}") # False
For developers coming from TensorFlow, the most significant adjustments will be:
requires_grad being the primary switch for gradient tracking, rather than an explicit GradientTape scope for every gradient computation.optimizer.zero_grad() in your training loop to prevent gradient accumulation.loss.backward() populates the .grad fields of your model parameters, which are then used by the optimizer.Both systems are powerful and flexible, designed to handle the complex differentiation needs of deep learning models. PyTorch's autograd often feels more integrated into the tensor system, reflecting its define-by-run nature, where the graph is built as operations occur. TensorFlow's GradientTape provides a more explicit way to control when and what operations are recorded for differentiation. As you work more with PyTorch, its autograd system will become second nature, especially when constructing custom training loops.
Was this section helpful?
tf.GradientTape and its usage.autograd engine, covering its core concepts and API.© 2026 ApX Machine LearningEngineered with