All Courses

Disabling Gradient Tracking

While Autograd's ability to automatically track operations and compute gradients is fundamental for training models, there are important scenarios where this tracking is unnecessary or even undesirable. Specifically, during model evaluation (inference) or when you are performing operations that should not influence gradient calculations, tracking history consumes memory and computational resources without providing any benefit. PyTorch offers mechanisms to selectively disable gradient tracking.

Why Disable Gradient Tracking?

Model Evaluation (Inference): When you use a trained model to make predictions on new data, you are not updating its weights. Calculating gradients is pointless in this phase. Disabling gradient tracking significantly reduces memory usage (as no computation graph needs storing) and speeds up the forward pass.
Freezing Model Parameters: During fine-tuning, you might want to freeze parts of a pre-trained model (e.g., early convolutional layers) and only train the later layers. You need to signal PyTorch not to compute or store gradients for the frozen parameters.
Memory Efficiency: The computation graph stored by Autograd can consume substantial memory, especially with complex models or long sequences. Turning off tracking when not needed prevents this overhead.

Using the `torch.no_grad()` Context Manager

The most common and recommended way to disable gradient tracking for a block of code is by using the torch.no_grad() context manager. Any PyTorch operation performed inside this with block will behave as if none of the input tensors require gradients, even if they originally had requires_grad=True.

import torch

# Example tensors
x = torch.randn(2, 2, requires_grad=True)
w = torch.randn(2, 2, requires_grad=True)
b = torch.randn(2, 2, requires_grad=True)

# Operation outside the no_grad context
y = x * w + b
print(f"y.requires_grad: {y.requires_grad}") # Output: y.requires_grad: True
print(f"y.grad_fn: {y.grad_fn}")           # Output: y.grad_fn: <AddBackward0 object at ...>

# Perform operations within the no_grad context
print("\nEntering torch.no_grad() context:")
with torch.no_grad():
    z = x * w + b
    print(f"  z.requires_grad: {z.requires_grad}") # Output: z.requires_grad: False
    print(f"  z.grad_fn: {z.grad_fn}")           # Output: z.grad_fn: None

    # Even if an input requires grad, the output won't
    k = x * 5
    print(f"  k.requires_grad: {k.requires_grad}") # Output: k.requires_grad: False

# Outside the context, tracking resumes if inputs require grad
print("\nExiting torch.no_grad() context:")
p = x * w
print(f"p.requires_grad: {p.requires_grad}") # Output: p.requires_grad: True
print(f"p.grad_fn: {p.grad_fn}")           # Output: p.grad_fn: <MulBackward0 object at ...>

As seen in the example, operations within the with torch.no_grad(): block produce outputs (z, k) with requires_grad=False and no associated grad_fn, indicating they are detached from the computation graph history. This is precisely what you want during an evaluation loop:

# Evaluation loop snippet
model.eval() # Set model to evaluation mode (important for layers like dropout, batchnorm)
total_loss = 0
correct_predictions = 0

with torch.no_grad(): # Disable gradient calculations for evaluation
    for inputs, labels in validation_dataloader:
        inputs, labels = inputs.to(device), labels.to(device) # Move data to appropriate device

        outputs = model(inputs) # Forward pass
        loss = criterion(outputs, labels) # Calculate loss

        total_loss += loss.item()
        _, predicted = torch.max(outputs.data, 1)
        correct_predictions += (predicted == labels).sum().item()

# Calculate average loss and accuracy...

Using the `.detach()` Method

Another way to prevent gradient tracking for a specific tensor is to use the .detach() method. This method creates a new tensor that shares the same underlying data storage as the original tensor but is explicitly detached from the current computation graph. It will have requires_grad=False.

import torch

# Original tensor requiring gradients
a = torch.randn(3, 3, requires_grad=True)
b = a * 2
print(f"b.requires_grad: {b.requires_grad}") # Output: b.requires_grad: True
print(f"b.grad_fn: {b.grad_fn}")           # Output: b.grad_fn: <MulBackward0 object at ...>

# Detach the tensor 'b'
c = b.detach()
print(f"\nAfter detaching b to create c:")
print(f"c.requires_grad: {c.requires_grad}") # Output: c.requires_grad: False
print(f"c.grad_fn: {c.grad_fn}")           # Output: c.grad_fn: None

# Importantly, the original tensor 'b' is unchanged
print(f"\nOriginal tensor b remains attached:")
print(f"b.requires_grad: {b.requires_grad}") # Output: b.requires_grad: True
print(f"b.grad_fn: {b.grad_fn}")           # Output: b.grad_fn: <MulBackward0 object at ...>

# Operations using the detached tensor 'c' won't be tracked
d = c + 1
print(f"\nOperation on detached tensor c:")
print(f"d.requires_grad: {d.requires_grad}") # Output: d.requires_grad: False

When to use .detach() vs torch.no_grad()?

Use torch.no_grad() when you want to perform a block of operations without tracking gradients, typically for inference or evaluation code sections. It's generally more efficient for this purpose.
Use .detach() when you need a specific tensor removed from the computation graph, perhaps for logging its value, using it in an operation that shouldn't affect gradients (like updating a metric), or passing it to a function expecting a non-gradient-tracked tensor, while potentially still needing the original tensor's gradient history elsewhere. Since .detach() shares data, modifying the detached tensor in-place will affect the original tensor, which can have implications for gradient calculation if not handled carefully.

Modifying `requires_grad` In-Place

You can also directly modify the requires_grad attribute of a tensor in-place, but this is generally less common for temporary disabling than the context manager or .detach(). It's often used when defining parameters you explicitly never want to train.

my_tensor = torch.randn(5, requires_grad=True)
print(f"Initial requires_grad: {my_tensor.requires_grad}") # Output: Initial requires_grad: True

# Disable gradient tracking in-place
my_tensor.requires_grad_(False) # Note the underscore for in-place operation
print(f"After requires_grad_(False): {my_tensor.requires_grad}") # Output: After requires_grad_(False): False

Using torch.no_grad() is the standard practice for efficient inference and evaluation, while .detach() offers finer-grained control when you need to isolate specific tensors from the gradient history. Understanding when and how to disable gradient tracking is essential for writing efficient and correct PyTorch code, especially when moving past basic training loops.

Was this section helpful?

Disabling Gradient Tracking

Why Disable Gradient Tracking?

Using the torch.no_grad() Context Manager

Using the .detach() Method

Modifying requires_grad In-Place

Using the `torch.no_grad()` Context Manager

Using the `.detach()` Method

Modifying `requires_grad` In-Place