As you start building more complex PyTorch models and training loops, you'll inevitably encounter errors or situations where the model doesn't behave as expected. While some errors produce clear messages, others can be more subtle, leading to poor performance without obvious crashes. Recognizing common patterns is the first step in efficient debugging. Let's look at some frequent issues that arise during PyTorch development.
Perhaps the most common runtime error in PyTorch involves tensor shape incompatibilities. This typically happens when the output shape of one layer doesn't match the expected input shape of the next layer, or when the input data's shape doesn't align with the model's first layer.
Consider a simple sequence: a convolutional layer followed by a fully connected (linear) layer. The nn.Conv2d
layer expects input tensors of shape (Batch Size, Input Channels, Height, Width), often abbreviated as (N,Cin,Hin,Win). It produces an output of shape (N,Cout,Hout,Wout). However, an nn.Linear
layer expects a 2D input of shape (Batch Size, Input Features), or (N,featuresin). Connecting these directly without reshaping will cause an error.
import torch
import torch.nn as nn
# Example layers
conv_layer = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)
linear_layer = nn.Linear(in_features=???, out_features=10) # Problem: What is in_features?
# Simulate input data
input_data = torch.randn(64, 3, 32, 32) # (N, C_in, H_in, W_in)
# Forward pass through convolution
conv_output = conv_layer(input_data)
print(f"Conv output shape: {conv_output.shape}")
# Output: Conv output shape: torch.Size([64, 16, 32, 32])
# Attempting to pass directly to linear layer (will fail)
# output = linear_layer(conv_output) # This would raise a RuntimeError
# Correct approach requires flattening
flattened_output = conv_output.view(conv_output.size(0), -1) # Flatten all dims except batch
print(f"Flattened output shape: {flattened_output.shape}")
# Output: Flattened output shape: torch.Size([64, 16384]) # 16 * 32 * 32 = 16384
# Now we know the required in_features for the linear layer
correct_linear_layer = nn.Linear(in_features=16384, out_features=10)
output = correct_linear_layer(flattened_output)
print(f"Final output shape: {output.shape}")
# Output: Final output shape: torch.Size([64, 10])
Errors often look like: RuntimeError: size mismatch, m1: [64 x 16384], m2: [? x 10]
. The m1
usually refers to the input tensor being passed to the layer, and m2
refers to the layer's weight matrix. The error message indicates the shapes PyTorch tried to multiply. Debugging these involves:
.shape
.nn.Linear
after convolutions, this often involves flattening the (N,C,H,W) output to (N,C∗H∗W) using tensor.view(batch_size, -1)
.in_features
argument of the nn.Linear
layer matches the flattened size.PyTorch allows computation on different devices, primarily the CPU and NVIDIA GPUs (using CUDA). A frequent runtime error occurs when you try to perform an operation involving tensors located on different devices. For instance, if your model is moved to the GPU (model.to('cuda')
) but your input data remains on the CPU, the forward pass will fail.
# Assume CUDA is available
if torch.cuda.is_available():
device = torch.device("cuda")
else:
device = torch.device("cpu")
print(f"Using device: {device}")
model = nn.Linear(10, 5)
input_cpu = torch.randn(1, 10) # Tensor on CPU by default
# Move model to GPU (if available)
model.to(device)
print(f"Model device: {next(model.parameters()).device}")
# Attempt forward pass with CPU tensor and GPU model (will fail if device is cuda)
try:
output = model(input_cpu)
except RuntimeError as e:
print(f"Error: {e}")
# Output might be: Error: Expected all tensors to be on the same device,
# but found at least two devices, cuda:0 and cpu!
# Correct approach: Move input tensor to the same device as the model
input_gpu = input_cpu.to(device)
print(f"Input tensor device: {input_gpu.device}")
output = model(input_gpu) # This works
print(f"Output tensor device: {output.device}")
print("Forward pass successful!")
A common scenario leading to device mismatch errors. Input tensors must be moved to the same device as the model (e.g., GPU) before the forward pass.
The error message RuntimeError: Expected all tensors to be on the same device...
is quite explicit. Debugging involves:
device
early in your script (checking torch.cuda.is_available()
).model.to(device)
.data = data.to(device)
and targets = targets.to(device)
..device
attributes of tensors and model parameters (next(model.parameters()).device
) if unsure.Choosing the right loss function is important, but you also need to ensure your model's output and your target labels have the shape and data type expected by that loss function. Using the wrong combination might not always crash but can lead to "silent" failures where the loss decreases but the model isn't learning the actual task correctly.
nn.CrossEntropyLoss
: Commonly used for multi-class classification.
softmax
function before this loss is usually incorrect, as CrossEntropyLoss
combines LogSoftmax
and NLLLoss
.nn.MSELoss
(Mean Squared Error): Often used for regression tasks.
nn.BCEWithLogitsLoss
: Used for binary classification or multi-label classification.
sigmoid
before this loss is incorrect as it includes the sigmoid calculation.A mismatch here might result in:
RuntimeError
if shapes are fundamentally incompatible.CrossEntropyLoss
).MSELoss
for classification indices).Debugging involves carefully reading the PyTorch documentation for your chosen loss function and verifying:
.dtype
) of your target tensor.softmax
or sigmoid
) should be applied before the loss.Sometimes, gradients don't propagate back through the network as expected, leading to parameters not being updated. This can happen silently if not monitored.
requires_grad=True
: While nn.Module
parameters automatically have requires_grad=True
, if you create intermediate tensors that should be part of the computation graph, ensure they have this flag set correctly. Usually, this is handled automatically if they result from operations on tensors that already require gradients.tensor.add_()
) can interfere with gradient tracking in older PyTorch versions or complex graphs. While PyTorch has improved its handling, it's generally safer to use out-of-place versions (y = x + 1
instead of x += 1
) within network computations where gradients are needed..numpy()
) detaches it from the computation graph. Any subsequent operations using that NumPy array, even if converted back to a tensor, will not have gradients flowing back to the original parts of the graph..detach()
: Calling .detach()
on a tensor explicitly removes it from the computation graph. This is necessary sometimes (e.g., during evaluation), but accidentally using it during training will stop gradients.A symptom of this is finding that .grad
attributes of some parameters remain None
after loss.backward()
, or that the model's performance doesn't improve despite the training loop running. You'll learn how to inspect gradients more formally later in this chapter.
Bugs can also originate in your Dataset
implementation or data transformations.
__getitem__
: Returning data of the wrong type (e.g., images as NumPy arrays instead of Tensors if the model expects tensors) or incorrect shape.__getitem__
returns tensors of varying sizes (e.g., variable-length sequences or images of different dimensions) and the DataLoader
's collate_fn
doesn't handle this padding or stacking correctly, it can cause errors during batch creation.transforms.Compose
pipeline) can silently corrupt the data fed into the model.Debugging these often involves:
Dataset
separately.dataset[i]
.DataLoader
.Being aware of these common pitfalls helps you anticipate potential issues and provides a starting point for diagnosing problems when they occur. The following sections will provide more structured techniques for debugging and monitoring your models.
© 2025 ApX Machine Learning