requires_grad)backward()).grad)torch.nntorch.nn.Module Base Classtorch.nn losses)torch.optim)torch.utils.data.Datasettorchvision.transforms)torch.utils.data.DataLoaderAs you start building more complex PyTorch models and training loops, you'll inevitably encounter errors or situations where the model doesn't behave as expected. While some errors produce clear messages, others can be more subtle, leading to poor performance without obvious crashes. Recognizing common patterns is the first step in efficient debugging. Here are some frequent issues that arise during PyTorch development.
Perhaps the most common runtime error in PyTorch involves tensor shape incompatibilities. This typically happens when the output shape of one layer doesn't match the expected input shape of the next layer, or when the input data's shape doesn't align with the model's first layer.
Consider a simple sequence: a convolutional layer followed by a fully connected (linear) layer. The nn.Conv2d layer expects input tensors of shape (Batch Size, Input Channels, Height, Width), often abbreviated as (N,Cin,Hin,Win). It produces an output of shape (N,Cout,Hout,Wout). However, an nn.Linear layer expects a 2D input of shape (Batch Size, Input Features), or (N,featuresin). Connecting these directly without reshaping will cause an error.
import torch
import torch.nn as nn
# Example layers
conv_layer = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)
linear_layer = nn.Linear(in_features=???, out_features=10) # Problem: What is in_features?
# Simulate input data
input_data = torch.randn(64, 3, 32, 32) # (N, C_in, H_in, W_in)
# Forward pass through convolution
conv_output = conv_layer(input_data)
print(f"Conv output shape: {conv_output.shape}")
# Output: Conv output shape: torch.Size([64, 16, 32, 32])
# Attempting to pass directly to linear layer (will fail)
# output = linear_layer(conv_output) # This would raise a RuntimeError
# Correct approach requires flattening
flattened_output = conv_output.view(conv_output.size(0), -1) # Flatten all dims except batch
print(f"Flattened output shape: {flattened_output.shape}")
# Output: Flattened output shape: torch.Size([64, 16384]) # 16 * 32 * 32 = 16384
# Now we know the required in_features for the linear layer
correct_linear_layer = nn.Linear(in_features=16384, out_features=10)
output = correct_linear_layer(flattened_output)
print(f"Final output shape: {output.shape}")
# Output: Final output shape: torch.Size([64, 10])
Errors often look like: RuntimeError: size mismatch, m1: [64 x 16384], m2: [? x 10]. The m1 usually refers to the input tensor being passed to the layer, and m2 refers to the layer's weight matrix. The error message indicates the shapes PyTorch tried to multiply. Debugging these involves:
.shape.nn.Linear after convolutions, this often involves flattening the (N,C,H,W) output to (N,C∗H∗W) using tensor.view(batch_size, -1).in_features argument of the nn.Linear layer matches the flattened size.PyTorch allows computation on different devices, primarily the CPU and NVIDIA GPUs (using CUDA). A frequent runtime error occurs when you try to perform an operation involving tensors located on different devices. For instance, if your model is moved to the GPU (model.to('cuda')) but your input data remains on the CPU, the forward pass will fail.
# Assume CUDA is available
if torch.cuda.is_available():
device = torch.device("cuda")
else:
device = torch.device("cpu")
print(f"Using device: {device}")
model = nn.Linear(10, 5)
input_cpu = torch.randn(1, 10) # Tensor on CPU by default
# Move model to GPU (if available)
model.to(device)
print(f"Model device: {next(model.parameters()).device}")
# Attempt forward pass with CPU tensor and GPU model (will fail if device is cuda)
try:
output = model(input_cpu)
except RuntimeError as e:
print(f"Error: {e}")
# Output might be: Error: Expected all tensors to be on the same device,
# but found at least two devices, cuda:0 and cpu!
# Correct approach: Move input tensor to the same device as the model
input_gpu = input_cpu.to(device)
print(f"Input tensor device: {input_gpu.device}")
output = model(input_gpu) # This works
print(f"Output tensor device: {output.device}")
print("Forward pass successful!")
A common scenario leading to device mismatch errors. Input tensors must be moved to the same device as the model (e.g., GPU) before the forward pass.
The error message RuntimeError: Expected all tensors to be on the same device... is quite explicit. Debugging involves:
device early in your script (checking torch.cuda.is_available()).model.to(device).data = data.to(device) and targets = targets.to(device)..device attributes of tensors and model parameters (next(model.parameters()).device) if unsure.Choosing the right loss function is important, but you also need to ensure your model's output and your target labels have the shape and data type expected by that loss function. Using the wrong combination might not always crash but can lead to "silent" failures where the loss decreases but the model isn't learning the actual task correctly.
nn.CrossEntropyLoss: Commonly used for multi-class classification.
softmax function before this loss is usually incorrect, as CrossEntropyLoss combines LogSoftmax and NLLLoss.nn.MSELoss (Mean Squared Error): Often used for regression tasks.
nn.BCEWithLogitsLoss: Used for binary classification or multi-label classification.
sigmoid before this loss is incorrect as it includes the sigmoid calculation.A mismatch here might result in:
RuntimeError if shapes are fundamentally incompatible.CrossEntropyLoss).MSELoss for classification indices).Debugging involves carefully reading the PyTorch documentation for your chosen loss function and verifying:
.dtype) of your target tensor.softmax or sigmoid) should be applied before the loss.Sometimes, gradients don't propagate back through the network as expected, leading to parameters not being updated. This can happen silently if not monitored.
requires_grad=True: While nn.Module parameters automatically have requires_grad=True, if you create intermediate tensors that should be part of the computation graph, ensure they have this flag set correctly. Usually, this is handled automatically if they result from operations on tensors that already require gradients.tensor.add_()) can interfere with gradient tracking in older PyTorch versions or complex graphs. While PyTorch has improved its handling, it's generally safer to use out-of-place versions (y = x + 1 instead of x += 1) within network computations where gradients are needed..numpy()) detaches it from the computation graph. Any subsequent operations using that NumPy array, even if converted back to a tensor, will not have gradients flowing back to the original parts of the graph..detach(): Calling .detach() on a tensor explicitly removes it from the computation graph. This is necessary sometimes (e.g., during evaluation), but accidentally using it during training will stop gradients.A symptom of this is finding that .grad attributes of some parameters remain None after loss.backward(), or that the model's performance doesn't improve despite the training loop running. You'll learn how to inspect gradients more formally later in this chapter.
Bugs can also originate in your Dataset implementation or data transformations.
__getitem__: Returning data of the wrong type (e.g., images as NumPy arrays instead of Tensors if the model expects tensors) or incorrect shape.__getitem__ returns tensors of varying sizes (e.g., variable-length sequences or images of different dimensions) and the DataLoader's collate_fn doesn't handle this padding or stacking correctly, it can cause errors during batch creation.transforms.Compose pipeline) can silently corrupt the data fed into the model.Debugging these often involves:
Dataset separately.dataset[i].DataLoader.Being aware of these common challenges helps you anticipate potential issues and provides a starting point for diagnosing problems when they occur. The following sections will provide more structured techniques for debugging and monitoring your models.
Was this section helpful?
autograd, which is important for understanding how gradients are computed and propagated, and for diagnosing gradient flow issues.© 2026 ApX Machine LearningEngineered with