Now that we've discussed common debugging challenges and visualization tools, it's time to apply these concepts. This practice section provides hands-on exercises to solidify your skills in identifying errors, inspecting model behavior, and monitoring training progress using TensorBoard and standard debugging techniques. We will work through scenarios involving shape mismatches, device placement errors, and setting up visualization.
Shape mismatches are frequent errors when building or modifying neural networks. Consider the following simple model designed for processing 28x28 grayscale images (like MNIST) flattened into a 784-element vector:
import torch
import torch.nn as nn
class SimpleMLP(nn.Module):
def __init__(self):
super().__init__()
self.layer1 = nn.Linear(784, 128) # Input 784, Output 128
self.activation = nn.ReLU()
self.layer2 = nn.Linear(128, 64) # Input 128, Output 64
# Incorrect input size for layer3 - should be 64
self.layer3 = nn.Linear(100, 10) # Input 100 (Error!), Output 10
def forward(self, x):
x = self.layer1(x)
x = self.activation(x)
x = self.layer2(x)
x = self.activation(x)
# This line will cause an error
x = self.layer3(x)
return x
# Create a dummy input batch (Batch size 4, Features 784)
dummy_input = torch.randn(4, 784)
model = SimpleMLP()
# Attempt the forward pass
try:
output = model(dummy_input)
print("Model ran successfully!")
except RuntimeError as e:
print(f"Caught an error: {e}")
Run the Code: Execute the code snippet above. You will encounter a RuntimeError
. Observe the error message carefully. It typically points to a size mismatch, often mentioning expected and actual input sizes for a specific layer (in this case, mat1 and mat2 shapes cannot be multiplied
).
Diagnose: The error occurs when the input x
reaches self.layer3
. The previous layer, self.layer2
, outputs a tensor of shape (batch_size, 64)
. However, self.layer3
is defined with nn.Linear(100, 10)
, expecting an input with 100 features. This mismatch causes the error.
# Inside the forward method, before self.layer3(x)
print("Shape before layer3:", x.shape)
x = self.layer3(x)
Fix the Code: Modify the definition of self.layer3
in the __init__
method to accept the correct number of input features (which is 64, the output size of self.layer2
).
# Corrected layer definition
self.layer3 = nn.Linear(64, 10) # Input 64, Output 10
Verify: Rerun the script with the corrected layer definition. The forward pass should now complete without errors.
When using GPUs, it's important that both the model and the data are on the same device. Let's simulate an error where the model is moved to the GPU, but the input tensor remains on the CPU.
import torch
import torch.nn as nn
# Assume a CUDA-enabled GPU is available
if torch.cuda.is_available():
device = torch.device("cuda")
print("Using GPU:", torch.cuda.get_device_name(0))
else:
device = torch.device("cpu")
print("Using CPU")
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(10, 5)
def forward(self, x):
return self.linear(x)
# Create the model and move it to the target device (e.g., GPU)
model = SimpleNet().to(device)
print(f"Model parameters are on: {next(model.parameters()).device}")
# Create input data - intentionally left on the CPU
input_data = torch.randn(8, 10)
print(f"Input data is on: {input_data.device}")
# Attempt forward pass - this will likely cause an error if device is 'cuda'
try:
output = model(input_data)
print("Forward pass successful!")
except RuntimeError as e:
print(f"\nCaught an error: {e}")
print("\nHint: Check if the model and input data are on the same device.")
Run the Code: If you have a CUDA-enabled GPU, running this code will produce a RuntimeError
. The error message will likely state something like Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
.
Diagnose: The print statements confirm the model is on the cuda
device (if available), while the input_data
is on the cpu
device. PyTorch operations generally require operands to reside on the same device.
Fix the Code: Before passing the input_data
to the model, move it to the same device the model is on.
# Move input data to the correct device
input_data = input_data.to(device)
print(f"Input data moved to: {input_data.device}")
# Now, attempt the forward pass again
output = model(input_data)
print("Forward pass successful after moving data!")
Verify: Rerun the corrected script. The forward pass should execute without device mismatch errors. Remember this principle applies within the training loop too; each batch fetched from the DataLoader
needs to be moved to the appropriate device.
TensorBoard provides invaluable insights into the training process. Let's integrate it into a simplified training loop. We'll simulate training data and track a dummy loss value.
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.tensorboard import SummaryWriter
import time
# 1. Setup TensorBoard Writer
# Log files will be saved in the 'runs/simple_experiment' directory
writer = SummaryWriter('runs/simple_experiment')
# 2. Define a simple model, loss, and optimizer
model = nn.Linear(10, 2) # Simple linear model
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
# Simulate a simple dataset
inputs = torch.randn(100, 10) # 100 samples, 10 features
targets = torch.randn(100, 2) # 100 samples, 2 output values
# 3. Simple Training Loop
print("Starting simulated training...")
num_epochs = 50
for epoch in range(num_epochs):
optimizer.zero_grad() # Zero gradients
outputs = model(inputs) # Forward pass
loss = criterion(outputs, targets) # Calculate loss
# Simulate changing loss (replace with actual loss in real training)
# Making loss decrease over epochs for demonstration
simulated_loss = loss + torch.randn(1) * 0.1 + (num_epochs - epoch) / num_epochs
simulated_loss.backward() # Backward pass (using simulated loss for demo)
optimizer.step() # Update weights
# 4. Log metrics to TensorBoard
if (epoch + 1) % 5 == 0: # Log every 5 epochs
# Log the scalar 'loss' value
writer.add_scalar('Training/Loss', simulated_loss.item(), epoch)
# Log the distribution of model weights (example for the linear layer)
writer.add_histogram('Model/Weights', model.weight, epoch)
writer.add_histogram('Model/Bias', model.bias, epoch)
print(f'Epoch [{epoch+1}/{num_epochs}], Simulated Loss: {simulated_loss.item():.4f}')
time.sleep(0.1) # Simulate training time
# 5. Add model graph (optional)
# Ensure input shape matches what the model expects
# writer.add_graph(model, inputs[0].unsqueeze(0)) # Provide a sample input batch
# 6. Close the writer
writer.close()
print("Finished simulated training. TensorBoard logs saved to 'runs/simple_experiment'.")
print("Run 'tensorboard --logdir=runs' in your terminal to view.")
runs
folder (not inside runs
itself), and run the command:
tensorboard --logdir=runs
http://localhost:6006/
). Open this URL in your web browser.simple_experiment
run. Under the "Scalars" tab, you'll see the "Training/Loss" plot showing the decreasing trend of our simulated loss. Under the "Histograms" or "Distributions" tab, you can observe how the distributions of the model's weights and biases change (or don't change much in this simple simulation) over epochs. If you uncommented the add_graph
line, you'd also find a visualization of the model architecture under the "Graphs" tab.The following plot shows an example of how the loss might decrease over epochs when viewed in TensorBoard.
A line chart depicting a simulated decreasing training loss across 50 epochs, logged every 5 epochs.
Sometimes, print statements aren't enough, and you need to interactively inspect the state of your program. The Python Debugger (pdb
) is a powerful tool for this. Let's revisit the shape mismatch scenario from Exercise 1 and use pdb
.
Modify the original failing code from Exercise 1 by adding import pdb
at the top and pdb.set_trace()
right before the line that causes the error:
import torch
import torch.nn as nn
import pdb # Import the debugger
class SimpleMLP(nn.Module):
def __init__(self):
super().__init__()
self.layer1 = nn.Linear(784, 128)
self.activation = nn.ReLU()
self.layer2 = nn.Linear(128, 64)
# Incorrect input size for layer3
self.layer3 = nn.Linear(100, 10) # Error here!
def forward(self, x):
x = self.layer1(x)
x = self.activation(x)
x = self.layer2(x)
x = self.activation(x)
print("About to enter pdb...")
pdb.set_trace() # Set a breakpoint here
# Execution will pause here
print("Shape before layer3:", x.shape) # We can inspect x in pdb
x = self.layer3(x) # This line will cause an error
return x
dummy_input = torch.randn(4, 784)
model = SimpleMLP()
output = model(dummy_input) # Run will now pause inside forward()
pdb.set_trace()
, it will pause, and you'll see a (Pdb)
prompt in your terminal.p x.shape
(print x.shape
) and press Enter. You'll see torch.Size([4, 64])
.p self.layer3
and press Enter. You'll see the definition Linear(in_features=100, out_features=10, bias=True)
.n
(next) and press Enter. This attempts to execute the next line (x = self.layer3(x)
), which will cause the RuntimeError
and likely exit the debugger or show the traceback within it.c
(continue) to let the program continue until the next breakpoint or the error.q
(quit) to exit the debugger and terminate the script immediately.q
to quit, then fix the code as in Exercise 1, and remove the import pdb
and pdb.set_trace()
lines.This practice provides a foundation for tackling debugging and monitoring tasks in your PyTorch projects. Remember to use print statements for quick checks, pdb
for interactive inspection, and TensorBoard for visualizing training dynamics and model structure. These tools are essential for building, training, and refining effective deep learning models.
© 2025 ApX Machine Learning