Hands-on exercises solidify skills in applying common debugging challenges and visualization tools. These exercises focus on identifying errors, inspecting model behavior, and monitoring training progress using TensorBoard and standard debugging techniques. Scenarios covered include shape mismatches, device placement errors, and setting up visualization.Exercise 1: Fixing a Shape MismatchShape mismatches are frequent errors when building or modifying neural networks. Consider the following simple model designed for processing 28x28 grayscale images (like MNIST) flattened into a 784-element vector:import torch import torch.nn as nn class SimpleMLP(nn.Module): def __init__(self): super().__init__() self.layer1 = nn.Linear(784, 128) # Input 784, Output 128 self.activation = nn.ReLU() self.layer2 = nn.Linear(128, 64) # Input 128, Output 64 # Incorrect input size for layer3 - should be 64 self.layer3 = nn.Linear(100, 10) # Input 100 (Error!), Output 10 def forward(self, x): x = self.layer1(x) x = self.activation(x) x = self.layer2(x) x = self.activation(x) # This line will cause an error x = self.layer3(x) return x # Create a dummy input batch (Batch size 4, Features 784) dummy_input = torch.randn(4, 784) model = SimpleMLP() # Attempt the forward pass try: output = model(dummy_input) print("Model ran successfully!") except RuntimeError as e: print(f"Caught an error: {e}") Run the Code: Execute the code snippet above. You will encounter a RuntimeError. Observe the error message carefully. It typically points to a size mismatch, often mentioning expected and actual input sizes for a specific layer (in this case, mat1 and mat2 shapes cannot be multiplied).Diagnose: The error occurs when the input x reaches self.layer3. The previous layer, self.layer2, outputs a tensor of shape (batch_size, 64). However, self.layer3 is defined with nn.Linear(100, 10), expecting an input with 100 features. This mismatch causes the error.You could insert print statements before the failing line to confirm the shape:# Inside the forward method, before self.layer3(x) print("Shape before layer3:", x.shape) x = self.layer3(x)Fix the Code: Modify the definition of self.layer3 in the __init__ method to accept the correct number of input features (which is 64, the output size of self.layer2).# Corrected layer definition self.layer3 = nn.Linear(64, 10) # Input 64, Output 10Verify: Rerun the script with the corrected layer definition. The forward pass should now complete without errors.Exercise 2: Correcting Device PlacementWhen using GPUs, it's important that both the model and the data are on the same device. Let's simulate an error where the model is moved to the GPU, but the input tensor remains on the CPU.import torch import torch.nn as nn # Assume a CUDA-enabled GPU is available if torch.cuda.is_available(): device = torch.device("cuda") print("Using GPU:", torch.cuda.get_device_name(0)) else: device = torch.device("cpu") print("Using CPU") class SimpleNet(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(10, 5) def forward(self, x): return self.linear(x) # Create the model and move it to the target device (e.g., GPU) model = SimpleNet().to(device) print(f"Model parameters are on: {next(model.parameters()).device}") # Create input data - intentionally left on the CPU input_data = torch.randn(8, 10) print(f"Input data is on: {input_data.device}") # Attempt forward pass - this will likely cause an error if device is 'cuda' try: output = model(input_data) print("Forward pass successful!") except RuntimeError as e: print(f"\nCaught an error: {e}") print("\nHint: Check if the model and input data are on the same device.") Run the Code: If you have a CUDA-enabled GPU, running this code will produce a RuntimeError. The error message will likely state something like Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!.Diagnose: The print statements confirm the model is on the cuda device (if available), while the input_data is on the cpu device. PyTorch operations generally require operands to reside on the same device.Fix the Code: Before passing the input_data to the model, move it to the same device the model is on.# Move input data to the correct device input_data = input_data.to(device) print(f"Input data moved to: {input_data.device}") # Now, attempt the forward pass again output = model(input_data) print("Forward pass successful after moving data!")Verify: Rerun the corrected script. The forward pass should execute without device mismatch errors. Remember this principle applies within the training loop too; each batch fetched from the DataLoader needs to be moved to the appropriate device.Exercise 3: Visualizing Training with TensorBoardTensorBoard provides invaluable insights into the training process. Let's integrate it into a simplified training loop. We'll simulate training data and track a dummy loss value.import torch import torch.nn as nn import torch.optim as optim from torch.utils.tensorboard import SummaryWriter import time # 1. Setup TensorBoard Writer # Log files will be saved in the 'runs/simple_experiment' directory writer = SummaryWriter('runs/simple_experiment') # 2. Define a simple model, loss, and optimizer model = nn.Linear(10, 2) # Simple linear model criterion = nn.MSELoss() optimizer = optim.SGD(model.parameters(), lr=0.01) # Simulate a simple dataset inputs = torch.randn(100, 10) # 100 samples, 10 features targets = torch.randn(100, 2) # 100 samples, 2 output values # 3. Simple Training Loop print("Starting simulated training...") num_epochs = 50 for epoch in range(num_epochs): optimizer.zero_grad() # Zero gradients outputs = model(inputs) # Forward pass loss = criterion(outputs, targets) # Calculate loss # Simulate changing loss (replace with actual loss in real training) # Making loss decrease over epochs for demonstration simulated_loss = loss + torch.randn(1) * 0.1 + (num_epochs - epoch) / num_epochs simulated_loss.backward() # Backward pass (using simulated loss for demo) optimizer.step() # Update weights # 4. Log metrics to TensorBoard if (epoch + 1) % 5 == 0: # Log every 5 epochs # Log the scalar 'loss' value writer.add_scalar('Training/Loss', simulated_loss.item(), epoch) # Log the distribution of model weights (example for the linear layer) writer.add_histogram('Model/Weights', model.weight, epoch) writer.add_histogram('Model/Bias', model.bias, epoch) print(f'Epoch [{epoch+1}/{num_epochs}], Simulated Loss: {simulated_loss.item():.4f}') time.sleep(0.1) # Simulate training time # 5. Add model graph (optional) # Ensure input shape matches what the model expects # writer.add_graph(model, inputs[0].unsqueeze(0)) # Provide a sample input batch # 6. Close the writer writer.close() print("Finished simulated training. TensorBoard logs saved to 'runs/simple_experiment'.") print("Run 'tensorboard --logdir=runs' in your terminal to view.") Run the Code: Execute the Python script. It will print epoch progress and mention saving logs.Launch TensorBoard: Open your terminal or command prompt, navigate to the directory containing the runs folder (not inside runs itself), and run the command:tensorboard --logdir=runsView in Browser: TensorBoard will output a URL (usually http://localhost:6006/). Open this URL in your web browser.Explore: Navigate the TensorBoard interface. You should find the simple_experiment run. Under the "Scalars" tab, you'll see the "Training/Loss" plot showing the decreasing trend of our simulated loss. Under the "Histograms" or "Distributions" tab, you can observe how the distributions of the model's weights and biases change (or don't change much in this simple simulation) over epochs. If you uncommented the add_graph line, you'd also find a visualization of the model architecture under the "Graphs" tab.The following plot shows an example of how the loss might decrease over epochs when viewed in TensorBoard.{"data":[{"type": "scatter", "mode": "lines", "x": [0, 5, 10, 15, 20, 25, 30, 35, 40, 45], "y": [1.8, 1.5, 1.2, 1.0, 0.8, 0.65, 0.5, 0.4, 0.3, 0.25], "name": "Training Loss", "line": {"color": "#228be6"}}], "layout": {"title": "Simulated Training Loss over Epochs", "xaxis": {"title": "Epoch"}, "yaxis": {"title": "Loss"}, "width": 600, "height": 400}}A line chart depicting a simulated decreasing training loss across 50 epochs, logged every 5 epochs.Exercise 4: Using the Python Debugger (pdb)Sometimes, print statements aren't enough, and you need to interactively inspect the state of your program. The Python Debugger (pdb) is a powerful tool for this. Let's revisit the shape mismatch scenario from Exercise 1 and use pdb.Modify the original failing code from Exercise 1 by adding import pdb at the top and pdb.set_trace() right before the line that causes the error:import torch import torch.nn as nn import pdb # Import the debugger class SimpleMLP(nn.Module): def __init__(self): super().__init__() self.layer1 = nn.Linear(784, 128) self.activation = nn.ReLU() self.layer2 = nn.Linear(128, 64) # Incorrect input size for layer3 self.layer3 = nn.Linear(100, 10) # Error here! def forward(self, x): x = self.layer1(x) x = self.activation(x) x = self.layer2(x) x = self.activation(x) print("About to enter pdb...") pdb.set_trace() # Set a breakpoint here # Execution will pause here print("Shape before layer3:", x.shape) # We can inspect x in pdb x = self.layer3(x) # This line will cause an error return x dummy_input = torch.randn(4, 784) model = SimpleMLP() output = model(dummy_input) # Run will now pause inside forward() Run the Modified Code: Execute the script. When the program reaches pdb.set_trace(), it will pause, and you'll see a (Pdb) prompt in your terminal.Interact with pdb:Type p x.shape (print x.shape) and press Enter. You'll see torch.Size([4, 64]).Type p self.layer3 and press Enter. You'll see the definition Linear(in_features=100, out_features=10, bias=True).Comparing the input shape (64 features) with the layer's expected input (100 features) makes the mismatch clear.Type n (next) and press Enter. This attempts to execute the next line (x = self.layer3(x)), which will cause the RuntimeError and likely exit the debugger or show the traceback within it.Alternatively, type c (continue) to let the program continue until the next breakpoint or the error.Type q (quit) to exit the debugger and terminate the script immediately.Fix and Remove: Once you understand the issue, you can type q to quit, then fix the code as in Exercise 1, and remove the import pdb and pdb.set_trace() lines.This practice provides a foundation for tackling debugging and monitoring tasks in your PyTorch projects. Remember to use print statements for quick checks, pdb for interactive inspection, and TensorBoard for visualizing training dynamics and model structure. These tools are essential for building, training, and refining effective deep learning models.