While TensorBoard provides rich visualizations, the foundation of effective monitoring is systematic logging within your training and evaluation code. Simply running the loops isn't enough; you need to record key performance indicators to understand how the training is progressing and to diagnose issues when they arise. This section details how to implement basic metric logging directly within your PyTorch training and evaluation routines.
Logging metrics serves several important purposes:
The specific metrics depend on your task, but some common ones include:
During training, you typically want to track the average loss and accuracy over each epoch. Logging metrics for every single batch can be noisy and less informative about overall trends, although it can sometimes be useful for debugging instability.
Here’s how you might modify a standard training epoch function to include logging:
import torch
# Assume model, train_dataloader, loss_fn, optimizer are defined
def train_one_epoch(model, train_dataloader, loss_fn, optimizer, device):
model.train() # Set model to training mode
running_loss = 0.0
correct_predictions = 0
total_samples = 0
for batch_idx, (inputs, labels) in enumerate(train_dataloader):
inputs, labels = inputs.to(device), labels.to(device)
# 1. Zero gradients
optimizer.zero_grad()
# 2. Forward pass
outputs = model(inputs)
# 3. Calculate loss
loss = loss_fn(outputs, labels)
# 4. Backward pass
loss.backward()
# 5. Optimizer step
optimizer.step()
# --- Logging steps ---
# Accumulate loss (use .item() to get Python number)
running_loss += loss.item() * inputs.size(0) # Weighted by batch size
# Accumulate accuracy (example for classification)
_, predicted = torch.max(outputs.data, 1)
total_samples += labels.size(0)
correct_predictions += (predicted == labels).sum().item()
# --- End Logging steps ---
# Calculate average loss and accuracy for the epoch
epoch_loss = running_loss / total_samples
epoch_acc = correct_predictions / total_samples
print(f"Training Epoch: Loss: {epoch_loss:.4f}, Accuracy: {epoch_acc:.4f}")
# Return metrics for potential further logging or analysis
return epoch_loss, epoch_acc
Key points in the logging implementation:
running_loss
, correct_predictions
, total_samples
) at the start of the epoch.loss.item()
to get the scalar value of the loss tensor for the current batch, preventing graph retention. Multiply by inputs.size(0)
(batch size) if you want the total loss before averaging at the end; otherwise, you can average batch losses, but weighting by batch size is more accurate if the last batch is smaller..item()
here as well.Logging in the evaluation loop is similar, but crucial differences exist:
torch.no_grad()
to disable gradient calculations.model.eval()
) to disable dropout and use batch normalization statistics learned during training.import torch
# Assume model, val_dataloader, loss_fn are defined
def evaluate_model(model, val_dataloader, loss_fn, device):
model.eval() # Set model to evaluation mode
running_loss = 0.0
correct_predictions = 0
total_samples = 0
with torch.no_grad(): # Disable gradient computations
for inputs, labels in val_dataloader:
inputs, labels = inputs.to(device), labels.to(device)
# Forward pass
outputs = model(inputs)
# Calculate loss
loss = loss_fn(outputs, labels)
# --- Logging steps ---
running_loss += loss.item() * inputs.size(0)
_, predicted = torch.max(outputs.data, 1)
total_samples += labels.size(0)
correct_predictions += (predicted == labels).sum().item()
# --- End Logging steps ---
epoch_loss = running_loss / total_samples
epoch_acc = correct_predictions / total_samples
print(f"Validation: Loss: {epoch_loss:.4f}, Accuracy: {epoch_acc:.4f}")
return epoch_loss, epoch_acc
Printing metrics to the console is useful for immediate feedback. For more systematic analysis or visualization, you'll want to store them. Simple Python lists or dictionaries work well:
# --- Inside your main training script ---
num_epochs = 10
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# ... initialize model, data, loss, optimizer ...
train_losses, train_accuracies = [], []
val_losses, val_accuracies = [], []
for epoch in range(num_epochs):
print(f"--- Epoch {epoch+1}/{num_epochs} ---")
train_loss, train_acc = train_one_epoch(model, train_dataloader, loss_fn, optimizer, device)
val_loss, val_acc = evaluate_model(model, val_dataloader, loss_fn, device)
# Store metrics
train_losses.append(train_loss)
train_accuracies.append(train_acc)
val_losses.append(val_loss)
val_accuracies.append(val_acc)
# Optional: Save model checkpoint here based on validation performance
# Optional: Log metrics to TensorBoard (using values stored above)
# writer.add_scalar('Loss/train', train_loss, epoch)
# writer.add_scalar('Loss/validation', val_loss, epoch)
# ... etc.
print("Training finished.")
# Now you can analyze the lists: train_losses, val_losses, etc.
# For example, save them to a file or plot them.
This structure allows you to collect performance data over the entire training process. These stored lists (train_losses
, val_losses
, etc.) are exactly what you would feed into plotting libraries like Matplotlib or pass to a TensorBoard SummaryWriter
(as discussed in the previous section) to create visualizations like the one below.
Training and validation loss curves plotted over 10 epochs. Observing these trends helps diagnose overfitting (validation loss increasing while training loss decreases) or underfitting (both losses remain high).
By consistently logging metrics during both training and evaluation, you gain essential visibility into your model's behavior, enabling informed decisions about hyperparameter tuning, model adjustments, and debugging strategies.
© 2025 ApX Machine Learning