Once your PyTorch model is trained, or even during training to monitor progress, you'll need to evaluate its performance on a dataset it hasn't seen before, such as a validation or test set. If you're coming from TensorFlow Keras, you're familiar with the convenient model.evaluate()
method, which handles the evaluation process with a single call after you've compiled your model.
In PyTorch, similar to training, evaluating a model involves writing an explicit loop. This gives you full control over the process, allowing for custom metric calculations and detailed insights into your model's behavior. The structure of an evaluation loop is quite similar to a training loop, but with a few critical differences: you won't be calculating gradients or updating model weights.
Constructing an evaluation loop typically involves these steps:
DataLoader
.Let's look at the important components in more detail.
model.eval()
Before you begin evaluating, you must switch your model to evaluation mode by calling model.eval()
. This is important because some layers, most notably torch.nn.Dropout
and torch.nn.BatchNorm1d
/ torch.nn.BatchNorm2d
/ torch.nn.BatchNorm3d
, have different behaviors during training and evaluation.
Calling model.eval()
sets the mode for all modules in your model recursively. Conversely, when you switch back to training, you'll call model.train()
to revert these layers to their training behavior.
# Assuming 'model' is your PyTorch nn.Module instance
model.eval()
print("Model is in evaluation mode.")
# ... perform evaluation ...
# If you need to go back to training later
# model.train()
# print("Model is back in training mode.")
Forgetting model.eval()
can lead to inconsistent and misleading evaluation results because dropout would still be active, and batch normalization layers would use batch statistics instead of the learned population statistics.
torch.no_grad()
During evaluation, you are only interested in the model's output, not in updating its weights. Therefore, calculating gradients is unnecessary and computationally expensive. PyTorch provides a context manager, torch.no_grad()
, that disables gradient calculation within its scope.
Using torch.no_grad()
offers two main benefits:
Here’s how you use it:
import torch
# Assume model, data, and target are defined and on the correct device
# model.eval() should have been called before this part of the loop
with torch.no_grad():
# Forward pass
predictions = model(data_batch)
# Loss calculation (optional during evaluation, but often useful)
# loss = criterion(predictions, target_batch)
# Other metric calculations
# accuracy = calculate_accuracy(predictions, target_batch)
Any tensor operations performed inside the with torch.no_grad():
block will have requires_grad=False
, even if their inputs had requires_grad=True
outside the block.
Now, let's combine these elements into a typical evaluation loop. This function would take your model, the DataLoader
for your evaluation set, and the loss function (criterion) as inputs.
import torch
import torch.nn as nn
# Example: Define a simple model, criterion, and a dummy dataloader for illustration
# In a real scenario, these would be your actual trained model and data
class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.fc = nn.Linear(10, 2) # 10 input features, 2 output classes
def forward(self, x):
return self.fc(x)
# Dummy data for illustration
dummy_eval_data = [(torch.randn(32, 10), torch.randint(0, 2, (32,))) for _ in range(5)] # 5 batches of 32 samples
eval_loader = torch.utils.data.DataLoader(dummy_eval_data, batch_size=None) # batch_size=None because data is already batched
model = SimpleModel() # Assume this model has been trained
criterion = nn.CrossEntropyLoss()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
def evaluate_model(model, dataloader, criterion, device):
model.eval() # Set the model to evaluation mode
total_loss = 0.0
correct_predictions = 0
total_samples = 0
with torch.no_grad(): # Disable gradient calculations
for inputs, labels in dataloader:
inputs = inputs.to(device)
labels = labels.to(device)
# Forward pass
outputs = model(inputs)
# Calculate loss
loss = criterion(outputs, labels)
total_loss += loss.item() * inputs.size(0) # Accumulate loss, weighted by batch size
# Calculate accuracy
_, predicted_labels = torch.max(outputs, 1)
correct_predictions += (predicted_labels == labels).sum().item()
total_samples += labels.size(0)
avg_loss = total_loss / total_samples
accuracy = correct_predictions / total_samples
print(f'Evaluation: Average Loss: {avg_loss:.4f}, Accuracy: {accuracy:.4f} ({correct_predictions}/{total_samples})')
return avg_loss, accuracy
# Perform evaluation
eval_loss, eval_accuracy = evaluate_model(model, eval_loader, criterion, device)
In this evaluate_model
function:
model.eval()
is called first.torch.no_grad()
wraps the entire loop over the data, ensuring no gradients are computed.device
.outputs = model(inputs)
is performed.loss
is calculated using the criterion
. We multiply loss.item()
by inputs.size(0)
(batch size) before summing because loss functions often return the mean loss over the batch. To get the total loss for the dataset, we need to sum these up and then divide by the total number of samples.torch.max(outputs, 1)
finds the class with the highest score for each sample. The number of correct_predictions
and total_samples
are accumulated.avg_loss
and overall accuracy
are computed and reported.This structure is highly adaptable. You can easily add other metrics from libraries like torchmetrics
or implement custom ones within the loop.
tf.keras.evaluate()
If you're used to TensorFlow, the Keras model.evaluate(eval_dataset)
method performs all these steps implicitly. It takes your evaluation dataset, iterates through it, computes the configured loss and metrics, and returns the results.
The PyTorch approach, while requiring more explicit code, offers several advantages:
While Keras provides convenience, the PyTorch way gives you finer-grained control, which can be particularly useful for research or when dealing with complex evaluation scenarios.
After evaluating your model, you'll use these metrics (like average loss and accuracy) to compare different models, perform hyperparameter tuning, or decide if your model is ready for deployment. If you're evaluating during training (e.g., on a validation set after each epoch), these metrics can also inform early stopping decisions, helping you prevent overfitting and save training time.
© 2025 ApX Machine Learning