Okay, you've defined your network's architecture using nn.Module
, stacking layers and incorporating activation functions. But how does the network actually learn? It needs a way to measure how far its predictions are from the actual target values. This measurement is the job of the loss function, also known as the criterion or objective function.
The torch.nn
package provides a collection of standard loss functions commonly used in deep learning. The core idea is simple: a loss function takes the model's output (predictions) and the ground truth (targets) as input, and computes a single scalar value representing the "error" or "loss". This scalar loss value is then used by PyTorch's Autograd system during backpropagation to calculate gradients, which in turn guide the optimizer (like SGD or Adam from torch.optim
) on how to adjust the model's parameters (weights and biases) to minimize this loss.
Choosing the right loss function is important as it directly defines the objective your model is trying to achieve. Let's look at some of the most frequently used loss functions available in torch.nn
.
The loss function compares model predictions with target data to produce a scalar loss value, which guides parameter updates via backpropagation.
PyTorch implements loss functions as classes that inherit from nn.Module
. You first instantiate the loss function class and then call the instance with the model's predictions and the target values.
These are typically used when the goal is to predict a continuous value.
Mean Squared Error (MSELoss): Perhaps the most common loss function for regression tasks. It measures the average of the squares of the differences between the predicted values and the actual values.
The formula is:
Loss(y,y^)=N1i=1∑N(yi−y^i)2where N is the number of samples in the batch, yi is the true value, and y^i is the predicted value. Squaring the difference penalizes larger errors more heavily.
Use torch.nn.MSELoss
:
import torch
import torch.nn as nn
# Instantiate the loss function
loss_fn = nn.MSELoss()
# Example predictions and targets (batch size 3, 1 output feature)
predictions = torch.randn(3, 1, requires_grad=True)
targets = torch.randn(3, 1)
# Calculate the loss
loss = loss_fn(predictions, targets)
print(f"MSE Loss: {loss.item()}")
# Gradients can now be computed via loss.backward()
# loss.backward()
# print(predictions.grad)
Mean Absolute Error (L1Loss): Another popular regression loss. It measures the average of the absolute differences between predicted and actual values.
The formula is:
Loss(y,y^)=N1i=1∑N∣yi−y^i∣Compared to MSE, L1 loss is generally considered less sensitive to outliers because it doesn't square the errors.
Use torch.nn.L1Loss
:
import torch
import torch.nn as nn
loss_fn_l1 = nn.L1Loss()
predictions = torch.tensor([[1.0], [2.5], [0.0]], requires_grad=True)
targets = torch.tensor([[1.2], [2.2], [0.5]])
loss_l1 = loss_fn_l1(predictions, targets)
print(f"L1 Loss: {loss_l1.item()}") # Average of |1-1.2|, |2.5-2.2|, |0-0.5|
# (0.2 + 0.3 + 0.5) / 3 = 1.0 / 3 = 0.333...
These are used when the goal is to predict a discrete class label.
Cross-Entropy Loss (CrossEntropyLoss): This is the standard loss function for multi-class classification problems. It's particularly effective when your model outputs raw scores (logits) for each class.
torch.nn.CrossEntropyLoss
conveniently combines two steps into one:
LogSoftmax
function to the model's raw output scores (logits). Softmax converts logits into probabilities that sum to 1, and LogSoftmax takes the logarithm of these probabilities.NLLLoss
) between the LogSoftmax outputs and the target class indices.It expects:
(N, C)
where N
is the batch size and C
is the number of classes.(N)
.import torch
import torch.nn as nn
loss_fn_ce = nn.CrossEntropyLoss()
# Example: Batch of 3 samples, 5 classes
# Raw scores (logits) from the model
predictions_logits = torch.randn(3, 5, requires_grad=True)
# True class indices (must be LongTensor)
targets_classes = torch.tensor([1, 0, 4]) # Class indices for the 3 samples
loss_ce = loss_fn_ce(predictions_logits, targets_classes)
print(f"Cross-Entropy Loss: {loss_ce.item()}")
# loss_ce.backward()
# print(predictions_logits.grad)
Using nn.CrossEntropyLoss
is generally recommended over manually applying LogSoftmax
and then NLLLoss
due to better numerical stability.
Binary Cross-Entropy Loss (BCELoss and BCEWithLogitsLoss): Used for binary (two-class) classification problems or multi-label classification (where each sample can belong to multiple classes).
torch.nn.BCELoss
: Calculates the binary cross-entropy between the target and the output. It expects the model's output to already be probabilities (e.g., after applying a Sigmoid activation function), typically in the range [0, 1].
(N, *)
.torch.nn.BCEWithLogitsLoss
: This version is numerically more stable and convenient than using a Sigmoid layer followed by BCELoss
. It combines the Sigmoid activation and the BCE calculation in one step. It expects raw logits as input.
(N, *)
.For most binary classification tasks, BCEWithLogitsLoss
is preferred:
import torch
import torch.nn as nn
loss_fn_bce_logits = nn.BCEWithLogitsLoss()
# Example: Batch of 4 samples, 1 output node (binary classification)
predictions_logits_bin = torch.randn(4, 1, requires_grad=True) # Raw logits
# Targets should be floats (0.0 or 1.0)
targets_bin = torch.tensor([[1.0], [0.0], [0.0], [1.0]])
loss_bce = loss_fn_bce_logits(predictions_logits_bin, targets_bin)
print(f"BCE With Logits Loss: {loss_bce.item()}")
The choice depends heavily on your specific task:
nn.MSELoss
. If you suspect outliers are heavily influencing training, consider nn.L1Loss
.nn.BCEWithLogitsLoss
. Ensure your model has one output node producing logits.nn.CrossEntropyLoss
. Ensure your model has C
output nodes producing logits, where C
is the number of classes.nn.BCEWithLogitsLoss
. Ensure your model has C
output nodes producing logits, and your targets are multi-hot encoded (e.g., [1.0, 0.0, 1.0, 0.0]
if classes 0 and 2 are present).In a typical training loop, you'll instantiate the chosen loss function once outside the loop. Inside the loop, after getting the model's predictions for a batch of data, you pass the predictions and the corresponding target labels to the loss function instance to compute the loss for that batch.
# Assume model, optimizer, dataloader are already defined
# --- Outside the training loop ---
# Example: Multi-class classification
num_classes = 10
model = nn.Linear(784, num_classes) # Simple linear model example
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
loss_fn = nn.CrossEntropyLoss()
# Dummy data loader (replace with actual DataLoader)
dummy_dataloader = [(torch.randn(64, 784), torch.randint(0, num_classes, (64,))) for _ in range(5)]
# --- Inside the training loop ---
model.train() # Set model to training mode
for batch_idx, (data, target) in enumerate(dummy_dataloader):
# 1. Zero gradients
optimizer.zero_grad()
# 2. Forward pass: Get predictions (logits)
predictions = model(data)
# 3. Calculate loss
loss = loss_fn(predictions, target)
# 4. Backward pass: Compute gradients
loss.backward()
# 5. Optimizer step: Update weights
optimizer.step()
if batch_idx % 2 == 0: # Print loss periodically
print(f"Batch {batch_idx}, Loss: {loss.item():.4f}")
By choosing an appropriate loss function from torch.nn
and correctly integrating it into your training process, you provide your model with a clear objective to learn towards, forming a fundamental part of the model training mechanism alongside the optimizer and backpropagation.
© 2025 ApX Machine Learning