After defining the architecture of your neural network, specifying the layers, their sizes, and activation functions, the next step is to prepare the model for the training process. This configuration stage involves selecting two critical components: a loss function and an optimizer. Think of this as giving your model instructions on what goal it should aim for (minimizing the loss) and how it should adjust itself to reach that goal (using the optimizer). Frameworks like PyTorch provide convenient ways to specify these choices.
As you learned in Chapter 3, the loss function (also known as the cost function or criterion) measures how far the model's predictions are from the actual target values during training. The goal of training is to minimize this value. The choice of loss function depends heavily on the type of problem you are solving: regression or classification.
Regression Tasks: When predicting continuous values (like house prices or temperature), common choices include:
torch.nn.MSELoss
.import torch.nn as nn
# Instantiate MSE Loss
loss_fn_mse = nn.MSELoss()
MSE is sensitive to outliers due to the squaring operation. If your dataset contains significant outliers, you might consider MAE.
torch.nn.L1Loss
(L1 distance is equivalent to MAE).# Instantiate MAE Loss (L1 Loss)
loss_fn_mae = nn.L1Loss()
Binary Classification Tasks: When classifying inputs into one of two categories (e.g., spam/not spam, cat/dog), Binary Cross-Entropy is the standard choice.
BCELoss
): Measures the difference between two probability distributions (the predicted probabilities and the true binary labels). It expects the model's output to be probabilities, usually obtained by applying a Sigmoid activation function to the final layer.# Instantiate BCELoss (requires final Sigmoid activation on model output)
loss_fn_bce = nn.BCELoss()
BCEWithLogitsLoss
: This is often preferred over using a Sigmoid layer followed by BCELoss
. It combines the Sigmoid activation and the BCE calculation in a single, numerically more stable function. It expects the raw output scores (logits) from the final layer.# Instantiate BCEWithLogitsLoss (numerically stable, takes raw logits)
loss_fn_bce_logits = nn.BCEWithLogitsLoss()
Multi-class Classification Tasks: When classifying inputs into one of three or more categories (e.g., MNIST digit classification, object recognition among multiple classes), Categorical Cross-Entropy is used.
CrossEntropyLoss
): In PyTorch, nn.CrossEntropyLoss
conveniently combines a LogSoftmax activation (which converts raw scores into log-probabilities) and Negative Log Likelihood Loss (NLLLoss
). It expects the raw, unnormalized scores (logits) directly from the final layer of your network and the target labels as class indices (integers).# Instantiate CrossEntropyLoss (combines LogSoftmax and NLLLoss)
loss_fn_ce = nn.CrossEntropyLoss()
Choosing the correct loss function is fundamental. Using a classification loss for a regression problem, or vice versa, will prevent your model from learning effectively because the error signal being minimized won't match the task's objective.
Once you have a way to measure error (the loss function), you need a mechanism to update the model's weights and biases to reduce that error. This is the role of the optimizer. As discussed in Chapters 3 and 4, optimizers implement variations of the gradient descent algorithm, using the gradients calculated during backpropagation to iteratively adjust parameters.
lr
(learning rate) and momentum
. Momentum helps accelerate SGD in relevant directions and dampens oscillations.
import torch.optim as optim
# Assume 'model' is your defined neural network
# Optimizer: SGD with momentum
optimizer_sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
lr
, betas
(decay rates for the moment estimates), and eps
(for numerical stability). Adam often works well with default settings but tuning might still be needed.
# Optimizer: Adam
# Commonly used betas=(0.9, 0.999), eps=1e-8
optimizer_adam = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=1e-8)
lr
, alpha
(smoothing constant, similar to a beta in Adam), and eps
.
# Optimizer: RMSprop
optimizer_rmsprop = optim.RMSprop(model.parameters(), lr=0.01, alpha=0.99, eps=1e-8)
The learning rate (lr
) is arguably the most important hyperparameter for any optimizer. It controls the step size during parameter updates. Setting it too high can cause the optimization process to diverge, while setting it too low can make training prohibitively slow or get stuck in suboptimal minima. Finding a good learning rate often involves experimentation, which we'll touch upon more in Chapter 6 regarding hyperparameter tuning.
While the loss function guides the training process by providing the gradient signal, it might not always be the most intuitive measure of performance from a human perspective. For instance, cross-entropy values aren't easily interpretable, whereas classification accuracy (the percentage of correctly classified examples) is straightforward.
Therefore, alongside the loss function, you typically track one or more evaluation metrics during training and evaluation. These metrics provide a clearer picture of how well the model is performing on the actual task. Examples include:
In frameworks like Keras, metrics are often specified directly during the compile
step along with the loss and optimizer. In PyTorch, calculating metrics is typically done manually within the training and validation loops by passing model predictions and true labels to appropriate metric functions (often available in libraries like torchmetrics
or scikit-learn
). This separation reinforces that the loss function is for optimization, while metrics are for evaluation and monitoring.
Here's how you might combine these elements in PyTorch after defining your model architecture (MySimpleNet
):
import torch
import torch.nn as nn
import torch.optim as optim
# Assume MySimpleNet is a defined nn.Module class for multi-class classification
model = MySimpleNet(input_size=784, hidden_size=128, output_size=10)
# 1. Select the Loss Function (Criterion)
# For multi-class classification with logits output
criterion = nn.CrossEntropyLoss()
# 2. Choose the Optimizer
# Using Adam with a learning rate of 0.001
learning_rate = 0.001
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# --- Ready for Training ---
# The 'model', 'criterion', and 'optimizer' are now configured.
# The next step (covered in the following section) involves feeding data
# through the model, calculating loss, performing backpropagation,
# and updating weights using the optimizer within a training loop.
# Evaluation metrics like accuracy would be calculated within that loop as well.
print("Model configured successfully!")
print(f"Loss Function: {criterion}")
print(f"Optimizer: {optimizer}")
This configuration step essentially sets the rules for learning. By choosing an appropriate loss function and a suitable optimizer, you provide the necessary components for the framework to effectively train your neural network based on the data you provide. The selections made here have a direct and significant impact on the training dynamics and the final performance of your model.
© 2025 ApX Machine Learning