After constructing your neural network's architecture with layers and activation functions, the next main step is to define how the model learns. This learning process is guided by two main components: loss functions, which measure how far off your model's predictions are from the actual target values, and optimizers, which adjust the model's parameters (weights and biases) to minimize this error. This section covers how to define and use these components within the Flux.jl ecosystem.
A loss function, also known as a cost function or objective function, quantifies the difference between the predicted output of your model (y^) and the true target value (y). The goal of training a neural network is to find a set of parameters that minimizes this loss. A smaller loss value indicates that the model's predictions are closer to the actual values.
The choice of loss function depends heavily on the type of problem you're solving:
Flux.jl provides a range of pre-defined loss functions in its Flux.Losses
module, making it straightforward to incorporate them into your training routine.
Let's look at some of the most frequently used loss functions and how to use them in Flux.
Mean Squared Error measures the average of the squares of the errors. It's particularly sensitive to outliers due to the squaring term. The formula is: LMSE=N1∑i=1N(yi−y^i)2 where N is the number of samples, yi is the true value, and y^i is the predicted value.
In Flux.jl, you can use Flux.mse
:
using Flux
# Example:
y_hat = [0.5, 1.8, 3.3] # Model predictions
y_true = [0.6, 2.0, 3.0] # True values
loss = Flux.mse(y_hat, y_true)
println("MSE Loss: ", loss)
# Output: MSE Loss: 0.033333335f0 (approx)
This is suitable when your model's final layer outputs continuous values directly, often without an activation function or with a linear activation.
Cross-Entropy loss is the standard for classification problems. It measures the performance of a classification model whose output is a probability value between 0 and 1.
Binary Cross-Entropy
For binary classification (two classes, e.g., 0 or 1), the Binary Cross-Entropy (BCE) loss is used. If y is the true label (0 or 1) and y^ is the predicted probability for class 1, the BCE loss for a single sample is:
LBCE=−[ylog(y^)+(1−y)log(1−y^)]
Flux provides Flux.binarycrossentropy
for inputs that are probabilities (typically after a sigmoid activation) and Flux.logitbinarycrossentropy
for inputs that are logits (raw scores before the sigmoid activation). Using the logit
version is often more numerically stable.
using Flux
# Example with logitbinarycrossentropy (expects raw scores/logits)
logits = [0.5, -1.0, 2.0] # Raw model outputs (before sigmoid)
y_true_binary = [1.0, 0.0, 1.0] # True labels (0 or 1)
# Note: Flux.logitbinarycrossentropy expects y_true to be 0 or 1.
loss_bce_logits = Flux.logitbinarycrossentropy(logits, y_true_binary)
println("Logit Binary Cross-Entropy Loss: ", loss_bce_logits)
# Example with binarycrossentropy (expects probabilities)
probabilities = Flux.sigmoid.([0.5, -1.0, 2.0]) # Outputs after sigmoid
loss_bce_probs = Flux.binarycrossentropy(probabilities, y_true_binary)
println("Binary Cross-Entropy Loss (with probabilities): ", loss_bce_probs)
Categorical Cross-Entropy
For multi-class classification (more than two classes), Categorical Cross-Entropy is used. If C is the number of classes, yij is 1 if sample i belongs to class j (and 0 otherwise, often one-hot encoded), and y^ij is the predicted probability of sample i belonging to class j, the formula for N samples is:
LCCE=−N1∑i=1N∑j=1Cyijlog(y^ij)
Flux provides Flux.crossentropy
for inputs that are probabilities (typically after a softmax activation) and Flux.logitcrossentropy
for inputs that are logits (raw scores before softmax). Again, the logit
version is generally preferred for numerical stability.
using Flux
using Flux: onehotbatch, onecold # For one-hot encoding
# Example with logitcrossentropy (expects raw scores/logits for multiple classes)
# Suppose 3 classes, 2 samples
logits_multiclass = Float32[
0.1 0.8; # Sample 1: scores for class 1, class 2
0.5 0.1; # Sample 1: scores for class 2, class 2
0.4 0.1 # Sample 1: scores for class 3, class 2
] # predictions for 2 samples, 3 classes each (columns are samples)
# True labels (e.g., sample 1 is class 2, sample 2 is class 1)
y_true_multiclass_indices = [2, 1]
# Convert to one-hot encoding
y_true_onehot = Flux.onehotbatch(y_true_multiclass_indices, 1:3)
loss_cce_logits = Flux.logitcrossentropy(logits_multiclass, y_true_onehot)
println("Logit Categorical Cross-Entropy Loss: ", loss_cce_logits)
# Example with crossentropy (expects probabilities after softmax)
probabilities_multiclass = Flux.softmax(logits_multiclass, dims=1)
loss_cce_probs = Flux.crossentropy(probabilities_multiclass, y_true_onehot)
println("Categorical Cross-Entropy Loss (with probabilities): ", loss_cce_probs)
Always ensure your target data y_true
is in the format expected by the chosen loss function (e.g., raw labels, one-hot encoded vectors).
Once the loss function calculates the error, an optimizer's job is to update the model's parameters (weights W and biases b) in a way that reduces this error. Optimizers use the gradients of the loss function with respect to the parameters. These gradients, which you'll learn are computed by Zygote.jl, indicate the direction of the steepest ascent of the loss function. The optimizer takes a step in the opposite direction (steepest descent) to minimize the loss.
The general update rule for a parameter θ (which could be a weight or a bias) using gradient descent is: θnew=θold−η∇L where η (eta) is the learning rate, a hyperparameter that controls the step size, and ∇L is the gradient of the loss L with respect to the parameter θ.
Flux.jl provides several optimizers in Flux.Optimise
.
The diagram shows the relationship between model outputs, true values, the loss function, and the optimizer in generating parameter updates. Gradients derived from the loss signal are used by the optimizer to determine how to adjust the model's weights.
SGD is the foundational optimization algorithm. In its simplest form, it updates parameters using a fixed learning rate.
using Flux
# opt_state = Flux.setup(Flux.SGD(0.01), model) # 0.01 is the learning rate
# For a model 'm'
opt = Flux.SGD(0.01) # Learning rate of 0.01
While simple, SGD can be slow to converge and sensitive to the choice of learning rate. Variations like SGD with momentum (Flux.Momentum
) are often more effective.
Adam is a popular and often effective optimizer that adapts the learning rate for each parameter individually. It combines ideas from Momentum and RMSProp. It often works well with default hyperparameter settings, making it a good starting point.
using Flux
# opt_state = Flux.setup(Flux.ADAM(0.001), model) # 0.001 is a common default learning rate for Adam
opt = Flux.ADAM(0.001) # Learning rate of 0.001
# Adam has other parameters like β1, β2, ϵ with default values.
# opt = Flux.ADAM(0.001, (0.9, 0.999), 1e-8)
The arguments to ADAM
are (η, β::Tuple, ϵ)
. η is the learning rate. β
is a tuple (β1, β2)
representing the decay rates for the moment estimates. ϵ (ϵ
) is a small constant for numerical stability.
RMSProp (Root Mean Square Propagation) also adapts the learning rate per parameter, dividing the learning rate by an exponentially decaying average of squared gradients. It's known to work well in recurrent neural networks.
using Flux
# opt_state = Flux.setup(Flux.RMSProp(0.001), model)
opt = Flux.RMSProp(0.001) # Learning rate of 0.001
# RMSProp also has a γ parameter for the decay rate (default 0.9)
# opt = Flux.RMSProp(0.001, 0.9)
Flux.jl offers a comprehensive suite of optimizers, including AdaGrad
, AdaDelta
, AMSGrad
, and more. You can find them in the Flux.Optimise
module documentation.
Selecting the appropriate loss function is usually straightforward and dictated by your problem type:
Flux.mse
is a good default. Others include Flux.mae
(Mean Absolute Error).Flux.logitbinarycrossentropy
(if your model outputs logits) or Flux.binarycrossentropy
(if outputs are probabilities via sigmoid).Flux.logitcrossentropy
(if model outputs logits) or Flux.crossentropy
(if outputs are probabilities via softmax).Choosing an optimizer and its hyperparameters (like the learning rate) often requires some experimentation.
0.001
.The learning rate is a particularly sensitive hyperparameter. A learning rate that is too small can lead to very slow convergence, while one that is too large can cause the loss to oscillate or diverge. Techniques like learning rate scheduling (decaying the learning rate over time) can also be beneficial, and Flux supports these through scheduler functions that can wrap an optimizer.
With your loss function and optimizer defined, you have the core components ready to tell your model how to learn from data. The next step is to implement the training loop, which repeatedly feeds data to the model, calculates the loss, and uses the optimizer to update the model.
Was this section helpful?
© 2025 ApX Machine Learning