At the output layer, a neural network produces its ultimate prediction, y^. This is the final stage of processing for input data, which involves linear transformations and non-linear activations. Operations within the output layer are similar to those in hidden layers, but the resulting output is specifically tailored to the problem being addressed.
Let's denote the activations from the last hidden layer (or the input layer, if there are no hidden layers) as A[L−1]. To get the output prediction A[L], we perform two final steps:
-
Calculate the Weighted Sum for the Output Layer: Just like in the hidden layers, we compute the linear combination of the inputs to this layer (A[L−1]), using the output layer's weights (W[L]) and biases (b[L]).
Z[L]=W[L]A[L−1]+b[L]
Here, Z[L] represents the pre-activation values for the output layer neurons. The dimensions of W[L], b[L], and consequently Z[L] depend on the number of neurons in the last hidden layer and the number of neurons in the output layer. The number of output neurons is determined by the task: one neuron for binary classification or regression, and C neurons for multi-class classification with C classes.
-
Apply the Output Layer Activation Function: The choice of activation function g[L] for the output layer is significant because it formats the final output A[L] into the desired prediction format.
A[L]=g[L](Z[L])
This final activation A[L] is the network's prediction, which we often denote as y^.
The final step in forward propagation involves applying a linear transformation and an activation function specific to the output layer to produce the network's prediction, y^.
Choosing the Right Output Activation
Unlike hidden layers where ReLU is a common choice, the output layer's activation function (g[L]) depends directly on the nature of the prediction task:
- Regression: If the network needs to predict a continuous numerical value (like predicting house prices), a linear activation function is typically used. This means the output is simply the weighted sum: A[L]=Z[L]. The output can range from −∞ to +∞.
- Binary Classification: For tasks where the output is one of two classes (e.g., spam or not spam, 0 or 1), the Sigmoid function is used. It squashes the output Z[L] into the range (0, 1), which can be interpreted as the probability of the positive class.
y^=A[L]=σ(Z[L])=1+e−Z[L]1
- Multi-class Classification: When classifying among more than two mutually exclusive classes (e.g., identifying digits 0-9), the Softmax function is the standard choice. It takes the vector Z[L] (where each element corresponds to a class) and converts it into a probability distribution. Each element in the output vector A[L] is between 0 and 1, and the sum of all elements is 1.
y^i=Ai[L]=softmax(Z[L])i=∑j=1CeZj[L]eZi[L]for i=1,...,C
Here, C is the number of classes, and y^i represents the predicted probability that the input belongs to class i.
Understanding the Prediction (y^)
The final output vector A[L], which we call y^, contains the network's prediction(s) based on the input X.
- For regression, y^ is a single continuous value (or multiple values if predicting multiple targets).
- For binary classification, y^ is a single value between 0 and 1, representing P(class=1∣X). You might apply a threshold (e.g., 0.5) to convert this probability into a definite class label (0 or 1).
- For multi-class classification, y^ is a vector of probabilities, one for each class. The class with the highest probability is typically chosen as the final predicted label.
With the calculation of y^, the forward propagation process is complete. The network has taken an input X and produced a prediction y^ by passing information through its layers, applying linear transformations and activation functions along the way. This prediction can now be compared to the true label y using a loss function, which is the first step in the training process (covered in the next chapter).