At the output layer, a neural network produces its ultimate prediction, y^. This is the final stage of processing for input data, which involves linear transformations and non-linear activations. Operations within the output layer are similar to those in hidden layers, but the resulting output is specifically tailored to the problem being addressed.
Let's denote the activations from the last hidden layer (or the input layer, if there are no hidden layers) as A[L−1]. To get the output prediction A[L], we perform two final steps:
Calculate the Weighted Sum for the Output Layer: Just like in the hidden layers, we compute the linear combination of the inputs to this layer (A[L−1]), using the output layer's weights (W[L]) and biases (b[L]).
Z[L]=W[L]A[L−1]+b[L]Here, Z[L] represents the pre-activation values for the output layer neurons. The dimensions of W[L], b[L], and consequently Z[L] depend on the number of neurons in the last hidden layer and the number of neurons in the output layer. The number of output neurons is determined by the task: one neuron for binary classification or regression, and C neurons for multi-class classification with C classes.
Apply the Output Layer Activation Function: The choice of activation function g[L] for the output layer is significant because it formats the final output A[L] into the desired prediction format.
A[L]=g[L](Z[L])This final activation A[L] is the network's prediction, which we often denote as y^.
The final step in forward propagation involves applying a linear transformation and an activation function specific to the output layer to produce the network's prediction, y^.
Unlike hidden layers where ReLU is a common choice, the output layer's activation function (g[L]) depends directly on the nature of the prediction task:
The final output vector A[L], which we call y^, contains the network's prediction(s) based on the input X.
With the calculation of y^, the forward propagation process is complete. The network has taken an input X and produced a prediction y^ by passing information through its layers, applying linear transformations and activation functions along the way. This prediction can now be compared to the true label y using a loss function, which is the first step in the training process (covered in the next chapter).
Was this section helpful?
© 2026 ApX Machine LearningEngineered with