After the input data has traversed through the input and hidden layers, undergoing linear transformations and non-linear activations at each step, it finally reaches the output layer. This is where the network produces its ultimate result, the prediction y^. The process mirrors the steps in the hidden layers but culminates in an output tailored to the specific problem you're trying to solve.
Let's denote the activations from the last hidden layer (or the input layer, if there are no hidden layers) as A[L−1]. To get the output prediction A[L], we perform two final steps:
Calculate the Weighted Sum for the Output Layer: Just like in the hidden layers, we compute the linear combination of the inputs to this layer (A[L−1]), using the output layer's weights (W[L]) and biases (b[L]).
Z[L]=W[L]A[L−1]+b[L]Here, Z[L] represents the pre-activation values for the output layer neurons. The dimensions of W[L], b[L], and consequently Z[L] depend on the number of neurons in the last hidden layer and the number of neurons in the output layer. The number of output neurons is determined by the task: one neuron for binary classification or regression, and C neurons for multi-class classification with C classes.
Apply the Output Layer Activation Function: The choice of activation function g[L] for the output layer is significant because it formats the final output A[L] into the desired prediction format.
A[L]=g[L](Z[L])This final activation A[L] is the network's prediction, which we often denote as y^.
The final step in forward propagation involves applying a linear transformation and an activation function specific to the output layer to produce the network's prediction, y^.
Unlike hidden layers where ReLU is a common choice, the output layer's activation function (g[L]) depends directly on the nature of the prediction task:
The final output vector A[L], which we call y^, contains the network's prediction(s) based on the input X.
With the calculation of y^, the forward propagation process is complete. The network has taken an input X and produced a prediction y^ by passing information through its layers, applying linear transformations and activation functions along the way. This prediction can now be compared to the true label y using a loss function, which is the first step in the training process (covered in the next chapter).
© 2025 ApX Machine Learning