After calculating the linear combination Z=WX+b for a layer, representing the weighted sum of inputs plus the bias for each neuron, the network performs a critical step: applying an activation function. If we were to omit this step and simply pass the linear combination Z directly to the next layer, the entire network, no matter how many layers deep, would behave like a single, large linear transformation. Stacking linear functions results in another linear function. Such a network wouldn't be able to model complex, non-linear relationships often present in real-world data like images, text, or intricate tabular datasets.
Activation functions introduce non-linearity into the network, enabling it to learn much more complex patterns and functions. This non-linear transformation is applied element-wise to the output of the linear step (Z). This means the activation function operates independently on each element within the matrix Z.
If Z is the matrix containing the linear combinations for all neurons in a layer (where each column might represent an example in a batch, and each row a neuron), and g represents the chosen activation function, the output of the activation step, denoted by A, is calculated as:
A=g(Z)Each element Aij in the matrix A is obtained by applying the function g to the corresponding element Zij in matrix Z:
Aij=g(Zij)This operation happens within each hidden layer and potentially in the output layer.
Recall from Chapter 1 the common activation functions like Sigmoid, Hyperbolic Tangent (Tanh), and the Rectified Linear Unit (ReLU). Each applies a specific non-linear transformation:
The choice of activation function impacts how the network learns and performs. ReLU is a common default for hidden layers due to its simplicity and ability to mitigate vanishing gradient issues compared to Sigmoid or Tanh in deep networks.
Let's visualize the transformation within a layer: The input X (or the activation A from the previous layer) goes through a linear transformation to produce Z, which is then passed through the element-wise activation function g to produce the layer's output activation A.
Data flows from the input, through the linear combination calculation, and then through the element-wise activation function to produce the layer's output.
Consider the effect of the ReLU activation function. It clips all negative values from the linear combination Z to zero, allowing only positive values to pass through.
The ReLU function g(z)=max(0,z) applied element-wise to the output Z of the linear transformation.
This two-step process (linear combination followed by non-linear activation) defines the computation within a single layer of the network. During forward propagation, the output activations A[l] of layer l become the input X[l+1] for the next layer, l+1.
Z[l]=W[l]A[l−1]+b[l] A[l]=g[l](Z[l])Here, A[0] represents the initial input data X. This sequence repeats for all hidden layers until the final output layer is reached.
In practice, using libraries like NumPy allows for efficient element-wise application of activation functions to the entire matrix Z at once. For example, applying ReLU conceptually:
import numpy as np
# Assume Z is the NumPy array containing the output of the linear layer
# Z = np.dot(W, A_prev) + b
# Apply ReLU activation element-wise
A = np.maximum(0, Z)
# A now holds the activations for this layer
This vectorized operation is significantly faster than iterating through each element individually.
While activation functions like ReLU, Sigmoid, and Tanh are common in hidden layers, the activation function used in the output layer is often chosen based on the specific task (e.g., linear for regression, Sigmoid for binary classification, Softmax for multi-class classification). This ensures the network's output is in the appropriate format for calculating the loss and making predictions. We'll explore this further when discussing the final prediction calculation.
© 2025 ApX Machine Learning