In the previous sections, we saw how to calculate the output of a single neuron by taking a weighted sum of its inputs, adding a bias, and then applying an activation function. While this gives us the fundamental building block, imagine doing this calculation one neuron at a time for a layer with hundreds or thousands of neurons, and then repeating this across multiple layers for millions of input samples. This quickly becomes computationally infeasible.
Fortunately, we can leverage the power of linear algebra and optimized numerical libraries (like NumPy in Python) to perform these calculations much more efficiently using matrix operations. This process is often referred to as vectorization. Instead of iterating through each neuron or each input connection individually, we process entire layers or batches of data simultaneously.
Let's reconsider the calculations for a single layer in the network. Suppose this layer has nout neurons and receives input from a previous layer (or the input features) with nin values.
Recall the weighted sum for a single neuron j in the layer: zj=(∑i=1ninwjixi)+bj
Using matrix notation, we can compute the weighted sums for all nout neurons in the layer in a single operation:
Z=Wx+bLet's break down the dimensions:
The matrix multiplication Wx results in a vector of size (nout×1). Each element in this resulting vector is the dot product of a row from W (the weights for one neuron) and the input vector x. This precisely calculates the ∑wixi part for each neuron simultaneously.
We then add the bias vector b (also size nout×1) to the result of Wx. This addition is performed element-wise, adding the corresponding bias bj to the weighted sum calculated for neuron j. The final result, Z, is a column vector of size (nout×1), where Zj contains the weighted sum zj for the j-th neuron in the layer.
Forward propagation for a single layer using matrix operations. The input vector x is transformed using the layer's weight matrix W and bias vector b to produce the vector of weighted sums Z. An activation function g is then applied element-wise to Z to yield the activation vector A.
Once we have the vector Z containing the weighted sums for all neurons in the layer, we apply the layer's activation function g (like Sigmoid, Tanh, or ReLU). This operation is performed element-wise on the vector Z:
A=g(Z)If Z=[z1,z2,...,znout]T, then the resulting activation vector A will be A=[g(z1),g(z2),...,g(znout)]T. This vector A, also of size (nout×1), represents the output of the current layer and serves as the input to the next layer in the network.
Neural networks are typically trained using batches of multiple training examples at once, rather than one example at a time. Matrix operations handle this extension gracefully.
Instead of a single input vector x (nin×1), we now have an input matrix X where each column represents a different training example. If we have a batch of m examples, the input matrix X will have dimensions (nin×m).
The forward propagation calculation for the entire batch becomes:
Z=WX+bLet's check the dimensions again:
The matrix multiplication WX now results in a matrix of size (nout×m). Each column j of this result matrix contains the weighted sums for all neurons in the layer, calculated using the j-th input example from the batch X.
When adding the bias vector b (nout×1) to the matrix WX (nout×m), most numerical libraries use a feature called broadcasting. The bias vector b is effectively duplicated or "broadcast" across all m columns of the WX matrix, so the bias bi is added to the weighted sum of the i-th neuron for every example in the batch. The resulting matrix Z has dimensions (nout×m).
Finally, the activation function g is applied element-wise to the entire matrix Z:
A=g(Z)The resulting activation matrix A also has dimensions (nout×m). Each column of A represents the activations of the layer's neurons for one input example from the batch. This matrix A then becomes the input for the next layer.
Using matrix operations transforms the computationally intensive process of looping through neurons and connections into highly optimized, single-line commands (conceptually, like Z = W @ X + b
in NumPy). This leverages underlying libraries (like BLAS/LAPACK) often written in C or Fortran, making the calculations significantly faster than explicit Python loops. This speedup is essential for training modern neural networks, which can have millions or even billions of parameters, on large datasets.
This efficient calculation of the forward pass using matrices forms the backbone of how neural networks generate predictions and is a prerequisite for understanding the subsequent backpropagation algorithm used during training.
© 2025 ApX Machine Learning