In the previous chapter, we saw that Multi-Layer Perceptrons (MLPs) gain their power from stacking layers of artificial neurons. Each neuron, as we discussed, computes a weighted sum of its inputs and adds a bias. If we stopped there, simply passing this sum directly to the next layer, our deep network wouldn't be much more capable than a single-layer model.
Why is that? Consider a two-layer network where each layer performs only a linear transformation (weighted sum + bias). The output of the first layer would be a linear function of the input. The output of the second layer would be a linear function of the first layer's output. Composing two linear functions results in another linear function. No matter how many linear layers we stack, the entire network would behave like a single linear transformation. It couldn't model complex, non-linear patterns in data, like the XOR problem we encountered earlier.
This is where activation functions come into play. An activation function is a fixed, non-linear function applied to the output of the weighted sum (plus bias) calculation within a neuron, before the result is passed to the next layer.
Mathematically, if z is the weighted sum plus bias (z=∑iwixi+b), the neuron's output a is given by: a=f(z) where f is the activation function.
The critical property of f is its non-linearity. By introducing this non-linear step after each layer's linear computation, the network breaks the chain of linearity. Stacking layers now allows the MLP to approximate arbitrarily complex functions, enabling it to learn intricate relationships and patterns in data that linear models cannot capture.
Think of it like bending or warping the space in which the data exists. A linear transformation can only stretch, rotate, or shear the space. A non-linear activation function allows the network to perform more complex transformations at each step, ultimately enabling it to separate data points that are not linearly separable.
Comparison between a simple linear 'activation' (identity function) and a common non-linear activation function (ReLU). The non-linearity allows the network to model more complex relationships.
In essence, the primary role of activation functions is to introduce the non-linearity required for deep neural networks to learn complex mappings from inputs to outputs. Without them, MLPs would lose much of their representational power. While some activation functions have additional useful properties (like bounding the output range, which we'll discuss), their fundamental contribution is enabling non-linear computation.
The choice of activation function can significantly impact a network's training dynamics and performance. In the following sections, we will look at several popular activation functions, examining their characteristics, advantages, and disadvantages.
© 2025 ApX Machine Learning