In the previous section, we saw how an artificial neuron combines its inputs using weights and adds a bias to produce a value, often denoted as z:
z=(i∑wixi)+bThis calculation, z=w⋅x+b in vector notation, is a linear transformation. If we simply stacked layers of neurons that only perform this linear calculation, the entire network would still only be capable of learning linear relationships. Why? Because a sequence of linear transformations can always be mathematically reduced to a single, equivalent linear transformation. If our network function is f(x)=W2(W1x+b1)+b2, this simplifies to f(x)=(W2W1)x+(W2b1+b2), which is just another linear function f(x)=W′x+b′.
To learn complex, non-linear patterns in data (like recognizing images, understanding language, or predicting intricate trends), neural networks need a way to introduce non-linearity. This is precisely the role of activation functions.
An activation function, denoted typically as f(z) or σ(z) or g(z), is a function applied to the output z of the linear transformation within a neuron. The result, a=f(z), is called the neuron's activation or output.
a=f((i∑wixi)+b)This function takes the summed, weighted input plus bias (z) and transforms it, typically in a non-linear way. By applying this non-linear function after each layer's linear calculations, the network gains the ability to model much more complex relationships between inputs and outputs. It's this introduction of non-linearity that allows deep networks to approximate highly complex functions.
Several activation functions have been developed, each with its own characteristics, advantages, and disadvantages. Let's look at some of the most common ones used in hidden layers and output layers.
The Sigmoid function, also known as the logistic function, was historically very popular. It squashes its input value into a range between 0 and 1.
σ(z)=1+e−z1Characteristics:
Drawbacks:
The Hyperbolic Tangent function, or Tanh, is mathematically related to Sigmoid but squashes the input to a range between -1 and 1.
tanh(z)=ez+e−zez−e−z=2σ(2z)−1Characteristics:
Drawbacks:
ReLU is currently one of the most widely used activation functions in hidden layers, particularly in deep learning. It's computationally simple and effective.
ReLU(z)=max(0,z)Characteristics:
Drawbacks:
The following chart shows the shapes and output ranges of Sigmoid, Tanh, and ReLU functions.
Comparison of Sigmoid (blue, 0 to 1), Tanh (purple, -1 to 1), and ReLU (orange, 0 to ∞) activation functions. Notice the saturation points for Sigmoid and Tanh where the gradient approaches zero, versus the constant gradient of ReLU for positive inputs.
Researchers have proposed variants to address the drawbacks of these main functions. For example, Leaky ReLU introduces a small, non-zero slope for negative inputs (max(0.01z,z)) to combat the dying ReLU problem. Other popular choices include ELU (Exponential Linear Unit) and Swish. However, ReLU often remains a strong default choice for hidden layers due to its simplicity and effectiveness in many scenarios.
In a typical neural network layer, the activation function is applied element-wise to the vector or matrix resulting from the linear transformation (z=Wx+b). If a layer has 10 neurons, the linear transformation produces 10 values (z1,z2,...,z10). The activation function is then applied individually to each of these values to produce the layer's final output activations (a1=f(z1),a2=f(z2),...,a10=f(z10)). These activations then serve as the inputs to the next layer.
The choice of activation function can significantly impact network performance. Here are some general guidelines, though experimentation is often necessary:
In summary, activation functions are a fundamental component of neural networks. By introducing non-linearity after the linear calculations in each neuron, they equip the network with the ability to learn complex patterns and functions from data, moving far beyond the capabilities of simple linear models.
© 2025 ApX Machine Learning