After establishing the necessity of non-linear activation functions in the previous section, let's examine one of the earliest and most historically significant activation functions: the Sigmoid function, also known as the logistic function.
The Sigmoid function is defined mathematically as:
σ(x)=1+e−x1Where x is the input to the function (typically the weighted sum of inputs plus bias for a neuron). The function takes any real-valued number and "squashes" it into a range between 0 and 1.
The defining characteristic of the Sigmoid function is its "S"-shaped curve.
The Sigmoid function maps inputs to the range (0, 1). It exhibits a smooth transition around x=0 and saturates towards 0 for large negative inputs and towards 1 for large positive inputs.
Key properties include:
Despite its historical role, the Sigmoid function has fallen out of favor for use in the hidden layers of deep networks due to significant drawbacks:
Vanishing Gradients: This is the most critical issue. Look at the shape of the Sigmoid function again. For large positive or large negative inputs (when the neuron "saturates"), the function becomes very flat. A flat function means the derivative (gradient) is close to zero.
The derivative of the Sigmoid function is largest at x=0 (value is 0.25) and quickly approaches zero as the input moves away from 0.
During backpropagation, gradients are multiplied layer by layer. If multiple layers have Sigmoid activations and their neurons operate in saturated regions, the gradients flowing backward will be repeatedly multiplied by small numbers (the derivatives, which are less than or equal to 0.25). This can cause the gradients reaching the earlier layers to become extremely small ("vanish"), making it very difficult for the weights in those layers to update effectively. The network essentially stops learning in its deeper layers.
Outputs are Not Zero-Centered: The output of the Sigmoid function is always positive (between 0 and 1). This can be problematic. If the input to a neuron in the next layer is always positive, the gradients of the weights for that neuron during backpropagation will all have the same sign (either all positive or all negative, depending on the gradient of the loss function with respect to the neuron's output). This can lead to inefficient, zig-zagging updates during gradient descent, slowing down convergence compared to using activation functions with zero-centered outputs.
Due to the vanishing gradient and non-zero-centered output issues, Sigmoid is generally not recommended for hidden layers in modern deep learning models. Functions like ReLU and its variants (which we will discuss next) typically lead to faster and more effective training.
However, Sigmoid still finds use in specific scenarios:
Applying the Sigmoid function in PyTorch is straightforward using torch.sigmoid
or the nn.Sigmoid
module.
import torch
import torch.nn as nn
# Sample input tensor (e.g., output of a linear layer)
x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0])
# Apply Sigmoid using the functional API
sigmoid_output_functional = torch.sigmoid(x)
print("Output using torch.sigmoid:", sigmoid_output_functional)
# Apply Sigmoid using the module API
sigmoid_module = nn.Sigmoid()
sigmoid_output_module = sigmoid_module(x)
print("Output using nn.Sigmoid:", sigmoid_output_module)
# Verify the output range
print(f"Min output: {sigmoid_output_module.min()}, Max output: {sigmoid_output_module.max()}")
# Example output:
# Output using torch.sigmoid: tensor([0.1192, 0.2689, 0.5000, 0.7311, 0.8808])
# Output using nn.Sigmoid: tensor([0.1192, 0.2689, 0.5000, 0.7311, 0.8808])
# Min output: 0.11920292109251022, Max output: 0.8807970285415649
While Sigmoid played an important role in the history of neural networks, its limitations, particularly the vanishing gradient problem, led researchers to explore alternatives. In the next sections, we will look at other activation functions like Tanh and ReLU that address some of these issues.
© 2025 ApX Machine Learning