All Courses

Sigmoid Activation

After establishing the necessity of non-linear activation functions in the previous section, let's examine one of the earliest and most historically significant activation functions: the Sigmoid function, also known as the logistic function.

Mathematical Definition

The Sigmoid function is defined mathematically as:

\sigma(x) = \frac{1}{1 + e^{-x}}

Where $x$ is the input to the function (typically the weighted sum of inputs plus bias for a neuron). The function takes any real-valued number and "squashes" it into a range between 0 and 1.

Properties and Shape

The defining characteristic of the Sigmoid function is its "S"-shaped curve.

The Sigmoid function maps inputs to the range (0, 1). It exhibits a smooth transition around $x=0$ and saturates towards 0 for large negative inputs and towards 1 for large positive inputs.

Important properties include:

Output Range: The output is always between 0 and 1 (exclusive). This property made it attractive initially, especially for output layers in binary classification problems, where the output could be interpreted as a probability.
Non-linearity: It's a non-linear function, which is essential for neural networks to learn complex patterns that linear models cannot capture.
Smoothness: The function is differentiable everywhere, which is a requirement for gradient-based optimization methods like backpropagation. Its derivative is easy to compute: $\sigma'(x) = \sigma(x)(1 - \sigma(x))$ .

Advantages

Probabilistic Interpretation: The (0, 1) output range allows the output of a Sigmoid neuron to be treated as a probability, which is intuitive for binary classification tasks (e.g., probability of belonging to class 1).
Smooth Gradient: The gradient is well-defined everywhere, preventing abrupt jumps during training.

Disadvantages

Despite its historical role, the Sigmoid function has fallen out of favor for use in the hidden layers of deep networks due to significant drawbacks:

Vanishing Gradients: This is the most critical issue. Look at the shape of the Sigmoid function again. For large positive or large negative inputs (when the neuron "saturates"), the function becomes very flat. A flat function means the derivative (gradient) is close to zero.

The derivative of the Sigmoid function is largest at $x=0$ (value is 0.25) and quickly approaches zero as the input moves away from 0.

During backpropagation, gradients are multiplied layer by layer. If multiple layers have Sigmoid activations and their neurons operate in saturated regions, the gradients flowing backward will be repeatedly multiplied by small numbers (the derivatives, which are less than or equal to 0.25). This can cause the gradients reaching the earlier layers to become extremely small ("vanish"), making it very difficult for the weights in those layers to update effectively. The network essentially stops learning in its deeper layers.
Outputs are Not Zero-Centered: The output of the Sigmoid function is always positive (between 0 and 1). This can be problematic. If the input to a neuron in the next layer is always positive, the gradients of the weights for that neuron during backpropagation will all have the same sign (either all positive or all negative, depending on the gradient of the loss function with respect to the neuron's output). This can lead to inefficient, zig-zagging updates during gradient descent, slowing down convergence compared to using activation functions with zero-centered outputs.

Usage Context

Due to the vanishing gradient and non-zero-centered output issues, Sigmoid is generally not recommended for hidden layers in modern deep learning models. Functions like ReLU and its variants (which we will discuss next) typically lead to faster and more effective training.

However, Sigmoid still finds use in specific scenarios:

Output Layer for Binary Classification: Its (0, 1) output range makes it a natural choice for the final layer of a network performing binary classification, where the output represents the probability of the positive class.
Gates in Recurrent Networks: Certain recurrent neural network architectures (like LSTMs and GRUs, covered later) use Sigmoid functions internally as "gates" to control the flow of information, leveraging its (0, 1) range.

Implementation Example (PyTorch)

Applying the Sigmoid function in PyTorch is straightforward using torch.sigmoid or the nn.Sigmoid module.

import torch
import torch.nn as nn

# Sample input tensor (e.g., output of a linear layer)
x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0])

# Apply Sigmoid using the functional API
sigmoid_output_functional = torch.sigmoid(x)
print("Output using torch.sigmoid:", sigmoid_output_functional)

# Apply Sigmoid using the module API
sigmoid_module = nn.Sigmoid()
sigmoid_output_module = sigmoid_module(x)
print("Output using nn.Sigmoid:", sigmoid_output_module)

# Verify the output range
print(f"Min output: {sigmoid_output_module.min()}, Max output: {sigmoid_output_module.max()}")

# Example output:
# Output using torch.sigmoid: tensor([0.1192, 0.2689, 0.5000, 0.7311, 0.8808])
# Output using nn.Sigmoid: tensor([0.1192, 0.2689, 0.5000, 0.7311, 0.8808])
# Min output: 0.11920292109251022, Max output: 0.8807970285415649

While Sigmoid played an important role in the history of neural networks, its limitations, particularly the vanishing gradient problem, led researchers to explore alternatives. In the next sections, we will look at other activation functions like Tanh and ReLU that address some of these issues.

Was this section helpful?