After exploring functions like Sigmoid and Tanh, which introduce non-linearity but come with potential drawbacks like saturation and vanishing gradients, we turn our attention to a simpler, yet highly effective alternative: the Rectified Linear Unit, commonly known as ReLU. It has become a staple in deep learning, particularly for hidden layers, due to its computational efficiency and ability to alleviate some gradient issues.
The ReLU function is defined mathematically as:
f(x)=max(0,x)In simple terms, if the input x is positive, the function outputs x itself. If the input is zero or negative, the function outputs zero. This creates a "rectified" behavior where negative values are clipped to zero.
The ReLU function f(x)=max(0,x). It is linear for positive inputs and zero for negative inputs.
Despite its advantages, ReLU is not without its own issues. The most significant one is the "dying ReLU" problem. If a neuron's input consistently becomes negative during training, it will always output zero. Consequently, the gradient flowing through that neuron will also consistently be zero (since f′(x)=0 for x<0).
When this happens, the weights associated with that neuron will no longer be updated via gradient descent. The neuron essentially becomes inactive and stops participating in the learning process. This can happen if the learning rate is set too high or if there's a large negative bias term that pushes the neuron's weighted sum into the negative range. Once a ReLU unit "dies," it's unlikely to recover.
Consider this simple PyTorch example demonstrating how to use ReLU:
import torch
import torch.nn as nn
# Sample input tensor
input_tensor = torch.randn(1, 5) # Batch size 1, 5 features
print(f"Input: {input_tensor}")
# Apply ReLU using torch.nn.ReLU
relu_activation = nn.ReLU()
output_tensor = relu_activation(input_tensor)
print(f"Output after ReLU: {output_tensor}")
# Apply ReLU using torch.relu functional form
output_functional = torch.relu(input_tensor)
print(f"Output using functional form: {output_functional}")
# Demonstrate the gradient (requires_grad=True)
input_tensor.requires_grad_(True)
output_relu = torch.relu(input_tensor)
# Assume some upstream gradient for demonstration
output_relu.backward(torch.ones_like(output_relu))
print(f"Gradient of input: {input_tensor.grad}")
# Notice gradients are 1 where input was > 0, and 0 where input was <= 0
The output demonstrates the zeroing-out effect for negative inputs and shows how the gradient is 1 for positive inputs and 0 for negative inputs.
ReLU's simplicity, speed, and ability to mitigate vanishing gradients have made it a default choice for hidden layers in many deep learning models. However, the potential for dying neurons means careful initialization and learning rate selection are important. In the next section, we'll look at variations of ReLU designed specifically to address this "dying" problem.
© 2025 ApX Machine Learning