Activation functions introduce non-linearity into neural networks, allowing them to model complex patterns. In Recurrent Neural Networks, this role is particularly significant because the activation function is applied repeatedly at each time step as the network processes a sequence. This repeated application directly influences the dynamics of the hidden state and, crucially, the flow of gradients during Backpropagation Through Time (BPTT). The choice of activation function can either alleviate or exacerbate the vanishing and exploding gradient problems discussed earlier in this chapter.
Recall the basic RNN state update equation:
ht=f(Whhht−1+Wxhxt+bh)Here, ht is the hidden state at time step t, xt is the input, Whh and Wxh are weight matrices, bh is the bias, and f is the activation function. During BPTT, the gradient of the loss with respect to an earlier state hk (where k<t) involves multiplying the gradients through each intervening step. This involves repeated multiplication by the derivative of the activation function, f′, evaluated at the pre-activation values at each step.
If f′ is consistently small (less than 1), the gradients can shrink exponentially as they propagate backward, leading to vanishing gradients. Conversely, if f′ can be large, combined with large weight values, the gradients might grow exponentially, leading to exploding gradients.
Let's examine the properties of common activation functions in the context of RNNs.
Hyperbolic Tangent (tanh) The tanh function is frequently used as the primary activation for the hidden state in simple RNNs and is also prevalent within LSTM and GRU cells.
tanh(z)=ez+e−zez−e−zIts output range is (−1,1), which is zero-centered. This property can be beneficial for optimization compared to functions that are not zero-centered. The derivative of tanh is:
dzdtanh(z)=1−tanh2(z)The derivative's maximum value is 1 (at z=0), and it approaches 0 as ∣z∣ increases (saturation). While the gradients are larger around zero compared to the sigmoid function, the saturation still makes tanh-based RNNs susceptible to vanishing gradients, especially over many time steps. However, its bounded output helps prevent the hidden state values themselves from becoming excessively large.
Sigmoid (Logistic) The sigmoid function squashes its input into the range (0,1).
σ(z)=1+e−z1Its derivative is:
dzdσ(z)=σ(z)(1−σ(z))The maximum value of the sigmoid derivative is 0.25 (at z=0). Because this maximum value is significantly less than 1, repeated multiplication during BPTT causes gradients to vanish very quickly. Consequently, sigmoid is rarely used as the main activation function for the hidden state update in modern RNNs. Its primary role now lies in gating mechanisms within LSTMs and GRUs, where its (0,1) output range is ideal for controlling information flow (e.g., deciding how much information to forget or pass through).
Rectified Linear Unit (ReLU) ReLU is a popular choice in feedforward networks and Convolutional Neural Networks (CNNs).
ReLU(z)=max(0,z)Its derivative is simple: 1 for z>0 and 0 for z≤0.
While less common than tanh as the main hidden state activation in simple RNNs, ReLU and its variants (like Leaky ReLU or Parametric ReLU) are sometimes used, particularly within the update mechanisms of LSTMs/GRUs or in conjunction with careful initialization and regularization.
The plot below shows the shape and derivative of the tanh, sigmoid, and ReLU functions. Pay attention to the output range and where the derivatives become small or zero.
Comparison of tanh, sigmoid, and ReLU activation functions and their first derivatives. Note how tanh and sigmoid saturate (derivatives approach zero), while ReLU's derivative is constant for positive inputs.
The choice of activation function is interconnected with other aspects of model design and training, including weight initialization (discussed previously) and the specific architecture used. The inherent difficulties in managing gradient flow with these standard activations in simple RNNs are a strong motivation for the development of more sophisticated recurrent architectures like LSTMs and GRUs, which incorporate explicit mechanisms (gates, often using sigmoid and tanh) to better control the information and gradient flow over long sequences. We will explore these advanced architectures in the upcoming chapters.
© 2025 ApX Machine Learning