Having explored simple Recurrent Neural Networks (RNNs) and the vanishing gradient problem they often encounter, we saw how Long Short-Term Memory (LSTM) networks introduce gating mechanisms to manage information flow over long sequences. LSTMs, however, are quite complex, featuring three distinct gates (input, forget, output) and a separate cell state. In 2014, Cho and colleagues introduced the Gated Recurrent Unit (GRU), a simpler alternative that often achieves comparable performance to LSTMs while being computationally more efficient.
GRUs, like LSTMs, use gating mechanisms to control the flow of information, but they achieve this with a more streamlined architecture. Instead of three gates and a separate cell state, a GRU cell employs just two gates: the Update Gate and the Reset Gate. It also merges the cell state and hidden state into a single hidden state vector, ht.
Let's look at how these components work together at a given time step t, taking the current input xt and the previous hidden state ht−1 as inputs.
The reset gate (rt) determines how to combine the new input xt with the previous hidden state ht−1. Specifically, it controls how much of the previous hidden state should be "forgotten" or ignored when calculating a candidate for the next hidden state. If the reset gate's value is close to 0 for certain dimensions of ht−1, it effectively makes the unit act as if reading the input for the first time, disregarding irrelevant past information for the current context.
The reset gate's activation is calculated using a sigmoid function (σ), which outputs values between 0 and 1:
rt=σ(Wr[ht−1,xt]+br)Here, [ht−1,xt] represents the concatenation of the previous hidden state and the current input vector. Wr is the weight matrix and br is the bias term for the reset gate.
The update gate (zt) performs a role similar to the combination of the forget and input gates in an LSTM. It decides how much information from the previous hidden state ht−1 should be carried over to the current hidden state ht. Simultaneously, it also controls how much of the newly computed candidate hidden state (h~t, explained next) should be added.
If zt is close to 1, the previous hidden state ht−1 is largely copied to the new hidden state ht. If zt is close to 0, the new candidate state h~t predominantly forms the new hidden state ht. This mechanism allows the GRU to retain information from distant past time steps.
The update gate's activation is also calculated using a sigmoid function:
zt=σ(Wz[ht−1,xt]+bz)Where Wz and bz are the weight matrix and bias term for the update gate.
The GRU computes a candidate hidden state h~t that represents the proposed update for the current time step. This calculation is influenced by the reset gate rt. The reset gate determines which parts of the previous hidden state ht−1 are used in this calculation. The candidate state is typically computed using a hyperbolic tangent (tanh) activation function, which outputs values between -1 and 1.
h~t=tanh(Wh[rt⊙ht−1,xt]+bh)Notice the element-wise multiplication (⊙) between the reset gate rt and the previous hidden state ht−1. This is where the reset gate selectively filters the information from ht−1 before combining it with the current input xt. Wh and bh are the corresponding weight matrix and bias.
Finally, the actual hidden state ht for the current time step is computed by linearly interpolating between the previous hidden state ht−1 and the candidate hidden state h~t. The update gate zt controls this interpolation:
ht=(1−zt)⊙ht−1+zt⊙h~tThis equation shows how the update gate acts:
This allows the network to maintain long-term information by keeping zt close to 0 for multiple time steps, or update quickly based on new inputs by setting zt close to 1.
The following diagram illustrates the structure of a GRU cell:
Internal structure of a Gated Recurrent Unit (GRU) cell, showing the flow of information through the reset gate (rt) and update gate (zt) to compute the next hidden state (ht).
While both GRUs and LSTMs effectively address the vanishing gradient problem and capture long-range dependencies, they differ in their internal structure:
Similar to LSTMs, GRUs are readily available as layers in standard deep learning frameworks like TensorFlow (via Keras) and PyTorch. You can easily swap an LSTM
layer for a GRU
layer in your model definitions. Concepts like stacking multiple GRU layers (to create deeper networks) or using Bidirectional GRUs (to process sequences in both forward and backward directions) apply just as they do for LSTMs.
In summary, GRUs present a compelling alternative to LSTMs for sequence modeling. Their simpler architecture often translates to computational efficiency without a significant sacrifice in performance for many NLP tasks. By employing reset and update gates, they effectively manage the flow of information through time, enabling them to learn dependencies across sequences, much like their LSTM counterparts. When building sequence models, considering both LSTMs and GRUs is a standard part of the model selection process.
© 2025 ApX Machine Learning