Building upon the foundation laid by LSTMs to mitigate the gradient problems inherent in simple RNNs, Gated Recurrent Units (GRUs), introduced by Cho et al. in 2014, offer a variation on the theme of gated architectures. GRUs aim for a similar outcome, controlling information flow through time, but achieve it with a slightly streamlined structure compared to LSTMs. This simplification often leads to fewer parameters and potentially faster computation, while frequently delivering performance comparable to LSTMs.
A GRU cell operates without a separate cell state, modifying its hidden state ht directly using two primary gating mechanisms: the reset gate and the update gate. Let's examine their roles and computations.
The reset gate determines how much of the previous hidden state, ht−1, should be effectively "forgotten" or ignored when proposing a new candidate hidden state. If the reset gate outputs values close to 0, it allows the cell to drop information from the past that is deemed irrelevant for the current computation. Conversely, values close to 1 retain most of the previous state's information.
The computation involves the current input xt and the previous hidden state ht−1. A sigmoid function σ squashes the output to the range [0, 1]:
rt=σ(Wrxt+Urht−1+br)Here, Wr, Ur, and br are learned weight matrices and a bias vector specific to the reset gate.
The update gate plays a role analogous to a combination of the forget and input gates in an LSTM. It decides how much information from the previous hidden state ht−1 should be carried forward to the new hidden state ht. Simultaneously, it controls how much of the newly computed candidate hidden state, h~t, should be incorporated.
Its computation is similar in structure to the reset gate:
zt=σ(Wzxt+Uzht−1+bz)Where Wz, Uz, and bz are the learned parameters for the update gate.
The candidate hidden state represents a proposal for the new hidden state at time t. Its computation is influenced by the reset gate rt. Specifically, the contribution of the previous hidden state ht−1 is modulated (element-wise multiplied, denoted by ⊙) by the reset gate's output before being combined with the processed input xt. A hyperbolic tangent function (tanh) is typically used as the activation:
h~t=tanh(Whxt+Uh(rt⊙ht−1)+bh)Wh, Uh, and bh are the learned parameters for calculating the candidate state. The element-wise multiplication rt⊙ht−1 is the mechanism allowing the GRU to selectively discard parts of the previous state based on rt.
The final hidden state ht for the current time step is computed by linearly interpolating between the previous hidden state ht−1 and the candidate hidden state h~t. The update gate zt determines the balance of this interpolation.
ht=(1−zt)⊙ht−1+zt⊙h~tWhen zt is close to 1, the candidate state h~t contributes more, effectively updating the hidden state with new information. When zt is close to 0, the previous state ht−1 is largely preserved, allowing information to carry over long distances.
Data flow within a Gated Recurrent Unit (GRU) cell at time step t. xt is the input, ht−1 is the previous hidden state. The reset (rt) and update (zt) gates control the computation of the candidate state (h~t) and the final hidden state (ht). Dotted lines indicate usage of a value in a computation.
The GRU architecture can be seen as a simplification of the LSTM:
Empirically, neither architecture consistently outperforms the other across all tasks. The choice between LSTM and GRU often comes down to experimental results on the specific problem, though GRUs might be favored when computational resources or parameter efficiency are primary concerns.
Despite their sophisticated gating, GRUs still retain the fundamental characteristic of recurrent models: sequential computation. Information must propagate step-by-step through the sequence length. This inherent sequentiality limits parallelization during training, making them significantly slower to train on very long sequences compared to architectures like the Transformer. Furthermore, while much better than simple RNNs at capturing longer-range dependencies, the reliance on summarizing the past into a fixed-size hidden state can still become a bottleneck for extremely long sequences where intricate dependencies span vast distances. These remaining challenges paved the way for attention-based mechanisms, which we will explore next.
© 2025 ApX Machine Learning