Gated Recurrent Units (GRUs), introduced by Cho et al. in 2014, provide an alternative within gated recurrent architectures. These units, like Long Short-Term Memory (LSTM) networks, are designed to mitigate the gradient problems inherent in simple Recurrent Neural Networks (RNNs). GRUs aim for a similar outcome, controlling information flow through time, but achieve it with a slightly streamlined structure compared to LSTMs. This simplification often leads to fewer parameters and potentially faster computation, while frequently delivering performance comparable to LSTMs.
A GRU cell operates without a separate cell state, modifying its hidden state directly using two primary gating mechanisms: the reset gate and the update gate. Let's examine their roles and computations.
The reset gate determines how much of the previous hidden state, , should be effectively "forgotten" or ignored when proposing a new candidate hidden state. If the reset gate outputs values close to 0, it allows the cell to drop information from the past that is deemed irrelevant for the current computation. Conversely, values close to 1 retain most of the previous state's information.
The computation involves the current input and the previous hidden state . A sigmoid function squashes the output to the range [0, 1]:
Here, , , and are learned weight matrices and a bias vector specific to the reset gate.
The update gate plays a role analogous to a combination of the forget and input gates in an LSTM. It decides how much information from the previous hidden state should be carried forward to the new hidden state . Simultaneously, it controls how much of the newly computed candidate hidden state, , should be incorporated.
Its computation is similar in structure to the reset gate:
Where , , and are the learned parameters for the update gate.
The candidate hidden state represents a proposal for the new hidden state at time . Its computation is influenced by the reset gate . Specifically, the contribution of the previous hidden state is modulated (element-wise multiplied, denoted by ) by the reset gate's output before being combined with the processed input . A hyperbolic tangent function () is typically used as the activation:
, , and are the learned parameters for calculating the candidate state. The element-wise multiplication is the mechanism allowing the GRU to selectively discard parts of the previous state based on .
The final hidden state for the current time step is computed by linearly interpolating between the previous hidden state and the candidate hidden state . The update gate determines the balance of this interpolation.
When is close to 1, the candidate state contributes more, effectively updating the hidden state with new information. When is close to 0, the previous state is largely preserved, allowing information to carry over long distances.
Data flow within a Gated Recurrent Unit (GRU) cell at time step . is the input, is the previous hidden state. The reset () and update () gates control the computation of the candidate state () and the final hidden state (). Dotted lines indicate usage of a value in a computation.
The GRU architecture can be seen as a simplification of the LSTM:
Empirically, neither architecture consistently outperforms the other across all tasks. The choice between LSTM and GRU often comes down to experimental results on the specific problem, though GRUs might be favored when computational resources or parameter efficiency are primary concerns.
Despite their sophisticated gating, GRUs still retain the fundamental characteristic of recurrent models: sequential computation. Information must propagate step-by-step through the sequence length. This inherent sequentiality limits parallelization during training, making them significantly slower to train on very long sequences compared to architectures like the Transformer. Furthermore, while much better than simple RNNs at capturing longer-range dependencies, the reliance on summarizing the past into a fixed-size hidden state can still become a bottleneck for extremely long sequences where intricate dependencies span extensive distances. These remaining challenges paved the way for attention-based mechanisms, which we will explore next.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•