Gated Recurrent Units (GRUs) are a type of recurrent neural network architecture designed to effectively capture dependencies in sequential data. They achieve this by employing gating mechanisms that regulate information flow. A GRU cell features two primary gates: the reset gate and the update gate. These gates control how new input data and previous hidden state information are used to update the current hidden state. The GRU simplifies its internal structure by directly combining the functions of a cell state and hidden state into a single hidden state vector, ht. This design often results in a simpler model compared to architectures like LSTMs, which typically use three gates and separate cell and hidden states.
Let's examine the components and the flow of information within a single GRU cell at time step t. The cell receives the current input xt and the hidden state from the previous time step ht−1. It then computes the new hidden state ht, which also serves as the output for this time step.
The components are:
Here is a view of the GRU cell's architecture:
A simplified view of the GRU cell structure. Inputs xt and ht−1 feed into the reset (rt) and update (zt) gates. The reset gate modulates the influence of ht−1 on the candidate state h~t. The update gate controls the mix between ht−1 and h~t to produce the final output ht.
Now, let's look at the calculations performed within the cell.
The reset gate decides how much information from the previous hidden state ht−1 should be disregarded when computing the candidate hidden state h~t. It takes the current input xt and the previous hidden state ht−1 as inputs.
The calculation is:
rt=σ(Wrxt+Urht−1+br)Here, Wr and Ur are weight matrices, br is a bias vector, and σ is the sigmoid activation function. The sigmoid function outputs values between 0 and 1. A value close to 0 means "reset" or ignore the corresponding element in the previous state, while a value close to 1 means "keep" it when calculating the candidate state.
The update gate controls the extent to which the hidden state is updated with new information versus retaining old information. It determines how much of the previous hidden state ht−1 carries over to the final hidden state ht. Similar to the reset gate, it uses the current input xt and the previous hidden state ht−1.
The calculation is:
zt=σ(Wzxt+Uzht−1+bz)Again, Wz, Uz, and bz are learned parameters (weights and bias), and σ is the sigmoid function. A value of zt close to 1 indicates that the new hidden state ht should primarily be based on the candidate state h~t, while a value close to 0 suggests retaining most of the previous state ht−1.
The candidate hidden state is calculated similarly to the hidden state in a simple RNN, but with a modification involving the reset gate. It aims to capture the new information from the current input xt, potentially tempered by the relevant parts of the previous state ht−1.
The calculation involves the current input xt and the previous hidden state ht−1, element-wise multiplied (⊙) by the reset gate's output rt:
h~t=tanh(Whxt+Uh(rt⊙ht−1)+bh)Here, Wh, Uh, and bh are learned parameters. The tanh activation function squashes the output to be between -1 and 1. The element-wise multiplication rt⊙ht−1 is significant: if an element in rt is close to 0, the corresponding element from ht−1 contributes very little to the calculation of h~t, effectively allowing the cell to "forget" irrelevant past information when generating the candidate state.
The final hidden state ht for the current time step is computed by linearly interpolating between the previous hidden state ht−1 and the candidate hidden state h~t. The update gate zt controls this interpolation.
The calculation is:
ht=zt⊙ht−1+(1−zt)⊙h~tThis equation shows how the GRU updates its state. The vector zt acts element-wise:
This mechanism allows the GRU to maintain information over long sequences (when zt is close to 0 for many steps) or update rapidly based on new inputs (when zt is close to 1). Notably, the GRU does not have a separate cell state like the LSTM; the hidden state ht carries all the necessary information forward.
This architecture, with its two gates and combined state representation, provides a powerful yet potentially more computationally efficient way to handle sequential data compared to LSTMs, which we will compare more directly later in this chapter.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with