Like LSTMs, Gated Recurrent Units (GRUs) are designed to effectively capture dependencies in sequential data by using gating mechanisms. However, the GRU cell achieves this with a slightly different and often simpler internal structure compared to the LSTM cell. Instead of three gates (forget, input, output) and a separate cell state, the GRU employs two primary gates: the reset gate and the update gate. It also directly combines the functions of the LSTM's cell state and hidden state into a single hidden state vector, ht.
Let's examine the components and the flow of information within a single GRU cell at time step t. The cell receives the current input xt and the hidden state from the previous time step ht−1. It then computes the new hidden state ht, which also serves as the output for this time step.
The key components are:
Here is a conceptual view of the GRU cell's architecture:
A simplified view of the GRU cell structure. Inputs xt and ht−1 feed into the reset (rt) and update (zt) gates. The reset gate modulates the influence of ht−1 on the candidate state h~t. The update gate controls the mix between ht−1 and h~t to produce the final output ht.
Now, let's look at the calculations performed within the cell.
The reset gate decides how much information from the previous hidden state ht−1 should be disregarded when computing the candidate hidden state h~t. It takes the current input xt and the previous hidden state ht−1 as inputs.
The calculation is:
rt=σ(Wrxt+Urht−1+br)Here, Wr and Ur are weight matrices, br is a bias vector, and σ is the sigmoid activation function. The sigmoid function outputs values between 0 and 1. A value close to 0 means "reset" or ignore the corresponding element in the previous state, while a value close to 1 means "keep" it when calculating the candidate state.
The update gate controls the extent to which the hidden state is updated with new information versus retaining old information. It determines how much of the previous hidden state ht−1 carries over to the final hidden state ht. Similar to the reset gate, it uses the current input xt and the previous hidden state ht−1.
The calculation is:
zt=σ(Wzxt+Uzht−1+bz)Again, Wz, Uz, and bz are learned parameters (weights and bias), and σ is the sigmoid function. A value of zt close to 1 indicates that the new hidden state ht should primarily be based on the candidate state h~t, while a value close to 0 suggests retaining most of the previous state ht−1.
The candidate hidden state is calculated similarly to the hidden state in a simple RNN, but with a modification involving the reset gate. It aims to capture the new information from the current input xt, potentially tempered by the relevant parts of the previous state ht−1.
The calculation involves the current input xt and the previous hidden state ht−1, element-wise multiplied (⊙) by the reset gate's output rt:
h~t=tanh(Whxt+Uh(rt⊙ht−1)+bh)Here, Wh, Uh, and bh are learned parameters. The tanh
activation function squashes the output to be between -1 and 1. The element-wise multiplication rt⊙ht−1 is significant: if an element in rt is close to 0, the corresponding element from ht−1 contributes very little to the calculation of h~t, effectively allowing the cell to "forget" irrelevant past information when generating the candidate state.
The final hidden state ht for the current time step is computed by linearly interpolating between the previous hidden state ht−1 and the candidate hidden state h~t. The update gate zt controls this interpolation.
The calculation is:
ht=(1−zt)⊙ht−1+zt⊙h~tThis equation shows how the GRU updates its state. The vector zt acts element-wise:
This mechanism allows the GRU to maintain information over long sequences (when zt is close to 0 for many steps) or update rapidly based on new inputs (when zt is close to 1). Notably, the GRU does not have a separate cell state like the LSTM; the hidden state ht carries all the necessary information forward.
This architecture, with its two gates and combined state representation, provides a powerful yet potentially more computationally efficient way to handle sequential data compared to LSTMs, which we will compare more directly later in this chapter.
© 2025 ApX Machine Learning