While both gates in a GRU control information flow, the reset gate plays a specific and important role: it determines how much of the past information, carried by the previous hidden state (ht−1), should be ignored or "reset" when calculating the new candidate hidden state (h~t). Think of it as a filter deciding the relevance of past context for proposing an updated memory state.
Like the update gate, the reset gate's activation, denoted as rt, is computed based on the current input xt and the previous hidden state ht−1. It uses a sigmoid activation function, ensuring its output values are between 0 and 1.
The calculation involves learning separate weight matrices (Wxr and Whr) and a bias term (br):
rt=σ(Wxrxt+Whrht−1+br)Here:
The output rt is a vector of the same dimension as the hidden state. Each element in rt corresponds to a dimension of the hidden state, acting as a gate value for that specific dimension.
The values in the reset gate vector rt directly control the influence of the previous hidden state ht−1 when computing the candidate hidden state h~t. A value close to 0 in rt for a particular dimension effectively "resets" or nullifies the contribution from the corresponding dimension in ht−1. Conversely, a value close to 1 allows that part of the previous hidden state to pass through mostly unchanged.
This mechanism is applied via element-wise multiplication (⊙) between the reset gate rt and the previous hidden state ht−1. This modulated previous state is then used in the calculation of the candidate hidden state:
h~t=tanh(Wxhxt+Whh(rt⊙ht−1)+bh)Notice how rt⊙ht−1 determines exactly which parts of the previous state ht−1 are combined with the current input xt to form the candidate state h~t. If an element in rt is 0, the corresponding element in ht−1 is effectively zeroed out before the weighted sum inside the tanh function.
Flow showing the calculation of the reset gate rt and its element-wise multiplication (⊙) with the previous hidden state ht−1 to influence the candidate hidden state h~t.
The reset gate gives the GRU unit the ability to dynamically adjust how much the proposed new state (h~t) should depend on the immediate past state (ht−1). If the current input xt suggests a significant shift in context or topic compared to what was encoded in ht−1, the reset gate can learn to activate close to 0. This effectively allows the unit to "start fresh" in computing the candidate state, focusing more on the current input xt rather than blending it with potentially irrelevant past information.
For example, in language modeling, if the network encounters the end of a sentence (signaled perhaps by punctuation in xt), the reset gate might activate strongly (values near 0) to diminish the influence of the previous sentence's hidden state when calculating the candidate state for the beginning of the next sentence.
In summary, the reset gate acts as a controller, selectively diminishing parts of the previous hidden state before calculating the candidate hidden state. This allows the GRU to effectively forget information that is deemed irrelevant for the immediate next step, contributing to its ability to handle dependencies over time.
© 2025 ApX Machine Learning