In our look at simple RNNs, we saw their difficulty in handling long-range dependencies partly because they lacked a mechanism to explicitly control their memory. Information from many steps ago could either vanish or overwhelm the network. LSTMs introduce specialized components, called gates, to manage the information stored in the network's memory, which we call the cell state (C).
The first gate we'll examine is the forget gate. Its function is straightforward but significant: it decides which information should be discarded or kept from the cell state. Think of the cell state as the LSTM's long-term memory. As new input arrives, the forget gate looks at the previous state and the new input to determine which parts of the existing long-term memory are no longer relevant.
The forget gate works by passing the previous hidden state (ht−1) and the current input (xt) through a sigmoid activation function (σ). The sigmoid function is ideal here because it outputs values between 0 and 1.
Inputs (ht−1 and xt) are processed by the forget gate's sigmoid layer to produce the forget vector ft.
Mathematically, the calculation is:
ft=σ(Wf⋅[ht−1,xt]+bf)Let's break this down:
The output ft is a vector with the same dimension as the cell state Ct−1. Each element in ft is a number between 0 and 1, corresponding to an element in the cell state Ct−1.
This vector ft acts like a filter. It will be multiplied element-wise with the previous cell state Ct−1 to decide how much of the old memory should pass through to the next step. We will see this multiplication happen when we discuss updating the cell state later in this chapter.
The ability to selectively forget irrelevant information based on the current input and the past context (via ht−1) is a core reason why LSTMs can maintain useful information over much longer sequences compared to simple RNNs. It prevents the cell state from becoming cluttered with outdated or unnecessary details.
© 2025 ApX Machine Learning