As we saw in the previous section, simple Recurrent Neural Networks (RNNs) struggle to learn dependencies between elements that are far apart in a sequence. This is largely due to the vanishing gradient problem, where gradients become too small during backpropagation to effectively update the network's weights for earlier time steps. To overcome this limitation, a more sophisticated recurrent unit was developed: the Long Short-Term Memory (LSTM) network, introduced by Hochreiter and Schmidhuber in 1997.
LSTMs are explicitly designed to remember information for long periods. The central innovation is the cell state (Ct), often conceptualized as the network's "memory," which can carry information relatively unchanged across many time steps. Unlike the single hidden state transformation in a simple RNN, LSTMs use a system of gates to carefully regulate the flow of information into and out of this cell state.
An LSTM unit processes data sequentially, taking the input at the current time step (xt) and the hidden state from the previous time step (ht−1) to compute the new hidden state (ht) and update its internal cell state (Ct). This process is controlled by three primary gates:
Forget Gate (ft): This gate decides what information to discard from the cell state. It looks at ht−1 and xt and outputs a number between 0 and 1 for each number in the previous cell state Ct−1. A 1 represents "completely keep this," while a 0 represents "completely get rid of this." It uses a sigmoid activation function (σ). ft=σ(Wf[ht−1,xt]+bf) Here, [ht−1,xt] denotes the concatenation of the previous hidden state and the current input, Wf are the weights, and bf is the bias for the forget gate.
Input Gate (it): This gate determines which new information will be stored in the cell state. It consists of two parts:
tanh
layer creates a vector of new candidate values, C~t, that could be added to the state.
C~t=tanh(WC[ht−1,xt]+bC)
These two parts are combined to update the cell state.Updating the Cell State: The old cell state Ct−1 is updated to the new cell state Ct. We multiply the old state by ft, forgetting the things we decided to forget earlier. Then we add it∗C~t. This is the new candidate information, scaled by how much we decided to update each state value. Ct=ft∗Ct−1+it∗C~t Note the use of addition here. This additive interaction is important for allowing gradients to flow better compared to the repeated multiplications in simple RNNs.
Output Gate (ot): This gate decides what the next hidden state ht (the output for this time step) should be. The output will be based on the filtered cell state.
tanh
(to push values between -1 and 1) and multiplied by the output of the sigmoid gate ot. This ensures we only output the parts we decided to.
ht=ot∗tanh(Ct)The hidden state ht is then passed to the next time step (as ht−1) and can also be used as the output of the model at the current step.
The structure of an LSTM cell, with its gates controlling the flow of information through the cell state, can be visualized as follows:
Data flow within a single LSTM cell. Arrows indicate data dependencies. Circles with 'X' denote element-wise multiplication, '+' denotes element-wise addition, 'σ' denotes the sigmoid function, and 'tanh' denotes the hyperbolic tangent function. The cell state acts as a conveyor belt, with gates controlling information removal and addition.
The key mechanism allowing LSTMs to handle long-range dependencies better than simple RNNs lies in the cell state and the gating mechanism.
Think of the cell state as a conveyor belt carrying information through time. The forget gate removes items from the belt, the input gate adds new items, and the output gate reads items off the belt to decide the immediate output (hidden state). This structure is much more effective at preserving signals over long durations.
LSTMs have been highly successful in various NLP tasks, including machine translation, sentiment analysis, and sequence labeling, precisely because of their ability to capture long-range context. While they are more complex than simple RNNs, their effectiveness often justifies the additional computational overhead.
In the next section, we'll look at Gated Recurrent Units (GRUs), a slightly simpler variant of the LSTM that often achieves comparable performance.
© 2025 ApX Machine Learning