As we saw previously, the core mathematical operations within simple RNNs, involving repeated matrix multiplications across time steps, lead directly to the vanishing and exploding gradient problems. Training deep recurrent networks becomes unstable, making it difficult to capture dependencies between elements far apart in a sequence. Long Short-Term Memory (LSTM) networks, introduced by Hochreiter and Schmidhuber in 1997, were specifically designed to combat these issues through a more sophisticated internal structure incorporating gating mechanisms.
The central innovation of the LSTM is the introduction of a cell state (Ct) alongside the hidden state (ht). Think of the cell state as an information highway or a memory conveyor belt. It runs straight down the entire sequence, with only minor linear interactions. Information can be added to or removed from the cell state, operations carefully regulated by structures called gates.
These gates are composed of a sigmoid neural network layer and a pointwise multiplication operation. The sigmoid layer outputs numbers between 0 and 1, describing how much of each component should be let through. A value of 1 means "let everything through," while a value of 0 means "let nothing through." An LSTM unit typically contains three such gates to protect and control the cell state.
Let's examine each gate at a specific time step t, considering the current input xt, the previous hidden state ht−1, and the previous cell state Ct−1.
The first step is to decide what information we're going to throw away from the cell state. This decision is made by the forget gate. It looks at ht−1 and xt and outputs a number between 0 and 1 for each number in the cell state Ct−1.
ft=σ(Wf[ht−1,xt]+bf)Here, [ht−1,xt] denotes the concatenation of the previous hidden state and the current input vector. Wf represents the weight matrix and bf the bias vector for the forget gate. The sigmoid function σ squashes the output to the range [0, 1]. A value close to 0 suggests forgetting the corresponding information in Ct−1, while a value close to 1 suggests keeping it.
Next, we need to decide what new information we're going to store in the cell state. This involves two parts:
tanh
layer creates a vector of new candidate values, C~t, that could be added to the state.Similar to the forget gate, Wi,bi,WC,bC are weight matrices and bias vectors learned during training. The tanh
function outputs values between -1 and 1, representing potential updates (positive or negative) to the cell state.
Now, we update the old cell state Ct−1 into the new cell state Ct. We multiply the old state by ft, forgetting the things we decided to forget earlier. Then we add it∗C~t. This is the new candidate values, scaled by how much we decided to update each state value.
Ct=ft∗Ct−1+it∗C~tThe symbol ∗ denotes element-wise multiplication.
Finally, we need to decide what we're going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output.
ot=σ(Wo[ht−1,xt]+bo)Then, we put the cell state Ct through tanh
(to push the values to be between -1 and 1) and multiply it by the output of the sigmoid gate ot, so that we only output the parts we decided to. This result is the new hidden state ht.
This ht is passed on to the next time step and can also be used as the output of the LSTM unit for prediction at the current time step.
Internal structure of an LSTM cell. The gates (sigmoid σ) control the flow of information into and out of the cell state (Ct), represented by the green path. The hidden state (ht) is a filtered version of the cell state.
The key insight is the cell state's additive interaction. New information is added to the cell state (via it∗C~t) and old information is removed (via multiplication by ft), rather than being transformed repeatedly by matrix multiplications and non-linearities as in simple RNNs. The forget gate allows the cell state to retain information over long periods if needed (by setting ft close to 1).
This structure creates pathways where gradients can flow backward through time without vanishing as rapidly. The gates learn to control this flow, opening or closing access to the cell state based on the context. If the forget gate is mostly open (ft≈1) and the input gate is mostly closed (it≈0), the cell state can pass its information largely unchanged across many time steps, preserving gradients.
While LSTMs represented a significant advance and enabled progress on many sequence modeling tasks previously intractable for simple RNNs, they are not a perfect solution. They still process information sequentially, limiting parallelization during training and inference. Furthermore, while much better at capturing longer dependencies than simple RNNs, they can still struggle with extremely long sequences where subtle dependencies exist across thousands of time steps. The complexity of the gating mechanism also adds computational overhead compared to simpler models. These remaining challenges paved the way for architectures like the Transformer, which abandons recurrence altogether.
© 2025 ApX Machine Learning