As we saw, SimpleRNN
layers, while intuitive, struggle to learn patterns that span long intervals in a sequence. This is largely due to the vanishing gradient problem, where the influence of earlier inputs fades rapidly during backpropagation, making it difficult for the network to adjust weights based on long-range dependencies.
To overcome this significant limitation, a more sophisticated recurrent architecture called Long Short-Term Memory (LSTM) was developed by Sepp Hochreiter and Jürgen Schmidhuber in 1997. LSTMs are explicitly designed to remember information for extended periods, making them highly effective for a wide range of sequence modeling tasks.
The core innovation of LSTMs lies in their internal structure. Unlike a SimpleRNN
unit which has a simple recurrent connection and transformation, an LSTM unit incorporates a memory cell (or cell state) and several gates. Think of the cell state, often denoted as Ct, as the network's long-term memory. It can carry information relatively unchanged across many time steps. The gates are specialized neural network components that act like controllers, regulating the flow of information into and out of this cell state. They learn which information is important to keep, which to discard, and which to output at each time step.
Let's look at the key components within an LSTM unit at a time step t:
LSTM units typically have three main gates that control the information flow. These gates use sigmoid activation functions, which output values between 0 and 1. A value close to 0 means "let nothing through," while a value close to 1 means "let everything through." They take the current input xt and the previous hidden state ht−1 as inputs.
Forget Gate (ft): This gate decides what information should be removed from the cell state. It looks at ht−1 and xt and outputs a number between 0 and 1 for each number in the previous cell state Ct−1. A 1 represents "completely keep this," while a 0 represents "completely get rid of this."
Input Gate (it): This gate determines what new information should be stored in the cell state. It consists of two parts:
tanh
layer creates a vector of new candidate values, C~t, that could be added to the state. The tanh
function outputs values between -1 and 1.Output Gate (ot): This gate decides what the next hidden state ht should be. The output will be based on the filtered cell state. First, a sigmoid layer determines which parts of the cell state to output. Then, the current cell state Ct is passed through tanh
(to push values between -1 and 1) and multiplied by the output of the sigmoid gate (ot). This filtered version becomes the hidden state ht, which is passed to the next time step and can also be used as the output for the current time step.
The cell state acts like a conveyor belt for information. It runs straight down the entire chain of time steps, with only minor linear interactions controlled by the gates. The forget gate selectively removes information from the previous cell state Ct−1, and the input gate selectively adds new candidate information C~t. Conceptually, the update looks like this:
Ct=(ft∗Ct−1)+(it∗C~t)Here, ∗ denotes element-wise multiplication. The first term (ft∗Ct−1) represents the information kept from the previous state, and the second term (it∗C~t) represents the new information added. This structure makes it much easier for information to persist over long durations compared to the repeated matrix multiplications and non-linear transformations in SimpleRNN
.
The hidden state ht is computed based on the filtered cell state Ct, controlled by the output gate ot.
ht=ot∗tanh(Ct)This ht is then passed to the LSTM unit at the next time step (t+1) and can also serve as the output prediction for the current time step t.
Diagram of information flow within a single LSTM unit. Gates (blue) control how the cell state (green cylinder) is updated and what is output as the hidden state (orange parallelogram). Sigmoid (σ) and tanh activation functions are used within the gates and for state updates.
By using this gating mechanism and the separate cell state pathway, LSTMs can effectively learn which information to store, which to forget, and which to output over long sequences. This addresses the vanishing gradient problem and allows them to capture dependencies that might be hundreds of time steps apart, making them a standard choice for many sequence-related tasks.
In the next section, we will see how to implement LSTMs using the convenient LSTM
layer provided by Keras.
© 2025 ApX Machine Learning