Simple Recurrent Neural Networks face significant challenges when learning dependencies across long sequences, primarily due to vanishing and exploding gradients. The network's ability to carry relevant information from earlier time steps gets compromised. Long Short-Term Memory (LSTM) networks were specifically developed to address this limitation by introducing a more complex cell structure capable of maintaining a memory over extended periods.
At the core of an LSTM network is the LSTM cell. It replaces the simple transformation found in a standard RNN cell with a sophisticated system of gates and a dedicated cell state. This architecture allows the network to selectively add, remove, or retain information over time.
The LSTM cell has two primary components that enable this controlled information flow:
Let's visualize how these components interact within a single LSTM cell at time step :
Data flow and components within a single LSTM cell at time step . It receives the current input , the previous hidden state , and the previous cell state . It computes the new cell state and the new hidden state . Circles represent operations (sigmoid , hyperbolic tangent , element-wise multiplication , element-wise addition ).
Information processing within the LSTM cell proceeds as follows:
Forget Gate (): The cell first decides what information to throw away from the previous cell state, . It looks at the previous hidden state and the current input . These are passed through a sigmoid function . The output contains values between 0 and 1 for each number in . A 1 represents "completely keep this," while a 0 represents "completely get rid of this."
Input Gate () and Candidate Values (): Next, the cell decides what new information to store in the cell state. This involves two steps:
Cell State Update (): Now, the old cell state is updated into the new cell state . The previous state is multiplied element-wise () by the forget vector (forgetting the selected parts). Then, the result of the element-wise multiplication (the new information, scaled by how much we decided to update) is added.
Output Gate () and Hidden State (): Finally, the cell determines the output, which is the hidden state . This output is a filtered version of the cell state.
In the equations above, represent the weight matrices and are the bias vectors, which are learned during training. The notation often means the concatenation of the two vectors.
This gated architecture, especially the distinct cell state that undergoes only minor additions and multiplications controlled by gates, is the reason LSTMs are much better at capturing long-range dependencies than simple RNNs. Information can be preserved over many steps, and the gates learn to control what information is relevant, effectively mitigating the vanishing gradient problem by providing more direct pathways for gradient propagation.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with