As we discussed previously, simple Recurrent Neural Networks face significant challenges when learning dependencies across long sequences, primarily due to vanishing and exploding gradients. The network's ability to carry relevant information from earlier time steps gets compromised. Long Short-Term Memory (LSTM) networks were specifically developed to address this limitation by introducing a more complex cell structure capable of maintaining a memory over extended periods.
At the core of an LSTM network is the LSTM cell. It replaces the simple transformation found in a standard RNN cell with a sophisticated system of gates and a dedicated cell state. This architecture allows the network to selectively add, remove, or retain information over time.
The LSTM cell has two primary components that enable this controlled information flow:
Let's visualize how these components interact within a single LSTM cell at time step t:
Data flow and components within a single LSTM cell at time step t. It receives the current input xt, the previous hidden state ht−1, and the previous cell state Ct−1. It computes the new cell state Ct and the new hidden state ht. Circles represent operations (sigmoid σ, hyperbolic tangent tanh, element-wise multiplication ×, element-wise addition +).
Information processing within the LSTM cell proceeds as follows:
Forget Gate (ft): The cell first decides what information to throw away from the previous cell state, Ct−1. It looks at the previous hidden state ht−1 and the current input xt. These are passed through a sigmoid function σ. The output ft contains values between 0 and 1 for each number in Ct−1. A 1 represents "completely keep this," while a 0 represents "completely get rid of this." ft=σ(Wf[ht−1,xt]+bf)
Input Gate (it) and Candidate Values (C~t): Next, the cell decides what new information to store in the cell state. This involves two steps:
Cell State Update (Ct): Now, the old cell state Ct−1 is updated into the new cell state Ct. The previous state Ct−1 is multiplied element-wise (⊙) by the forget vector ft (forgetting the selected parts). Then, the result of the element-wise multiplication it⊙C~t (the new information, scaled by how much we decided to update) is added. Ct=ft⊙Ct−1+it⊙C~t
Output Gate (ot) and Hidden State (ht): Finally, the cell determines the output, which is the hidden state ht. This output is a filtered version of the cell state.
In the equations above, Wf,Wi,WC,Wo represent the weight matrices and bf,bi,bC,bo are the bias vectors, which are learned during training. The notation [ht−1,xt] often signifies the concatenation of the two vectors.
This gated architecture, especially the distinct cell state Ct that undergoes only minor additions and multiplications controlled by gates, is the reason LSTMs are much better at capturing long-range dependencies than simple RNNs. Information can be preserved over many steps, and the gates learn to control what information is relevant, effectively mitigating the vanishing gradient problem by providing more direct pathways for gradient propagation.
© 2025 ApX Machine Learning