As we saw in the previous chapter, training simple Recurrent Neural Networks (RNNs) on long sequences presents significant difficulties. The core issue lies in how gradients flow backward through time during training (BPTT). Gradients associated with information from early time steps must travel through many recurrent connections. In simple RNNs, these gradients are repeatedly multiplied by the same weight matrices. This repeated multiplication can cause gradients to either shrink exponentially towards zero (vanishing gradients) or grow exponentially out of control (exploding gradients).
The consequence? Simple RNNs struggle to learn dependencies between elements that are far apart in a sequence. The vanishing gradient problem means the network effectively forgets information from the distant past, making it impossible to connect early inputs to later outputs. Exploding gradients lead to unstable training, where weight updates become too large and erratic.
To address these limitations, Long Short-Term Memory (LSTM) networks introduce a fundamental concept: gating mechanisms. Think of these gates as sophisticated valves or switches embedded within the RNN cell. Unlike simple RNNs where information flows relatively unchecked (just passing through matrix multiplications and an activation function), LSTMs use gates to regulate the flow of information dynamically.
These gates are not fixed; they are neural networks themselves, typically using a sigmoid activation function (σ). A sigmoid function outputs values between 0 and 1. This output acts as a control signal:
Crucially, the network learns how to operate these gates based on the current input and the previous hidden state. This allows the LSTM cell to make context-dependent decisions at each time step about:
View of a gating mechanism regulating information flow. A learned control signal passes through a sigmoid (σ) activation to produce a gate value between 0 and 1, which then multiplies the incoming information flow, effectively controlling how much passes through.
By introducing this selective control, gating mechanisms directly combat the vanishing and exploding gradient problems. They allow important information to be carried forward over many time steps without significant degradation (addressing vanishing gradients) and prevent irrelevant information from causing excessive updates (helping with exploding gradients). This ability to selectively remember and forget information over long durations is the primary reason LSTMs are far more effective than simple RNNs for many sequence modeling tasks.
In the following sections, we will examine the specific gates used within an LSTM cell: the forget gate, the input gate, and the output gate, along with the central component they regulate – the cell state.
© 2025 ApX Machine Learning