All Courses

LSTM Layer in Keras

In the previous sections, we discussed the SimpleRNN layer and identified its limitations, particularly the vanishing gradient problem, which makes it difficult for the network to learn dependencies across long sequences. To overcome this, more sophisticated recurrent architectures were developed. One of the most successful and widely used is the Long Short-Term Memory (LSTM) network.

LSTMs introduce a mechanism to explicitly manage the flow of information over time, allowing them to selectively remember or forget information. This is achieved through a system of gates controlling a dedicated cell state ( $c_t$ ), which acts like a conveyor belt running through the entire sequence, carrying information with minimal manipulation.

The LSTM Cell Structure

An LSTM cell processes the input at the current timestep ( $x_t$ ) along with the hidden state from the previous timestep ( $h_{t-1}$ ). Unlike SimpleRNN, it uses three primary gates and updates both a hidden state ( $h_t$ ) and the cell state ( $c_t$ ).

Forget Gate ( $f_t$ ): Decides what information to discard from the cell state. It looks at $h_{t-1}$ and $x_t$ , outputting a number between 0 and 1 for each number in the cell state $c_{t-1}$ . A 1 represents "completely keep this," while a 0 represents "completely get rid of this." It typically uses a sigmoid activation. $f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$
Input Gate ( $i_t$ ): Determines which new information we're going to store in the cell state. This has two parts:
- A sigmoid layer ( $i_t$ ) decides which values to update.
- A tanh layer ( $\tilde{c}_t$ ) creates a vector of new candidate values to be added to the state. $i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$ $\tilde{c}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$
Cell State Update: The old cell state $c_{t-1}$ is updated to the new cell state $c_t$ . The forget gate $f_t$ is multiplied by the previous state, and the input gate $i_t$ is multiplied by the candidate values $\tilde{c}_t$ . $c_t = f_t * c_{t-1} + i_t * \tilde{c}_t$
Output Gate ( $o_t$ ): Decides what part of the cell state to output.
- A sigmoid layer ( $o_t$ ) determines which parts of the cell state to output.
- The cell state is passed through tanh (to push values between -1 and 1) and multiplied by the output of the sigmoid gate. This filtered version becomes the new hidden state $h_t$ . $o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$ $h_t = o_t * \tanh(c_t)$

The hidden state $h_t$ is the output of the LSTM unit for the current timestep. The combination of the cell state and the gates allows LSTMs to maintain relevant information over much longer sequences compared to SimpleRNNs, mitigating the vanishing gradient problem.

A diagram of an LSTM cell showing the flow of information through the forget, input, and output gates, interacting with the cell state and hidden state.

Implementing LSTMs in Keras

Using LSTMs in Keras is straightforward, thanks to the keras.layers.LSTM layer. It functions similarly to SimpleRNN but incorporates the more complex internal logic described above.

import keras
from keras import layers

# Define an LSTM layer with 64 units
# Assuming input shape is (batch_size, timesteps, features)
# For example, (32, 10, 8) means 32 sequences, 10 timesteps each, 8 features per timestep
lstm_layer = layers.LSTM(units=64)

# You can add it to a Sequential model:
model = keras.Sequential([
    # Input shape required for the first layer
    layers.Input(shape=(None, 8)), # (timesteps, features) - None allows variable sequence length
    layers.LSTM(units=64, return_sequences=True), # Returns the full sequence output
    layers.LSTM(units=32), # Returns only the last output
    layers.Dense(units=10) # Example final classification layer
])

model.summary()

Important parameters for keras.layers.LSTM:

units: This is the dimensionality of the output space, which also corresponds to the dimensionality of the hidden state $h_t$ and the cell state $c_t$ . This is a required argument.
activation: The activation function applied to the candidate cell state ( $\tilde{c}_t$ ) and the final hidden state output calculation ( $h_t$ ). The default is 'tanh'.
recurrent_activation: The activation function used for the three gates (forget, input, output). The default is 'sigmoid'.
return_sequences: A boolean value.
- If False (default), the layer only returns the hidden state for the last timestep in the input sequence ( $h_T$ ). This is suitable when the LSTM layer is the final recurrent layer before a Dense layer for tasks like sequence classification.
- If True, the layer returns the hidden state for every timestep ( $h_1, h_2, ..., h_T$ ). This is necessary when stacking LSTM layers (so the next LSTM layer receives a sequence as input) or for sequence-to-sequence tasks where an output is needed at each step.
input_shape: Like other Keras layers, you need to specify the shape of the input for the first layer in a model. For recurrent layers, this is typically (timesteps, features). You can use None for the timesteps dimension if your sequences have variable lengths.

By default, the LSTM layer uses optimized CuDNN kernels when running on a compatible GPU, providing significant speedups during training.

Compared to SimpleRNN, the LSTM layer involves more computations per timestep due to its internal gating mechanisms. However, this complexity is precisely what allows it to effectively learn long-range dependencies, making it a much more powerful tool for many sequence modeling tasks. In the practice section later in this chapter, you'll implement an LSTM model for text classification.

Was this section helpful?