Just as we stack dense layers in a feedforward network to create deeper models capable of learning more complex representations, we can also stack recurrent layers like LSTMs or GRUs. This technique allows the network to learn hierarchical temporal features. The first recurrent layer might process the raw input sequence and capture lower-level patterns, while subsequent layers operate on the sequence of hidden states from the layer below, potentially learning higher-level or longer-range temporal abstractions.
Stacking recurrent layers increases the model's representational capacity. Each layer adds more parameters and computational steps, allowing the network to model more intricate relationships within the sequential data.
return_sequences
ParameterThe most important concept when stacking recurrent layers is controlling what each layer outputs. Recurrent layers in frameworks like TensorFlow/Keras and PyTorch typically have an option, often named return_sequences
(Keras) or implicitly controlled by how you use the output (PyTorch), that determines whether the layer outputs:
return_sequences=False
(Keras default). The output shape is typically (batch_size, units)
, representing the hidden state after processing the very last time step. This is suitable for the final recurrent layer if the task involves summarizing the entire sequence, like in sequence classification.return_sequences=True
. The output shape is (batch_size, time_steps, units)
, providing the hidden state for each time step in the input sequence. This is necessary for any recurrent layer whose output needs to be fed into another recurrent layer, as the next layer expects a sequence as input. It's also used when the output at each time step is required, such as in sequence-to-sequence tasks or when applying attention mechanisms later.Let's see how stacking works conceptually in popular frameworks.
In Keras, stacking is straightforward using the Sequential
API or the Functional API. You simply add recurrent layers one after another, ensuring that all layers except possibly the last one have return_sequences=True
.
import tensorflow as tf
# Assume input_shape = (time_steps, features)
# For variable length sequences, time_steps can be None: (None, features)
num_features = 10
num_units_l1 = 64
num_units_l2 = 32
output_dim = 5 # Example for classification
model = tf.keras.Sequential([
# Input shape required only for the first layer
tf.keras.layers.LSTM(num_units_l1, return_sequences=True, input_shape=(None, num_features)),
# Layer 1 outputs shape: (batch_size, time_steps, num_units_l1)
# This layer receives the full sequence from the previous layer
tf.keras.layers.GRU(num_units_l2, return_sequences=False), # Or True if needed later
# Layer 2 (with return_sequences=False) outputs shape: (batch_size, num_units_l2)
# Add a dense layer for classification/regression
tf.keras.layers.Dense(output_dim, activation='softmax') # Example activation
])
model.summary()
A simple two-layer stacked recurrent network. The first LSTM layer must return sequences to feed the second GRU layer. The final GRU layer returns only the last hidden state, suitable for a subsequent Dense layer performing classification.
In PyTorch, you define the layers in the __init__
method and specify the connections in the forward
method. You need to manually pass the output sequence of one layer as the input to the next. PyTorch's nn.LSTM
and nn.GRU
modules return the full output sequence and the final hidden/cell states. You typically use the output sequence tensor for stacking.
It's worth noting that PyTorch's nn.LSTM
and nn.GRU
also have a num_layers
parameter, allowing you to create a stacked recurrent layer internally within a single module instance. This is often more computationally efficient than manually stacking separate layer instances, especially on GPUs, due to optimized kernels (like cuDNN). However, manually stacking gives more flexibility if you want different layer types (e.g., LSTM followed by GRU) or configurations per layer.
Here's an example of manual stacking:
import torch
import torch.nn as nn
class StackedRNN(nn.Module):
def __init__(self, input_size, hidden_size1, hidden_size2, output_size):
super().__init__()
# batch_first=True makes input/output tensors shape (batch, seq_len, features)
self.lstm1 = nn.LSTM(input_size, hidden_size1, batch_first=True)
# The input size for the second layer is the hidden size of the first
self.gru2 = nn.GRU(hidden_size1, hidden_size2, batch_first=True)
self.fc = nn.Linear(hidden_size2, output_size)
def forward(self, x):
# x shape: (batch_size, seq_length, input_size)
# output_seq1 shape: (batch_size, seq_length, hidden_size1)
# final_states1 is a tuple (h_n, c_n) for LSTM
output_seq1, final_states1 = self.lstm1(x)
# Feed the output sequence of lstm1 to gru2
# output_seq2 shape: (batch_size, seq_length, hidden_size2)
# final_state2 is h_n for GRU
output_seq2, final_state2 = self.gru2(output_seq1)
# If we need the output of the last time step for classification:
# output_seq2[:, -1, :] selects the output for the last time step
# Shape becomes: (batch_size, hidden_size2)
last_time_step_output = output_seq2[:, -1, :]
# Pass the final output through a dense layer
out = self.fc(last_time_step_output)
# out shape: (batch_size, output_size)
return out
# Example usage:
# input_features = 10
# seq_len = 20
# batch_size = 4
# model = StackedRNN(input_size=10, hidden_size1=64, hidden_size2=32, output_size=5)
# dummy_input = torch.randn(batch_size, seq_len, input_features)
# output = model(dummy_input)
# print(output.shape) # Should be torch.Size([4, 5])
Using the num_layers
argument in PyTorch:
import torch
import torch.nn as nn
class StackedRNNInternal(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, output_size):
super().__init__()
# Creates a stacked LSTM internally
self.lstm = nn.LSTM(input_size, hidden_size, num_layers=num_layers,
batch_first=True) # Add dropout between layers if needed
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x):
# output_seq contains the hidden states of the *last* layer for each time step
# h_n and c_n contain the final hidden/cell states for *all* layers
output_seq, (h_n, c_n) = self.lstm(x)
# Use the output of the last time step from the last layer
last_time_step_output = output_seq[:, -1, :]
out = self.fc(last_time_step_output)
return out
# Example usage:
# model_internal = StackedRNNInternal(input_size=10, hidden_size=64, num_layers=2, output_size=5)
# dummy_input = torch.randn(4, 20, 10)
# output = model_internal(dummy_input)
# print(output.shape) # Should be torch.Size([4, 5])
We can visualize a simple stacked architecture:
Data flow in a two-layer stacked RNN where the final output is taken from the last time step of the second recurrent layer for processing by a Dense layer. Note the requirement
return_sequences=True
for the first layer.
While stacking can enhance model performance, keep these points in mind:
In the upcoming sections, we'll explore another common architectural pattern: bidirectional RNNs, which process sequences in both directions. Following that, we'll put these implementation concepts into practice with a hands-on sentiment analysis example.
© 2025 ApX Machine Learning