Standard recurrent layers, whether simple RNNs, LSTMs, or GRUs, process sequences in a single direction, typically from the beginning to the end (t=1 to T). At any given time step t, the hidden state ht summarizes information only from the past inputs (x1,...,xt). While this captures historical context effectively, many sequence modeling tasks benefit from understanding the context that follows a particular element as well.
Consider tasks like sentiment analysis or named entity recognition. The meaning of a word or its role in a sentence often depends not just on what came before it, but also on what comes after. For instance, identifying whether "bank" refers to a financial institution or a river bank might require looking ahead in the sentence.
This is where Bidirectional RNNs (BiRNNs) come into play. The core idea is straightforward: instead of one RNN processing the sequence forward, we use two independent RNNs.
At each time step t, the final hidden state representation combines the information from both directions. The most common combination strategy is concatenation:
ht=[ht;ht]Here, [⋅;⋅] denotes vector concatenation. If the forward and backward layers each have units
hidden units, the resulting combined hidden state ht at each time step will have 2 * units
dimensions. This combined state captures context from both the past and the future relative to the current time step t.
Diagram of a Bidirectional RNN at time step t. It comprises two independent RNN layers processing the input xt along with the previous forward state ht−1 and the next backward state ht+1. Their outputs are typically concatenated to form the final hidden state ht.
Deep learning libraries provide convenient ways to create bidirectional recurrent layers.
TensorFlow (Keras API)
In Keras, you use the tf.keras.layers.Bidirectional
wrapper, which takes an instance of a recurrent layer (like LSTM
or GRU
) as its primary argument.
import tensorflow as tf
# Assume input_shape = (batch_size, timesteps, features)
# Example: (32, 20, 10) -> 20 time steps, 10 features per step
# Create a Bidirectional LSTM layer
# hidden_units defines the dimensionality of the output space for EACH direction
hidden_units = 64
lstm_layer = tf.keras.layers.LSTM(hidden_units, return_sequences=True)
bidirectional_lstm = tf.keras.layers.Bidirectional(lstm_layer)
# If the input shape is (32, 20, 10) and hidden_units=64:
# The output shape will be (32, 20, 128) because forward (64) and backward (64)
# outputs are concatenated by default (merge_mode='concat').
# Example usage in a Sequential model:
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(20, 10)), # Define input shape explicitly
bidirectional_lstm,
# Add subsequent layers, e.g., a Dense layer for classification
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.summary()
The Bidirectional
wrapper handles creating the forward and backward instances of the provided layer and merging their outputs. The default merge_mode
is 'concat'
, which concatenates the forward and backward outputs along the last axis. Other options include 'sum'
, 'mul'
, and 'ave'
, but concatenation is most common.
If return_sequences=True
in the wrapped layer, the Bidirectional
layer outputs the combined hidden state for each time step. If return_sequences=False
, it outputs only the final combined hidden state (concatenation of the last forward state hT and the first backward state h1).
PyTorch
In PyTorch, bidirectionality is directly supported as an argument within the LSTM
and GRU
layer initializers.
import torch
import torch.nn as nn
# Assume input shape: (batch_size, seq_len, input_size)
# Example: (32, 20, 10)
# Create a Bidirectional GRU layer
input_size = 10
hidden_size = 64 # Defines the dimensionality for EACH direction
num_layers = 1 # Number of stacked layers (can be > 1)
# Set bidirectional=True
bi_gru_layer = nn.GRU(input_size=input_size,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=True, # Input/output tensors provide batch dim first
bidirectional=True)
# Example input tensor
batch_size = 32
seq_len = 20
input_tensor = torch.randn(batch_size, seq_len, input_size)
# Forward pass
# output shape: (batch_size, seq_len, num_directions * hidden_size) -> (32, 20, 2 * 64)
# hn shape: (num_layers * num_directions, batch_size, hidden_size) -> (1 * 2, 32, 64)
output, hn = bi_gru_layer(input_tensor)
print("Output shape:", output.shape)
print("Final hidden state shape:", hn.shape)
# The 'output' tensor contains the concatenated forward and backward hidden states
# at each time step.
# output[batch, t, :hidden_size] is the forward state at time t
# output[batch, t, hidden_size:] is the backward state at time t
# The 'hn' tensor contains the final hidden states for each layer and direction.
# For a single layer BiGRU:
# hn[0, :, :] is the final forward hidden state h_T->
# hn[1, :, :] is the final backward hidden state h_1<- (from the start of the reversed sequence)
Setting bidirectional=True
automatically doubles the effective size of the output features (num_directions * hidden_size
) because it concatenates the forward and backward hidden states. The batch_first=True
argument is convenient as it aligns the tensor dimensions with common practices (batch, sequence, features).
Bidirectional RNNs are particularly effective for offline tasks where the entire input sequence is available before making predictions. They often yield better performance than unidirectional RNNs on tasks like:
However, BiRNNs are generally not suitable for online tasks or real-time predictions (like stock market forecasting based only on past data), because the backward pass requires information from future time steps, which is not available in a real-time scenario.
Implementing bidirectional layers introduces roughly double the parameters and computation compared to a unidirectional layer with the same hidden size, as it involves training two separate RNNs. This is a trade-off to consider against the potential performance improvement gained from using bidirectional context.
© 2025 ApX Machine Learning