All Courses

Implementing Bidirectional Layers

Standard recurrent layers, whether simple RNNs, LSTMs, or GRUs, process sequences in a single direction, typically from the beginning to the end ( $t=1$ to $T$ ). At any given time step $t$ , the hidden state $h_t$ summarizes information only from the past inputs ( $x_1, ..., x_t$ ). While this captures historical context effectively, many sequence modeling tasks benefit from understanding the context that follows a particular element as well.

Consider tasks like sentiment analysis or named entity recognition. The meaning of a word or its role in a sentence often depends not just on what came before it, but also on what comes after. For instance, identifying whether "bank" refers to a financial institution or a river bank might require looking ahead in the sentence.

This is where Bidirectional RNNs (BiRNNs) come into play. The core idea is straightforward: instead of one RNN processing the sequence forward, we use two independent RNNs.

Forward Layer: Processes the input sequence from $t=1$ to $T$ , generating a sequence of forward hidden states: $\overrightarrow{h_1}, \overrightarrow{h_2}, ..., \overrightarrow{h_T}$ .
Backward Layer: Processes the input sequence in reverse, from $t=T$ down to $1$ , generating a sequence of backward hidden states: $\overleftarrow{h_1}, \overleftarrow{h_2}, ..., \overleftarrow{h_T}$ .

At each time step $t$ , the final hidden state representation combines the information from both directions. The most common combination strategy is concatenation:

h_t = [\overrightarrow{h_t} ; \overleftarrow{h_t}]

Here, $[\cdot ; \cdot]$ denotes vector concatenation. If the forward and backward layers each have units hidden units, the resulting combined hidden state $h_t$ at each time step will have 2 * units dimensions. This combined state captures context from both the past and the future relative to the current time step $t$ .

Diagram of a Bidirectional RNN at time step $t$ . It comprises two independent RNN layers processing the input $x_t$ along with the previous forward state $\overrightarrow{h}_{t-1}$ and the next backward state $\overleftarrow{h}_{t+1}$ . Their outputs are typically concatenated to form the final hidden state $h_t$ .

Implementing Bidirectional Layers in Frameworks

Deep learning libraries provide convenient ways to create bidirectional recurrent layers.

TensorFlow (Keras API)

In Keras, you use the tf.keras.layers.Bidirectional wrapper, which takes an instance of a recurrent layer (like LSTM or GRU) as its primary argument.

import tensorflow as tf

# Assume input_shape = (batch_size, timesteps, features)
# Example: (32, 20, 10) -> 20 time steps, 10 features per step

# Create a Bidirectional LSTM layer
# hidden_units defines the dimensionality of the output space for EACH direction
hidden_units = 64
lstm_layer = tf.keras.layers.LSTM(hidden_units, return_sequences=True)
bidirectional_lstm = tf.keras.layers.Bidirectional(lstm_layer)

# If the input shape is (32, 20, 10) and hidden_units=64:
# The output shape will be (32, 20, 128) because forward (64) and backward (64)
# outputs are concatenated by default (merge_mode='concat').

# Example usage in a Sequential model:
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=(20, 10)), # Define input shape explicitly
    bidirectional_lstm,
    # Add subsequent layers, e.g., a Dense layer for classification
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.summary()

The Bidirectional wrapper handles creating the forward and backward instances of the provided layer and merging their outputs. The default merge_mode is 'concat', which concatenates the forward and backward outputs along the last axis. Other options include 'sum', 'mul', and 'ave', but concatenation is most common.

If return_sequences=True in the wrapped layer, the Bidirectional layer outputs the combined hidden state for each time step. If return_sequences=False, it outputs only the final combined hidden state (concatenation of the last forward state $\overrightarrow{h_T}$ and the first backward state $\overleftarrow{h_1}$ ).

PyTorch

In PyTorch, bidirectionality is directly supported as an argument within the LSTM and GRU layer initializers.

import torch
import torch.nn as nn

# Assume input shape: (batch_size, seq_len, input_size)
# Example: (32, 20, 10)

# Create a Bidirectional GRU layer
input_size = 10
hidden_size = 64 # Defines the dimensionality for EACH direction
num_layers = 1 # Number of stacked layers (can be > 1)

# Set bidirectional=True
bi_gru_layer = nn.GRU(input_size=input_size,
                      hidden_size=hidden_size,
                      num_layers=num_layers,
                      batch_first=True, # Input/output tensors provide batch dim first
                      bidirectional=True)

# Example input tensor
batch_size = 32
seq_len = 20
input_tensor = torch.randn(batch_size, seq_len, input_size)

# Forward pass
# output shape: (batch_size, seq_len, num_directions * hidden_size) -> (32, 20, 2 * 64)
# hn shape: (num_layers * num_directions, batch_size, hidden_size) -> (1 * 2, 32, 64)
output, hn = bi_gru_layer(input_tensor)

print("Output shape:", output.shape)
print("Final hidden state shape:", hn.shape)

# The 'output' tensor contains the concatenated forward and backward hidden states
# at each time step.
# output[batch, t, :hidden_size] is the forward state at time t
# output[batch, t, hidden_size:] is the backward state at time t

# The 'hn' tensor contains the final hidden states for each layer and direction.
# For a single layer BiGRU:
# hn[0, :, :] is the final forward hidden state h_T->
# hn[1, :, :] is the final backward hidden state h_1<- (from the start of the reversed sequence)

Setting bidirectional=True automatically doubles the effective size of the output features (num_directions * hidden_size) because it concatenates the forward and backward hidden states. The batch_first=True argument is convenient as it aligns the tensor dimensions with common practices (batch, sequence, features).

Use Cases and Considerations

Bidirectional RNNs are particularly effective for offline tasks where the entire input sequence is available before making predictions. They often yield better performance than unidirectional RNNs on tasks like:

Natural Language Processing: Sentiment analysis, named entity recognition (NER), part-of-speech (POS) tagging, question answering, and machine translation (especially in the encoder part of encoder-decoder models). Understanding the full sentence context is highly beneficial.
Bioinformatics: Protein structure prediction or gene sequence analysis.

However, BiRNNs are generally not suitable for online tasks or real-time predictions (like stock market forecasting based only on past data), because the backward pass requires information from future time steps, which is not available in a real-time scenario.

Implementing bidirectional layers introduces roughly double the parameters and computation compared to a unidirectional layer with the same hidden size, as it involves training two separate RNNs. This is a trade-off to consider against the potential performance improvement gained from using bidirectional context.

Was this section helpful?