Having explored the Long Short-Term Memory (LSTM) architecture, we now turn our attention to its close relative: the Gated Recurrent Unit (GRU). As discussed in Chapter 6, GRUs offer a simplified gating mechanism compared to LSTMs, often achieving comparable performance with fewer parameters and potentially faster computation. This section focuses on how to implement GRU layers using popular deep learning frameworks like TensorFlow (with Keras API) and PyTorch.
TensorFlow's high-level Keras API provides a straightforward way to incorporate GRU layers into your models using tf.keras.layers.GRU
. Its usage pattern is very similar to the SimpleRNN
and LSTM
layers we've encountered.
To create a basic GRU layer, you primarily need to specify the number of units in the layer. This number defines the dimensionality of the hidden state and, consequently, the output if return_sequences=False
.
import tensorflow as tf
# Define a GRU layer with 64 units
# 'units' specifies the dimensionality of the hidden state and output space
gru_layer = tf.keras.layers.GRU(units=64)
# Example Input Shape: (batch_size, timesteps, features)
# e.g., a batch of 32 sequences, each 10 steps long, with 8 features per step
sample_input = tf.random.normal([32, 10, 8])
# Pass input through the layer
output = gru_layer(sample_input)
# By default, GRU returns only the output of the last time step
# Default output shape: (batch_size, units)
print(f"Input shape: {sample_input.shape}")
print(f"Output shape (default): {output.shape}")
Here are some of the most commonly used parameters for tf.keras.layers.GRU
:
units
: (Required) Integer, the dimensionality of the output space and the hidden state.activation
: Activation function for the candidate hidden state computation. Defaults to tanh
, which is the standard choice for GRUs, allowing the state to range between -1 and 1.recurrent_activation
: Activation function used for the reset (rt) and update (zt) gates. Defaults to sigmoid
. This is essential as gates need to output values between 0 and 1 to control information flow effectively.return_sequences
: Boolean. If True
, the layer returns the hidden state output for every time step in the input sequence. The output shape becomes (batch_size, timesteps, units)
. If False
(the default), it only returns the final hidden state output from the last time step, resulting in an output shape of (batch_size, units)
. You typically set return_sequences=True
for all recurrent layers except possibly the last one in a stack, or when the subsequent layer expects a sequence.return_state
: Boolean. If True
, the layer returns the final hidden state tensor in addition to the output(s). For a GRU layer, this is a single tensor representing the hidden state at the last time step. The output becomes a list: [output, final_state]
.go_backwards
: Boolean (default: False
). If True
, the input sequence is processed in reverse order, and the reversed sequence is returned if return_sequences=True
.reset_after
: Boolean (default: True
). Determines the GRU calculation variant. True
corresponds to the variant where the reset gate is applied after the matrix multiplication for the candidate hidden state. False
applies it before. The default True
often works well in practice.Let's observe the effect of return_sequences=True
:
# Create a GRU layer that returns the full sequence of outputs
gru_layer_seq = tf.keras.layers.GRU(units=64, return_sequences=True)
# Pass the same sample input
output_seq = gru_layer_seq(sample_input)
# Output shape now includes the time steps dimension
# Output shape: (batch_size, timesteps, units)
print(f"Output shape (return_sequences=True): {output_seq.shape}")
As you can see, setting return_sequences=True
preserves the temporal dimension in the output, making it suitable for feeding into subsequent recurrent layers or for tasks requiring outputs at each time step.
In PyTorch, the equivalent functionality is provided by the torch.nn.GRU
module. When initializing this module, you need to specify the size of the input features and the desired size of the hidden state.
A point requiring careful attention in PyTorch is the default expected shape for input sequences. Unlike Keras, PyTorch's recurrent layers, including torch.nn.GRU
, default to expecting input tensors with the shape (sequence_length, batch_size, features)
. However, data loading and preprocessing pipelines often yield data in the format (batch_size, sequence_length, features)
. To handle this common format directly, you must set the batch_first
argument to True
when creating the GRU
instance.
import torch
import torch.nn as nn
# Define network parameters
input_features = 8 # Number of features per time step in the input
hidden_units = 64 # Number of units in the GRU hidden state
batch_size = 32 # Number of sequences in a batch
seq_length = 10 # Length of each sequence
# Define a GRU module
# Ensure batch_first=True to use (batch, seq, feature) input format
gru_module = nn.GRU(input_size=input_features,
hidden_size=hidden_units,
batch_first=True) # Crucial for common data shape
# Example Input: Tensor shape (batch_size, seq_length, input_features)
sample_input_pt = torch.randn(batch_size, seq_length, input_features)
# Pass input through the module.
# By default, PyTorch GRU returns two items:
# 1. output_seq: Tensor containing the output hidden state for each time step.
# 2. final_hidden_state: Tensor containing the hidden state for the final time step.
output_seq_pt, final_hidden_state_pt = gru_module(sample_input_pt)
# Output sequence shape: (batch_size, seq_length, hidden_size)
print(f"PyTorch Input shape: {sample_input_pt.shape}")
print(f"PyTorch Output sequence shape: {output_seq_pt.shape}")
# Final hidden state shape: (num_layers * num_directions, batch_size, hidden_size)
# For a single-layer, unidirectional GRU, num_layers=1, num_directions=1.
print(f"PyTorch Final hidden state shape: {final_hidden_state_pt.shape}")
Key parameters for torch.nn.GRU
:
input_size
: (Required) The number of expected features in the input xt.hidden_size
: (Required) The number of features in the hidden state ht.num_layers
: Integer (default: 1). The number of stacked GRU layers. Stacking is handled conveniently by this parameter. We will explore stacking in a later section.bias
: Boolean (default: True
). If False
, the layer will not use bias weights bir,bhr, etc.batch_first
: Boolean (default: False
). If True
, the input and output tensors are provided with the batch dimension first: (batch, seq, feature)
. Set this to True
if your data follows this common convention.dropout
: Float (default: 0). If non-zero, introduces a Dropout layer on the outputs of each GRU layer except the last one, with the specified dropout probability. Useful for regularization.bidirectional
: Boolean (default: False
). If True
, creates a bidirectional GRU. We'll cover this architecture later in the chapter.Notice that the PyTorch GRU
module naturally returns the full sequence of outputs (like return_sequences=True
in Keras) as the first element of its output tuple. If your task only requires the final hidden state's output (equivalent to Keras' default return_sequences=False
), you can easily extract it from the output_seq_pt
tensor. A common way is to select the last time step's output for each sequence in the batch:
# Extracting the output of the very last time step for each sequence in the batch
last_step_output_pt = output_seq_pt[:, -1, :]
print(f"PyTorch Last time step output shape: {last_step_output_pt.shape}")
# Shape: (batch_size, hidden_size)
The second returned item, final_hidden_state_pt
, contains the final hidden state, which is useful for initializing subsequent layers or for sequence-to-sequence tasks. Its shape includes the number of layers and directions, which becomes relevant when using stacked or bidirectional GRUs.
Now that you can implement both LSTMs and GRUs, when should you choose one over the other?
Having learned how to instantiate and use basic GRU layers in both TensorFlow/Keras and PyTorch, you are ready to integrate them into sequence modeling pipelines. The following sections will build upon this foundation, showing how to configure these layers further, combine them into deeper architectures by stacking, and enhance their ability to capture context using bidirectional processing.
© 2025 ApX Machine Learning