Now that you know how to add LSTM
or GRU
layers to your models using framework APIs like TensorFlow/Keras or PyTorch, let's look at the common parameters you'll need to configure to tailor these layers to your specific task. Getting these configurations right is important for building effective sequence models.
The most fundamental parameter for both LSTM
and GRU
layers is typically called units
(in Keras/TensorFlow) or hidden_size
(in PyTorch). This integer value determines the dimensionality of the hidden state ht and, for LSTMs, also the cell state ct.
# TensorFlow/Keras Example
lstm_layer = tf.keras.layers.LSTM(units=64)
gru_layer = tf.keras.layers.GRU(units=32)
# PyTorch Example
# Note: input_size must also be specified here
lstm_layer_pytorch = torch.nn.LSTM(input_size=10, hidden_size=64, batch_first=True)
gru_layer_pytorch = torch.nn.GRU(input_size=10, hidden_size=32, batch_first=True)
Think of units
as the number of memory cells or neurons within the recurrent layer. A higher number of units allows the layer to potentially capture more complex patterns and dependencies in the sequence data, increasing the model's representational capacity. However, just like in feedforward networks, too many units can lead to overfitting, especially with smaller datasets, and increase computational cost significantly. Finding the right number often involves experimentation and validation.
Recurrent layers like LSTMs and GRUs involve several internal computations, many of which use activation functions. Frameworks typically allow you to configure two main activations:
activation
: This function is applied to the cell state update (in LSTMs) and the candidate hidden state calculation. The default and most common choice is the hyperbolic tangent function, tanh
. Its output range of [−1,1] helps regulate the values within the network.
tanh(x)=ex+e−xex−e−x
recurrent_activation
: This function is applied to the gates (forget gate ft, input gate it, output gate ot in LSTM; reset gate rt, update gate zt in GRU). The standard choice is the sigmoid function, σ(x), often referred to as sigmoid
or sometimes hard_sigmoid
(a computationally cheaper approximation). The sigmoid function outputs values between 0 and 1, making it ideal for gating mechanisms. A value close to 1 means "let the information pass," while a value close to 0 means "block the information."
σ(x)=1+e−x1
While you can technically change these defaults, tanh
for the main activation and sigmoid
(or hard_sigmoid
) for the recurrent (gate) activations are standard practice and generally work well.
# TensorFlow/Keras Example with custom activations (less common)
lstm_layer = tf.keras.layers.LSTM(
units=128,
activation='relu', # Non-standard choice for main activation
recurrent_activation='sigmoid' # Standard choice for gates
)
# PyTorch: Activations are often implicitly tanh/sigmoid within the nn.LSTM/nn.GRU modules.
# Customizing them might require implementing the cell logic manually.
return_sequences
A critical parameter is return_sequences
. It's a boolean flag that determines the shape of the layer's output.
return_sequences=False
(Default): The layer only outputs the hidden state ht from the final time step. If your input has shape (batch_size, time_steps, features)
, the output will have shape (batch_size, units)
. This is common when the recurrent layer is the last layer before a final Dense
layer for classification or regression on the entire sequence (e.g., sentiment analysis).
return_sequences=True
: The layer outputs the hidden state ht for every time step. The output shape will be (batch_size, time_steps, units)
. This is necessary when:
# TensorFlow/Keras Example
# Layer outputs only the last hidden state (shape: batch_size, 64)
lstm_last_state = tf.keras.layers.LSTM(units=64, return_sequences=False)
# Layer outputs hidden states for all time steps (shape: batch_size, time_steps, 128)
lstm_all_states = tf.keras.layers.LSTM(units=128, return_sequences=True)
# PyTorch Example
# By default, nn.LSTM/nn.GRU return outputs for all steps and the final hidden/cell states.
# You often select the part you need from the tuple they return.
lstm_pytorch = torch.nn.LSTM(input_size=10, hidden_size=64, batch_first=True)
# output, (hn, cn) = lstm_pytorch(input_tensor)
# 'output' contains all hidden states (batch, seq_len, hidden_size)
# 'hn' contains the final hidden state (num_layers * num_directions, batch, hidden_size)
Impact of
return_sequences
on the output shape of an LSTM or GRU layer.
return_state
Another boolean parameter, return_state
, controls whether the layer returns the final hidden state(s) in addition to the sequence output.
return_state=False
(Default): The layer only returns the output sequence (either the last state or all states, depending on return_sequences
).return_state=True
: The layer returns a list (in Keras) or tuple (in PyTorch) containing the sequence output and the final state(s).
[final_hidden_state]
.[final_hidden_state, final_cell_state]
.This is particularly useful in encoder-decoder architectures (covered later) where the final state of the encoder is used to initialize the state of the decoder. It can also be helpful for analyzing the final learned representation of the sequence.
# TensorFlow/Keras Example
lstm_layer = tf.keras.layers.LSTM(units=32, return_sequences=True, return_state=True)
# When called, output is: [all_hidden_states, final_hidden_state, final_cell_state]
gru_layer = tf.keras.layers.GRU(units=64, return_sequences=False, return_state=True)
# When called, output is: [last_hidden_state, final_hidden_state]
# Note: When return_sequences=False, the first two elements are the same.
Sequential
), you typically need to specify the shape of the input per instance, excluding the batch dimension. This is often done via the input_shape
argument, which expects a tuple like (time_steps, features)
. For subsequent layers, the framework usually infers the input shape automatically. PyTorch requires input_size
(number of features) during layer initialization.dropout
(applied to the input/output units) and recurrent_dropout
(applied to the recurrent connections, i.e., the hidden state updates). These apply dropout masks consistently across time steps for the recurrent connections, which is important for training stability.kernel_initializer
, recurrent_initializer
, bias_initializer
). Standard initializers like Glorot (Xavier) uniform or Orthogonal are often good starting points.Understanding these parameters allows you to effectively construct and customize LSTM and GRU layers for your sequence modeling needs. Remember that the optimal configuration often depends on the specific dataset and task, requiring some degree of experimentation. In the following sections, we'll see how to combine these layers into more complex architectures like stacked and bidirectional RNNs.
© 2025 ApX Machine Learning