When transitioning from Keras to PyTorch, you'll find that the fundamental building blocks of neural networks, layers, have direct counterparts. While the underlying principles are similar, the naming conventions, parameterization, and some default behaviors can differ. This section will guide you through implementing common layer types like Dense (Linear), Convolutional (Conv2D), and Recurrent (LSTM) layers in PyTorch, drawing comparisons to their Keras equivalents.
tf.keras.layers.Dense
vs torch.nn.Linear
The most basic layer, a fully connected or dense layer, performs a linear transformation (y=Wx+b). In Keras, this is tf.keras.layers.Dense
. In PyTorch, it's torch.nn.Linear
.
Differences and Similarities:
units
parameter to define the dimensionality of the output space. PyTorch uses out_features
.torch.nn.Linear
requires you to specify in_features
, the dimensionality of the input. Keras often infers this from the input shape when the model is first called (or if an input_shape
is provided to the first layer).Dense
layer (e.g., activation='relu'
). In PyTorch, activation functions are typically applied as separate modules (e.g., torch.nn.ReLU()
) or functions from torch.nn.functional
after the linear layer.use_bias=True
(default) to include a bias term. PyTorch uses bias=True
(default).Here's a comparison of common parameters:
Keras (tf.keras.layers.Dense ) |
PyTorch (torch.nn.Linear ) |
Description |
---|---|---|
units |
out_features |
Size of the output |
(inferred or input_shape ) |
in_features |
Size of the input |
activation |
(Applied separately) | Activation function |
use_bias |
bias |
Whether to include a bias term |
kernel_initializer |
(Handled differently) | Weight initialization strategy |
bias_initializer |
(Handled differently) | Bias initialization strategy |
Example:
Let's create a dense layer that takes 64 input features and produces 128 output features.
TensorFlow (Keras):
import tensorflow as tf
# Keras Dense layer
keras_dense_layer = tf.keras.layers.Dense(units=128, input_shape=(64,), activation='relu')
# Example usage with dummy data
dummy_input_keras = tf.random.normal(shape=(32, 64)) # Batch size 32, 64 features
output_keras = keras_dense_layer(dummy_input_keras)
print("Keras Output Shape:", output_keras.shape)
PyTorch:
import torch
import torch.nn as nn
# PyTorch Linear layer
pytorch_linear_layer = nn.Linear(in_features=64, out_features=128)
pytorch_relu = nn.ReLU()
# Example usage with dummy data
dummy_input_pytorch = torch.randn(32, 64) # Batch size 32, 64 features
linear_output_pytorch = pytorch_linear_layer(dummy_input_pytorch)
output_pytorch = pytorch_relu(linear_output_pytorch) # Apply activation separately
print("PyTorch Output Shape:", output_pytorch.shape)
In the PyTorch example, nn.ReLU()
is instantiated as a module. You could also use torch.nn.functional.relu()
. The weights and biases are automatically initialized in PyTorch, but you can customize this, as discussed in the "Weight Initialization Strategies" section.
tf.keras.layers.Conv2D
vs torch.nn.Conv2d
Convolutional layers are fundamental to computer vision tasks. Keras provides tf.keras.layers.Conv2D
for 2D convolutions, while PyTorch offers torch.nn.Conv2d
.
Differences and Similarities:
filters
to specify the number of output channels (depth of the convolution). PyTorch uses out_channels
.torch.nn.Conv2d
requires in_channels
to be specified. Keras typically infers this.kernel_size
(can be an int or a tuple for asymmetric kernels).strides
(a tuple, e.g., (1, 1)
). PyTorch uses stride
(an int or a tuple).'valid'
(no padding) or 'same'
(padding to maintain input spatial dimensions). PyTorch's padding
argument can take an integer (symmetrical padding), a tuple (for explicit padding on each side), or string values 'valid'
or 'same'
(similar to Keras, though numerical padding offers more control).tf.keras.layers.Conv2D
) defaults to 'channels_last'
data format, meaning input tensors are expected in the shape (batch_size, height, width, channels)
.torch.nn.Conv2d
) expects 'channels_first'
data format: (batch_size, channels, height, width)
. You need to ensure your input data adheres to this format.Here's a parameter comparison:
Keras (tf.keras.layers.Conv2D ) |
PyTorch (torch.nn.Conv2d ) |
Description |
---|---|---|
filters |
out_channels |
Number of output filters/channels |
(inferred or input_shape ) |
in_channels |
Number of input channels |
kernel_size |
kernel_size |
Size of the convolution kernel |
strides |
stride |
Step size of the convolution |
padding |
padding |
Padding mode or amount |
data_format |
(Implicitly 'channels_first') | Tensor data format |
activation |
(Applied separately) | Activation function |
use_bias |
bias |
Whether to include a bias term |
Example:
A 2D convolutional layer with 32 output filters, a 3x3 kernel, and stride 1. Assume input images are grayscale (1 channel).
TensorFlow (Keras):
import tensorflow as tf
# Keras Conv2D layer
# Input: (batch, height, width, channels) e.g., (N, 28, 28, 1)
keras_conv_layer = tf.keras.layers.Conv2D(filters=32, kernel_size=(3, 3), strides=(1, 1), padding='same', activation='relu', input_shape=(28, 28, 1))
# Example usage
dummy_input_keras = tf.random.normal(shape=(32, 28, 28, 1)) # N, H, W, C
output_keras = keras_conv_layer(dummy_input_keras)
print("Keras Conv2D Output Shape:", output_keras.shape) # (32, 28, 28, 32) due to 'same' padding
PyTorch:
import torch
import torch.nn as nn
# PyTorch Conv2d layer
# Input: (batch, channels, height, width) e.g., (N, 1, 28, 28)
pytorch_conv_layer = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, stride=1, padding=1) # padding=1 for 3x3 kernel achieves "same" effect
pytorch_relu = nn.ReLU()
# Example usage
dummy_input_pytorch = torch.randn(32, 1, 28, 28) # N, C, H, W
conv_output_pytorch = pytorch_conv_layer(dummy_input_pytorch)
output_pytorch = pytorch_relu(conv_output_pytorch)
print("PyTorch Conv2d Output Shape:", output_pytorch.shape) # (32, 32, 28, 28)
Note on PyTorch
padding
forConv2d
: To achieve 'same' padding behavior as in Keras (where output spatial dimensions match input forstride=1
), if yourkernel_size
is k, you generally setpadding = (k-1) // 2
for odd kernel sizes. For akernel_size=3
,padding=1
. Forkernel_size=5
,padding=2
. PyTorch also accepts string valuespadding='same'
in recent versions, simplifying this.
tf.keras.layers.LSTM
vs torch.nn.LSTM
For sequence modeling, Long Short-Term Memory (LSTM) networks are a popular choice. Keras provides tf.keras.layers.LSTM
, and PyTorch offers torch.nn.LSTM
.
Differences and Similarities:
units
for the dimensionality of the hidden state (and output state if return_sequences=True
). PyTorch uses hidden_size
.torch.nn.LSTM
requires input_size
, which is the number of features in the input sequence at each time step.(batch_size, timesteps, features)
format.torch.nn.LSTM
defaults to batch_first=False
, meaning it expects input as (timesteps, batch_size, features)
. You can set batch_first=True
to use the more common (batch_size, timesteps, features)
format. This is a frequent point of attention for developers transitioning.LSTM
has return_sequences
(to return the full sequence of outputs) and return_state
(to return the final hidden and cell states).torch.nn.LSTM
forward
method always returns output, (h_n, c_n)
.
output
: Contains the output features from the last LSTM layer for each time step. Its shape depends on batch_first
. If batch_first=True
, shape is (batch_size, seq_len, num_directions * hidden_size)
.h_n
: Contains the final hidden state for each element in the batch. Shape: (num_layers * num_directions, batch_size, hidden_size)
.c_n
: Contains the final cell state for each element in the batch. Shape: (num_layers * num_directions, batch_size, hidden_size)
.num_layers
parameter allows easy stacking of LSTMs. In Keras, you'd stack LSTM
layers sequentially.Example:
An LSTM layer with 128 hidden units, processing sequences of length 10 with 20 features per time step.
TensorFlow (Keras):
import tensorflow as tf
# Keras LSTM layer
# Input shape: (batch_size, timesteps, features)
keras_lstm_layer = tf.keras.layers.LSTM(units=128, return_sequences=True, input_shape=(10, 20))
# Example usage
dummy_input_keras = tf.random.normal(shape=(32, 10, 20)) # batch, timesteps, features
output_keras = keras_lstm_layer(dummy_input_keras)
print("Keras LSTM Output Shape (sequences):", output_keras.shape)
keras_lstm_layer_last_step = tf.keras.layers.LSTM(units=128, return_sequences=False)
output_keras_last = keras_lstm_layer_last_step(dummy_input_keras)
print("Keras LSTM Output Shape (last step):", output_keras_last.shape)
PyTorch:
import torch
import torch.nn as nn
# PyTorch LSTM layer
# input_size: features per timestep
# hidden_size: LSTM units
pytorch_lstm_layer = nn.LSTM(input_size=20, hidden_size=128, num_layers=1, batch_first=True)
# Example usage
dummy_input_pytorch = torch.randn(32, 10, 20) # batch, timesteps, features (due to batch_first=True)
output_pytorch, (h_n, c_n) = pytorch_lstm_layer(dummy_input_pytorch)
print("PyTorch LSTM Full Output Shape:", output_pytorch.shape) # (batch_size, seq_len, hidden_size)
print("PyTorch LSTM Final Hidden State Shape (h_n):", h_n.shape) # (num_layers, batch_size, hidden_size)
print("PyTorch LSTM Final Cell State Shape (c_n):", c_n.shape) # (num_layers, batch_size, hidden_size)
Important: Remember the
batch_first=True
argument in PyTorch'snn.LSTM
if your data is structured as(batch, sequence, feature)
, which is common. Without it, PyTorch expects(sequence, batch, feature)
. The shapes ofh_n
andc_n
are(num_layers * num_directions, batch, hidden_size)
, so for a single-layer, non-bidirectional LSTM, it's(1, batch, hidden_size)
. You might need tosqueeze()
the first dimension if you need just(batch, hidden_size)
.
Many other layers have straightforward translations:
Pooling Layers:
tf.keras.layers.MaxPool2D
, tf.keras.layers.AvgPool2D
torch.nn.MaxPool2d
, torch.nn.AvgPool2d
pool_size
(Keras) map to kernel_size
(PyTorch). strides
and padding
behave similarly to convolutional layers. Remember the channels_first
data format for PyTorch 2D pooling layers.Dropout Layers:
tf.keras.layers.Dropout(rate)
torch.nn.Dropout(p)
rate
in Keras and p
in PyTorch both represent the probability of an element to be zeroed out during training.Flatten Layers:
tf.keras.layers.Flatten()
torch.nn.Flatten(start_dim=1, end_dim=-1)
Flatten
is more flexible; start_dim=1
is common to flatten all dimensions except the batch dimension, similar to Keras's default.Batch Normalization:
tf.keras.layers.BatchNormalization(axis=-1, ...)
(axis usually channels)torch.nn.BatchNorm1d(num_features)
, torch.nn.BatchNorm2d(num_features)
, torch.nn.BatchNorm3d(num_features)
num_features
corresponds to the number of channels for BatchNorm2d
(if data is N, C, H, W
) or features for BatchNorm1d
(if data is N, L
or N, C, L
).By understanding these mappings and paying attention to details like input shapes and parameter names, you can effectively translate your Keras layer knowledge to build models using torch.nn
. The next step is to assemble these layers into complete model architectures.
© 2025 ApX Machine Learning