Now that we've explored the concepts behind Recurrent Neural Networks, SimpleRNNs, LSTMs, and GRUs, let's put this knowledge into practice. In this section, we'll build and train an LSTM model using Keras to perform text classification, a common task in Natural Language Processing (NLP). We'll use the popular IMDB movie review dataset, where the goal is to classify reviews as either positive or negative based on their text content. This exercise reinforces the concepts of sequence data preparation, embedding layers, and recurrent layer implementation.
The IMDB dataset consists of 50,000 movie reviews, pre-split into 25,000 for training and 25,000 for testing. Each review is labeled as positive (1) or negative (0). Keras provides convenient access to this dataset, already preprocessed into sequences of word indices (integers). Each integer represents a specific word in a dictionary.
Let's start by loading the data. We'll limit our vocabulary to the top 10,000 most frequent words to keep the input manageable.
import keras
from keras.datasets import imdb
from keras import utils
# Load the dataset, keeping only the top 10,000 most frequent words
vocabulary_size = 10000
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=vocabulary_size)
print(f"Number of training samples: {len(train_data)}")
print(f"Number of test samples: {len(test_data)}")
# Example: Look at the first review (sequence of word indices)
print(f"First training review (indices): {train_data[0][:15]}...")
print(f"Label for first review: {train_labels[0]}")
You'll notice each train_data
and test_data
sample is a list of integers. Also, the reviews have varying lengths.
Recurrent neural networks require inputs to have a consistent sequence length. We need to pad (or truncate) the reviews so that every sequence has the same number of elements. We'll use Keras's pad_sequences
utility. Let's choose a maximum sequence length, say 250 words. Reviews shorter than this will be padded with zeros at the beginning (padding='pre'
), and reviews longer than this will be truncated.
# Set the maximum length for each review sequence
max_sequence_length = 250
# Pad sequences
padded_train_data = utils.pad_sequences(train_data, maxlen=max_sequence_length, padding='pre')
padded_test_data = utils.pad_sequences(test_data, maxlen=max_sequence_length, padding='pre')
print(f"Shape of padded training data: {padded_train_data.shape}")
print(f"Shape of padded test data: {padded_test_data.shape}")
# Example: Look at the first padded review
print(f"First padded training review: {padded_train_data[0]}")
Now, padded_train_data
and padded_test_data
are 2D tensors of shape (num_samples, max_sequence_length)
.
We'll construct a simple sequential model:
embedding_dim
). This layer requires the input_dim
(vocabulary size) and output_dim
(embedding dimension).Let's define this model using the Keras Sequential API.
from keras import layers
from keras import models
embedding_dim = 32 # Dimension of the word embedding vectors
lstm_units = 32 # Number of units in the LSTM layer
model = models.Sequential(name="imdb_lstm_classifier")
model.add(layers.Embedding(input_dim=vocabulary_size,
output_dim=embedding_dim,
input_length=max_sequence_length,
name="word_embedding"))
model.add(layers.LSTM(units=lstm_units, name="lstm_layer"))
model.add(layers.Dense(units=1, activation='sigmoid', name="output_classifier"))
model.summary()
The model.summary()
output shows the layers, their output shapes, and the number of parameters. Notice the large number of parameters in the Embedding layer (vocabulary_size * embedding_dim
) and the LSTM layer.
Before training, we need to configure the learning process using the compile()
method. We specify:
adam
is a widely used and effective optimization algorithm.binary_crossentropy
is suitable for binary classification problems where the output is a probability.accuracy
during training and evaluation.model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
print("Model compiled successfully.")
Now we train the model using the fit()
method. We provide the padded training data and labels. We also set:
epochs
: The number of times to iterate over the entire training dataset.batch_size
: The number of samples per gradient update.validation_split
: A fraction of the training data to be used as validation data. The model's performance on this set is monitored at the end of each epoch, helping us detect overfitting.num_epochs = 10
batch_size = 128
validation_fraction = 0.2
print("Starting training...")
history = model.fit(padded_train_data,
train_labels,
epochs=num_epochs,
batch_size=batch_size,
validation_split=validation_fraction,
verbose=1) # Set verbose=1 or 2 to see progress per epoch
print("Training finished.")
The fit()
method returns a History
object containing training and validation loss and metrics for each epoch. We can use this to visualize the training process.
Training and validation accuracy over epochs. Note how validation accuracy often plateaus or even decreases while training accuracy continues to rise, indicating potential overfitting.
Training and validation loss over epochs. An increasing validation loss alongside decreasing training loss is a clear sign of overfitting.
Finally, let's evaluate the performance of our trained model on the unseen test data using the evaluate()
method.
loss, accuracy = model.evaluate(padded_test_data, test_labels, verbose=0)
print(f"\nTest Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")
This gives us the final performance metrics on data the model has never encountered during training. You should typically see an accuracy significantly better than random guessing (50%) for this task.
In this practice section, you successfully:
Embedding
layer and an LSTM
layer.This provides a solid foundation for applying RNNs and LSTMs to sequence-based problems. From here, you could experiment with:
GRU
layers instead of LSTM
.keras.layers.Bidirectional
) to process sequences in both forward and backward directions.© 2025 ApX Machine Learning