Masterclass
Understanding the architecture and capabilities of today's large language models requires appreciating the evolution of techniques designed to process sequential data, particularly text. The journey from simple statistical methods to complex neural networks highlights the persistent challenges of capturing context and dependencies in language.
Before the prevalence of deep learning, statistical methods formed the bedrock of language modeling. Among the most fundamental were N-gram models. These models operate on the Markov assumption: the probability of the next word depends only on the preceding n−1 words. For instance, a trigram model (n=3) estimates the probability of a word wi given the two previous words wi−1 and wi−2:
P(wi∣w1,...,wi−1)≈P(wi∣wi−2,wi−1)
These probabilities are typically estimated by counting occurrences in large text corpora:
P(wi∣wi−2,wi−1)=count(wi−2,wi−1)count(wi−2,wi−1,wi)
While simple and interpretable, N-gram models face significant limitations:
N-gram models predict the next word based on a fixed window of previous words.
Neural networks offered a way to overcome the limitations of N-grams. Recurrent Neural Networks (RNNs) were specifically designed for sequential data. Unlike feed-forward networks, RNNs possess connections that loop back, allowing them to maintain an internal hidden state (ht) that theoretically captures information from all previous time steps in the sequence.
At each time step t, an RNN takes the current input xt and the previous hidden state ht−1 to compute the new hidden state ht and potentially an output yt:
ht=tanh(Whhht−1+Wxhxt+bh) yt=Whyht+by
Here, Whh, Wxh, Why, bh, and by are learned parameters (weight matrices and biases), shared across all time steps. The tanh function is a common activation function.
An RNN cell processes input xt and the previous state ht−1 to produce the next state ht and output yt. The state is passed through time.
While RNNs offered the promise of modeling arbitrarily long dependencies, training them proved difficult due to the vanishing gradient problem. During backpropagation through time, gradients multiplied across many time steps could shrink exponentially, preventing weights associated with earlier time steps from being updated effectively. This meant that, in practice, simple RNNs struggled to learn dependencies beyond a relatively short window, similar in effect to N-grams. The related exploding gradient problem (where gradients grow exponentially) could also occur, though it was often easier to manage with techniques like gradient clipping.
To address the vanishing gradient problem, more sophisticated recurrent units were developed.
Long Short-Term Memory (LSTM) networks introduced gating mechanisms:
These gates control the flow of information through time via a separate cell state (Ct), allowing the network to selectively remember or forget information over long durations.
import torch
import torch.nn as nn
# Example: Instantiate an LSTM layer in PyTorch
# Input features per time step
input_size = 10
# Number of features in the hidden state
hidden_size = 20
# Number of recurrent layers
num_layers = 2
# Create an LSTM layer
lstm_layer = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
# Example input (batch_size=3, sequence_length=5, input_features=10)
batch_size = 3
seq_len = 5
example_input = torch.randn(batch_size, seq_len, input_size)
# Initial hidden and cell states (optional, defaults to zeros)
h0 = torch.randn(num_layers, batch_size, hidden_size)
c0 = torch.randn(num_layers, batch_size, hidden_size)
# Forward pass
output, (hn, cn) = lstm_layer(example_input, (h0, c0))
print("Output shape:", output.shape) # (batch_size, seq_len, hidden_size)
print("Final hidden state shape:", hn.shape)
# (num_layers, batch_size, hidden_size)
print("Final cell state shape:", cn.shape)
# (num_layers, batch_size, hidden_size)
Gated Recurrent Units (GRUs) provide a slightly simpler alternative with two gates (Update and Reset gates) and no separate cell state. GRUs often achieve performance comparable to LSTMs on many tasks but with fewer parameters.
Both LSTMs and GRUs significantly improved the ability to capture longer-range dependencies compared to simple RNNs. They became the standard for many NLP tasks, often used within sequence-to-sequence (Seq2Seq) architectures. Seq2Seq models consist of an encoder RNN that processes the input sequence into a context vector (often the final hidden state) and a decoder RNN that generates the output sequence based on that context vector. While successful in machine translation, summarization, etc., Seq2Seq models still faced a bottleneck: the entire meaning of the input sequence had to be compressed into a single fixed-size context vector, regardless of the input length.
The attention mechanism was introduced to alleviate the Seq2Seq bottleneck. Instead of relying solely on the final encoder hidden state, the decoder was allowed to "attend" to different parts of the entire input sequence at each step of output generation. It calculates attention scores between the current decoder state and all encoder hidden states, creating a weighted context vector specific to that decoding step. This allowed models to focus on relevant input words when generating corresponding output words, dramatically improving performance, especially for longer sequences.
Attention allows the decoder state st to selectively weight encoder hidden states (h1,h2,h3) to form a context vector ct.
While attention improved RNN-based models, the sequential nature of recurrence remained a bottleneck for training efficiency. Processing a sequence required O(sequence length) sequential operations, hindering parallelization.
The Transformer architecture, introduced in the paper "Attention Is All You Need" (Vaswani et al., 2017), revolutionized sequence modeling by dispensing with recurrence entirely. It relies solely on attention mechanisms, specifically self-attention. Self-attention allows the model to weigh the importance of all other words in the input sequence when encoding a representation for a specific word.
Critically, the computations within Transformer layers (self-attention and feed-forward networks) can be performed largely in parallel across the sequence positions. This parallelizability was a major breakthrough, enabling the training of much larger models on vastly larger datasets than was previously feasible with RNNs. The computational efficiency gained through parallelization, combined with the effectiveness of self-attention at capturing complex dependencies (both short and long-range), laid the foundation for the development of the very large language models that are the focus of this course. The scaling properties and computational demands spurred by the Transformer architecture are central themes we will revisit throughout our discussions.
© 2025 ApX Machine Learning