Masterclass
At its heart, a language model is a statistical tool designed to predict the probability of a sequence of words or, more accurately, tokens (which can be words, subwords, or characters, as we'll discuss in Chapter 5). Given a sequence of preceding tokens, the model attempts to predict the most likely next token. This fundamental capability can be expressed mathematically as calculating the probability:
P(tokeni​∣token1​,token2​,...,tokeni−1​)This predictive ability allows language models to generate text, complete sentences, translate languages, and perform various other Natural Language Processing (NLP) tasks. Traditional language models, including n-grams and early neural network approaches like Recurrent Neural Networks (RNNs), operated with parameter counts typically ranging from thousands to hundreds of millions.
So, what makes a language model "Large"? The term "Large Language Model" (LLM) specifically refers to neural network-based language models characterized by their enormous scale, both in terms of the number of parameters they contain and the vast amounts of data they are trained on.
The primary differentiator is the sheer number of learnable parameters. While earlier models like BERT-Large had around 340 million parameters, LLMs push this boundary significantly further, typically containing billions, tens of billions, hundreds of billions, or even trillions of parameters. These parameters (weights and biases within the neural network) encode the patterns, grammar, knowledge, and nuances learned from the training data. This massive increase in capacity is a defining feature.
Approximate parameter counts for several well-known language models, illustrating the scale difference (note the logarithmic y-axis).
Correspondingly, LLMs are pre-trained on immense datasets, often comprising hundreds of terabytes or even petabytes of text data scraped from the web, books, code repositories, and other sources (Chapters 6-9 explore data sourcing and processing). This contrasts with smaller models trained on more curated, smaller datasets often measured in gigabytes. The scale of the data is necessary to effectively train the vast number of parameters and expose the model to a wide range of language use and world knowledge.
While earlier models explored various architectures, modern LLMs are almost universally based on the Transformer architecture, introduced in the paper "Attention Is All You Need" (Vaswani et al., 2017). The Transformer's self-attention mechanism allows the model to weigh the importance of different tokens in the input sequence when making predictions, overcoming limitations of previous sequential architectures like RNNs in handling long-range dependencies. We will examine the Transformer in detail in Chapter 4.
Perhaps the most fascinating aspect of LLMs is the emergence of capabilities that are not explicitly programmed or trained for but arise as a consequence of scale. Smaller models typically require significant task-specific fine-tuning to perform well on downstream tasks like sentiment analysis or question answering. LLMs, however, often exhibit remarkable zero-shot or few-shot learning abilities.
Examples of emergent abilities include arithmetic reasoning, complex instruction following, translation between languages not explicitly paired in training data, and code generation. These abilities appear somewhat abruptly as model size crosses certain thresholds, suggesting that quantity (scale) can lead to qualitative changes in behavior.
Consider a simplified interaction:
# Hypothetical LLM interaction
# Assume 'llm' is a pre-loaded large language model object
context = "The capital of France is"
next_token_probabilities = llm.predict_next_token_probabilities(context)
# The model assigns high probability to 'Paris'
print(f"Top prediction: {llm.get_most_likely_token(next_token_probabilities)}")
# Output: Top prediction: Paris
context_few_shot = """
Translate English to French:
sea otter => loutre de mer
cheese => fromage
plush toy => peluche
Translate English to French:
cloud => ?
"""
next_token_probabilities_few_shot = llm.predict_next_token_probabilities(
context_few_shot
)
# The model uses the examples to perform the translation
print(f"Few-shot prediction: {llm.get_most_likely_token(
next_token_probabilities_few_shot
)}")
# Output: Few-shot prediction: nuage
This few-shot capability, driven by the model's scale and vast pre-training, distinguishes LLMs from their predecessors, which would typically require fine-tuning on a dedicated English-to-French translation dataset to achieve similar results.
Finally, defining LLMs inherently involves acknowledging the immense computational resources required for their training. Training runs often involve thousands of high-end GPUs or TPUs running for weeks or months, consuming significant energy and incurring substantial costs. This scale of computation is another practical differentiator from smaller models.
In summary, an LLM is defined by:
Understanding these characteristics sets the stage for exploring the engineering challenges and techniques involved in building, training, and deploying these powerful models, which is the focus of this course.
© 2025 ApX Machine Learning