In the previous sections, we explored how Large Language Models (LLMs) break down text into tokens and predict the next token based on the sequence they've seen so far. But how does the model decide which token is the most probable next one? The answer lies heavily in the context – the sequence of tokens that comes before the point of prediction.
Think of reading a story. If you jump into the middle of a paragraph, you might struggle to understand what's happening. You need the preceding sentences and paragraphs to provide context: Who are the characters? What is the setting? What events have already occurred? LLMs operate similarly. The text provided to the model before it generates its output serves as its understanding of the current situation.
At its core, an LLM is constantly calculating probabilities for the next token based on the sequence it has processed. The context directly shapes these probabilities. A slight change in the input context can lead to a completely different output.
Consider these simple examples:
In each case, the words preceding the generation point strongly suggest the most probable next word. The model uses the patterns learned during its training on vast amounts of text to understand that "white" often follows "clouds are", "eggs" often follows "bacon and", and "wrench" is relevant to "fix the leaky faucet".
Context isn't just about predicting the very next word; it's crucial for generating longer stretches of text that are coherent and stay on topic. When you provide an LLM with a prompt or continue a conversation, the entire preceding sequence influences subsequent generations.
Imagine you ask an LLM:
"Write a short story about an astronaut who discovers a strange plant on Mars. The plant glows faintly."
The LLM uses this entire request as context. As it generates the story, it constantly refers back to this initial context (and the text it has generated so far) to ensure the story remains about an astronaut, Mars, a strange plant, and its glowing property. If the context only mentioned "write a story", the output would be far less specific.
You might wonder if the model gives equal weight to every word in the context. If the context is very long, does the first word have as much influence as the last word? Not usually.
Modern LLMs, particularly those based on the Transformer architecture mentioned earlier, use mechanisms often referred to as attention. Conceptually, attention allows the model to weigh the importance of different parts of the input context when generating a specific output token. It can "pay more attention" to words or phrases in the context that are most relevant to predicting the next token.
For instance, if the model is generating the next word after "The astronaut picked up the glowing _", the attention mechanism would likely focus heavily on "astronaut", "glowing", and "plant" from the earlier context to predict something relevant, perhaps "specimen" or "flower".
Illustration of how attention might focus on relevant words ("astronaut", "glowing", "plant") from the context when predicting the next word. The thickness and color intensity of the lines suggest the relative importance assigned.
While context is powerful, an LLM's ability to "remember" or consider past text is not infinite. Models have a context window (also called context length), which is the maximum number of tokens the model can consider at any one time. This includes both the input prompt and the generated output.
Think of it like short-term memory. The model can only keep a certain amount of recent information active. If a conversation or document exceeds the context window size, the model effectively forgets the earliest parts.
For example, a model with a context window of 4096 tokens can process roughly 3000 words (since tokens don't map one-to-one with words). If you provide it with a 5000-word document and ask for a summary, it might only be able to consider the last ~3000 words when generating the summary, potentially missing information from the beginning.
The size of the context window varies significantly between different LLMs. Models with larger context windows can handle longer conversations, analyze bigger documents, and maintain coherence over more extended interactions, but they often require more computational resources.
Understanding how context influences generation is fundamental to using LLMs effectively. By providing clear, relevant, and sufficient context within the model's context window, you can guide it to produce more accurate, coherent, and useful outputs. As you start crafting prompts in the next chapter, remember that the context you provide is the primary tool you have to shape the LLM's response.
© 2025 ApX Machine Learning