Masterclass
Large language models, particularly those based on the Transformer architecture, generate text using an autoregressive process. This means they predict the next token in a sequence based on all the tokens generated so far. While powerful, this sequential, step-by-step generation method introduces significant computational and memory bottlenecks, especially when dealing with long sequences or serving real-time applications. Let's examine these challenges in detail.
The fundamental challenge stems from the core definition of autoregressive decoding. To generate the i-th token, the model requires the sequence of i−1 previously generated tokens as input. This creates an inherent sequential dependency: you cannot compute the next token until the current one is known.
Consider a simplified generation loop in PyTorch:
import torch
import torch.nn.functional as F
def generate_next_token(model, input_ids):
"""Generates the next token given the current sequence."""
with torch.no_grad():
# Get model logits for the last token position
outputs = model(input_ids)
next_token_logits = outputs.logits[:, -1, :]
# Apply softmax to get probabilities
probs = F.softmax(next_token_logits, dim=-1)
# Sample the next token (e.g., using greedy decoding)
next_token_id = torch.argmax(probs, dim=-1).unsqueeze(-1)
return next_token_id
# Example usage
model = ... # Your pre-trained autoregressive LLM
tokenizer = ... # Your tokenizer
prompt = "The quick brown fox"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
max_new_tokens = 50
for _ in range(max_new_tokens):
next_token = generate_next_token(model, input_ids)
# Check for end-of-sequence token
if next_token.item() == tokenizer.eos_token_id:
break
# Append the new token and continue
input_ids = torch.cat([input_ids, next_token], dim=-1)
generated_text = tokenizer.decode(input_ids[0])
print(generated_text)
This loop illustrates the iterative process. Each call to generate_next_token
performs a full forward pass (or a significant portion of it) through the model. For a sequence of length N, generating M new tokens requires M separate forward passes. This sequential execution directly translates to latency; the total time is roughly proportional to the number of tokens generated.
A major bottleneck in Transformers, especially during inference, is memory bandwidth. In each step of the autoregressive generation, the model needs to compute attention scores. The standard self-attention mechanism requires calculating interactions between the current token's query vector and the key vectors of all preceding tokens in the sequence.
To do this efficiently, the keys (K) and values (V) computed for all previous tokens are typically kept in GPU memory. As the sequence grows longer, the size of this K/V cache increases linearly with the sequence length (n). The attention computation at step i involves loading these cached K and V tensors (of size proportional to i) from high-bandwidth memory (HBM) into the compute units (like Tensor Cores).
Attention(Qi​,K1..i​,V1..i​)=softmax(dk​​Qi​K1..iT​​)V1..i​Here, Qi​ is the query for the token being generated at step i, while K1..i​ and V1..i​ represent the keys and values for all tokens from 1 to i. Retrieving K1..i​ and V1..i​ from memory at each step consumes significant memory bandwidth. For large models and long sequences, the time spent moving data between HBM and the compute units can dominate the overall latency, making the process memory-bound rather than compute-bound.
Sequential dependency and increasing memory access in autoregressive generation. Each step requires loading previously computed keys and values.
While memory bandwidth is often the primary bottleneck for single-token generation latency, the computational cost itself is not negligible. The self-attention calculation has a time complexity of O(n2d) per layer, where n is the sequence length and d is the model's hidden dimension. Although techniques like KV caching (discussed next) avoid recomputing keys and values for past tokens, the attention score calculation at step i still involves computing dot products between the current query Qi​ and all previous keys K1..i​, resulting in a cost proportional to i×d for that part of the computation. The subsequent multiplication with the value vectors V1..i​ also scales similarly.
Summing the computation across all layers and the feed-forward networks, each generation step requires a significant number of floating-point operations (FLOPs) that grows with the current sequence length. For very long sequences, this computational demand adds to the overall latency.
The sequential nature of autoregressive decoding also poses challenges for throughput, which is the number of output tokens generated per unit of time across all concurrent requests. While you can process multiple independent generation requests in parallel batches, the generation within each sequence remains sequential. This limits the extent to which hardware can be fully utilized, particularly if sequences have vastly different lengths or if individual requests require very long outputs. Optimizing batching strategies becomes important but doesn't fundamentally change the per-sequence sequential constraint.
These inherent challenges associated with the standard autoregressive decoding process. The sequential dependency, memory bandwidth limitations, scaling computational costs, and throughput constraints necessitate the specialized optimization techniques explored in the following sections of this chapter. Strategies like KV caching, optimized attention implementations, and speculative decoding directly target these bottlenecks to make LLM inference faster and more efficient.
© 2025 ApX Machine Learning