Large language models, particularly those based on the Transformer architecture, generate text using an autoregressive process. This means they predict the next token in a sequence based on all the tokens generated so far. While powerful, this sequential, step-by-step generation method introduces significant computational and memory bottlenecks, especially when dealing with long sequences or serving real-time applications. Let's examine these challenges in detail.

The Sequential Nature of Generation

The fundamental challenge stems from the core definition of autoregressive decoding. To generate the $i$ -th token, the model requires the sequence of $i-1$ previously generated tokens as input. This creates an inherent sequential dependency: you cannot compute the next token until the current one is known.

Consider a simplified generation loop in PyTorch:

import torch
import torch.nn.functional as F

def generate_next_token(model, input_ids):
    """Generates the next token given the current sequence."""
    with torch.no_grad():
        # Get model logits for the last token position
        outputs = model(input_ids)
        next_token_logits = outputs.logits[:, -1, :]

        # Apply softmax to get probabilities
        probs = F.softmax(next_token_logits, dim=-1)

        # Sample the next token (e.g., using greedy decoding)
        next_token_id = torch.argmax(probs, dim=-1).unsqueeze(-1)
    return next_token_id

# Example usage
model = ... # Your pre-trained autoregressive LLM
tokenizer = ... # Your tokenizer

prompt = "The quick brown fox"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

max_new_tokens = 50
for _ in range(max_new_tokens):
    next_token = generate_next_token(model, input_ids)
    # Check for end-of-sequence token
    if next_token.item() == tokenizer.eos_token_id:
        break
    # Append the new token and continue
    input_ids = torch.cat([input_ids, next_token], dim=-1)

generated_text = tokenizer.decode(input_ids[0])
print(generated_text)

This loop illustrates the iterative process. Each call to generate_next_token performs a full forward pass (or a significant portion of it) through the model. For a sequence of length $N$ , generating $M$ new tokens requires $M$ separate forward passes. This sequential execution directly translates to latency; the total time is roughly proportional to the number of tokens generated.

Memory Bandwidth Limitations

A major bottleneck in Transformers, especially during inference, is memory bandwidth. In each step of the autoregressive generation, the model needs to compute attention scores. The standard self-attention mechanism requires calculating interactions between the current token's query vector and the key vectors of all preceding tokens in the sequence.

To do this efficiently, the keys ( $K$ ) and values ( $V$ ) computed for all previous tokens are typically kept in GPU memory. As the sequence grows longer, the size of this $K/V$ cache increases linearly with the sequence length ( $n$ ). The attention computation at step $i$ involves loading these cached $K$ and $V$ tensors (of size proportional to $i$ ) from high-bandwidth memory (HBM) into the compute units (like Tensor Cores).

\text{Attention}(Q_i, K_{1..i}, V_{1..i}) = \text{softmax}\left(\frac{Q_i K_{1..i}^T}{\sqrt{d_k}}\right) V_{1..i}

Here, $Q_i$ is the query for the token being generated at step $i$ , while $K_{1..i}$ and $V_{1..i}$ represent the keys and values for all tokens from 1 to $i$ . Retrieving $K_{1..i}$ and $V_{1..i}$ from memory at each step consumes significant memory bandwidth. For large models and long sequences, the time spent moving data between HBM and the compute units can dominate the overall latency, making the process memory-bound rather than compute-bound.

Sequential dependency and increasing memory access in autoregressive generation. Each step requires loading previously computed keys and values.

Computational Cost Scaling

While memory bandwidth is often the primary bottleneck for single-token generation latency, the computational cost itself is not negligible. The self-attention calculation has a time complexity of $O(n^2 d)$ per layer, where $n$ is the sequence length and $d$ is the model's hidden dimension. Although techniques like KV caching (discussed next) avoid recomputing keys and values for past tokens, the attention score calculation at step $i$ still involves computing dot products between the current query $Q_i$ and all previous keys $K_{1..i}$ , resulting in a cost proportional to $i \times d$ for that part of the computation. The subsequent multiplication with the value vectors $V_{1..i}$ also scales similarly.

Summing the computation across all layers and the feed-forward networks, each generation step requires a significant number of floating-point operations (FLOPs) that grows with the current sequence length. For very long sequences, this computational demand adds to the overall latency.

Throughput Challenges

The sequential nature of autoregressive decoding also poses challenges for throughput, which is the number of output tokens generated per unit of time across all concurrent requests. While you can process multiple independent generation requests in parallel batches, the generation within each sequence remains sequential. This limits the extent to which hardware can be fully utilized, particularly if sequences have significantly different lengths or if individual requests require very long outputs. Optimizing batching strategies becomes important but doesn't fundamentally change the per-sequence sequential constraint.

These inherent challenges associated with the standard autoregressive decoding process. The sequential dependency, memory bandwidth limitations, scaling computational costs, and throughput constraints necessitate the specialized optimization techniques explored in the following sections of this chapter. Strategies like KV caching, optimized attention implementations, and speculative decoding directly target these bottlenecks to make LLM inference faster and more efficient.

Was this section helpful?