LLM Inference Pipeline (Prefill vs Decode)
Every LLM inference request has two phases with very different economics: - **Prefill**: process the entire input prompt in parallel. Heavy compute, fast per token. - **Decode**: generate output tokens one at a time. Light compute per token, but bound by memory access. Understanding this split is the foundation for understanding [kv cache](/kb/01-llms/kv-cache), [vllm](/kb/07-llm-optimization/vllm), and why ChatGPT has that "thinking..." pause before streaming.
LLM Inference Pipeline (Prefill vs Decode)
Watch or read first (recommended order)
- Karpathy - "Let's build GPT from scratch" (YouTube, 2h). Best for code-level understanding. Jump to 1:20:00 for generation.
- Daily Dose DS - "KV Caching in LLMs Explained Visually" (blog). Excellent visual of the prefill hint: "this is why ChatGPT takes some time to generate the first token than the subsequent tokens".
- vLLM blog - "Efficient Memory Management for LLM Serving with PagedAttention". Production-level view.
TL;DR
Every LLM inference request has two phases with very different economics:
- Prefill: process the entire input prompt in parallel. Heavy compute, fast per token.
- Decode: generate output tokens one at a time. Light compute per token, but bound by memory access.
Understanding this split is the foundation for understanding kv cache, vllm, and why ChatGPT has that "thinking..." pause before streaming.
The historical problem
Naive sequence generation with a transformer looks deceptively simple: feed prompt in, generate next token, append, repeat.
But this hides a performance cliff. For every generated token, if you recompute attention from scratch over all past tokens, you are doing O(N^2) work per step and O(N^3) for a response of length N. A 500-token response would take minutes.
The prefill / decode separation was the key insight that made real-time LLM chat possible. It also set up the KV cache.
How it works
The two phases, side by side
Input Prompt: "Once upon a time" (4 tokens)
|
v
+---------------------+
| PREFILL | (processes all 4 tokens in PARALLEL)
+---------------------+
| Compute K1, V1 |
| Compute K2, V2 |
| Compute K3, V3 |
| Compute K4, V4 |
| Store in KV cache |
| Output first token | <- "there"
+---------------------+
|
v (first token, now append)
+---------------------+
| DECODE | (one token at a time, SEQUENTIAL)
+---------------------+
| Read KV cache |
| Compute Q5, K5, V5 | <- just for the new token
| Attention |
| Output next token | <- "was"
| Append K5, V5 to |
| cache |
+---------------------+
|
v (repeat for each new token)
"a", "brave", "hero", "."
Prefill: compute-bound
During prefill, the model processes all input tokens simultaneously. This is heavy matmul (matrix multiplication) work. The GPU's arithmetic units are the bottleneck.
Key characteristics:
- Parallelizable: N tokens processed at once
- Compute-bound: bottleneck is FLOPs
- Duration: roughly proportional to prompt length
- First-token latency (TTFT): mostly prefill time
For a 10K token prompt on a modern GPU, prefill can take 1-3 seconds. This is the "ChatGPT is thinking..." pause before streaming starts.
Decode: memory-bound
During decode, the model generates ONE token at a time. For each new token:
- Read the entire KV cache from memory (potentially gigabytes)
- Compute Q, K, V for the SINGLE new token (cheap)
- Attention: Q dot-product against cached K, then weighted sum of cached V
- Run through MLP, output logit vector
- Sample next token
- Append the new K, V to cache
Key characteristics:
- Sequential: token N+1 needs token N (cannot parallelize within a request)
- Memory-bound: bottleneck is reading the KV cache, not computing with it
- Duration per token: mostly constant (small compute, fixed memory read)
- Time between tokens (TBT): the streaming rate you see in ChatGPT
For a 7B model on one H100 GPU, you can decode at ~50-100 tokens/second. The limit is how fast you can stream the KV cache from GPU memory to the compute units.
Why the split matters
This asymmetry drives every optimization in LLM serving:
| Concern | Prefill strategy | Decode strategy |
|---|---|---|
| Maximize throughput | Batch many requests | Continuous batching |
| Reduce latency | Cache key prefixes (prompt caching) | Speculative decoding |
| Memory | Less critical (transient) | CRITICAL (KV cache dominates) |
| Compute | Heavy (matmul) | Light (one token) |
| Parallelism | High (across tokens) | Low (one at a time) |
Where the time goes
Rough breakdown for a "short answer" query (1K token prompt, 200 tokens output) on a 7B model:
Total time: ~3 seconds
|
+-- Prefill: ~0.3s (10%) <- processes all 1K input tokens
+-- Decode: ~2.7s (90%) <- generates 200 output tokens at ~75 tok/s
Decode dominates. This is why KV cache and vllm optimizations target decode latency more than prefill.
For a "long context, short answer" query (100K input, 200 output):
Total time: ~15 seconds
|
+-- Prefill: ~12s (80%) <- 100K tokens is a LOT to process
+-- Decode: ~3s (20%)
Here prefill dominates. This is where prompt caching becomes a game changer.
Relevance today (2026)
The prefill / decode framing became widely understood in 2023 and drives every production LLM stack in 2026.
1. Chunked prefill
For very long prompts (1M+ context in Gemini 2 Ultra, Claude Opus 4.5), prefill is chunked and interleaved with ongoing decodes of other requests. vLLM and SGLang support this as "chunked prefill" to prevent long prefills from blocking other users.
2. Continuous batching
Old batching waited for all requests in a batch to finish before starting new ones. Continuous batching (vLLM 2023, now standard) adds new requests into the decode batch as slots free up. Lets you serve many more requests on the same GPU.
3. Prompt caching
In 2024, OpenAI and Anthropic launched prompt caching: the system remembers the KV cache of a prompt prefix across requests. If you send the same 20K system prompt 100 times, you pay prefill once. See prompt caching.
Huyen mentions this briefly. In 2026 it is table stakes for any agent system.
4. Speculative decoding
A small draft model generates K tokens ahead, the big model verifies them in parallel. If the draft is right (often), you get K tokens per big-model step. 2-3x speedup on decode, no quality loss. Implemented in vLLM, Medusa, EAGLE.
5. Reasoning models change the accounting
OpenAI o1, Claude Opus 4.5 thinking, DeepSeek R1 generate LOTS of thinking tokens before the visible answer. Decode dominates even more. Latency per response is 10-60 seconds, not 2-3.
For these, test-time compute scales inference cost. See test time compute.
Critical questions
- Why can prefill parallelize N input tokens but decode cannot parallelize output tokens within one request?
- If prefill is compute-bound and decode is memory-bound, do they need different hardware? (Yes! Some production stacks even consider separate pools.)
- Why is prompt caching a bigger deal for long contexts than short? (Prefill cost grows linearly with prompt length.)
- How does speculative decoding fit into this pipeline? (Decode side, with a twist.)
- For a chat app serving 1000 users, are you prefill-limited or decode-limited? (Usually decode at scale.)
Production pitfalls
- Confusing TTFT with TBT. Time-to-first-token (TTFT) = prefill + first decode. Time-between-tokens (TBT) = decode steady state. Optimizing the wrong one leads nowhere.
- Under-provisioning KV cache memory. Decode reads the cache constantly. If you run out, the system thrashes. vLLM's PagedAttention manages this.
- Prefill saturating GPU at peak. Long prompts during peak traffic can starve other users' decodes. Use chunked prefill.
- Mistaking latency for throughput. A batched decode is slower PER REQUEST but higher THROUGHPUT. Choose what matters for your app.
- Ignoring reasoning model economics. A reasoning model spends 90% of its time decoding thinking tokens. Your cost estimate must account for those.
Mental parallels (non-AI)
- Book vs exam. Prefill = reading the whole book (parallel, heavy). Decode = taking the exam (sequential, answer one question at a time, look things up in your notes). The "notes" are the KV cache.
- Cooking a multi-course meal. Mise-en-place (prefill) = chop all vegetables, prepare sauces in parallel. Service (decode) = plate each dish sequentially, pulling from prepped mise-en-place.
- Compilation vs execution. Compilation = analyze and optimize everything in bulk (slow but parallel). Execution = run instructions one at a time (fast per step, sequential).
- A lawyer reviewing a contract. Prefill = read the whole contract, flag every clause. Decode = answer questions one by one, referring to flagged clauses. Flagged clauses = KV cache.
Mini-lab
labs/measure-prefill-decode/ (to create) - measure prefill vs decode times on a local model:
- Load a small model (Phi-3 mini or Llama 3 3B) via
llama-cpp-pythonortransformers - Send a short prompt (10 tokens) and generate 200 tokens. Measure total time, split prefill (first token) vs decode (next 199).
- Send a long prompt (10K tokens) and generate 200 tokens. Measure the same split.
- Plot prefill time vs prompt length.
- Plot decode time per token vs generation position.
Goal: see with your own eyes that prefill scales with prompt length, decode scales with output length. Feel where your app's bottleneck actually is.
Further reading
Canonical explanations
- Karpathy, "Let's build GPT from scratch" (YouTube, 2023): https://www.youtube.com/watch?v=kCc8FmEb1nY
- Daily Dose DS, "KV Caching in LLMs Explained Visually": https://www.dailydoseofds.com/p/kv-caching-in-llms-explained-visually/
- Jay Alammar, "The Illustrated GPT-2": https://jalammar.github.io/illustrated-gpt2/
Production LLM serving
- vLLM paper, "Efficient Memory Management for Large Language Model Serving with PagedAttention" (Kwon et al., 2023): https://arxiv.org/abs/2309.06180
- SGLang paper, "Efficiently Programming Large Language Models using SGLang" (Zheng et al., 2024)
- Orca / Continuous Batching (Yu et al., 2022): https://www.usenix.org/conference/osdi22/presentation/yu
Speculative decoding
- Medusa (Cai et al., 2024): https://arxiv.org/abs/2401.10774
- EAGLE (Li et al., 2024): https://arxiv.org/abs/2401.15077
For the 2026 context
- Huyen, "AI Engineering", Chapter 9 (inference optimization)
- Anthropic prompt caching docs: https://docs.anthropic.com/claude/docs/prompt-caching
- OpenAI prompt caching announcement (Oct 2024)