LLM Inference Pipeline (Prefill vs Decode)
When you ask an LLM something, the model does TWO different jobs back to back: - **Prefill**: read your prompt. Fast because it reads everything in parallel. - **Decode**: write the answer. Slow because it writes one token at a time. That's the whole concept. The "ChatGPT is thinking..." pause you see before streaming starts is prefill finishing. The words appearing one by one after are decode. Everything else on this page is details about WHY each phase behaves the way it does, and WHY that matters in production.
LLM Inference Pipeline (Prefill vs Decode)
Watch or read first (recommended order)
- Karpathy - "Let's build GPT from scratch" (YouTube, 2h): https://www.youtube.com/watch?v=kCc8FmEb1nY - best for code-level understanding. Jump to 1:20:00 for generation.
- Daily Dose DS - "KV Caching in LLMs Explained Visually" (blog): https://www.dailydoseofds.com/p/kv-caching-in-llms-explained-visually/ - excellent visual of the prefill hint.
- vLLM paper - "Efficient Memory Management for LLM Serving with PagedAttention" (Kwon et al., 2023): https://arxiv.org/abs/2309.06180 - production-level view.
TL;DR
When you ask an LLM something, the model does TWO different jobs back to back:
- Prefill: read your prompt. Fast because it reads everything in parallel.
- Decode: write the answer. Slow because it writes one token at a time.
That's the whole concept. The "ChatGPT is thinking..." pause you see before streaming starts is prefill finishing. The words appearing one by one after are decode.
Everything else on this page is details about WHY each phase behaves the way it does, and WHY that matters in production.
The everyday observation
Open ChatGPT. Type a question. Hit Enter.
Two things happen:
- Pause for a moment (nothing visible)
- Words stream out, one chunk at a time, until the answer is done
That pause + stream is not a UI quirk. It is the model doing two fundamentally different things.
The pause is prefill: the model is reading your entire prompt, token by token, building an internal representation. For a short prompt, this is fast. For a 20,000-token system prompt plus history, this can take seconds.
The stream is decode: the model generates the answer one token at a time. Each new word depends on all the words before it, so they cannot be predicted in parallel. You see them pop out at the speed the GPU can compute them.
Once you see these two phases, you cannot unsee them. Every LLM API, every serving optimization (vLLM, prompt caching, speculative decoding) targets either prefill or decode, never both.
A simple analogy before the details
A lawyer prepares for a deposition:
- Prefill = reading the case file. Long, dense, but the lawyer can skim sections in parallel (different bookmarks at once). This is slow but only happens once.
- Decode = answering questions in the deposition. Each answer comes out one sentence at a time. The lawyer keeps glancing at their bookmarks (the KV cache) to stay consistent. Fast per question, but strictly sequential.
If the case file is short, most of the time is spent answering. If the case file is huge, most of the time is spent reading.
That trade-off, case-file-size vs answer-length, is the entire design space of LLM inference.
The historical problem
The naive way of generating text with a transformer looks simple: feed the prompt in, predict the next token, append, predict again, repeat.
But there is a hidden problem. If you do that naively, every new token you generate forces the model to recompute attention over ALL previous tokens. So token #100 does 100 attention computations, token #500 does 500, and so on.
For a 500-token response, you burn through about 125,000 attention computations. On a modest GPU that is minutes per response. ChatGPT in real time would be impossible.
The fix is a two-part trick:
- Split the work into prefill (read the prompt once) and decode (generate tokens)
- Keep a running memory of past K and V values so you never recompute them (that memory is the kv cache)
Together, these make LLM inference feasible in real time.
How it works
The two phases, side by side
Input Prompt: "Once upon a time" (4 tokens)
|
v
+---------------------+
| PREFILL | (processes all 4 tokens in PARALLEL)
+---------------------+
| Compute K1, V1 |
| Compute K2, V2 |
| Compute K3, V3 |
| Compute K4, V4 |
| Store in KV cache |
| Output first token | <- "there"
+---------------------+
|
v (first token, now append)
+---------------------+
| DECODE | (one token at a time, SEQUENTIAL)
+---------------------+
| Read KV cache |
| Compute Q5, K5, V5 | <- just for the new token
| Attention |
| Output next token | <- "was"
| Append K5, V5 to |
| cache |
+---------------------+
|
v (repeat for each new token)
"a", "brave", "hero", "."
Prefill: heavy math, but parallel
During prefill, the model crunches all your prompt tokens at the same time.
Imagine giving a textbook to ten students and asking each to study one page. They all work in parallel. That is prefill.
Why it is fast per token: the GPU is built for parallel math. A thousand matrix multiplications at once is the GPU's happy place.
Why it can still feel slow: the TOTAL amount of math scales with the prompt length. A 10,000-token prompt is 10x more math than a 1,000-token one. For a big prompt on a single GPU, prefill takes 1-3 seconds before anything streams.
That pause you feel before ChatGPT starts typing? That is prefill finishing.
Rule of thumb: prefill time ~ prompt length. Double the prompt, roughly double the pause.
Decode: light math, but sequential
During decode, the model generates one token at a time. You cannot predict token #5 without token #4 (it depends on what #4 was).
So for every new token, the GPU does a small job:
- Read the saved K and V from the kv cache (potentially gigabytes of data)
- Compute Q, K, V for the ONE new token (cheap)
- Do attention: compare the new Q to all cached K, then combine the cached V
- Run through a small network (MLP)
- Pick the next token (see sampling and temperature)
- Save the new K, V to the cache
- Repeat
Why it is fast per token: the math is tiny (one token worth of compute).
Why it can still feel slow: the GPU keeps reading gigabytes from memory for every new token. The bottleneck is not the math, it is how fast data moves from memory to compute units.
Rule of thumb: decode speed ~ 50-100 tokens per second on a 7B model, one GPU. Bigger models are slower.
Why the two phases matter in production
Every serving optimization targets ONE of the two phases. Mixing them up means you optimize the wrong thing.
| Question | Prefill answer | Decode answer |
|---|---|---|
| Is it fast per token? | Yes, everything at once | No, one at a time |
| Is it heavy on GPU math? | Yes, lots of matrices | Not really, tiny math |
| Is it heavy on GPU memory traffic? | Not so much | YES, reads the cache constantly |
| Does it scale with prompt length? | Yes, linearly | No, scales with output length |
| How do I make it faster? | Cache common prompt prefixes (prompt caching) | Speculative decoding, smarter cache |
Where the time goes
Rough breakdown for a "short answer" query (1K token prompt, 200 tokens output) on a 7B model:
Total time: ~3 seconds
|
+-- Prefill: ~0.3s (10%) <- processes all 1K input tokens
+-- Decode: ~2.7s (90%) <- generates 200 output tokens at ~75 tok/s
Decode dominates. This is why KV cache and vllm optimizations target decode latency more than prefill.
For a "long context, short answer" query (100K input, 200 output):
Total time: ~15 seconds
|
+-- Prefill: ~12s (80%) <- 100K tokens is a LOT to process
+-- Decode: ~3s (20%)
Here prefill dominates. This is where prompt caching becomes a game changer.
Prefill share flips with context length
Short prompts: decode dominates, optimize speculative decoding, MQA/GQA, KV cache throughput. Long prompts: prefill dominates, optimize prompt caching, chunked prefill.
The pipeline, end to end
Relevance today (2026)
The prefill / decode framing became widely understood in 2023 and drives every production LLM stack in 2026.
1. Chunked prefill
For very long prompts (1M+ context in Gemini 2 Ultra, Claude Opus 4.5), prefill is chunked and interleaved with ongoing decodes of other requests. vLLM and SGLang support this as "chunked prefill" to prevent long prefills from blocking other users.
2. Continuous batching
Old batching waited for all requests in a batch to finish before starting new ones. Continuous batching (vLLM 2023, now standard) adds new requests into the decode batch as slots free up. Lets you serve many more requests on the same GPU.
3. Prompt caching
In 2024, OpenAI and Anthropic launched prompt caching: the system remembers the KV cache of a prompt prefix across requests. If you send the same 20K system prompt 100 times, you pay prefill once. See prompt caching.
Huyen mentions this briefly. In 2026 it is table stakes for any agent system.
4. Speculative decoding
A small draft model generates K tokens ahead, the big model verifies them in parallel. If the draft is right (often), you get K tokens per big-model step. 2-3x speedup on decode, no quality loss. Implemented in vLLM, Medusa, EAGLE.
5. Reasoning models change the accounting
OpenAI o1, Claude Opus 4.5 thinking, DeepSeek R1 generate LOTS of thinking tokens before the visible answer. Decode dominates even more. Latency per response is 10-60 seconds, not 2-3.
For these, test-time compute scales inference cost. See test time compute.
Critical questions
- Why can prefill parallelize N input tokens but decode cannot parallelize output tokens within one request?
- If prefill is compute-bound and decode is memory-bound, do they need different hardware? (Yes! Some production stacks even consider separate pools.)
- Why is prompt caching a bigger deal for long contexts than short? (Prefill cost grows linearly with prompt length.)
- How does speculative decoding fit into this pipeline? (Decode side, with a twist.)
- For a chat app serving 1000 users, are you prefill-limited or decode-limited? (Usually decode at scale.)
Production pitfalls
- Mixing up the "pause" with the "streaming speed". Two different metrics. Time-to-first-token (TTFT = pause before anything appears) is mostly prefill. Time-between-tokens (TBT = how fast words stream after) is decode. Teams often optimize one and complain nothing is faster, because the other was the bottleneck.
- Running out of KV cache room. Decode reads the cache constantly. If your GPU runs out, the system thrashes and latency spikes. vLLM's paged attention manages this smartly. See kv cache.
- Letting long prompts block everyone at peak. One user sends a 500K-token prompt during peak traffic, and prefill grabs all the GPU for seconds. Other users' decodes stall. Fix: chunked prefill (interleave with decodes).
- Confusing latency and throughput. Batching many requests = slower per user, but more users served per second. Which matters for your app? Chat apps care about latency. Batch jobs care about throughput. Know the difference.
- Forgetting reasoning models are mostly decode. A reasoning model (o1, R1, Opus 4.5 thinking) generates thousands of invisible "thinking" tokens before the visible answer. Your cost and latency estimates must account for them.
Mental parallels (non-AI)
- Book vs exam. Prefill = reading the whole book (parallel, heavy). Decode = taking the exam (sequential, answer one question at a time, look things up in your notes). The "notes" are the KV cache.
- Cooking a multi-course meal. Mise-en-place (prefill) = chop all vegetables, prepare sauces in parallel. Service (decode) = plate each dish sequentially, pulling from prepped mise-en-place.
- Compilation vs execution. Compilation = analyze and optimize everything in bulk (slow but parallel). Execution = run instructions one at a time (fast per step, sequential).
- A lawyer reviewing a contract. Prefill = read the whole contract, flag every clause. Decode = answer questions one by one, referring to flagged clauses. Flagged clauses = KV cache.
Mini-lab
labs/measure-prefill-decode/ (to create) - measure prefill vs decode times on a local model:
- Load a small model (Phi-3 mini or Llama 3 3B) via
llama-cpp-pythonortransformers - Send a short prompt (10 tokens) and generate 200 tokens. Measure total time, split prefill (first token) vs decode (next 199).
- Send a long prompt (10K tokens) and generate 200 tokens. Measure the same split.
- Plot prefill time vs prompt length.
- Plot decode time per token vs generation position.
Goal: see with your own eyes that prefill scales with prompt length, decode scales with output length. Feel where your app's bottleneck actually is.
Further reading
Canonical explanations
- Karpathy, "Let's build GPT from scratch" (YouTube, 2023): https://www.youtube.com/watch?v=kCc8FmEb1nY
- Daily Dose DS, "KV Caching in LLMs Explained Visually": https://www.dailydoseofds.com/p/kv-caching-in-llms-explained-visually/
- Jay Alammar, "The Illustrated GPT-2": https://jalammar.github.io/illustrated-gpt2/
Production LLM serving
- vLLM paper, "Efficient Memory Management for Large Language Model Serving with PagedAttention" (Kwon et al., 2023): https://arxiv.org/abs/2309.06180
- SGLang paper, "Efficiently Programming Large Language Models using SGLang" (Zheng et al., 2024): https://arxiv.org/abs/2312.07104
- Orca / Continuous Batching (Yu et al., 2022): https://www.usenix.org/conference/osdi22/presentation/yu
Speculative decoding
- Medusa (Cai et al., 2024): https://arxiv.org/abs/2401.10774
- EAGLE (Li et al., 2024): https://arxiv.org/abs/2401.15077
For the 2026 context
- Huyen, "AI Engineering", Chapter 9 - inference optimization (O'Reilly 2025, paid): https://www.oreilly.com/library/view/ai-engineering/9781098166298/
- Anthropic prompt caching docs: https://docs.anthropic.com/claude/docs/prompt-caching
- OpenAI prompt caching announcement (Oct 2024): https://openai.com/index/api-prompt-caching/