LLMs
01·LLMs·updated 2026-04-22

LLM Inference Pipeline (Prefill vs Decode)

Every LLM inference request has two phases with very different economics: - **Prefill**: process the entire input prompt in parallel. Heavy compute, fast per token. - **Decode**: generate output tokens one at a time. Light compute per token, but bound by memory access. Understanding this split is the foundation for understanding [kv cache](/kb/01-llms/kv-cache), [vllm](/kb/07-llm-optimization/vllm), and why ChatGPT has that "thinking..." pause before streaming.

LLM Inference Pipeline (Prefill vs Decode)

Watch or read first (recommended order)

  1. Karpathy - "Let's build GPT from scratch" (YouTube, 2h). Best for code-level understanding. Jump to 1:20:00 for generation.
  2. Daily Dose DS - "KV Caching in LLMs Explained Visually" (blog). Excellent visual of the prefill hint: "this is why ChatGPT takes some time to generate the first token than the subsequent tokens".
  3. vLLM blog - "Efficient Memory Management for LLM Serving with PagedAttention". Production-level view.

TL;DR

Every LLM inference request has two phases with very different economics:

  • Prefill: process the entire input prompt in parallel. Heavy compute, fast per token.
  • Decode: generate output tokens one at a time. Light compute per token, but bound by memory access.

Understanding this split is the foundation for understanding kv cache, vllm, and why ChatGPT has that "thinking..." pause before streaming.

The historical problem

Naive sequence generation with a transformer looks deceptively simple: feed prompt in, generate next token, append, repeat.

But this hides a performance cliff. For every generated token, if you recompute attention from scratch over all past tokens, you are doing O(N^2) work per step and O(N^3) for a response of length N. A 500-token response would take minutes.

The prefill / decode separation was the key insight that made real-time LLM chat possible. It also set up the KV cache.

How it works

The two phases, side by side

Input Prompt: "Once upon a time"                  (4 tokens)
                    |
                    v
          +---------------------+
          |      PREFILL        |  (processes all 4 tokens in PARALLEL)
          +---------------------+
          | Compute K1, V1      |
          | Compute K2, V2      |
          | Compute K3, V3      |
          | Compute K4, V4      |
          | Store in KV cache   |
          | Output first token  |  <- "there"
          +---------------------+
                    |
                    v (first token, now append)
          +---------------------+
          |       DECODE        |  (one token at a time, SEQUENTIAL)
          +---------------------+
          | Read KV cache       |
          | Compute Q5, K5, V5  |  <- just for the new token
          | Attention           |
          | Output next token   |  <- "was"
          | Append K5, V5 to    |
          |   cache             |
          +---------------------+
                    |
                    v (repeat for each new token)
              "a", "brave", "hero", "."

Prefill: compute-bound

During prefill, the model processes all input tokens simultaneously. This is heavy matmul (matrix multiplication) work. The GPU's arithmetic units are the bottleneck.

Key characteristics:

  • Parallelizable: N tokens processed at once
  • Compute-bound: bottleneck is FLOPs
  • Duration: roughly proportional to prompt length
  • First-token latency (TTFT): mostly prefill time

For a 10K token prompt on a modern GPU, prefill can take 1-3 seconds. This is the "ChatGPT is thinking..." pause before streaming starts.

Decode: memory-bound

During decode, the model generates ONE token at a time. For each new token:

  1. Read the entire KV cache from memory (potentially gigabytes)
  2. Compute Q, K, V for the SINGLE new token (cheap)
  3. Attention: Q dot-product against cached K, then weighted sum of cached V
  4. Run through MLP, output logit vector
  5. Sample next token
  6. Append the new K, V to cache

Key characteristics:

  • Sequential: token N+1 needs token N (cannot parallelize within a request)
  • Memory-bound: bottleneck is reading the KV cache, not computing with it
  • Duration per token: mostly constant (small compute, fixed memory read)
  • Time between tokens (TBT): the streaming rate you see in ChatGPT

For a 7B model on one H100 GPU, you can decode at ~50-100 tokens/second. The limit is how fast you can stream the KV cache from GPU memory to the compute units.

Why the split matters

This asymmetry drives every optimization in LLM serving:

ConcernPrefill strategyDecode strategy
Maximize throughputBatch many requestsContinuous batching
Reduce latencyCache key prefixes (prompt caching)Speculative decoding
MemoryLess critical (transient)CRITICAL (KV cache dominates)
ComputeHeavy (matmul)Light (one token)
ParallelismHigh (across tokens)Low (one at a time)

Where the time goes

Rough breakdown for a "short answer" query (1K token prompt, 200 tokens output) on a 7B model:

Total time: ~3 seconds
|
+-- Prefill:  ~0.3s  (10%)   <- processes all 1K input tokens
+-- Decode:   ~2.7s  (90%)   <- generates 200 output tokens at ~75 tok/s

Decode dominates. This is why KV cache and vllm optimizations target decode latency more than prefill.

For a "long context, short answer" query (100K input, 200 output):

Total time: ~15 seconds
|
+-- Prefill:  ~12s   (80%)   <- 100K tokens is a LOT to process
+-- Decode:   ~3s    (20%)

Here prefill dominates. This is where prompt caching becomes a game changer.

Relevance today (2026)

The prefill / decode framing became widely understood in 2023 and drives every production LLM stack in 2026.

1. Chunked prefill

For very long prompts (1M+ context in Gemini 2 Ultra, Claude Opus 4.5), prefill is chunked and interleaved with ongoing decodes of other requests. vLLM and SGLang support this as "chunked prefill" to prevent long prefills from blocking other users.

2. Continuous batching

Old batching waited for all requests in a batch to finish before starting new ones. Continuous batching (vLLM 2023, now standard) adds new requests into the decode batch as slots free up. Lets you serve many more requests on the same GPU.

3. Prompt caching

In 2024, OpenAI and Anthropic launched prompt caching: the system remembers the KV cache of a prompt prefix across requests. If you send the same 20K system prompt 100 times, you pay prefill once. See prompt caching.

Huyen mentions this briefly. In 2026 it is table stakes for any agent system.

4. Speculative decoding

A small draft model generates K tokens ahead, the big model verifies them in parallel. If the draft is right (often), you get K tokens per big-model step. 2-3x speedup on decode, no quality loss. Implemented in vLLM, Medusa, EAGLE.

5. Reasoning models change the accounting

OpenAI o1, Claude Opus 4.5 thinking, DeepSeek R1 generate LOTS of thinking tokens before the visible answer. Decode dominates even more. Latency per response is 10-60 seconds, not 2-3.

For these, test-time compute scales inference cost. See test time compute.

Critical questions

  • Why can prefill parallelize N input tokens but decode cannot parallelize output tokens within one request?
  • If prefill is compute-bound and decode is memory-bound, do they need different hardware? (Yes! Some production stacks even consider separate pools.)
  • Why is prompt caching a bigger deal for long contexts than short? (Prefill cost grows linearly with prompt length.)
  • How does speculative decoding fit into this pipeline? (Decode side, with a twist.)
  • For a chat app serving 1000 users, are you prefill-limited or decode-limited? (Usually decode at scale.)

Production pitfalls

  • Confusing TTFT with TBT. Time-to-first-token (TTFT) = prefill + first decode. Time-between-tokens (TBT) = decode steady state. Optimizing the wrong one leads nowhere.
  • Under-provisioning KV cache memory. Decode reads the cache constantly. If you run out, the system thrashes. vLLM's PagedAttention manages this.
  • Prefill saturating GPU at peak. Long prompts during peak traffic can starve other users' decodes. Use chunked prefill.
  • Mistaking latency for throughput. A batched decode is slower PER REQUEST but higher THROUGHPUT. Choose what matters for your app.
  • Ignoring reasoning model economics. A reasoning model spends 90% of its time decoding thinking tokens. Your cost estimate must account for those.

Mental parallels (non-AI)

  • Book vs exam. Prefill = reading the whole book (parallel, heavy). Decode = taking the exam (sequential, answer one question at a time, look things up in your notes). The "notes" are the KV cache.
  • Cooking a multi-course meal. Mise-en-place (prefill) = chop all vegetables, prepare sauces in parallel. Service (decode) = plate each dish sequentially, pulling from prepped mise-en-place.
  • Compilation vs execution. Compilation = analyze and optimize everything in bulk (slow but parallel). Execution = run instructions one at a time (fast per step, sequential).
  • A lawyer reviewing a contract. Prefill = read the whole contract, flag every clause. Decode = answer questions one by one, referring to flagged clauses. Flagged clauses = KV cache.

Mini-lab

labs/measure-prefill-decode/ (to create) - measure prefill vs decode times on a local model:

  1. Load a small model (Phi-3 mini or Llama 3 3B) via llama-cpp-python or transformers
  2. Send a short prompt (10 tokens) and generate 200 tokens. Measure total time, split prefill (first token) vs decode (next 199).
  3. Send a long prompt (10K tokens) and generate 200 tokens. Measure the same split.
  4. Plot prefill time vs prompt length.
  5. Plot decode time per token vs generation position.

Goal: see with your own eyes that prefill scales with prompt length, decode scales with output length. Feel where your app's bottleneck actually is.

Further reading

Canonical explanations

Production LLM serving

Speculative decoding

For the 2026 context

inferenceprefilldecodepipelineautoregressivefundamentals