KV Cache
The KV cache stores the Key (K) and Value (V) vectors from past tokens so they are not recomputed for every new token generated. It turns token generation from quadratic work (O(N^2): doubling response length means 4x the compute) into linear work (O(N): doubling response length just doubles the compute). In practice, a 4.5x speedup on decoding, at the cost of gigabytes of GPU memory.
KV Cache
60-second version
- What: store the Key (K) and Value (V) vectors of past tokens, so the model does not recompute them at every new token.
- Why: without the cache, generating 500 tokens takes ~125,000 attention computations. With the cache, it takes ~500. About 250x less work. Real-world: 4.5x faster decoding.
- Cost: gigabytes of GPU memory. The cache grows linearly with context length AND with the number of concurrent users.
If you remember nothing else, remember those three lines. The rest of this notion is detail.
Watch or read first (prerequisites)
This notion is dense. For a smoother path, review in this order:
- 3Blue1Brown, "Attention in transformers, visually explained" (YouTube, 26 min): https://www.youtube.com/watch?v=eMlx5fFNoYc - if you do not feel comfortable with what Q, K, V are, stop and watch this first.
- Daily Dose DS, "KV Caching in LLMs Explained Visually" (short blog): https://www.dailydoseofds.com/p/kv-caching-in-llms-explained-visually/ - the clearest visual walkthrough.
- Karpathy, "Let's build GPT from scratch" (YouTube, 2h): https://www.youtube.com/watch?v=kCc8FmEb1nY - he naturally shows where the cache fits.
Prerequisites in this KB:
- attention mechanism - understand Q, K, V
- inference pipeline - understand prefill vs decode
With those in hand, this notion should click.
TL;DR
The KV cache stores the Key (K) and Value (V) vectors from past tokens so they are not recomputed for every new token generated. It turns token generation from quadratic work (O(N^2): doubling response length means 4x the compute) into linear work (O(N): doubling response length just doubles the compute). In practice, a 4.5x speedup on decoding, at the cost of gigabytes of GPU memory.
The historical problem
In a transformer, generating token N+1 requires attention between the new token's Query and every past token's Key and Value.
Without caching, you would recompute ALL past K and V vectors at every decoding step:
- Step 1 (generate token 2): compute K_1, V_1
- Step 2 (generate token 3): compute K_1, V_1, K_2, V_2
- Step 3 (generate token 4): compute K_1, V_1, K_2, V_2, K_3, V_3
- Step N: compute all N previous K and V again
In math terms, that is O(N^2) work, which means: when N doubles, the work is multiplied by 4. For a 500-token response, you do about 125,000 wasted attention computations. On modest hardware, that takes minutes per response.
Quick refresher on Big-O notation (ignore if you know it). O(N) = linear (double input, double work). O(N^2) = quadratic (double input, 4x work). O(N^3) = cubic (double input, 8x work). Smaller exponent = scales better as N grows.
The KV cache fixes this with a one-line insight: K_i and V_i do not change once token i is placed in the sequence.
The key insight (inspired by Daily Dose DS)
Step 1: What you actually need to generate token N+1
You need only the attention output for the LAST token's query. The attention matrix looks like this during generation:
K1 K2 K3 K4 K_last
+-----+-----+-----+-----+-----+
Q1 | . | | | | |
+-----+-----+-----+-----+-----+
Q2 | . | . | | | |
+-----+-----+-----+-----+-----+
Q3 | . | . | . | | |
+-----+-----+-----+-----+-----+
Q4 | . | . | . | . | |
+-----+-----+-----+-----+-----+
Qlast | X | X | X | X | X | <- ONLY THIS ROW MATTERS
+-----+-----+-----+-----+-----+
Huge realization: you compute an entire matrix just to use its LAST ROW.
Step 2: What the last row needs
The last row uses:
- Q_last: the query vector of the new token (new, changes every step)
- K1, K2, ..., K_last: keys of ALL previous tokens (some old, some new)
- V1, V2, ..., V_last: values of ALL previous tokens (some old, some new)
Step 3: The revelation
For an OLD token i, is K_i going to change when token N+1 is generated?
Recall from attention mechanism:
K_i = x_i * W_K
V_i = x_i * W_V
Where:
x_i= the embedding of token i (fixed once the token is in the sequence)W_K, W_V= the model's weights (fixed at inference time)
Both inputs are frozen. So K_i and V_i are deterministic AND immutable for i < last.
What about Q? Q_i = x_i * W_Q. Q is also computed from the same input, but here is the subtle point: when generating token N+1, we ONLY use Q_last (the query of the new token). We do NOT need Q_1, Q_2, ..., Q_(N-1) anymore. Their queries were used in earlier steps and discarded.
So:
- K, V for past tokens: computed once, reused many times -> CACHE THEM
- Q for past tokens: used once, never needed again -> DO NOT CACHE
- K, V, Q for the NEW token: computed fresh each step
Step 4: The caching strategy
For each new token generation:
1. Append new token to sequence
2. Compute Q_new, K_new, V_new from the new token's embedding
3. Append K_new, V_new to the KV cache (cache grows by one token)
4. Attention = softmax(Q_new * cache.K^T / sqrt(d)) * cache.V
5. Output logit vector, sample next token
6. Repeat from step 1 with the newly generated token
No re-computation of past K, V. Just one new K and V per step, plus the attention read against the cache.
Per-token flow during decode
What happens at every new token generation step, visualized:
For each new token, you compute ONE new Q, K, V triple (just for that token). You reuse the N previously cached K and V vectors. The cache grows by one token per step.
Benchmark (Daily Dose DS, concrete numbers)
For a 500-token response on a typical setup:
| Configuration | Time |
|---|---|
| Without KV cache | 40 seconds |
| With KV cache | 9 seconds |
| Speedup | 4.5x |
This is a floor, not a ceiling. For long responses (2000+ tokens), the gap widens dramatically.
The cost: memory
The speedup is not free. Every cached token takes memory in VRAM.
Formula
Per layer, per token, the cache stores K and V. In standard multi-head attention:
cache_size_per_token = 2 * hidden_dim * num_layers * bytes_per_value
Where 2 accounts for both K and V (some sources include a factor for the batch).
Llama 3 70B example (Daily Dose DS)
- Layers: 80
- Hidden size: 8192
- Per token: ~2.5 MB in the KV cache
- At max context of 4K tokens: 10.5 GB just for the cache
On an 80GB H100, the cache eats 13% of VRAM for ONE request at max context. At 10 parallel requests, cache alone can saturate the GPU.
This is why 2026 production LLM serving obsesses about KV cache memory.
Memory growth visualized
Linear growth. ~2.5 MB per token. At max context (4K for that config), the cache holds ~10.5 GB. Every concurrent request multiplies that.
The cache lifecycle (mental model)
Input Prompt: "Once upon a time" (4 tokens)
|
v
+---------------------------+
| PREFILL (parallel) |
+---------------------------+
Cache: [ K1 | K2 | K3 | K4 ]
[ V1 | V2 | V3 | V4 ]
|
v
Generate token 5: "there"
Compute Q5, K5, V5
Append K5, V5 to cache
|
v
Cache: [ K1 | K2 | K3 | K4 | K5 ]
[ V1 | V2 | V3 | V4 | V5 ]
|
v
Generate token 6: "was"
Compute Q6, K6, V6
Append K6, V6 to cache
|
v
Cache: [ K1 | K2 | K3 | K4 | K5 | K6 ]
[ V1 | V2 | V3 | V4 | V5 | V6 ]
|
v
... and so on, cache grows by one token per step
Key observations:
- Cache grows linearly with tokens
- K and V of old tokens never change
- When conversation ends, cache is usually discarded (unless prompt caching persists it)
Prefill vs decode at a glance
The asymmetry is the whole game: prefill burns compute but runs in parallel, decode is light per step but serial and bottlenecked by KV cache reads. Every production optimization targets one side of this split. See inference pipeline for the deeper take.
Why this connects to everything
The KV cache is the center of LLM inference economics.
- Context length limit is often a KV cache memory limit, not a model architecture limit. A 7B model "with 128K context" means "with enough VRAM to fit a 128K-token KV cache".
- Paged attention (vllm) manages the KV cache like OS virtual memory to reduce fragmentation.
- Prompt caching (prompt caching) exposes the server-side KV cache to the API user, so repeated prompts skip prefill.
- Multi-Query / Grouped-Query Attention (MQA, GQA) share K and V across heads to cut cache size 4-32x.
- KV quantization stores K and V in 8-bit or 4-bit instead of 16-bit to halve or quarter the cache size.
- Inference batching is limited by how much KV cache you can hold.
In 2026, "how much KV cache can I hold" is often the first question an inference engineer asks.
Relevance today (2026)
The fundamental mechanism has not changed. The optimizations around it have exploded. Snapshot of the 2026 landscape:
| Optimization | Year | What it does | When it matters most |
|---|---|---|---|
| MQA / GQA (Multi/Grouped-Query Attention) | 2019 / 2023 | All attention heads (or groups of them) share K and V. Cuts cache 4-32x. | Every new frontier model uses one of them. Default in Llama 3+, Claude, GPT. |
| Paged Attention (vLLM) | 2023 | Allocates the cache in small fixed-size pages, like OS virtual memory. Kills fragmentation. | Multi-user serving. See vllm. |
| Prompt caching APIs (Anthropic, OpenAI) | 2024 | Server keeps the cache of a prompt prefix across user requests. | Agent systems with long, repeated system prompts. Cost drops up to 90%. See prompt caching. |
| KV quantization (KVQuant, FP8) | 2024 | Stores K and V in 8 or 4 bits instead of 16. Saves 2-4x memory. | Long context, tight VRAM. Standard on H100+. |
| Sliding window attention | 2023+ | Each token only attends to the last N tokens. Cache can be truncated beyond the window. | Mistral 7B, newer Llamas. Useful for very long inputs. |
| Tiered cache (LMCache) | 2024 | Hot in GPU VRAM, warm in CPU RAM, cold on disk. | Long-running agent systems with intermittent reuse. |
Two trends worth knowing in detail:
Reasoning models stress the cache harder
Reasoning models (o1, Opus 4.5 thinking, R1) generate thousands of "thinking" tokens per response. Decode dominates. The cache per request grows fast and stays alive longer. Cost per request is 3-10x higher than a normal model for this reason. Plan capacity accordingly.
Ollama vs vLLM: the production question
Ollama is great for local hacking but had poor KV cache management historically. It is not production grade for chat apps with concurrent users. If you are shipping to users, use vLLM, TGI, or SGLang, all of which handle the cache well.
Questions to master
The questions you should be able to answer after reading this notion. Mix of comprehension, trade-off reasoning, and production thinking. Read the question, try to answer in your head, then check.
Q: What is the KV cache and why does it speed up LLM inference?
The KV cache stores the Key and Value vectors from past tokens so the model does not recompute them at every new token. Without it, generating token 500 would redo attention over tokens 1 through 499 from scratch. With it, each new token only adds one K and V to the cache and reads the rest. The work goes from quadratic (O(N^2): doubling tokens means 4x the work) to linear (O(N): doubling tokens means 2x the work). For a 500-token response, that is roughly 125,000 attention computations vs 500. Real-time chat would be impossible without this.
Q: Why do we cache K and V but not Q?
Because Q of past tokens is used once and then discarded. The query of token N was only needed when generating token N+1; after that, we never need it again. K and V, in contrast, are reused at every future step: the new token's Q attends against all past K, and reads all past V. So K and V are the only vectors worth keeping around.
Q: Multi-Query Attention (MQA) reduces KV cache memory by up to 32x on a Llama 2-7B. Why does quality not drop 32x?
K and V across attention heads capture correlated signals. Full multi-head is over-parameterized for the caching role. Sharing one K and V projection across all heads removes redundancy without losing most of the useful information. In practice, MQA loses 1-3% on benchmarks, not 32x. That trade-off is why every modern frontier model uses MQA or GQA (grouped-query) in production.
Q: For a 100-user chat app on one GPU, what limits you first: model weights, activations, or KV cache?
Almost always the KV cache. Model weights are fixed regardless of how many users you serve. Activations are transient. The KV cache grows with context length AND with the number of concurrent requests. A 70B model at FP16 with 100 users at 4K context each easily requires a terabyte of VRAM just for caches. This is why vLLM's paged attention and KV cache quantization matter more than raw model size for multi-user apps.
Q: In production, you notice decoding latency is much higher than expected. Where do you look first?
Three suspects, in order:
- KV cache memory pressure: if the GPU is near full, decode starts thrashing. Check VRAM usage.
- Batch size too low or too high: too low wastes GPU, too high blows memory. Continuous batching (vLLM, SGLang) auto-tunes this.
- Wrong serving stack: if you are on Ollama with concurrent users, that is the bug. Move to vLLM or a hosted vLLM.
Do not chase model optimizations (quantization, speculative decoding) until these three are ruled out.
Q: Why is the KV cache the main bottleneck for context length, not the model weights?
Model weights are loaded once and shared across all requests. Their size does not scale with context. The KV cache grows linearly with tokens per request AND with the number of active requests. A 7B model uses ~14 GB of weights regardless of context. Its KV cache at 128K context can exceed 50 GB. That is why "128K context" claims are really "enough VRAM to hold a 128K-token cache" claims.
Q: What is the relationship between the KV cache and prompt caching APIs (Anthropic, OpenAI)?
Prompt caching is the server-side KV cache exposed to API users. The provider keeps the KV cache of your prompt prefix across requests. If you send the same 20K-token system prompt 100 times, you pay prefill once instead of 100 times. Anthropic reports 90% cost reduction on typical agent workloads. It is the same underlying mechanism as the KV cache, just persisted and billed differently.
Common misconceptions
❌ "KV cache and prompt caching are the same thing." ✅ Related but distinct. KV cache is the GPU mechanism that stores K and V vectors during inference. Prompt caching is the API feature (Anthropic, OpenAI) that persists that cache across multiple user requests for billing purposes. Same underlying tech, different layer.
❌ "More cache = faster generation." ✅ Up to a point. Past a threshold, the GPU spends more time reading the cache than computing new tokens. Bigger cache = more memory traffic = slower per-token throughput. This is why serving stacks like vLLM page the cache instead of letting it grow unchecked.
❌ "Setting temperature=0 changes the KV cache." ✅ Zero relation. Temperature affects how the next token is picked from the model's output. The KV cache stores intermediate vectors. Two completely independent parts of the inference pipeline.
❌ "You need to manage the KV cache manually." ✅ Modern serving stacks (vLLM, SGLang, TGI) handle it for you. You only touch cache code when running raw HuggingFace transformers, which is rare in production. If you find yourself writing cache code, you are probably reinventing what vLLM already does better.
❌ "Bigger model = bigger cache." ✅ Partly true. Cache size scales with model layers and hidden dimension. But MQA and GQA (Multi-Query / Grouped-Query Attention) cut cache memory 4-32x WITHOUT changing model size. A 70B model with GQA can have less cache than a 13B model without.
Production pitfalls
- OOM on long conversations. Cache grows with conversation length. Check your max context sizing BEFORE prod. Do the math:
layers * hidden * 2 * bytes * max_tokens * concurrent_requests. - Ollama in prod. Limited KV cache management. Latency degrades under concurrent load. Use vLLM/TGI/SGLang for any multi-user app.
- Cold start on a new conversation. First prefill is expensive. If your app is latency-sensitive, keep a warm pool with pre-prefilled system prompts (via prompt caching).
- Cache invalidation on prompt change. Any modification to the prompt prefix invalidates the cached prefix hash. Even ONE changed character at the start means full recompute. Structure prompts: static parts first, dynamic parts last.
- Multi-tenant leakage risk. Without proper isolation, one user's KV cache could theoretically be exposed to another request. Prod systems should enforce tenant-scoped caches. Rare in practice but audit.
- Mid-stream prompt edits. Some agent frameworks mutate the prompt mid-generation (insert tool results, etc). Each mutation can break the cache. Use providers that support cache-friendly tool injection (Anthropic's cache breakpoints, OpenAI's implicit cache).
- Assuming more cache = better. Bigger cache means more memory, slower memory reads. At some point, cache reads become the decode bottleneck. Measure, do not assume.
Mental parallels (non-AI)
Three analogies, ranked by usefulness. Pick the one that sticks for you.
1. Chef's mise-en-place
A chef serves 100 plates of the same dish during a service.
- Without KV cache: for each plate, the chef washes, peels, chops all vegetables from scratch. Re-boils water. Re-makes the sauce. Service collapses by plate 5.
- With KV cache: before service, the chef does mise-en-place (pre-chops vegetables, pre-makes sauce, boils stock). During service, each plate is quick because the prep is cached and reused.
Mise-en-place IS the KV cache. Pay the prep cost once, reuse across many plates.
2. A lawyer at trial
A lawyer prepares for a deposition.
- First read = prefill. The lawyer reads the entire case file, flags every relevant clause, takes notes. Slow but done in parallel (across pages).
- Trial Q&A = decode. Questions come one at a time. For each, the lawyer glances at their flagged notes (the cache), combines with the new question, answers fast.
If the lawyer loses their notes, they would have to re-read the whole file per question. Unthinkable in court. That is what running an LLM without a KV cache feels like.
3. Operating system virtual memory
This one is literal, not metaphorical.
OS virtual memory breaks RAM into small pages and manages them flexibly to avoid fragmentation when programs come and go. Paged Attention (vLLM) applies the SAME idea to the KV cache. It is not "inspired by" virtual memory, it is the same concept ported to GPU memory. Hence the name.
If you understand how Linux manages memory pages, you understand Paged Attention.
Mini-lab
labs/kv-cache-from-scratch/ (to create) - implement naive transformer decoding with and without KV cache:
- Load a small pretrained model (GPT-2 small or Phi-3 mini)
- Write two
generatefunctions:generate_no_cache: recompute ALL K, V every stepgenerate_with_cache: reuse K, V, compute only new token's K, V
- Generate a 200-token response with each. Measure time.
- Plot time per token for both. Show divergence as context grows.
- Plot cache memory usage over generation steps.
Goal: reproduce the Daily Dose DS 4.5x benchmark on your own machine. Feel the speedup.
Stretch: add paged attention manually. Then compare with vLLM.
Stack: uv + torch + transformers. Fits on a 8GB GPU for small models.
Alternatives / Comparisons (inference servers)
| Serving stack | KV cache quality | Paged attention | Prefix sharing | Verdict 2026 |
|---|---|---|---|---|
| vLLM | Excellent | Native (original author) | Yes | Reference for prod |
| SGLang | Excellent | Yes | Excellent (radix tree) | Rising fast |
| TGI (HuggingFace) | Good | Yes | Partial | Solid default |
| Together, Fireworks, Anyscale | Good (vLLM / SGLang under the hood) | Yes | Yes | Hosted convenience |
| Ollama | Limited | No | No | Homelab only |
| Raw HuggingFace Transformers | Basic | No | No | Prototype only |
For any multi-user chat app: vLLM or a hosted vLLM (Together, Fireworks). Never Ollama in production.
Further reading
Best visual explanations
- Daily Dose DS - "KV Caching in LLMs Explained Visually": https://www.dailydoseofds.com/p/kv-caching-in-llms-explained-visually/
- 3Blue1Brown - "Attention in transformers": https://www.youtube.com/watch?v=eMlx5fFNoYc
- Karpathy - "Let's build GPT from scratch": https://www.youtube.com/watch?v=kCc8FmEb1nY
Papers
- "Efficiently Scaling Transformer Inference" (Pope et al., Google, 2022): https://arxiv.org/abs/2211.05102
- "Efficient Memory Management for LLM Serving with PagedAttention" (Kwon et al., 2023): https://arxiv.org/abs/2309.06180 - vLLM paper
- KVQuant (Hooper et al., 2024) - aggressive KV quantization: https://arxiv.org/abs/2401.18079
- "Reducing Transformer Key-Value Cache Size with Cross-Layer Attention" (Brandon et al., 2024): https://arxiv.org/abs/2405.12981
Related concepts in this KB
- attention mechanism - Q, K, V deep dive
- inference pipeline - prefill and decode
- prompt caching - exposing the cache to API users
- vllm - production serving with paged attention
- hallucinations - unrelated but also a thing you should know
- sampling and temperature - how the next token is actually picked once attention is done
Documentation references
- vLLM docs: https://docs.vllm.ai
- Anthropic prompt caching: https://docs.anthropic.com/claude/docs/prompt-caching
- LMCache (tiered KV caching): https://github.com/LMCache/LMCache
- CNCF talk, "vLLM on Kubernetes, 5x GPU efficiency with cache": https://www.youtube.com/results?search_query=CNCF+vLLM+Kubernetes+cache
- Chip Huyen, "AI Engineering", Chapter 9 - inference optimization (O'Reilly 2025, paid): https://www.oreilly.com/library/view/ai-engineering/9781098166298/