LLMs
01·LLMs·updated 2026-04-19

KV Cache

Cache of the keys and values (K, V) from the attention mechanism so we do not recompute attention over tokens we have already seen. Essential for inference server performance.

KV Cache

TL;DR

Cache of the keys and values (K, V) from the attention mechanism so we do not recompute attention over tokens we have already seen. Essential for inference server performance.

The historical problem

In an autoregressive Transformer, each new generated token needs attention computed over ALL previous tokens. Without a cache, we recompute the same matrix products 1, 2, 3, 4... N times. Costly on GPU and useless because these values do not change.

In a chat conversation: at every turn we resend the whole history plus the new message. Without a KV cache, we retokenize and re-attention the full past every turn.

How it works

The 2 inference phases (prefill vs decode)

Every LLM request has 2 distinct phases:

  • Prefill: processes the whole input prompt at once. Dense matmul on all tokens to build the internal representation. Compute-bound, expensive. This is where we compute K and V for each token.
  • Decode: generates tokens one by one. Each new token reads attention over the past but does very little compute. Memory-bound.

The cache is built during prefill and read during decode.

Mechanics

  1. During the forward pass in prefill, each attention layer computes Query, Key and Value for each token. The K/V of a token only depend on tokens before it -> once computed, they never change.
  2. Instead of throwing them away, we store them in a cache (K, V tensors per layer).
  3. On the next token (decode), we only compute the K, V of the new token and append it to the cache.
  4. Attention for the new token attends over the WHOLE cache (without recomputing the past).

Gains:

  • Speedup generation: O(N^2) -> O(N) in practice
  • Fewer GPU flops: only the new extension is computed
  • Stable latency: each next token is nearly free (except memory)

Cost: memory. The cache grows linearly with context length. For a 7B model in 16 bits, with 32 layers and 32 heads, you can hit several GB for a 32k token context.

Hash-based cross-request caching

Beyond the intra-generation cache, modern inference servers (vLLM, SGLang) persist the cache across requests, indexed by a cryptographic hash of the token sequence:

  • New request comes in -> hash the prefix -> match -> K/V loaded from memory, prefill skipped for these tokens
  • Only the new tokens are prefilled

This is the mechanism exposed to the user via prompt caching on the API side (Anthropic, OpenAI, Gemini). Fragility: the smallest change in the prefix (even the order) changes the hash -> full invalidation.

Relevance today (2026)

The KV cache is more critical than ever:

  • Contexts explode (1M+ tokens). Without a cache, inference is unusable.
  • Inference prices are driven by the KV cache (GPU memory = cost)
  • New research directions:
    • Page Attention (vLLM): paged management of the KV cache, inspired by OS virtual memory. Reduces fragmentation.
    • Prefix sharing: several requests with the same system prompt share the cache
    • KV cache quantization (KVQuant, 2024): compress K/V to fit more
    • Prompt caching (Anthropic, OpenAI, Gemini): expose the cache to the user via API to reduce the cost of long repeated prompts. Claude Code reaches 92% hit-rate and -81% cost through this discipline. See prompt caching.
    • Tiered cache (LMCache): GPU -> RAM -> disk depending on hit-rate

Question to ask today: if your inference engine does not handle the KV cache properly, you pay 2x to 10x for nothing. It has become table stakes for a pro stack.

Critical questions

  • Why does the cache grow with context length ? What is stored exactly ?
  • If I have 10 simultaneous users on the same model, how do I share or isolate the caches ?
  • A 10GB cache for 1 user, is that sustainable at 100 users ? Which strategies ?
  • Ollama does not handle the KV cache (in 2024). What does it mean for my chat latency ?
  • If I reset a conversation, what happens to the cache ? And for a long conversation ?
  • API-side prompt caching (Anthropic): what does it add vs server-side caching ?
  • Quantize the KV cache (8 bits instead of 16): measurable quality loss ?

Production pitfalls

  • GPU OOM on long conversations: compute max memory based on max context
  • Ollama: no persistent KV cache across requests = slow in chat. Pick vLLM/TGI/SGLang for prod.
  • Cache miss when changing the system prompt: full invalidation (gain lost)
  • Cold start: first token very slow if the cache has to be built from scratch
  • Multi-tenant: without isolation, potential context leak between users (rare but possible)

Alternatives / Comparisons

Inference servers and KV cache:

ServerKV cachePage attentionPrefix sharingProduction verdict
vLLMYes (top)Yes (native)YesProduction reference
TGI (HuggingFace)YesPartialYesSolid
SGLangYesYesExcellent (radix tree)Rising
OllamaLimitedNoNoHomelab, not production
Transformers (HF) rawBasicNoNoPrototype only

Mini-lab

[[labs/01-inference-with-kv-cache/]] - implement inference with and without KV cache on GPT-2, benchmark token-by-token latency.

To create: /lab kv-cache.

Further reading

  • Original paper "Efficiently Scaling Transformer Inference" (Pope et al, 2022)
  • vLLM paper "Efficient Memory Management for Large Language Model Serving with PagedAttention"
  • LMCache: tiered KV caching (GPU/CPU/disk)
  • Anthropic prompt caching docs: https://docs.anthropic.com/claude/docs/prompt-caching
  • CNCF video: "VLLM on Kubernetes, Squeeze 5x GPU efficiency with cache"
  • KVQuant paper (2024): aggressive quantization of the KV cache
inferenceoptimizationmemoryattention