Attention Mechanism (Q, K, V deep dive)
Attention is a weighted lookup. Given a **Query** vector (what you are looking for), you compare it against **Key** vectors (what each item advertises), and retrieve a weighted combination of **Value** vectors (what each item actually contains). In transformers, Q, K, V are learned linear projections of the same input.
Attention Mechanism (Q, K, V deep dive)
Watch or read first (recommended order)
Before this notion, get the visual intuition:
- 3Blue1Brown - "Attention in transformers, visually explained" (YouTube, 26 min). The single best visual explanation of Q, K, V. Watch this even if you think you understand.
- Jay Alammar - "The Illustrated Transformer" (blog, 10 min read). Static diagrams, classic reference.
- Karpathy - "Let's build GPT from scratch" (YouTube, 2h). Code-first. Skip to 40:00 for attention if short on time.
If you already watched those, read on.
TL;DR
Attention is a weighted lookup. Given a Query vector (what you are looking for), you compare it against Key vectors (what each item advertises), and retrieve a weighted combination of Value vectors (what each item actually contains). In transformers, Q, K, V are learned linear projections of the same input.
The historical problem
Before attention, sequence models (RNNs, LSTMs) processed tokens one by one and squeezed everything into a single hidden state. Problem: when the sequence is long, old information gets washed out. A token at position 1 barely influences position 500.
Seq2seq encoder-decoder architectures (2014) pushed the whole input through an RNN and handed only the final state to the decoder. That last state was a bottleneck.
The attention mechanism (Bahdanau et al., 2014; Vaswani et al., 2017) fixed it by letting each output token look back at ALL input tokens directly, weighted by relevance.
How it works
The core formula
For a single attention head:
Attention(Q, K, V) = softmax(Q K^T / sqrt(d)) V
Where:
- Q, K, V are matrices, each row is a vector per token
- d is the dimension of each K vector (for scaling stability)
Three steps:
- Compute similarity between each Q and each K: the
Q K^Tmatrix - Convert to attention weights: divide by
sqrt(d)for gradient stability, then softmax each row - Weighted sum of V vectors using those weights
Mental parallel: library search
- Q (query) = what you are searching for. "books about cooking Italian pasta"
- K (key) = the card catalog index. Each book has a tag: "pasta recipes", "history of Italy", "baking 101".
- V (value) = the actual content of each book.
To search:
- Compare your Q to each K (dot product = similarity score)
- Softmax the scores to get weights (mostly on the most relevant books)
- Return a weighted mix of book contents (V)
The model learns Q, K, V projections that make this lookup meaningful.
Where do Q, K, V come from?
Given an input token embedding x (a vector), we apply three learned weight matrices:
Q = x W_Q
K = x W_K
V = x W_V
W_Q, W_K, W_V are parameters of the model. They are learned during pretraining.
Key insight for understanding kv cache: once x is fixed, K and V are deterministic. If you have already computed K_3 and V_3 for token 3, they will not change when you process token 4, 5, 6...
Self-attention vs cross-attention
- Self-attention: Q, K, V all come from the same sequence. Used in decoders (GPT, Llama). Every token attends to every other token in the same sequence.
- Cross-attention: Q comes from one sequence (decoder), K and V come from another (encoder). Used in translation models (T5).
Modern LLMs (GPT, Claude, Llama) are decoder-only with self-attention.
Causal mask (the subtlety for generation)
In autoregressive generation, a token should not attend to future tokens (they do not exist yet during inference). A causal mask sets those attention scores to -infinity before softmax, so they become 0.
Visualized as a lower-triangular matrix (only past tokens, including self):
K1 K2 K3 K4 K5
+-----+-----+-----+-----+-----+
Q1 | X | | | | |
+-----+-----+-----+-----+-----+
Q2 | X | X | | | |
+-----+-----+-----+-----+-----+
Q3 | X | X | X | | |
+-----+-----+-----+-----+-----+
Q4 | X | X | X | X | |
+-----+-----+-----+-----+-----+
Q5 | X | X | X | X | X |
+-----+-----+-----+-----+-----+
Empty cells are masked (effectively 0 attention weight). X is computed.
Multi-head attention (the parallel attention)
Instead of one attention operation with vectors of size d, transformers run multiple heads in parallel, each with smaller vectors.
Example: Llama 2-7B has hidden dimension 4096 and 32 heads. Each head has Q, K, V of dimension 4096 / 32 = 128.
Each head can learn to attend to different things:
- Head 1: syntax (subject-verb relationships)
- Head 2: coreference (he/she -> name)
- Head 3: long-range context (topic consistency)
- etc.
Outputs of all heads are concatenated and projected back to dimension 4096 via an output matrix W_O.
Dimensions recap (Llama 2-7B example)
| Element | Dimension |
|---|---|
| Token embedding x | 4096 |
| W_Q, W_K, W_V (per head) | 4096 x 128 |
| Q, K, V per head per token | 128 |
| Number of heads | 32 |
| Total Q / K / V per token across heads | 4096 |
| W_O (output projection) | 4096 x 4096 |
The critical insight for KV cache
Here is the revelation that makes everything click:
For already-processed tokens, K and V are frozen. Only Q changes as new tokens are generated.
Why? Because:
K_i = x_i W_K. Once token i is in the sequence,x_iis fixed forever.W_Kis the model (also fixed at inference). SoK_iis fixed.- Same logic for
V_i = x_i W_V. - But
Qis used to ASK a question. When a new token is generated, the NEW token'sQis different from all previousQvalues.
Consequence: if you cache all the K and V vectors of past tokens, you NEVER have to recompute them. This is the entire KV cache. See kv cache.
Illustration: what "attention" looks like at decoding step 5
Input so far: 5 tokens. We want to generate token 6.
Previous cache:
K = [K1, K2, K3, K4, K5] (frozen, reused)
V = [V1, V2, V3, V4, V5] (frozen, reused)
Step 1: compute Q6 from new token's embedding
Step 2: compute K6 and V6 from new token's embedding
Step 3: append K6, V6 to the cache
Step 4: attention_6 = softmax(Q6 * [K1,...,K6]^T / sqrt(d)) * [V1,...,V6]
Step 5: use attention_6 to generate token 7
For each new token, you compute ONE new Q, K, V (just for that token). You reuse the N previously cached K and V. That is why decoding is fast.
Relevance today (2026)
The fundamental Q, K, V mechanism has not changed since 2017. But the optimizations around it have:
- Multi-Query Attention (MQA): all heads share a single K and V projection. Saves a lot of KV cache memory at slight quality cost.
- Grouped-Query Attention (GQA): heads share K/V in groups (e.g., 8 groups of 4 heads in Llama 3). Middle ground, now standard.
- FlashAttention 2 and 3 (2023-2024): GPU-optimized kernel that computes attention without materializing the full attention matrix. Saves memory, standard in every production stack.
- Rotary Position Embedding (RoPE): replaces absolute positional encoding. Applied inside Q and K. Better generalization to long context.
- Sliding Window Attention (Mistral, Llama 3): each token attends only to a fixed window of past tokens. Caps KV cache growth for very long sequences.
All of these are optimizations of the SAME core Q, K, V mechanism. Master that, then layer on the tricks.
Critical questions
- Why do we scale the dot product
Q K^Tbysqrt(d)? What breaks if you skip it? - In self-attention, Q, K, V all come from the same input. What is the point of having three different projections instead of just one?
- In multi-head attention with 32 heads, is every head learning something different? How would you verify?
- MQA reduces KV cache memory by 32x for Llama 2-7B. Why does it not reduce quality by 32x?
- If a model has 4096 hidden dimension and 32 heads of 128, why not use 1 head of 4096? What is lost or gained?
Production pitfalls
- Forgetting the causal mask. Without it, a decoder can attend to future tokens during training, learning to "cheat". Bug is silent: training loss looks fine, generation is garbage.
- Mixing up Q, K, V dimensions. Especially with MQA/GQA, K and V can have different head counts than Q. This trips up engineers writing custom inference code.
- Not using FlashAttention. Naive attention at 32K context is O(N^2) memory. One request OOMs on a single GPU. Use vLLM or HuggingFace Transformers with FlashAttention backend.
- Wrong scaling. Some custom implementations forget the
sqrt(d)or use the wrong d (head dim vs model dim). Model trains but converges poorly. - Assuming all heads are equal. Pruning attention heads is a real compression technique, but some heads are more critical than others. See Michel et al. 2019.
Mental parallels (non-AI)
- Library + search engine: Q = query, K = index tags, V = content. You compute Q-K similarity, you retrieve weighted V.
- Panel of experts: Q = question asked, each K = expert's area of expertise (advertised specialty), V = their opinion. You weigh answers by how well each expert's specialty matches your question.
- Google search: Q = what you typed, K = titles + meta of indexed pages, V = page content. Ranking algorithm combines.
- Memory retrieval: Q = current thought, K = cues stored in memory, V = memories themselves. Your brain does something like attention constantly.
Paul Graham's essay "How to Write Usefully" has an attention-like metaphor: "write to find what you did not know you knew". The query changes what surfaces.
Mini-lab
labs/attention-from-scratch/ (to create) - implement one self-attention head in NumPy:
- Define a toy vocabulary (10 tokens)
- Define embeddings (random vectors of dim 8)
- Define W_Q, W_K, W_V as random 8x8 matrices
- Compute Q, K, V for a sequence of 5 tokens
- Compute attention scores (Q K^T / sqrt(d))
- Apply softmax row-wise
- Multiply by V to get output
- Visualize the attention weights as a heatmap
Goal: feel the math with your own fingers. 30 minutes of code, lifelong understanding.
Next step: add multi-head (split each vector into 2 heads of dim 4) and a causal mask.
Further reading
Canonical explanations
- 3Blue1Brown, "Attention in transformers, visually explained" (YouTube, 2024) - https://www.youtube.com/watch?v=eMlx5fFNoYc
- Jay Alammar, "The Illustrated Transformer" - https://jalammar.github.io/illustrated-transformer/
- Andrej Karpathy, "Let's build GPT from scratch" (YouTube, 2023) - https://www.youtube.com/watch?v=kCc8FmEb1nY
Original papers
- "Attention Is All You Need" (Vaswani et al., 2017) - the paper that started it all: https://arxiv.org/abs/1706.03762
- "Neural Machine Translation by Jointly Learning to Align and Translate" (Bahdanau et al., 2014) - first attention in deep learning
Modern variants
- FlashAttention (Dao et al., 2022) - https://arxiv.org/abs/2205.14135
- FlashAttention-2 (Dao, 2023) - https://arxiv.org/abs/2307.08691
- Multi-Query Attention (Shazeer, 2019) - https://arxiv.org/abs/1911.02150
- GQA: Grouped-Query Attention (Ainslie et al., 2023) - https://arxiv.org/abs/2305.13245
- RoFormer / RoPE (Su et al., 2021) - https://arxiv.org/abs/2104.09864
Deeper dives
- Lilian Weng, "The Transformer Family" (blog) - https://lilianweng.github.io/posts/2023-01-27-the-transformer-family-v2/
- Harvard NLP, "The Annotated Transformer" - http://nlp.seas.harvard.edu/annotated-transformer/
- Huyen, "AI Engineering", Chapter 2 (see raw/papers)