Attention Mechanism (Q, K, V deep dive)

Watch or read first (recommended order)

Before this notion, get the visual intuition:

3Blue1Brown - "Attention in transformers, visually explained" (YouTube, 26 min): https://www.youtube.com/watch?v=eMlx5fFNoYc - the single best visual explanation of Q, K, V. Watch this even if you think you understand.
Jay Alammar - "The Illustrated Transformer" (blog, 10 min read): https://jalammar.github.io/illustrated-transformer/ - static diagrams, classic reference.
Karpathy - "Let's build GPT from scratch" (YouTube, 2h): https://www.youtube.com/watch?v=kCc8FmEb1nY - code-first. Skip to 40:00 for attention if short on time.

If you already watched those, read on.

TL;DR

Attention is a weighted lookup. Given a Query vector (what you are looking for), you compare it against Key vectors (what each item advertises), and retrieve a weighted combination of Value vectors (what each item actually contains). In transformers, Q, K, V are learned linear projections of the same input.

The historical problem

Before attention, sequence models (RNNs, LSTMs) processed tokens one by one and squeezed everything into a single hidden state. Problem: when the sequence is long, old information gets washed out. A token at position 1 barely influences position 500.

Seq2seq encoder-decoder architectures (2014) pushed the whole input through an RNN and handed only the final state to the decoder. That last state was a bottleneck.

The attention mechanism (Bahdanau et al., 2014; Vaswani et al., 2017) fixed it by letting each output token look back at ALL input tokens directly, weighted by relevance.

How it works

The core formula

For a single attention head:

Attention(Q, K, V) = softmax(Q K^T / sqrt(d)) V

Where:

Q, K, V are matrices, each row is a vector per token
d is the dimension of each K vector (for scaling stability)

Three steps:

Compute similarity between each Q and each K: the Q K^T matrix
Convert to attention weights: divide by sqrt(d) for gradient stability, then softmax each row
Weighted sum of V vectors using those weights

Attention visualized

Mental parallel: library search

Q (query) = what you are searching for. "books about cooking Italian pasta"
K (key) = the card catalog index. Each book has a tag: "pasta recipes", "history of Italy", "baking 101".
V (value) = the actual content of each book.

To search:

Compare your Q to each K (dot product = similarity score)
Softmax the scores to get weights (mostly on the most relevant books)
Return a weighted mix of book contents (V)

The model learns Q, K, V projections that make this lookup meaningful.

Where do Q, K, V come from?

Given an input token embedding x (a vector), we apply three learned weight matrices:

Q = x W_Q
K = x W_K
V = x W_V

W_Q, W_K, W_V are parameters of the model. They are learned during pretraining.

Key insight for understanding kv cache: once x is fixed, K and V are deterministic. If you have already computed K_3 and V_3 for token 3, they will not change when you process token 4, 5, 6...

Self-attention vs cross-attention

Self-attention: Q, K, V all come from the same sequence. Used in decoders (GPT, Llama). Every token attends to every other token in the same sequence.
Cross-attention: Q comes from one sequence (decoder), K and V come from another (encoder). Used in translation models (T5).

Modern LLMs (GPT, Claude, Llama) are decoder-only with self-attention.

Causal mask (the subtlety for generation)

In autoregressive generation, a token should not attend to future tokens (they do not exist yet during inference). A causal mask sets those attention scores to -infinity before softmax, so they become 0.

Visualized as a lower-triangular matrix (only past tokens, including self):

          K1    K2    K3    K4    K5
      +-----+-----+-----+-----+-----+
Q1    |  X  |     |     |     |     |
      +-----+-----+-----+-----+-----+
Q2    |  X  |  X  |     |     |     |
      +-----+-----+-----+-----+-----+
Q3    |  X  |  X  |  X  |     |     |
      +-----+-----+-----+-----+-----+
Q4    |  X  |  X  |  X  |  X  |     |
      +-----+-----+-----+-----+-----+
Q5    |  X  |  X  |  X  |  X  |  X  |
      +-----+-----+-----+-----+-----+

Empty cells are masked (effectively 0 attention weight). X is computed.

Multi-head attention (the parallel attention)

Instead of one attention operation with vectors of size d, transformers run multiple heads in parallel, each with smaller vectors.

Example: Llama 2-7B has hidden dimension 4096 and 32 heads. Each head has Q, K, V of dimension 4096 / 32 = 128.

Each head can learn to attend to different things:

Head 1: syntax (subject-verb relationships)
Head 2: coreference (he/she -> name)
Head 3: long-range context (topic consistency)
etc.

Outputs of all heads are concatenated and projected back to dimension 4096 via an output matrix W_O.

Dimensions recap (Llama 2-7B example)

Element	Dimension
Token embedding x	4096
W_Q, W_K, W_V (per head)	4096 x 128
Q, K, V per head per token	128
Number of heads	32
Total Q / K / V per token across heads	4096
W_O (output projection)	4096 x 4096

The critical insight for KV cache

Here is the revelation that makes everything click:

For already-processed tokens, K and V are frozen. Only Q changes as new tokens are generated.

Why? Because:

K_i = x_i W_K. Once token i is in the sequence, x_i is fixed forever. W_K is the model (also fixed at inference). So K_i is fixed.
Same logic for V_i = x_i W_V.
But Q is used to ASK a question. When a new token is generated, the NEW token's Q is different from all previous Q values.

Consequence: if you cache all the K and V vectors of past tokens, you NEVER have to recompute them. This is the entire KV cache. See kv cache.

Illustration: what "attention" looks like at decoding step 5

Input so far: 5 tokens. We want to generate token 6.

Previous cache:
  K = [K1, K2, K3, K4, K5]     (frozen, reused)
  V = [V1, V2, V3, V4, V5]     (frozen, reused)

Step 1: compute Q6 from new token's embedding
Step 2: compute K6 and V6 from new token's embedding
Step 3: append K6, V6 to the cache
Step 4: attention_6 = softmax(Q6 * [K1,...,K6]^T / sqrt(d)) * [V1,...,V6]
Step 5: use attention_6 to generate token 7

For each new token, you compute ONE new Q, K, V (just for that token). You reuse the N previously cached K and V. That is why decoding is fast.

Relevance today (2026)

The fundamental Q, K, V mechanism has not changed since 2017. But the optimizations around it have:

Multi-Query Attention (MQA): all heads share a single K and V projection. Saves a lot of KV cache memory at slight quality cost.
Grouped-Query Attention (GQA): heads share K/V in groups (e.g., 8 groups of 4 heads in Llama 3). Middle ground, now standard.
FlashAttention 2 and 3 (2023-2024): GPU-optimized kernel that computes attention without materializing the full attention matrix. Saves memory, standard in every production stack.
Rotary Position Embedding (RoPE): replaces absolute positional encoding. Applied inside Q and K. Better generalization to long context.
Sliding Window Attention (Mistral, Llama 3): each token attends only to a fixed window of past tokens. Caps KV cache growth for very long sequences.

All of these are optimizations of the SAME core Q, K, V mechanism. Master that, then layer on the tricks.

Critical questions

Why do we scale the dot product Q K^T by sqrt(d)? What breaks if you skip it?
In self-attention, Q, K, V all come from the same input. What is the point of having three different projections instead of just one?
In multi-head attention with 32 heads, is every head learning something different? How would you verify?
MQA reduces KV cache memory by 32x for Llama 2-7B. Why does it not reduce quality by 32x?
If a model has 4096 hidden dimension and 32 heads of 128, why not use 1 head of 4096? What is lost or gained?

Production pitfalls

Forgetting the causal mask. Without it, a decoder can attend to future tokens during training, learning to "cheat". Bug is silent: training loss looks fine, generation is garbage.
Mixing up Q, K, V dimensions. Especially with MQA/GQA, K and V can have different head counts than Q. This trips up engineers writing custom inference code.
Not using FlashAttention. Naive attention uses O(N^2) memory: double the context, 4x the memory. At 32K context one request runs out of memory on a single GPU. Use vLLM or HuggingFace Transformers with the FlashAttention backend (it cuts memory to O(N)).
Wrong scaling. Some custom implementations forget the sqrt(d) or use the wrong d (head dim vs model dim). Model trains but converges poorly.
Assuming all heads are equal. Pruning attention heads is a real compression technique, but some heads are more critical than others. See Michel et al. 2019.

Mental parallels (non-AI)

Library + search engine: Q = query, K = index tags, V = content. You compute Q-K similarity, you retrieve weighted V.
Panel of experts: Q = question asked, each K = expert's area of expertise (advertised specialty), V = their opinion. You weigh answers by how well each expert's specialty matches your question.
Google search: Q = what you typed, K = titles + meta of indexed pages, V = page content. Ranking algorithm combines.
Memory retrieval: Q = current thought, K = cues stored in memory, V = memories themselves. Your brain does something like attention constantly.

Paul Graham's essay "How to Write Usefully" has an attention-like metaphor: "write to find what you did not know you knew". The query changes what surfaces.

Mini-lab

labs/attention-from-scratch/ (to create) - implement one self-attention head in NumPy:

Define a toy vocabulary (10 tokens)
Define embeddings (random vectors of dim 8)
Define W_Q, W_K, W_V as random 8x8 matrices
Compute Q, K, V for a sequence of 5 tokens
Compute attention scores (Q K^T / sqrt(d))
Apply softmax row-wise
Multiply by V to get output
Visualize the attention weights as a heatmap

Goal: feel the math with your own fingers. 30 minutes of code, lifelong understanding.

Next step: add multi-head (split each vector into 2 heads of dim 4) and a causal mask.

Attention Mechanism (Q, K, V deep dive)

Attention Mechanism (Q, K, V deep dive)

Watch or read first (recommended order)

TL;DR

The historical problem

How it works

The core formula

Attention visualized

Mental parallel: library search

Where do Q, K, V come from?

Self-attention vs cross-attention

Causal mask (the subtlety for generation)

Multi-head attention (the parallel attention)

Dimensions recap (Llama 2-7B example)

The critical insight for KV cache

Illustration: what "attention" looks like at decoding step 5

Relevance today (2026)

Critical questions

Production pitfalls

Mental parallels (non-AI)

Mini-lab

Further reading

Canonical explanations

Original papers

Modern variants

Deeper dives