01·1 notion

LLMs

Models themselves: loading, serving, quantization, inference-time optimization.

KV Cache

Cache of the keys and values (K, V) from the attention mechanism so we do not recompute attention over tokens we have already seen. Essential for inference server performance.