01·1 notion
LLMs
Models themselves: loading, serving, quantization, inference-time optimization.
Apr 19,
KV Cache
Cache of the keys and values (K, V) from the attention mechanism so we do not recompute attention over tokens we have already seen. Essential for inference server performance.