all domains
01·6 notions

LLMs

Models themselves: loading, serving, quantization, inference-time optimization.

LLM Inference Pipeline (Prefill vs Decode)

When you ask an LLM something, the model does TWO different jobs back to back: - **Prefill**: read your prompt. Fast because it reads everything in parallel. - **Decode**: write the answer. Slow because it writes one token at a time. That's the whole concept. The "ChatGPT is thinking..." pause you see before streaming starts is prefill finishing. The words appearing one by one after are decode. Everything else on this page is details about WHY each phase behaves the way it does, and WHY that matters in production.

KV Cache

The KV cache stores the Key (K) and Value (V) vectors from past tokens so they are not recomputed for every new token generated. It turns token generation from quadratic work (O(N^2): doubling response length means 4x the compute) into linear work (O(N): doubling response length just doubles the compute). In practice, a 4.5x speedup on decoding, at the cost of gigabytes of GPU memory.

Sampling and Temperature

Sampling is how a model picks the next token from the probability distribution over its vocabulary. The main knobs are **temperature** (flatten or sharpen probabilities), **top-k** (limit to the k most likely tokens), and **top-p** (limit to the smallest set whose cumulative probability exceeds p). These knobs control the creativity vs consistency trade-off.

Post-Training (SFT + RLHF/DPO)

Post-training turns a raw pre-trained language model into something usable. Two main steps: **Supervised Fine-Tuning (SFT)** teaches the model to follow instructions, then **Preference Fine-Tuning (RLHF, DPO)** aligns it with human preferences. Every modern assistant (ChatGPT, Claude, Gemini) goes through this pipeline.

Hallucinations

A hallucination is when a model generates content that is not grounded in facts. Two dominant hypotheses: (1) self-delusion (the model cannot distinguish its own output from given facts), (2) knowledge mismatch (SFT teaches the model to mimic answers that use facts the model itself does not know). Mitigation: retrieval, citations, reward models that penalize fabrication.

Test-Time Compute

Test-time compute is about spending more inference budget per query to improve quality. Instead of one shot, generate multiple candidates and pick the best. It unlocked a new era of **reasoning models** (o1, o3, Claude Opus 4.x thinking mode, DeepSeek R1). DeepMind showed this can sometimes beat scaling model size.