vLLM

TL;DR

High-performance LLM inference server, de facto reference for production in 2024-2026. Implements Page Attention, continuous batching, KV cache, and runtime LoRA adapters.

The historical problem

Naively serving an LLM with HuggingFace Transformers gives you:

No efficient KV cache between requests
No continuous batching (each request waits for the previous one to finish)
Fragmented GPU memory
Laughable throughput (1-10 tokens/s for a 7B)

In 2022, no open-source inference server was really production-ready for LLMs. Proprietary solutions like DeepSpeed were too complex.

How it works

vLLM stacks several innovations:

[[page-attention|Page Attention]]: paged management of the KV cache (inspired by OS virtual memory). Reduces GPU fragmentation, allows memory sharing across requests.
Continuous batching: no "fixed" batch, requests enter and leave the batch on the fly. Throughput maximized.
Native [[../01-llms/kv-cache|KV Cache]]: handled automatically across requests, with prefix sharing.
Runtime [[../10-fine-tuning/lora-adapters|LoRA adapters]]: load/unload adapters hot without redeploy. Enables multi-tenant with per use case specialization.
Quantization support: AWQ, GPTQ, FP8, INT4...
OpenAI-compatible API: drop-in replacement, easy to integrate.

Relevance today (2026)

vLLM is THE reference in 2026 but the landscape moves:

SGLang is rising fast in 2025-2026, radix tree for prefix caching more efficient than vLLM
TensorRT-LLM (NVIDIA): better on H100/H200 but less portable
vLLM v1 (2024-2025) architectural rewrite: faster, less buggy
MLX (Apple Silicon): for local Mac, not a production competitor

Honest question: will vLLM stay the reference in 2027 ?

For Kubernetes multi-GPU stacks: yes (ecosystem, CNCF talks, integrations)
For single-tenant ultra-optimized setups: SGLang and TensorRT-LLM may take over

Where vLLM shines:

Multi-tenant (multiple users, multiple LoRAs)
K8s scale with inference gateway
Need for an OpenAI-compatible API

Critical questions

Why is continuous batching more efficient than classic batching ? What tradeoff ?
What is Page Attention concretely ? Why is it a game changer ?
If I have 10 LoRA adapters to serve, do I load them all on GPU or at runtime ? Tradeoff ?
What are the cases where vLLM is worse than a simpler server ?
What is vLLM's memory overhead vs raw Transformers ?
How does vLLM handle failover ? And graceful shutdown of a long request ?
And observability (Prometheus metrics, traces) ? Is it good in production ?

Production pitfalls

Cold start: loading a 7B model = 30s-2min, to anticipate (warm pool of replicas)
Memory config: the default gpu_memory_utilization is too aggressive and can OOM under load
Mixed precision traps: some architectures do not support FP8, verify
LoRA adapters: quickly exceed the limit if too many adapters are loaded at the same time
Upgrades: vLLM moves fast, breaking changes between versions. Pin your versions.
Multi-GPU: tensor parallelism vs pipeline parallelism, pick based on the model
Debugging: logs are not always clear under heavy load, integrate Langfuse/Arize properly

Alternatives / Comparisons

See vllm vs alternatives (to create via /compare vllm vs tgi vs sglang).

Quick:

vLLM: solid default, community, ecosystem
SGLang: pure perf, radix tree prefix cache, rising fast
TensorRT-LLM: top NVIDIA perf, less portable, harder to deploy
TGI (HuggingFace): stable, good HF integration, less buzz
Ollama: homelab only, not production
llama.cpp: edge / embedded, not a multi-tenant server

Mini-lab

[[labs/01-vllm-vs-ollama-benchmark/]] - deploy the same model on vLLM and Ollama, benchmark throughput and latency on 100 concurrent requests.

To create: /lab vllm.

vLLM

vLLM

TL;DR

The historical problem

How it works

Relevance today (2026)

Critical questions

Production pitfalls

Alternatives / Comparisons

Mini-lab

Further reading