vLLM
High-performance LLM inference server, de facto reference for production in 2024-2026. Implements Page Attention, continuous batching, KV cache, and runtime LoRA adapters.
vLLM
TL;DR
High-performance LLM inference server, de facto reference for production in 2024-2026. Implements Page Attention, continuous batching, KV cache, and runtime LoRA adapters.
The historical problem
Naively serving an LLM with HuggingFace Transformers gives you:
- No efficient KV cache between requests
- No continuous batching (each request waits for the previous one to finish)
- Fragmented GPU memory
- Laughable throughput (1-10 tokens/s for a 7B)
In 2022, no open-source inference server was really production-ready for LLMs. Proprietary solutions like DeepSpeed were too complex.
How it works
vLLM stacks several innovations:
-
[[page-attention|Page Attention]]: paged management of the KV cache (inspired by OS virtual memory). Reduces GPU fragmentation, allows memory sharing across requests.
-
Continuous batching: no "fixed" batch, requests enter and leave the batch on the fly. Throughput maximized.
-
Native [[../01-llms/kv-cache|KV Cache]]: handled automatically across requests, with prefix sharing.
-
Runtime [[../10-fine-tuning/lora-adapters|LoRA adapters]]: load/unload adapters hot without redeploy. Enables multi-tenant with per use case specialization.
-
Quantization support: AWQ, GPTQ, FP8, INT4...
-
OpenAI-compatible API: drop-in replacement, easy to integrate.
Relevance today (2026)
vLLM is THE reference in 2026 but the landscape moves:
- SGLang is rising fast in 2025-2026, radix tree for prefix caching more efficient than vLLM
- TensorRT-LLM (NVIDIA): better on H100/H200 but less portable
- vLLM v1 (2024-2025) architectural rewrite: faster, less buggy
- MLX (Apple Silicon): for local Mac, not a production competitor
Honest question: will vLLM stay the reference in 2027 ?
- For Kubernetes multi-GPU stacks: yes (ecosystem, CNCF talks, integrations)
- For single-tenant ultra-optimized setups: SGLang and TensorRT-LLM may take over
Where vLLM shines:
- Multi-tenant (multiple users, multiple LoRAs)
- K8s scale with inference gateway
- Need for an OpenAI-compatible API
Critical questions
- Why is continuous batching more efficient than classic batching ? What tradeoff ?
- What is Page Attention concretely ? Why is it a game changer ?
- If I have 10 LoRA adapters to serve, do I load them all on GPU or at runtime ? Tradeoff ?
- What are the cases where vLLM is worse than a simpler server ?
- What is vLLM's memory overhead vs raw Transformers ?
- How does vLLM handle failover ? And graceful shutdown of a long request ?
- And observability (Prometheus metrics, traces) ? Is it good in production ?
Production pitfalls
- Cold start: loading a 7B model = 30s-2min, to anticipate (warm pool of replicas)
- Memory config: the default
gpu_memory_utilizationis too aggressive and can OOM under load - Mixed precision traps: some architectures do not support FP8, verify
- LoRA adapters: quickly exceed the limit if too many adapters are loaded at the same time
- Upgrades: vLLM moves fast, breaking changes between versions. Pin your versions.
- Multi-GPU: tensor parallelism vs pipeline parallelism, pick based on the model
- Debugging: logs are not always clear under heavy load, integrate Langfuse/Arize properly
Alternatives / Comparisons
See vllm vs alternatives (to create via /compare vllm vs tgi vs sglang).
Quick:
- vLLM: solid default, community, ecosystem
- SGLang: pure perf, radix tree prefix cache, rising fast
- TensorRT-LLM: top NVIDIA perf, less portable, harder to deploy
- TGI (HuggingFace): stable, good HF integration, less buzz
- Ollama: homelab only, not production
- llama.cpp: edge / embedded, not a multi-tenant server
Mini-lab
[[labs/01-vllm-vs-ollama-benchmark/]] - deploy the same model on vLLM and Ollama, benchmark throughput and latency on 100 concurrent requests.
To create: /lab vllm.
Further reading
- vLLM paper (SOSP 2023): "Efficient Memory Management for Large Language Model Serving with PagedAttention"
- https://docs.vllm.ai
- Repo: https://github.com/vllm-project/vllm
- CNCF talks on vLLM + Kubernetes + inference gateway
- gpu autoscaling k8s: scaling vLLM on K8s with custom metrics