LLM Optimization
07·LLM Optimization·updated 2026-04-13

vLLM

High-performance LLM inference server, de facto reference for production in 2024-2026. Implements Page Attention, continuous batching, KV cache, and runtime LoRA adapters.

vLLM

TL;DR

High-performance LLM inference server, de facto reference for production in 2024-2026. Implements Page Attention, continuous batching, KV cache, and runtime LoRA adapters.

The historical problem

Naively serving an LLM with HuggingFace Transformers gives you:

  • No efficient KV cache between requests
  • No continuous batching (each request waits for the previous one to finish)
  • Fragmented GPU memory
  • Laughable throughput (1-10 tokens/s for a 7B)

In 2022, no open-source inference server was really production-ready for LLMs. Proprietary solutions like DeepSpeed were too complex.

How it works

vLLM stacks several innovations:

  1. [[page-attention|Page Attention]]: paged management of the KV cache (inspired by OS virtual memory). Reduces GPU fragmentation, allows memory sharing across requests.

  2. Continuous batching: no "fixed" batch, requests enter and leave the batch on the fly. Throughput maximized.

  3. Native [[../01-llms/kv-cache|KV Cache]]: handled automatically across requests, with prefix sharing.

  4. Runtime [[../10-fine-tuning/lora-adapters|LoRA adapters]]: load/unload adapters hot without redeploy. Enables multi-tenant with per use case specialization.

  5. Quantization support: AWQ, GPTQ, FP8, INT4...

  6. OpenAI-compatible API: drop-in replacement, easy to integrate.

Relevance today (2026)

vLLM is THE reference in 2026 but the landscape moves:

  • SGLang is rising fast in 2025-2026, radix tree for prefix caching more efficient than vLLM
  • TensorRT-LLM (NVIDIA): better on H100/H200 but less portable
  • vLLM v1 (2024-2025) architectural rewrite: faster, less buggy
  • MLX (Apple Silicon): for local Mac, not a production competitor

Honest question: will vLLM stay the reference in 2027 ?

  • For Kubernetes multi-GPU stacks: yes (ecosystem, CNCF talks, integrations)
  • For single-tenant ultra-optimized setups: SGLang and TensorRT-LLM may take over

Where vLLM shines:

  • Multi-tenant (multiple users, multiple LoRAs)
  • K8s scale with inference gateway
  • Need for an OpenAI-compatible API

Critical questions

  • Why is continuous batching more efficient than classic batching ? What tradeoff ?
  • What is Page Attention concretely ? Why is it a game changer ?
  • If I have 10 LoRA adapters to serve, do I load them all on GPU or at runtime ? Tradeoff ?
  • What are the cases where vLLM is worse than a simpler server ?
  • What is vLLM's memory overhead vs raw Transformers ?
  • How does vLLM handle failover ? And graceful shutdown of a long request ?
  • And observability (Prometheus metrics, traces) ? Is it good in production ?

Production pitfalls

  • Cold start: loading a 7B model = 30s-2min, to anticipate (warm pool of replicas)
  • Memory config: the default gpu_memory_utilization is too aggressive and can OOM under load
  • Mixed precision traps: some architectures do not support FP8, verify
  • LoRA adapters: quickly exceed the limit if too many adapters are loaded at the same time
  • Upgrades: vLLM moves fast, breaking changes between versions. Pin your versions.
  • Multi-GPU: tensor parallelism vs pipeline parallelism, pick based on the model
  • Debugging: logs are not always clear under heavy load, integrate Langfuse/Arize properly

Alternatives / Comparisons

See vllm vs alternatives (to create via /compare vllm vs tgi vs sglang).

Quick:

  • vLLM: solid default, community, ecosystem
  • SGLang: pure perf, radix tree prefix cache, rising fast
  • TensorRT-LLM: top NVIDIA perf, less portable, harder to deploy
  • TGI (HuggingFace): stable, good HF integration, less buzz
  • Ollama: homelab only, not production
  • llama.cpp: edge / embedded, not a multi-tenant server

Mini-lab

[[labs/01-vllm-vs-ollama-benchmark/]] - deploy the same model on vLLM and Ollama, benchmark throughput and latency on 100 concurrent requests.

To create: /lab vllm.

Further reading

  • vLLM paper (SOSP 2023): "Efficient Memory Management for Large Language Model Serving with PagedAttention"
  • https://docs.vllm.ai
  • Repo: https://github.com/vllm-project/vllm
  • CNCF talks on vLLM + Kubernetes + inference gateway
  • gpu autoscaling k8s: scaling vLLM on K8s with custom metrics
inference-serveroptimizationproduction