The AI Engineering Stack (3 Layers)
Every AI application runs on a 3-layer stack: application development (top), model development (middle), infrastructure (bottom). You typically start at the top and move down only when you need more control or performance.
The AI Engineering Stack (3 Layers)
TL;DR
Every AI application runs on a 3-layer stack: application development (top), model development (middle), infrastructure (bottom). You typically start at the top and move down only when you need more control or performance.
The historical problem
When AI was synonymous with ML engineering, the job description was "train a model". One role did it all: collect data, architect the model, train, serve, monitor. As the tooling matured (AutoML, MLOps, feature stores, etc.) clear layers emerged but they were still all "ML engineering".
Foundation models flipped the stack. Suddenly:
- You do not need to train (the model is a service)
- You need to adapt (prompt, RAG, fine-tune)
- The work shifted up the stack
Without a clear model of the stack, teams argued about scope, responsibilities, and tooling. Huyen's 3-layer model is a map to cut through this confusion.
How it works
The 3 layers
Huyen decomposes every AI application stack into 3 layers, top to bottom:
+------------------------------------------+
| Application development (TOP) |
| - Prompt engineering |
| - Context construction (RAG, agents) |
| - Evaluation |
| - User interface |
+------------------------------------------+
| Model development (MIDDLE) |
| - Modeling and training (pretrain) |
| - Fine-tuning |
| - Dataset engineering |
| - Inference optimization (quant, distill) |
| - Evaluation |
+------------------------------------------+
| Infrastructure (BOTTOM) |
| - Model serving (vLLM, TGI, SGLang) |
| - Compute management (GPU clusters) |
| - Data management |
| - Monitoring and observability |
+------------------------------------------+
Who works at which layer?
- Most AI engineers: application layer. This is where the hottest growth is (2023-2026). You do not need to train a model to ship great AI products.
- Some AI engineers: model layer. Fine-tuning, PEFT/LoRA, model merging, quantization. Requires more ML background.
- Few AI engineers: infrastructure. Usually platform teams or specialized companies (Anyscale, Modal, BentoML, RunPod).
Huyen's key observation (GitHub analysis, 2024)
In March 2024, she searched GitHub for AI-related repos with 500+ stars. Out of 920 repos:
- Application development and applications: fastest growth (2023 explosion post ChatGPT)
- Infrastructure: slowest growth
This makes sense: infrastructure problems (serving, monitoring, GPU management) are mostly the same as traditional ML ops. What changed with foundation models is what you build ON TOP of it.
Start at the top, move down only when needed
The Huyen-recommended path:
- Ship with a foundation-model API (OpenAI, Anthropic) + good prompts -> fastest time to market
- Hit limits (cost, latency, quality, privacy) -> consider fine-tuning or RAG
- Hit deeper limits (scale, compliance, vendor risk) -> consider self-hosting with vLLM, quantization, custom ops
Do not start at the bottom unless you have a specific reason. Building your own inference stack for a prototype is almost always overkill in 2026.
Relevance today (2026)
Huyen's 3-layer model still holds, but the layer boundaries are shifting:
- Application layer swallowed more. Agent frameworks (LangGraph, CrewAI, OpenAI Agents SDK, MCP-based), eval platforms (Braintrust, Langfuse, LangSmith), and RAG frameworks (LlamaIndex, Ragie, Weaviate, Vectara) have matured. An AI engineer in 2026 composes these rather than building from scratch.
- Model layer democratized. PEFT, QLoRA, unsloth, and Together/Replicate/Modal abstractions made fine-tuning a weekend project. The gap between "just prompt it" and "fine-tune" narrowed.
- Infrastructure mostly hidden. Modal, RunPod, SkyPilot, Together AI, and Fireworks hide GPU management. Most AI engineers never touch raw K8s. The exceptions: hyperscale (Anthropic, Google) and privacy-sensitive self-hosters.
- A 4th layer emerging: agents/tools/MCP. Between application dev and model dev, a new "agentic infrastructure" layer appeared (MCP servers, tool registries, agent orchestrators). Huyen does not include this explicitly, she will need to in edition 2.
- Observability is now first-class. In 2024 Huyen lumped monitoring into infrastructure. In 2026, LLM-specific observability (traces, token usage, cost attribution) is its own category, see
09-observability/.
Question: in 2026, does "model development" still make sense as a middle layer, or has it split between "closed-API adapters" (tiny) and "open-weights fine-tuners" (separate discipline)?
Critical questions
- If the infrastructure layer is "mostly the same", why do vLLM, SGLang, and TGI keep innovating? What is actually new?
- Can a startup ship a valuable AI product working only at the application layer? (Yes, most do.) When do you have to go deeper?
- Is MCP a new layer, a new protocol within the application layer, or an evolution of tool-use APIs?
- Huyen wrote in 2024 that model development requires "specialized ML knowledge". Is that still true in 2026 with Unsloth + LoRA + 1-line fine-tune APIs?
- How do you decide when to self-host vs use an API? What numbers should drive that decision (cost per token, QPS, latency P99, privacy)?
Production pitfalls
- Premature descent. Teams jump from "API works" to "let's self-host" before they have traffic to justify it. Stays cheaper on API until you hit roughly $10K-$50K monthly inference bills.
- Building what you should buy. Writing your own RAG pipeline from scratch in 2026 is almost always a mistake. Use LlamaIndex, Weaviate, or similar. See
03-rag/vector-stores. - Missing the evaluation layer. Huyen lists eval in BOTH application and model dev. Teams often skip it in both. Without eval, you cannot improve anything systematically.
- Conflating model serving with inference optimization. Model serving (vLLM, TGI) is HOW you expose a model. Inference optimization (quantization, speculative decoding, KV cache) is HOW FAST it runs. Different knobs.
- Vendor layering trap. Stacking many SaaS layers (OpenAI + LangSmith + Pinecone + Braintrust + Vercel) can silently cost more than self-hosting. Check cost aggregation monthly.
Alternatives / Comparisons
Different companies structure their AI stack differently:
| Company archetype | Layer focus | Example |
|---|---|---|
| Startup, B2C AI app | Application layer only | Perplexity-like product on GPT/Claude API |
| Startup, AI-first SaaS | Application + some model dev (fine-tuning) | Cursor, Harvey, Glean |
| Enterprise, internal AI | Application + infra (self-hosted for privacy) | Bank chatbot on Llama 3 via vLLM |
| Model provider | All 3 layers, especially model dev | Anthropic, OpenAI, Mistral |
| Inference provider | Infrastructure | Together, Fireworks, Modal, Anyscale |
Your stack should match your competitive advantage, not be exhaustive.
Mini-lab
Take 3 AI products you know and map each to the 3-layer stack. Which layers does the team behind them touch? Which do they consume as a service?
Example candidates:
- A B2C chatbot (Perplexity, Character AI): usually application layer only, all models via API.
- A coding assistant (Cursor, Windsurf): application + some model dev (they fine-tune for code, route between models).
- A dev tools platform with LLMs (Sentry AI, Retool AI): application layer grafted onto an existing product, models via OpenAI/Anthropic API.
- An inference provider (Together, Fireworks, Modal): infrastructure layer, they hide GPU management so others can stay at higher layers.
- A frontier model lab (Anthropic, OpenAI, Mistral): all 3 layers, with most of the expertise at model development.
For each: write one sentence per layer stating what they own vs what they consume. Two hours, you have a clean vocabulary to talk about AI companies at interview level.
Further reading
- Huyen, Chapter 1 of "AI Engineering" (O'Reilly 2025, paid), "The AI Engineering Stack" section: https://www.oreilly.com/library/view/ai-engineering/9781098166298/
- Huyen, "Designing Machine Learning Systems" (O'Reilly 2022, paid) - deeper on infra patterns from traditional ML: https://www.oreilly.com/library/view/designing-machine-learning/9781098107956/
- Swyx, "The State of AI Engineering" - annual stack overview: https://www.latent.space/p/state-of-ai-engineering
- The MAD (Machine Learning, AI, Data) Landscape (Matt Turck, annual) - visual stack of tools: https://mad.firstmark.com/