The AI Engineering Stack (3 Layers)

TL;DR

Every AI application runs on a 3-layer stack: application development (top), model development (middle), infrastructure (bottom). You typically start at the top and move down only when you need more control or performance.

The historical problem

When AI was synonymous with ML engineering, the job description was "train a model". One role did it all: collect data, architect the model, train, serve, monitor. As the tooling matured (AutoML, MLOps, feature stores, etc.) clear layers emerged but they were still all "ML engineering".

Foundation models flipped the stack. Suddenly:

You do not need to train (the model is a service)
You need to adapt (prompt, RAG, fine-tune)
The work shifted up the stack

Without a clear model of the stack, teams argued about scope, responsibilities, and tooling. Huyen's 3-layer model is a map to cut through this confusion.

How it works

The 3 layers

Huyen decomposes every AI application stack into 3 layers, top to bottom:

+------------------------------------------+
|  Application development (TOP)            |
|  - Prompt engineering                      |
|  - Context construction (RAG, agents)     |
|  - Evaluation                              |
|  - User interface                          |
+------------------------------------------+
|  Model development (MIDDLE)                |
|  - Modeling and training (pretrain)        |
|  - Fine-tuning                             |
|  - Dataset engineering                     |
|  - Inference optimization (quant, distill) |
|  - Evaluation                              |
+------------------------------------------+
|  Infrastructure (BOTTOM)                   |
|  - Model serving (vLLM, TGI, SGLang)      |
|  - Compute management (GPU clusters)       |
|  - Data management                         |
|  - Monitoring and observability            |
+------------------------------------------+

Who works at which layer?

Most AI engineers: application layer. This is where the hottest growth is (2023-2026). You do not need to train a model to ship great AI products.
Some AI engineers: model layer. Fine-tuning, PEFT/LoRA, model merging, quantization. Requires more ML background.
Few AI engineers: infrastructure. Usually platform teams or specialized companies (Anyscale, Modal, BentoML, RunPod).

Huyen's key observation (GitHub analysis, 2024)

In March 2024, she searched GitHub for AI-related repos with 500+ stars. Out of 920 repos:

Application development and applications: fastest growth (2023 explosion post ChatGPT)
Infrastructure: slowest growth

This makes sense: infrastructure problems (serving, monitoring, GPU management) are mostly the same as traditional ML ops. What changed with foundation models is what you build ON TOP of it.

Start at the top, move down only when needed

The Huyen-recommended path:

Ship with a foundation-model API (OpenAI, Anthropic) + good prompts -> fastest time to market
Hit limits (cost, latency, quality, privacy) -> consider fine-tuning or RAG
Hit deeper limits (scale, compliance, vendor risk) -> consider self-hosting with vLLM, quantization, custom ops

Do not start at the bottom unless you have a specific reason. Building your own inference stack for a prototype is almost always overkill in 2026.

Relevance today (2026)

Huyen's 3-layer model still holds, but the layer boundaries are shifting:

Application layer swallowed more. Agent frameworks (LangGraph, CrewAI, OpenAI Agents SDK, MCP-based), eval platforms (Braintrust, Langfuse, LangSmith), and RAG frameworks (LlamaIndex, Ragie, Weaviate, Vectara) have matured. An AI engineer in 2026 composes these rather than building from scratch.
Model layer democratized. PEFT, QLoRA, unsloth, and Together/Replicate/Modal abstractions made fine-tuning a weekend project. The gap between "just prompt it" and "fine-tune" narrowed.
Infrastructure mostly hidden. Modal, RunPod, SkyPilot, Together AI, and Fireworks hide GPU management. Most AI engineers never touch raw K8s. The exceptions: hyperscale (Anthropic, Google) and privacy-sensitive self-hosters.
A 4th layer emerging: agents/tools/MCP. Between application dev and model dev, a new "agentic infrastructure" layer appeared (MCP servers, tool registries, agent orchestrators). Huyen does not include this explicitly, she will need to in edition 2.
Observability is now first-class. In 2024 Huyen lumped monitoring into infrastructure. In 2026, LLM-specific observability (traces, token usage, cost attribution) is its own category, see 09-observability/.

Question: in 2026, does "model development" still make sense as a middle layer, or has it split between "closed-API adapters" (tiny) and "open-weights fine-tuners" (separate discipline)?

Critical questions

If the infrastructure layer is "mostly the same", why do vLLM, SGLang, and TGI keep innovating? What is actually new?
Can a startup ship a valuable AI product working only at the application layer? (Yes, most do.) When do you have to go deeper?
Is MCP a new layer, a new protocol within the application layer, or an evolution of tool-use APIs?
Huyen wrote in 2024 that model development requires "specialized ML knowledge". Is that still true in 2026 with Unsloth + LoRA + 1-line fine-tune APIs?
How do you decide when to self-host vs use an API? What numbers should drive that decision (cost per token, QPS, latency P99, privacy)?

Production pitfalls

Premature descent. Teams jump from "API works" to "let's self-host" before they have traffic to justify it. Stays cheaper on API until you hit roughly $10K-$50K monthly inference bills.
Building what you should buy. Writing your own RAG pipeline from scratch in 2026 is almost always a mistake. Use LlamaIndex, Weaviate, or similar. See 03-rag/vector-stores.
Missing the evaluation layer. Huyen lists eval in BOTH application and model dev. Teams often skip it in both. Without eval, you cannot improve anything systematically.
Conflating model serving with inference optimization. Model serving (vLLM, TGI) is HOW you expose a model. Inference optimization (quantization, speculative decoding, KV cache) is HOW FAST it runs. Different knobs.
Vendor layering trap. Stacking many SaaS layers (OpenAI + LangSmith + Pinecone + Braintrust + Vercel) can silently cost more than self-hosting. Check cost aggregation monthly.

Alternatives / Comparisons

Different companies structure their AI stack differently:

Company archetype	Layer focus	Example
Startup, B2C AI app	Application layer only	Perplexity-like product on GPT/Claude API
Startup, AI-first SaaS	Application + some model dev (fine-tuning)	Cursor, Harvey, Glean
Enterprise, internal AI	Application + infra (self-hosted for privacy)	Bank chatbot on Llama 3 via vLLM
Model provider	All 3 layers, especially model dev	Anthropic, OpenAI, Mistral
Inference provider	Infrastructure	Together, Fireworks, Modal, Anyscale

Your stack should match your competitive advantage, not be exhaustive.

Mini-lab

Take 3 AI products you know and map each to the 3-layer stack. Which layers does the team behind them touch? Which do they consume as a service?

Example candidates:

A B2C chatbot (Perplexity, Character AI): usually application layer only, all models via API.
A coding assistant (Cursor, Windsurf): application + some model dev (they fine-tune for code, route between models).
A dev tools platform with LLMs (Sentry AI, Retool AI): application layer grafted onto an existing product, models via OpenAI/Anthropic API.
An inference provider (Together, Fireworks, Modal): infrastructure layer, they hide GPU management so others can stay at higher layers.
A frontier model lab (Anthropic, OpenAI, Mistral): all 3 layers, with most of the expertise at model development.

For each: write one sentence per layer stating what they own vs what they consume. Two hours, you have a clean vocabulary to talk about AI companies at interview level.

The AI Engineering Stack (3 Layers)

The AI Engineering Stack (3 Layers)

TL;DR

The historical problem

How it works

The 3 layers

Who works at which layer?

Huyen's key observation (GitHub analysis, 2024)

Start at the top, move down only when needed

Relevance today (2026)

Critical questions

Production pitfalls

Alternatives / Comparisons

Mini-lab

Further reading