Foundations
00·Foundations·updated 2026-04-21

The AI Engineering Stack (3 Layers)

Every AI application runs on a 3-layer stack: application development (top), model development (middle), infrastructure (bottom). You typically start at the top and move down only when you need more control or performance.

The AI Engineering Stack (3 Layers)

TL;DR

Every AI application runs on a 3-layer stack: application development (top), model development (middle), infrastructure (bottom). You typically start at the top and move down only when you need more control or performance.

The historical problem

When AI was synonymous with ML engineering, the job description was "train a model". One role did it all: collect data, architect the model, train, serve, monitor. As the tooling matured (AutoML, MLOps, feature stores, etc.) clear layers emerged but they were still all "ML engineering".

Foundation models flipped the stack. Suddenly:

  • You do not need to train (the model is a service)
  • You need to adapt (prompt, RAG, fine-tune)
  • The work shifted up the stack

Without a clear model of the stack, teams argued about scope, responsibilities, and tooling. Huyen's 3-layer model is a map to cut through this confusion.

How it works

The 3 layers

Huyen decomposes every AI application stack into 3 layers, top to bottom:

+------------------------------------------+
|  Application development (TOP)            |
|  - Prompt engineering                      |
|  - Context construction (RAG, agents)     |
|  - Evaluation                              |
|  - User interface                          |
+------------------------------------------+
|  Model development (MIDDLE)                |
|  - Modeling and training (pretrain)        |
|  - Fine-tuning                             |
|  - Dataset engineering                     |
|  - Inference optimization (quant, distill) |
|  - Evaluation                              |
+------------------------------------------+
|  Infrastructure (BOTTOM)                   |
|  - Model serving (vLLM, TGI, SGLang)      |
|  - Compute management (GPU clusters)       |
|  - Data management                         |
|  - Monitoring and observability            |
+------------------------------------------+

Who works at which layer?

  • Most AI engineers: application layer. This is where the hottest growth is (2023-2026). You do not need to train a model to ship great AI products.
  • Some AI engineers: model layer. Fine-tuning, PEFT/LoRA, model merging, quantization. Requires more ML background.
  • Few AI engineers: infrastructure. Usually platform teams or specialized companies (Anyscale, Modal, BentoML, RunPod).

Huyen's key observation (GitHub analysis, 2024)

In March 2024, she searched GitHub for AI-related repos with 500+ stars. Out of 920 repos:

  • Application development and applications: fastest growth (2023 explosion post ChatGPT)
  • Infrastructure: slowest growth

This makes sense: infrastructure problems (serving, monitoring, GPU management) are mostly the same as traditional ML ops. What changed with foundation models is what you build ON TOP of it.

Start at the top, move down only when needed

The Huyen-recommended path:

  1. Ship with a foundation-model API (OpenAI, Anthropic) + good prompts -> fastest time to market
  2. Hit limits (cost, latency, quality, privacy) -> consider fine-tuning or RAG
  3. Hit deeper limits (scale, compliance, vendor risk) -> consider self-hosting with vLLM, quantization, custom ops

Do not start at the bottom unless you have a specific reason. Building your own inference stack for a prototype is almost always overkill in 2026.

Relevance today (2026)

Huyen's 3-layer model still holds, but the layer boundaries are shifting:

  • Application layer swallowed more. Agent frameworks (LangGraph, CrewAI, OpenAI Agents SDK, MCP-based), eval platforms (Braintrust, Langfuse, LangSmith), and RAG frameworks (LlamaIndex, Ragie, Weaviate, Vectara) have matured. An AI engineer in 2026 composes these rather than building from scratch.
  • Model layer democratized. PEFT, QLoRA, unsloth, and Together/Replicate/Modal abstractions made fine-tuning a weekend project. The gap between "just prompt it" and "fine-tune" narrowed.
  • Infrastructure mostly hidden. Modal, RunPod, SkyPilot, Together AI, and Fireworks hide GPU management. Most AI engineers never touch raw K8s. The exceptions: hyperscale (Anthropic, Google) and privacy-sensitive self-hosters.
  • A 4th layer emerging: agents/tools/MCP. Between application dev and model dev, a new "agentic infrastructure" layer appeared (MCP servers, tool registries, agent orchestrators). Huyen does not include this explicitly, she will need to in edition 2.
  • Observability is now first-class. In 2024 Huyen lumped monitoring into infrastructure. In 2026, LLM-specific observability (traces, token usage, cost attribution) is its own category, see 09-observability/.

Question: in 2026, does "model development" still make sense as a middle layer, or has it split between "closed-API adapters" (tiny) and "open-weights fine-tuners" (separate discipline)?

Critical questions

  • If the infrastructure layer is "mostly the same", why do vLLM, SGLang, and TGI keep innovating? What is actually new?
  • Can a startup ship a valuable AI product working only at the application layer? (Yes, most do.) When do you have to go deeper?
  • Is MCP a new layer, a new protocol within the application layer, or an evolution of tool-use APIs?
  • Huyen wrote in 2024 that model development requires "specialized ML knowledge". Is that still true in 2026 with Unsloth + LoRA + 1-line fine-tune APIs?
  • How do you decide when to self-host vs use an API? What numbers should drive that decision (cost per token, QPS, latency P99, privacy)?

Production pitfalls

  • Premature descent. Teams jump from "API works" to "let's self-host" before they have traffic to justify it. Stays cheaper on API until you hit roughly $10K-$50K monthly inference bills.
  • Building what you should buy. Writing your own RAG pipeline from scratch in 2026 is almost always a mistake. Use LlamaIndex, Weaviate, or similar. See 03-rag/vector-stores.
  • Missing the evaluation layer. Huyen lists eval in BOTH application and model dev. Teams often skip it in both. Without eval, you cannot improve anything systematically.
  • Conflating model serving with inference optimization. Model serving (vLLM, TGI) is HOW you expose a model. Inference optimization (quantization, speculative decoding, KV cache) is HOW FAST it runs. Different knobs.
  • Vendor layering trap. Stacking many SaaS layers (OpenAI + LangSmith + Pinecone + Braintrust + Vercel) can silently cost more than self-hosting. Check cost aggregation monthly.

Alternatives / Comparisons

Different companies structure their AI stack differently:

Company archetypeLayer focusExample
Startup, B2C AI appApplication layer onlyPerplexity-like product on GPT/Claude API
Startup, AI-first SaaSApplication + some model dev (fine-tuning)Cursor, Harvey, Glean
Enterprise, internal AIApplication + infra (self-hosted for privacy)Bank chatbot on Llama 3 via vLLM
Model providerAll 3 layers, especially model devAnthropic, OpenAI, Mistral
Inference providerInfrastructureTogether, Fireworks, Modal, Anyscale

Your stack should match your competitive advantage, not be exhaustive.

Mini-lab

Map your own AI projects to the 3-layer stack:

  • Torah Study AI: which layers are you touching? (Probably app + a bit of model dev for fine-tune Gemini embeddings, no infra beyond Weaviate-as-service.)
  • Torah Timeline: app layer only?

Goal: get a clean mental model of what YOU build vs what you CONSUME. Useful for interview storytelling.

Further reading

  • Huyen, Chapter 1 of AI Engineering, "The AI Engineering Stack" section
  • Huyen, "Designing Machine Learning Systems" (DMLS) - deeper on infra patterns from traditional ML
  • Swyx, "The State of AI Engineering" (latentspace.com) - annual stack overview
  • The MAD (Machine Learning, AI, Data) Landscape (matt turck, annual) - visual stack of tools
architecturestacklayersinfrastructureapplication-development