Foundations
00·Foundations·updated 2026-04-21

Foundation Models

A foundation model is a large, general-purpose model trained on huge data with self-supervision, that can be adapted to many tasks. The word "foundation" captures both their importance and the fact that you build applications on top of them. Covers LLMs (text) and LMMs (multimodal).

Foundation Models

TL;DR

A foundation model is a large, general-purpose model trained on huge data with self-supervision, that can be adapted to many tasks. The word "foundation" captures both their importance and the fact that you build applications on top of them. Covers LLMs (text) and LMMs (multimodal).

The historical problem

For decades, AI research was split by data modality:

  • NLP for text (translation, spam detection)
  • Computer vision for images (object detection, classification)
  • Audio models for speech (STT, TTS)

Each modality had its own architectures, its own teams, its own benchmarks. A spam classifier could not translate. A translator could not recognize cats.

Worse, each task needed its own model. One model for sentiment analysis. Another for translation. Another for summarization. You trained from scratch or close to it, every time. For most companies this was too expensive.

The shift to foundation models broke both barriers at once: one model, many modalities, many tasks.

How it works

From LLM to foundation model

A language model processes text tokens. A multimodal model processes more than one data modality (text + image, text + audio, text + video).

A generative multimodal model is called a Large Multimodal Model (LMM). It generates the next token conditioned on text AND image tokens (or whatever modalities it supports). GPT-4V, Claude 3, and Gemini are LMMs.

Huyen uses "foundation models" to mean both LLMs and LMMs.

Self-supervision scales to multimodal

The same trick that made LLMs work, works for multimodal. OpenAI trained CLIP (2021) on 400 million (image, caption) pairs scraped from the web. No manual labeling. 400x bigger than ImageNet.

CLIP is not generative. It is an embedding model: it produces joint embeddings for text and image that can be compared. CLIP embeddings are the backbone of generative multimodal models like Flamingo, LLaVA, and Gemini.

Task-specific -> general-purpose

Foundation models transition AI from task-specific models (one model per task) to general-purpose models (one model, many tasks out of the box).

An LLM can do sentiment analysis AND translation AND summarization without retraining. You adapt it via three techniques:

  1. Prompt engineering - give it detailed instructions and examples, weights unchanged. See 02-prompt-engineering/.
  2. RAG (Retrieval-Augmented Generation) - connect it to a database of knowledge it can pull from. See 03-rag/.
  3. Finetuning - continue training on your specific data, weights change. See 10-fine-tuning/.

Economics shift

Building a task-specific model from scratch: 1M+ labeled examples, 6 months, specialized team. Adapting a foundation model: 10-100 examples, one weekend, generalist engineer.

This is the economic reason AI engineering exists as a discipline.

Relevance today (2026)

Huyen's 2024 take is mostly still true but sharper today:

  • Foundation is now the default. In 2024, Huyen still explained "why foundation models matter". In 2026, every new frontier model is multimodal from day one. The LLM/LMM distinction is fading. You pick a model, you pick what it accepts.
  • Open-weights are competitive. Huyen's examples (GPT-4, Claude 3, Gemini) are closed. In 2026, Llama 4, DeepSeek-V3, Qwen 3, Mistral Large 2 are within striking distance for many tasks. For AI engineers, open-weights means self-hosting is realistic again (see 07-llm-optimization/vllm).
  • Small foundation models are real. Phi-4, Llama 4 8B, Gemma 3 4B are good enough for many tasks and can run on laptops. "Foundation" no longer implies "huge and API-only".
  • Reasoning models are a new category. OpenAI o3, Claude Opus 4.5 thinking mode, DeepSeek R1, Gemini 2.5 Thinking. These are foundation models with test-time reasoning compute. Huyen does not really cover this, she will need a chapter 2.5 in edition 2.
  • Economics shifted again. Inference cost per million tokens dropped 10x since Dec 2024. Huyen's "cheaper than ever" is now "cheaper than you think".

Question: what does "foundation" even mean if every new model can do everything? The useful distinction in 2026 is probably between frontier models (GPT-5, Opus 4.5, Gemini 2 Ultra) and commodity models (small Llama, Qwen), not between LLM and LMM.

Critical questions

  • If your task is narrow (classify support tickets into 5 categories), is a foundation model the right tool or is a fine-tuned BERT-medium cheaper and faster?
  • Prompt engineering vs RAG vs finetuning: given the same task, how do you choose? What data do you have?
  • CLIP was 2021. Why do we still use it in 2026 as a backbone instead of replacing it with something newer?
  • If open-weights models are 90% as good as closed models, why do most startups still build on closed APIs?
  • What does "general-purpose" really mean when models still fail at arithmetic, long-horizon planning, and up-to-date facts?

Production pitfalls

  • Over-reliance on generalism. A foundation model gets 80% of a narrow task right. The last 20% often needs fine-tuning or heavy prompting. Budget for it.
  • Vendor lock-in on closed APIs. Moving from GPT-4 to Claude or Gemini is not a one-line change. Prompt behavior, tool-use format, output distribution all differ. Design prompts and eval suites to be model-portable.
  • Benchmark overfitting. A model that tops MMLU does not guarantee it's best for your use case. Always build a domain-specific eval set (see 08-evaluations/).
  • Cost explosion with multimodal. Image and video tokens are 10-100x more expensive per "unit" than text. Test cost assumptions before scaling.
  • Context length is not free. Just because a model accepts 1M tokens does not mean you should stuff it. Latency grows, cost grows, and quality often degrades past a certain point ("lost in the middle" effect).

Alternatives / Comparisons

ApproachCost to buildBest forDownside
Build your own modelHuge (1M+ labeled examples, 6 months)Very narrow, very large scaleOut of reach for 99% of teams
Adapt a foundation modelLow (10-100 examples, 1 weekend)Most production appsGeneric behavior until adapted
Closed API (GPT/Claude/Gemini)Zero infraFast time to marketVendor lock-in, data privacy, cost at scale
Open-weights self-hostedInfra investmentPrivacy, latency, cost at scaleOps overhead, slower than frontier
Small fine-tuned classifierMediumNarrow, repetitive tasksFragile when tasks drift

Mini-lab

labs/foundation-model-adaptation/ (to create) - take one open-weights model (Llama 3 8B or Phi-4), run it locally with Ollama or vLLM, then adapt it to a narrow task with:

  1. Prompt engineering only
  2. RAG on a small corpus
  3. LoRA fine-tune on 50 examples

Measure quality, latency, cost for each. Goal: feel which adaptation technique fits which problem.

Further reading

  • Bommasani et al., "On the Opportunities and Risks of Foundation Models" (Stanford HAI, 2021) - coined the term
  • OpenAI, CLIP paper (Radford et al., 2021) - foundational multimodal backbone
  • Huyen, Chapter 2 - deeper on how foundation models are built
  • Anthropic Claude 4.x model cards - for the 2026 frontier
  • DeepSeek-V3 technical report (2024) - open-weights frontier comparison
foundation-modelsLLMLMMmultimodalgenerative-aifundamentals