Structured Outputs
Structured outputs force a model to produce machine-readable content (JSON, XML, SQL, regex) that downstream code can parse. Four layers of control: prompting, post-processing, constrained sampling, and finetuning. In 2026, native API support (OpenAI structured outputs, Anthropic tool use, grammars in open-weight servers) has mostly solved this.
Structured Outputs
TL;DR
Structured outputs force a model to produce machine-readable content (JSON, XML, SQL, regex) that downstream code can parse. Four layers of control: prompting, post-processing, constrained sampling, and finetuning. In 2026, native API support (OpenAI structured outputs, Anthropic tool use, grammars in open-weight servers) has mostly solved this.
The historical problem
Many production tasks require machine-readable output:
- Text-to-SQL: user asks "average revenue last 6 months", model must return valid SQL
- Extracting data from documents: must output a JSON matching a specific schema
- Agent tool calls: model must output a tool name and arguments a system can execute
- Classification: output must be one of a fixed set of labels
Free-form LLM outputs break systems. A single missing bracket, a hallucinated field name, or a truncated JSON blob crashes the downstream parser. In early LLM production (2022-2023), teams spent huge effort on fragile retry loops and custom parsers.
How it works
Four layers of control
Huyen lays out the stack from lightest to heaviest:
1. Prompting
Instruct the model in plain language: "Output JSON with keys title and body. Do not include any other text."
Pros: zero setup cost, works on any model. Cons: no guarantee. A few percent of outputs will be malformed. For some models and tasks, more.
Models like Claude and GPT-4 are quite good at this now, but "quite good" is not "reliable".
2. Post-processing
The model outputs nearly-correct JSON. You fix predictable errors:
- Missing closing bracket: append it
- Trailing comma: remove it
- Smart quotes instead of straight: replace
LinkedIn reported: defensive YAML parsing went from 90% to 99.99% validity. Cheap win.
Works when errors are few and predictable. Fails when the model's output is wildly off-schema.
Tip from Huyen: YAML is less verbose than JSON, uses fewer output tokens. LinkedIn went with YAML for this reason despite JSON being more common.
3. Constrained sampling (grammar-based)
Filter the model's logit vector at each decoding step to only allow tokens that satisfy a grammar. The model literally CANNOT emit an invalid JSON bracket.
How it works:
- Define a grammar (BNF, regex, JSON schema)
- At each step, compute logits
- Mask (set to -inf) logits for tokens that violate the grammar
- Sample from the remaining valid tokens
Tools that implement this:
- Outlines (open-weight, most popular)
- Guidance (Microsoft, older)
- instructor (Python library on top of OpenAI)
- llama.cpp (built-in grammar support)
- xgrammar (fastest as of 2025, used in SGLang/vLLM)
Cons: overhead (grammar check adds latency), only supports formats with defined grammars.
Some researchers argue constrained sampling is a hack. The compute would be better spent training models to follow instructions. But empirically, constrained sampling works NOW and is widely used.
4. Finetuning
Fine-tune the model on examples of desired format. Most general solution, works for any output format.
Special case: feature-based transfer for classification. Replace the final layer with a classifier head that can only output one of N classes. Guarantees class membership.
Expensive but reliable. Used when constrained sampling is insufficient (complex domain-specific formats).
Relevance today (2026)
Huyen's framework still applies. Major changes since Dec 2024:
1. Native "structured outputs" mode is now default
OpenAI launched Structured Outputs (Aug 2024) with 100% schema compliance guarantee. Anthropic's tool use has similar behavior. Google Gemini has function calling.
Usage:
# OpenAI structured outputs
response = client.chat.completions.create(
model="gpt-4o-2024-08-06",
response_format={
"type": "json_schema",
"json_schema": {...}
}
)
Under the hood, these use constrained sampling. You just do not see it.
Practical impact: for common cases (JSON with schema), you no longer need Outlines or instructor. Native API does it. Teams still use external tools for custom grammars (SQL, DSL, regex).
2. Tool use / function calling is the agent primitive
In 2026, structured output has merged with tool use / function calling. When a model "calls a tool", it outputs structured JSON matching the tool's schema. MCP (Model Context Protocol) standardized this even further.
See [[../05-ai-agents/README]] and [[../06-mcp/README]].
3. Grammar support in open-weight servers
vLLM, SGLang, TGI all support grammar-constrained generation natively in 2026. Open-weight teams get constrained outputs as a serving-time feature, not a wrapper.
4. Pydantic-style DSLs everywhere
Instructor and similar libraries let you define output structure as Pydantic models. Library handles the JSON schema generation and parsing. Pattern is now standard:
class UserProfile(BaseModel):
name: str
age: int
interests: list[str]
user = client.chat.completions.create(
response_model=UserProfile,
...
)
5. Self-validation loops are cheap now
With cheaper inference (prices dropped 10x since 2024), you can afford a 2-model validation: generate with Model A, validate with Model B. Catches semantic errors the schema cannot.
Works well for "valid JSON that ALSO makes sense" tasks.
6. XML is Claude's favorite
Anthropic documents that Claude prefers XML-like tags (<answer>, <reasoning>) over JSON for structured content within prose. For pure data, JSON is still standard.
Critical questions
- Why does constrained sampling add latency? What is being computed per token?
- If native structured outputs guarantee schema compliance, what can still go wrong? (Answer: semantic errors. "age": -5 is valid schema, invalid reality.)
- XML vs JSON for structured outputs: what does Anthropic's data show?
- Is constrained sampling doing the model's job wrong? Should models just be trained better?
- For agents, should tool schemas be detailed (many fields) or minimal (few fields)? What's the tradeoff?
Production pitfalls
- Early stopping breaks JSON. If
max_tokensis too low, the JSON gets truncated mid-object. No closing bracket, unparseable. Setmax_tokensgenerously for structured outputs. - Prompting + no schema validation. Relying on the model "usually outputting JSON" will fail 2% of the time. Over 10M requests, that is 200K broken outputs.
- Overly complex schemas. Deeply nested or 50-field schemas confuse the model. Simpler schemas = higher reliability. Break into multiple calls if needed.
- Missing default values. A schema with 20 optional fields often has the model hallucinate plausible defaults. Mark required vs optional explicitly.
- Trusting the values, not just the structure. Structured output guarantees shape, not correctness. "age": 150 is valid schema, probably a hallucination.
- YAML trap. Whitespace-sensitive, easy to break with subtle formatting issues. Use JSON unless you have a strong reason (LinkedIn's verbosity argument).
Alternatives / Comparisons
| Layer | Effort | Reliability | When to use |
|---|---|---|---|
| Prompting alone | Zero | 95-99% | Quick prototypes, small schemas |
| Prompt + post-processing | Low | 99-99.9% | LinkedIn-style fix common errors |
| Native structured outputs (API) | Low | ~100% schema | Default in 2026 for OpenAI, Anthropic |
| Constrained sampling (Outlines, xgrammar) | Medium | ~100% | Self-hosted, custom grammar |
| Fine-tuning on format | High | ~100% | Domain-specific output you cannot express as grammar |
| Classifier head (feature-based transfer) | High | 100% | Classification with few classes |
Typical 2026 production config:
- Use native structured outputs if the API supports it
- Add Pydantic validation for semantic correctness
- Retry with a different prompt on validation failure
- Fall back to human review for critical fields
Mini-lab
labs/structured-outputs-compare/ (to create) - benchmark 5 approaches on the same task (e.g., extract structured metadata from a news article):
- Prompt only
- Prompt + post-processing
- OpenAI structured outputs API
- Outlines with grammar
- Instructor with Pydantic
Measure:
- Schema compliance rate
- Semantic accuracy
- Latency
- Cost per request
Stack: uv + OpenAI API + local Ollama model + Outlines + instructor.
Goal: build intuition for which layer to reach for in real projects.
Further reading
- OpenAI Structured Outputs announcement (Aug 2024): https://openai.com/index/introducing-structured-outputs-in-the-api/
- Outlines documentation: https://dottxt-ai.github.io/outlines/
- Instructor library: https://python.useinstructor.com/
- Anthropic tool use docs: https://docs.anthropic.com/en/docs/tool-use
- xgrammar paper (2024) - fastest grammar-constrained sampling
- Huyen, Chapter 2 section on Structured Outputs - the version this notion extends
- Brandon T. Willard, "Efficient Guided Generation for LLMs" (Outlines creator, 2023)