Structured Outputs

TL;DR

Structured outputs force a model to produce machine-readable content (JSON, XML, SQL, regex) that downstream code can parse. Four layers of control: prompting, post-processing, constrained sampling, and finetuning. In 2026, native API support (OpenAI structured outputs, Anthropic tool use, grammars in open-weight servers) has mostly solved this.

The historical problem

Many production tasks require machine-readable output:

Text-to-SQL: user asks "average revenue last 6 months", model must return valid SQL
Extracting data from documents: must output a JSON matching a specific schema
Agent tool calls: model must output a tool name and arguments a system can execute
Classification: output must be one of a fixed set of labels

Free-form LLM outputs break systems. A single missing bracket, a hallucinated field name, or a truncated JSON blob crashes the downstream parser. In early LLM production (2022-2023), teams spent huge effort on fragile retry loops and custom parsers.

How it works

Four layers of control

Huyen lays out the stack from lightest to heaviest:

1. Prompting

Instruct the model in plain language: "Output JSON with keys title and body. Do not include any other text."

Pros: zero setup cost, works on any model. Cons: no guarantee. A few percent of outputs will be malformed. For some models and tasks, more.

Models like Claude and GPT-4 are quite good at this now, but "quite good" is not "reliable".

2. Post-processing

The model outputs nearly-correct JSON. You fix predictable errors:

Missing closing bracket: append it
Trailing comma: remove it
Smart quotes instead of straight: replace

LinkedIn reported: defensive YAML parsing went from 90% to 99.99% validity. Cheap win.

Works when errors are few and predictable. Fails when the model's output is wildly off-schema.

Tip from Huyen: YAML is less verbose than JSON, uses fewer output tokens. LinkedIn went with YAML for this reason despite JSON being more common.

3. Constrained sampling (grammar-based)

Filter the model's logit vector at each decoding step to only allow tokens that satisfy a grammar. The model literally CANNOT emit an invalid JSON bracket.

How it works:

Define a grammar (BNF, regex, JSON schema)
At each step, compute logits
Mask (set to -inf) logits for tokens that violate the grammar
Sample from the remaining valid tokens

Tools that implement this:

Outlines (open-weight, most popular)
Guidance (Microsoft, older)
instructor (Python library on top of OpenAI)
llama.cpp (built-in grammar support)
xgrammar (fastest as of 2025, used in SGLang/vLLM)

Cons: overhead (grammar check adds latency), only supports formats with defined grammars.

Some researchers argue constrained sampling is a hack. The compute would be better spent training models to follow instructions. But empirically, constrained sampling works NOW and is widely used.

4. Finetuning

Fine-tune the model on examples of desired format. Most general solution, works for any output format.

Special case: feature-based transfer for classification. Replace the final layer with a classifier head that can only output one of N classes. Guarantees class membership.

Expensive but reliable. Used when constrained sampling is insufficient (complex domain-specific formats).

Relevance today (2026)

Huyen's framework still applies. Major changes since Dec 2024:

1. Native "structured outputs" mode is now default

OpenAI launched Structured Outputs (Aug 2024) with 100% schema compliance guarantee. Anthropic's tool use has similar behavior. Google Gemini has function calling.

Usage:

# OpenAI structured outputs
response = client.chat.completions.create(
    model="gpt-4o-2024-08-06",
    response_format={
        "type": "json_schema",
        "json_schema": {...}
    }
)

Under the hood, these use constrained sampling. You just do not see it.

Practical impact: for common cases (JSON with schema), you no longer need Outlines or instructor. Native API does it. Teams still use external tools for custom grammars (SQL, DSL, regex).

2. Tool use / function calling is the agent primitive

In 2026, structured output has merged with tool use / function calling. When a model "calls a tool", it outputs structured JSON matching the tool's schema. MCP (Model Context Protocol) standardized this even further.

See [[../05-ai-agents/README]] and [[../06-mcp/README]].

3. Grammar support in open-weight servers

vLLM, SGLang, TGI all support grammar-constrained generation natively in 2026. Open-weight teams get constrained outputs as a serving-time feature, not a wrapper.

4. Pydantic-style DSLs everywhere

Instructor and similar libraries let you define output structure as Pydantic models. Library handles the JSON schema generation and parsing. Pattern is now standard:

class UserProfile(BaseModel):
    name: str
    age: int
    interests: list[str]

user = client.chat.completions.create(
    response_model=UserProfile,
    ...
)

5. Self-validation loops are cheap now

With cheaper inference (prices dropped 10x since 2024), you can afford a 2-model validation: generate with Model A, validate with Model B. Catches semantic errors the schema cannot.

Works well for "valid JSON that ALSO makes sense" tasks.

6. XML is Claude's favorite

Anthropic documents that Claude prefers XML-like tags (<answer>, <reasoning>) over JSON for structured content within prose. For pure data, JSON is still standard.

Critical questions

Why does constrained sampling add latency? What is being computed per token?
If native structured outputs guarantee schema compliance, what can still go wrong? (Answer: semantic errors. "age": -5 is valid schema, invalid reality.)
XML vs JSON for structured outputs: what does Anthropic's data show?
Is constrained sampling doing the model's job wrong? Should models just be trained better?
For agents, should tool schemas be detailed (many fields) or minimal (few fields)? What's the tradeoff?

Production pitfalls

Early stopping breaks JSON. If max_tokens is too low, the JSON gets truncated mid-object. No closing bracket, unparseable. Set max_tokens generously for structured outputs.
Prompting + no schema validation. Relying on the model "usually outputting JSON" will fail 2% of the time. Over 10M requests, that is 200K broken outputs.
Overly complex schemas. Deeply nested or 50-field schemas confuse the model. Simpler schemas = higher reliability. Break into multiple calls if needed.
Missing default values. A schema with 20 optional fields often has the model hallucinate plausible defaults. Mark required vs optional explicitly.
Trusting the values, not just the structure. Structured output guarantees shape, not correctness. "age": 150 is valid schema, probably a hallucination.
YAML trap. Whitespace-sensitive, easy to break with subtle formatting issues. Use JSON unless you have a strong reason (LinkedIn's verbosity argument).

Alternatives / Comparisons

Layer	Effort	Reliability	When to use
Prompting alone	Zero	95-99%	Quick prototypes, small schemas
Prompt + post-processing	Low	99-99.9%	LinkedIn-style fix common errors
Native structured outputs (API)	Low	~100% schema	Default in 2026 for OpenAI, Anthropic
Constrained sampling (Outlines, xgrammar)	Medium	~100%	Self-hosted, custom grammar
Fine-tuning on format	High	~100%	Domain-specific output you cannot express as grammar
Classifier head (feature-based transfer)	High	100%	Classification with few classes

Typical 2026 production config:

Use native structured outputs if the API supports it
Add Pydantic validation for semantic correctness
Retry with a different prompt on validation failure
Fall back to human review for critical fields

Mini-lab

labs/structured-outputs-compare/ (to create) - benchmark 5 approaches on the same task (e.g., extract structured metadata from a news article):

Prompt only
Prompt + post-processing
OpenAI structured outputs API
Outlines with grammar
Instructor with Pydantic

Measure:

Schema compliance rate
Semantic accuracy
Latency
Cost per request

Stack: uv + OpenAI API + local Ollama model + Outlines + instructor.

Goal: build intuition for which layer to reach for in real projects.

Structured Outputs

Structured Outputs

TL;DR

The historical problem

How it works

Four layers of control

1. Prompting

2. Post-processing

3. Constrained sampling (grammar-based)

4. Finetuning

Relevance today (2026)

1. Native "structured outputs" mode is now default

2. Tool use / function calling is the agent primitive

3. Grammar support in open-weight servers

4. Pydantic-style DSLs everywhere

5. Self-validation loops are cheap now

6. XML is Claude's favorite

Critical questions

Production pitfalls

Alternatives / Comparisons

Mini-lab

Further reading