Prompt Engineering
02·Prompt Engineering·updated 2026-04-21

Structured Outputs

Structured outputs force a model to produce machine-readable content (JSON, XML, SQL, regex) that downstream code can parse. Four layers of control: prompting, post-processing, constrained sampling, and finetuning. In 2026, native API support (OpenAI structured outputs, Anthropic tool use, grammars in open-weight servers) has mostly solved this.

Structured Outputs

TL;DR

Structured outputs force a model to produce machine-readable content (JSON, XML, SQL, regex) that downstream code can parse. Four layers of control: prompting, post-processing, constrained sampling, and finetuning. In 2026, native API support (OpenAI structured outputs, Anthropic tool use, grammars in open-weight servers) has mostly solved this.

The historical problem

Many production tasks require machine-readable output:

  • Text-to-SQL: user asks "average revenue last 6 months", model must return valid SQL
  • Extracting data from documents: must output a JSON matching a specific schema
  • Agent tool calls: model must output a tool name and arguments a system can execute
  • Classification: output must be one of a fixed set of labels

Free-form LLM outputs break systems. A single missing bracket, a hallucinated field name, or a truncated JSON blob crashes the downstream parser. In early LLM production (2022-2023), teams spent huge effort on fragile retry loops and custom parsers.

How it works

Four layers of control

Huyen lays out the stack from lightest to heaviest:

1. Prompting

Instruct the model in plain language: "Output JSON with keys title and body. Do not include any other text."

Pros: zero setup cost, works on any model. Cons: no guarantee. A few percent of outputs will be malformed. For some models and tasks, more.

Models like Claude and GPT-4 are quite good at this now, but "quite good" is not "reliable".

2. Post-processing

The model outputs nearly-correct JSON. You fix predictable errors:

  • Missing closing bracket: append it
  • Trailing comma: remove it
  • Smart quotes instead of straight: replace

LinkedIn reported: defensive YAML parsing went from 90% to 99.99% validity. Cheap win.

Works when errors are few and predictable. Fails when the model's output is wildly off-schema.

Tip from Huyen: YAML is less verbose than JSON, uses fewer output tokens. LinkedIn went with YAML for this reason despite JSON being more common.

3. Constrained sampling (grammar-based)

Filter the model's logit vector at each decoding step to only allow tokens that satisfy a grammar. The model literally CANNOT emit an invalid JSON bracket.

How it works:

  1. Define a grammar (BNF, regex, JSON schema)
  2. At each step, compute logits
  3. Mask (set to -inf) logits for tokens that violate the grammar
  4. Sample from the remaining valid tokens

Tools that implement this:

  • Outlines (open-weight, most popular)
  • Guidance (Microsoft, older)
  • instructor (Python library on top of OpenAI)
  • llama.cpp (built-in grammar support)
  • xgrammar (fastest as of 2025, used in SGLang/vLLM)

Cons: overhead (grammar check adds latency), only supports formats with defined grammars.

Some researchers argue constrained sampling is a hack. The compute would be better spent training models to follow instructions. But empirically, constrained sampling works NOW and is widely used.

4. Finetuning

Fine-tune the model on examples of desired format. Most general solution, works for any output format.

Special case: feature-based transfer for classification. Replace the final layer with a classifier head that can only output one of N classes. Guarantees class membership.

Expensive but reliable. Used when constrained sampling is insufficient (complex domain-specific formats).

Relevance today (2026)

Huyen's framework still applies. Major changes since Dec 2024:

1. Native "structured outputs" mode is now default

OpenAI launched Structured Outputs (Aug 2024) with 100% schema compliance guarantee. Anthropic's tool use has similar behavior. Google Gemini has function calling.

Usage:

# OpenAI structured outputs
response = client.chat.completions.create(
    model="gpt-4o-2024-08-06",
    response_format={
        "type": "json_schema",
        "json_schema": {...}
    }
)

Under the hood, these use constrained sampling. You just do not see it.

Practical impact: for common cases (JSON with schema), you no longer need Outlines or instructor. Native API does it. Teams still use external tools for custom grammars (SQL, DSL, regex).

2. Tool use / function calling is the agent primitive

In 2026, structured output has merged with tool use / function calling. When a model "calls a tool", it outputs structured JSON matching the tool's schema. MCP (Model Context Protocol) standardized this even further.

See [[../05-ai-agents/README]] and [[../06-mcp/README]].

3. Grammar support in open-weight servers

vLLM, SGLang, TGI all support grammar-constrained generation natively in 2026. Open-weight teams get constrained outputs as a serving-time feature, not a wrapper.

4. Pydantic-style DSLs everywhere

Instructor and similar libraries let you define output structure as Pydantic models. Library handles the JSON schema generation and parsing. Pattern is now standard:

class UserProfile(BaseModel):
    name: str
    age: int
    interests: list[str]

user = client.chat.completions.create(
    response_model=UserProfile,
    ...
)

5. Self-validation loops are cheap now

With cheaper inference (prices dropped 10x since 2024), you can afford a 2-model validation: generate with Model A, validate with Model B. Catches semantic errors the schema cannot.

Works well for "valid JSON that ALSO makes sense" tasks.

6. XML is Claude's favorite

Anthropic documents that Claude prefers XML-like tags (<answer>, <reasoning>) over JSON for structured content within prose. For pure data, JSON is still standard.

Critical questions

  • Why does constrained sampling add latency? What is being computed per token?
  • If native structured outputs guarantee schema compliance, what can still go wrong? (Answer: semantic errors. "age": -5 is valid schema, invalid reality.)
  • XML vs JSON for structured outputs: what does Anthropic's data show?
  • Is constrained sampling doing the model's job wrong? Should models just be trained better?
  • For agents, should tool schemas be detailed (many fields) or minimal (few fields)? What's the tradeoff?

Production pitfalls

  • Early stopping breaks JSON. If max_tokens is too low, the JSON gets truncated mid-object. No closing bracket, unparseable. Set max_tokens generously for structured outputs.
  • Prompting + no schema validation. Relying on the model "usually outputting JSON" will fail 2% of the time. Over 10M requests, that is 200K broken outputs.
  • Overly complex schemas. Deeply nested or 50-field schemas confuse the model. Simpler schemas = higher reliability. Break into multiple calls if needed.
  • Missing default values. A schema with 20 optional fields often has the model hallucinate plausible defaults. Mark required vs optional explicitly.
  • Trusting the values, not just the structure. Structured output guarantees shape, not correctness. "age": 150 is valid schema, probably a hallucination.
  • YAML trap. Whitespace-sensitive, easy to break with subtle formatting issues. Use JSON unless you have a strong reason (LinkedIn's verbosity argument).

Alternatives / Comparisons

LayerEffortReliabilityWhen to use
Prompting aloneZero95-99%Quick prototypes, small schemas
Prompt + post-processingLow99-99.9%LinkedIn-style fix common errors
Native structured outputs (API)Low~100% schemaDefault in 2026 for OpenAI, Anthropic
Constrained sampling (Outlines, xgrammar)Medium~100%Self-hosted, custom grammar
Fine-tuning on formatHigh~100%Domain-specific output you cannot express as grammar
Classifier head (feature-based transfer)High100%Classification with few classes

Typical 2026 production config:

  1. Use native structured outputs if the API supports it
  2. Add Pydantic validation for semantic correctness
  3. Retry with a different prompt on validation failure
  4. Fall back to human review for critical fields

Mini-lab

labs/structured-outputs-compare/ (to create) - benchmark 5 approaches on the same task (e.g., extract structured metadata from a news article):

  1. Prompt only
  2. Prompt + post-processing
  3. OpenAI structured outputs API
  4. Outlines with grammar
  5. Instructor with Pydantic

Measure:

  • Schema compliance rate
  • Semantic accuracy
  • Latency
  • Cost per request

Stack: uv + OpenAI API + local Ollama model + Outlines + instructor.

Goal: build intuition for which layer to reach for in real projects.

Further reading

structured-outputjsonfunction-callingconstrained-samplingtool-use