Function Calling (tool use primitive)

Watch or read first

OpenAI function calling guide: https://platform.openai.com/docs/guides/function-calling
Anthropic tool use guide: https://docs.anthropic.com/claude/docs/tool-use
Daily Dose DS, "Building blocks of AI Agents - Tools" in the AI Engineering Guidebook (2025, paid): https://www.dailydoseofds.com/ai-engineering-guidebook/

TL;DR

Function calling is the primitive that turns an LLM into something that can act. You describe tools as JSON schemas. The LLM decides when to call them and with what arguments, returning a structured function call you execute in your code. The result goes back to the LLM. This is the atom of every agent.

The historical problem

Before function calling, making an LLM call an API meant:

Prompt-engineer the model to emit a format like "CALL: search(query=X)"
Parse that format with regex
Hope the model stays on format
Handle failures constantly

It worked in demos, broke at scale. The format drifted. Error handling was brittle.

OpenAI introduced function calling in June 2023. Anthropic followed in 2024. Google, Cohere, Mistral, open-source models adopted it. By 2026, function calling is a standard feature of every serious LLM API.

How it works

The three-step dance

1. You define tools (name, description, JSON schema for arguments).

2. You call the LLM with your prompt and the tools list.
   The LLM replies with either:
     - a text answer, OR
     - a function call (tool name + arguments)

3. If a function call: you execute it in your code, send the result
   back to the LLM, and loop until it returns a text answer.

Example: OpenAI API shape

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given city.",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"}
            },
            "required": ["city"]
        }
    }
}]

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Weather in Paris?"}],
    tools=tools,
)

# Model responds with a tool_call:
# { name: "get_weather", arguments: { "city": "Paris" } }

Example: Anthropic API shape

tools = [{
    "name": "get_weather",
    "description": "Get the current weather in a given city.",
    "input_schema": {
        "type": "object",
        "properties": {
            "city": {"type": "string", "description": "City name"}
        },
        "required": ["city"]
    }
}]

response = anthropic.messages.create(
    model="claude-haiku-4.5",
    messages=[{"role": "user", "content": "Weather in Paris?"}],
    tools=tools,
)

# Response is a list of blocks. If tool_use, you get:
# { type: "tool_use", name: "get_weather", input: { "city": "Paris" } }

Anthropic calls it tool_use. OpenAI calls it tool_calls. Same mechanism.

Full round-trip loop

messages = [{"role": "user", "content": "Weather in Paris, in Celsius?"}]

while True:
    response = llm.complete(messages, tools=tools)
    if response.is_text():
        return response.text
    for tool_call in response.tool_calls:
        result = execute_tool(tool_call.name, tool_call.arguments)
        messages.append({"role": "assistant", "content": None, "tool_calls": [tool_call]})
        messages.append({"role": "tool", "tool_call_id": tool_call.id, "content": result})
    # loop

This is the heart of every ReAct agent. See react pattern.

Parallel tool calls

Since late 2023, OpenAI and Anthropic support parallel tool calls: the model returns multiple function calls in one turn. You execute them in parallel, return all results together.

# Response contains:
[
    ToolCall(name="get_weather", args={"city": "Paris"}),
    ToolCall(name="get_weather", args={"city": "Tokyo"}),
]
# You execute both concurrently, feed both results back.

This cuts latency for multi-tool tasks significantly.

Tool choice control

You can force the model's behavior:

auto (default): model decides
required (OpenAI) / any (Anthropic): model must call a tool
none: model must not call a tool
Specific tool: model must call THIS tool

Useful for structured extraction: define a single tool whose schema matches your output, force it.

Function calling vs MCP

Function calling is the API-level primitive. MCP (Model Context Protocol, Anthropic 2024-11) is a protocol for exposing tools.

Without MCP:
  Your app -- embeds tool logic --> LLM
  Each framework redefines tools its own way.

With MCP:
  MCP server -- exposes tools --> MCP client -- converts to function calls --> LLM
  Any app can use any MCP server.

Under the hood, the LLM still uses function calling. MCP is the transport/discovery/auth layer above it. See [[../06-mcp/README]] and agent protocols.

Relevance today (2026)

Function calling is a commodity

Every serious LLM in 2026 supports function calling:

OpenAI (GPT-4o, o1, o3)
Anthropic (Claude 3/4/4.5)
Google (Gemini 2.5)
Cohere (Command R+)
Mistral (Large, Medium)
Open-source: Llama 3.1+, Qwen 2.5+, DeepSeek

Quality varies. OpenAI and Anthropic set the bar. Open-source tools often need extra prompt engineering.

Structured outputs blurred the line

Both OpenAI and Anthropic offer structured output modes with grammar-constrained decoding. These use the same JSON schema machinery as tool calls. For pure extraction tasks, structured outputs are cleaner; for actions, tools remain the primitive. See structured outputs.

MCP changes how tools are shared

Before MCP: each team wrote its tools as Python/TS functions specific to its framework. After MCP: tools become network-accessible services. One MCP server for Gmail, one for GitHub, one for Stripe. Any agent can use any of them.

By late 2026, the MCP marketplace has hundreds of servers. Teams expose their internal tools via MCP for their own agents.

Multi-turn tool loops are cheap

With prompt caching (see prompt caching), the conversation history cost stays flat across a long agent loop. Tool-heavy agents went from expensive to routine.

Pitfalls have not disappeared

The LLM still hallucinates tool arguments
It still chooses the wrong tool sometimes
Parallel tool calls need dependency management
Tool errors must be surfaced, not buried

Function calling is mature, not magical.

Critical questions

Why not just parse free-form "call this function" text? (Function calling uses grammar-constrained decoding in many providers; the JSON is guaranteed valid. Free-form parsing is brittle.)
What if the LLM calls a non-existent tool? (Most providers reject it at generation. Some leak. Your code must handle unknown tool names.)
Should descriptions be long or short? (Specific, example-rich, 1-3 sentences. Tool name alone is often ambiguous.)
How do you decide if your app needs tools or structured outputs? (Structured outputs for "extract fields from this text". Tools for "take an action based on the user's intent".)
Can you combine tool use with reasoning models? (Yes, on OpenAI o3 and Claude Opus 4.5 thinking. The model thinks before calling tools. Slower, more reliable.)
Why do some providers require you to wrap tool_call IDs in responses? (So the LLM can match results to calls when multiple tools fired in parallel.)

Production pitfalls

Too many tools. Accuracy drops past 5-10 tools. Cluster by agent, or use dynamic tool selection (describe top-k relevant tools only).
Ambiguous descriptions. LLM picks the wrong tool. Disambiguate with examples.
Missing required arguments. The LLM can still hallucinate missing fields. Validate strictly.
Tool output too long. A 50KB HTML page as tool output blows the context. Summarize or truncate tool results.
No retry / timeout. External APIs fail. Wrap with retry + backoff; report failures to the LLM as observations, not exceptions.
Destructive tools with no confirmation. delete_user accessible in a chat agent = disaster. Add a confirm step.
No tool-level guardrails. Model calls send_email with content you never approved. Add middleware that checks every tool call against policy.
Schema drift between LLM providers. Porting from OpenAI to Anthropic requires schema translation. MCP helps standardize.
Hallucinated tool names after long conversations. Periodically re-inject the tool list in the system prompt if the conversation is very long.

Alternatives / Comparisons

Approach	When
Prompt-only ("say CALL: search(X)")	Pre-2023 hack. Do not use.
Function calling (native API)	Default for 2026 agents
Structured outputs	Pure extraction, no action
MCP over function calling	Multi-agent, reusable tools
Python REPL as a tool	Code-heavy tasks (pandas, math)
Direct code generation (level 5)	Advanced agents, more risk

Mental parallels (non-AI)

Remote control for appliances: the LLM is the couch user. The tool schema is the remote's button layout. Pressing a button produces a defined effect. Without the remote (tools), the user can only complain. With it, they can act.
CLI vs GUI: function calling is the CLI for LLMs. Structured, typed, discoverable. Raw text prompting is the GUI - flexible, less precise.
REST APIs in a web app: each tool is an endpoint. The schema is the contract. The LLM is the client.
Delegation: like when you tell a junior engineer "use these 5 tools exactly; any other way, come back to me". Function calling enforces this.

Mini-lab

labs/function-calling/ (to create):

Build a small agent with three tools:
- search_web (Tavily)
- read_url (fetch + strip)
- save_note (append to SQLite)
Give it a task: "Research the current state of MCP adoption and save a one-page summary."
Use OpenAI first, then port to Anthropic. Log schema differences.
Expose the same tools via an MCP server. Rewrite the agent to consume them via MCP. Compare code size and reusability.

Stack: uv, openai, anthropic, tavily-python, modelcontextprotocol Python SDK.

Function Calling (tool use primitive)

Function Calling (tool use primitive)

Watch or read first

TL;DR

The historical problem

How it works

The three-step dance

Example: OpenAI API shape

Example: Anthropic API shape

Full round-trip loop

Parallel tool calls

Tool choice control

Function calling vs MCP

Relevance today (2026)

Function calling is a commodity

Structured outputs blurred the line

MCP changes how tools are shared

Multi-turn tool loops are cheap

Pitfalls have not disappeared

Critical questions

Production pitfalls

Alternatives / Comparisons

Mental parallels (non-AI)

Mini-lab

Further reading

Canonical

Related in this KB

Frameworks