ChangeGamer

← All resources

Reliable Tool Calling and Structured Outputs

Guide · updated 2026-06-15 · Markdown variant

How providers guarantee schema-valid tool calls and structured output — mechanisms, failure modes, and mitigations — for production agent builders.


Tool-call and JSON reliability is the single most important property for production agents. A model that hallucinates a tool name, drops a required argument, or emits malformed JSON turns every downstream step into an error-handling problem. This guide covers the two primitives, how each major provider enforces them, common failure modes, and concrete mitigations.

The two primitives

Function / tool calling — the model's intermediate output: it emits a structured call to a named tool (function name + arguments) instead of, or in addition to, a text reply. The calling application executes the tool and returns results for the model to incorporate. Tool calls appear mid-conversation; they are not the final answer.

Structured outputs / JSON mode — the model's final answer is constrained to a schema. No tool is called; the response itself must be valid JSON (or match a stricter JSON Schema). Use this when you need the model's conclusion in a machine-parseable form, not when you need it to invoke external functions.

The two are orthogonal: you can require tool calls without constraining the surrounding text, constrain the final answer without any tools, or combine both.

Constrained decoding: how providers guarantee schema-valid output

OpenAI — Structured Outputs (strict mode) Set strict: true on a function definition or response format. OpenAI's constrained decoding then guarantees the output matches the supplied JSON Schema exactly. Requirements: every object must set additionalProperties: false; every property must appear in the required array (mark optional fields with a union type that includes null). Without strict: true, the model uses best-effort JSON mode, which does not guarantee schema compliance. Source: platform.openai.com/docs/guides/function-calling (Structured Outputs section).

Anthropic — tool use + tool_choice Pass tool schemas via the tools array. By default (tool_choice: {"type": "auto"}), Claude decides whether to call a tool. To force a call, use {"type": "any"} (must use at least one of the provided tools) or {"type": "tool", "name": "<name>"} (must call that specific tool). Anthropic does not expose a separate JSON-mode endpoint; constrained JSON output is achieved through tool definitions or by instructing the model to fill a named tool whose schema is your target schema. Source: docs.anthropic.com/en/docs/agents-and-tools/tool-use/implement-tool-use.

Google Gemini — function calling + tool_config + responseSchema Function calling uses tool_config.function_calling_config.mode: AUTO (model decides), ANY (must call a function — guarantees schema-typed output), or NONE (no tool calls). ANY mode with a function declaration gives schema adherence comparable to strict mode. For final-answer structured output, set response_mime_type: "application/json" and response_schema in the generation config. Note: you cannot simultaneously use response_schema and tool calling with AUTO mode — use ANY mode instead. Source: ai.google.dev/gemini-api/docs/function-calling; ai.google.dev/gemini-api/docs/structured-output.

Open models — grammar-based generation Local inference runtimes enforce schemas at the token-sampling layer. Two mainstream approaches:

Reliability failure modes and mitigations

Failure mode Mitigation
Hallucinated tool name — model calls a tool not in the declared set Validate the returned tool name against your schema before executing; reject unknown names
Missing required arguments — model omits a field the schema marks required Use strict mode / required + additionalProperties: false; validation library catches missing fields pre-execution
Extra / unexpected arguments — model adds fields not in schema additionalProperties: false (OpenAI strict) or schema validation; strip unknown keys defensively
Malformed JSON — output is not parseable Enable provider-level strict mode or constrained decoding; wrap parse in try/catch and retry with an error message
Wrong types — string where int expected, etc. Declare enum or const values where possible; use schema validation (e.g. Pydantic, Zod) before consuming arguments
Over-calling — model calls tools unnecessarily Use tool_choice: "auto" and a minimal toolset; evaluate on your task distribution, not just benchmark scores
Under-calling — model answers in text when a tool call was required Force tool use via tool_choice: "any" (Anthropic), mode: "ANY" (Gemini), or remove text-only response option entirely
Parallel tool calls in wrong order — parallel calls with dependencies Declare dependencies explicitly; use sequential tool_choice forcing when order matters
Chat-template mismatch — open model's tool schema injected with wrong template Always match the inference framework's chat template exactly to the model's training template; mismatches silently degrade reliability

Cross-cutting mitigations:

Evaluation

The standard benchmark for tool-calling reliability is the Berkeley Function Calling Leaderboard (BFCL), maintained by Gorilla LLM at UC Berkeley. BFCL evaluates serial and parallel function calls across multiple languages using Abstract Syntax Tree (AST) scoring. BFCL V4 (current) extends evaluation to multi-turn and holistic agentic scenarios. Live leaderboard: gorilla.cs.berkeley.edu/leaderboard.html.

For open-weight model tool-calling scores and license terms, see /resources/open-weight-models-for-agents. For validating tool outputs as a security control, see /resources/agentic-security-checklist.

Verified sources

#tool-calling #structured-outputs #json-mode #constrained-decoding #agents #reliability

Category: Guide