Reliable Tool Calling and Structured Outputs

Guide · updated 2026-06-15 · Markdown variant

How providers guarantee schema-valid tool calls and structured output — mechanisms, failure modes, and mitigations — for production agent builders.

Tool-call and JSON reliability is the single most important property for production agents. A model that hallucinates a tool name, drops a required argument, or emits malformed JSON turns every downstream step into an error-handling problem. This guide covers the two primitives, how each major provider enforces them, common failure modes, and concrete mitigations.

The two primitives

Function / tool calling — the model's intermediate output: it emits a structured call to a named tool (function name + arguments) instead of, or in addition to, a text reply. The calling application executes the tool and returns results for the model to incorporate. Tool calls appear mid-conversation; they are not the final answer.

Structured outputs / JSON mode — the model's final answer is constrained to a schema. No tool is called; the response itself must be valid JSON (or match a stricter JSON Schema). Use this when you need the model's conclusion in a machine-parseable form, not when you need it to invoke external functions.

The two are orthogonal: you can require tool calls without constraining the surrounding text, constrain the final answer without any tools, or combine both.

Constrained decoding: how providers guarantee schema-valid output

OpenAI — Structured Outputs (strict mode) Set strict: true on a function definition or response format. OpenAI's constrained decoding then guarantees the output matches the supplied JSON Schema exactly. Requirements: every object must set additionalProperties: false; every property must appear in the required array (mark optional fields with a union type that includes null). Without strict: true, the model uses best-effort JSON mode, which does not guarantee schema compliance. Source: platform.openai.com/docs/guides/function-calling (Structured Outputs section).

Anthropic — tool use + tool_choice Pass tool schemas via the tools array. By default (tool_choice: {"type": "auto"}), Claude decides whether to call a tool. To force a call, use {"type": "any"} (must use at least one of the provided tools) or {"type": "tool", "name": "<name>"} (must call that specific tool). Anthropic does not expose a separate JSON-mode endpoint; constrained JSON output is achieved through tool definitions or by instructing the model to fill a named tool whose schema is your target schema. Source: docs.anthropic.com/en/docs/agents-and-tools/tool-use/implement-tool-use.

Google Gemini — function calling + tool_config + responseSchema Function calling uses tool_config.function_calling_config.mode: AUTO (model decides), ANY (must call a function — guarantees schema-typed output), or NONE (no tool calls). ANY mode with a function declaration gives schema adherence comparable to strict mode. For final-answer structured output, set response_mime_type: "application/json" and response_schema in the generation config. Note: you cannot simultaneously use response_schema and tool calling with AUTO mode — use ANY mode instead. Source: ai.google.dev/gemini-api/docs/function-calling; ai.google.dev/gemini-api/docs/structured-output.

Open models — grammar-based generation Local inference runtimes enforce schemas at the token-sampling layer. Two mainstream approaches:

llama.cpp GBNF — GGML BNF (GBNF), an extension of Backus-Naur Form, defines grammars that constrain token selection. Pass a grammar string or a JSON Schema (auto-converted to GBNF) at inference time. Note: grammar and function-calling cannot be used simultaneously in llama.cpp; function calling uses its own internal grammar. Source: github.com/ggml-org/llama.cpp/blob/master/grammars/README.md.
XGrammar — the default structured-generation backend for vLLM, SGLang, TensorRT-LLM, and MLC-LLM as of 2025. Compiles JSON Schema / EBNF to a pushdown automaton; applies bitwise token masking at under 40 µs overhead per token. Source: github.com/mlc-ai/xgrammar.
Outlines (dottxt-ai) — Python library that compiles JSON Schema or regex constraints to finite-state machines and masks invalid tokens during sampling. Works with Transformers, vLLM, Ollama, and others. Source: github.com/dottxt-ai/outlines.

Reliability failure modes and mitigations

Failure mode	Mitigation
Hallucinated tool name — model calls a tool not in the declared set	Validate the returned tool name against your schema before executing; reject unknown names
Missing required arguments — model omits a field the schema marks required	Use strict mode / `required` + `additionalProperties: false`; validation library catches missing fields pre-execution
Extra / unexpected arguments — model adds fields not in schema	`additionalProperties: false` (OpenAI strict) or schema validation; strip unknown keys defensively
Malformed JSON — output is not parseable	Enable provider-level strict mode or constrained decoding; wrap parse in try/catch and retry with an error message
Wrong types — string where int expected, etc.	Declare enum or `const` values where possible; use schema validation (e.g. Pydantic, Zod) before consuming arguments
Over-calling — model calls tools unnecessarily	Use `tool_choice: "auto"` and a minimal toolset; evaluate on your task distribution, not just benchmark scores
Under-calling — model answers in text when a tool call was required	Force tool use via `tool_choice: "any"` (Anthropic), `mode: "ANY"` (Gemini), or remove text-only response option entirely
Parallel tool calls in wrong order — parallel calls with dependencies	Declare dependencies explicitly; use sequential `tool_choice` forcing when order matters
Chat-template mismatch — open model's tool schema injected with wrong template	Always match the inference framework's chat template exactly to the model's training template; mismatches silently degrade reliability

Cross-cutting mitigations:

Prefer enums and const values over free-text fields wherever the value space is bounded.
Keep schemas shallow and required fields minimal — every optional field is a reliability risk.
Add a validation + retry/repair loop: parse and validate the model's output; on failure, send the validation error back as a user message and request a corrected call.
Use few-shot examples of correct tool calls in the system prompt.
Lower temperature (toward 0) improves schema adherence on models without constrained decoding.

Evaluation

The standard benchmark for tool-calling reliability is the Berkeley Function Calling Leaderboard (BFCL), maintained by Gorilla LLM at UC Berkeley. BFCL evaluates serial and parallel function calls across multiple languages using Abstract Syntax Tree (AST) scoring. BFCL V4 (current) extends evaluation to multi-turn and holistic agentic scenarios. Live leaderboard: gorilla.cs.berkeley.edu/leaderboard.html.

For open-weight model tool-calling scores and license terms, see /resources/open-weight-models-for-agents. For validating tool outputs as a security control, see /resources/agentic-security-checklist.

Verified sources

OpenAI function calling (includes Structured Outputs strict mode): https://platform.openai.com/docs/guides/function-calling
OpenAI Structured Outputs guide: https://developers.openai.com/api/docs/guides/structured-outputs
Anthropic tool use — implement tool use: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/implement-tool-use
Anthropic tool choice overview: https://platform.claude.com/docs/en/agents-and-tools/tool-use/overview
Google Gemini function calling: https://ai.google.dev/gemini-api/docs/function-calling
Google Gemini structured output: https://ai.google.dev/gemini-api/docs/structured-output
llama.cpp GBNF grammar README: https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md
XGrammar (mlc-ai, default backend for vLLM/SGLang/TensorRT-LLM): https://github.com/mlc-ai/xgrammar
Outlines (dottxt-ai, constrained decoding library): https://github.com/dottxt-ai/outlines
Berkeley Function Calling Leaderboard (BFCL V4): https://gorilla.cs.berkeley.edu/leaderboard.html

#tool-calling #structured-outputs #json-mode #constrained-decoding #agents #reliability

Category: Guide