Reliable Tool Calling and Structured Outputs
How providers guarantee schema-valid tool calls and structured output — mechanisms, failure modes, and mitigations — for production agent builders.
Tool-call and JSON reliability is the single most important property for production agents. A model that hallucinates a tool name, drops a required argument, or emits malformed JSON turns every downstream step into an error-handling problem. This guide covers the two primitives, how each major provider enforces them, common failure modes, and concrete mitigations.
The two primitives
Function / tool calling — the model's intermediate output: it emits a structured call to a named tool (function name + arguments) instead of, or in addition to, a text reply. The calling application executes the tool and returns results for the model to incorporate. Tool calls appear mid-conversation; they are not the final answer.
Structured outputs / JSON mode — the model's final answer is constrained to a schema. No tool is called; the response itself must be valid JSON (or match a stricter JSON Schema). Use this when you need the model's conclusion in a machine-parseable form, not when you need it to invoke external functions.
The two are orthogonal: you can require tool calls without constraining the surrounding text, constrain the final answer without any tools, or combine both.
Constrained decoding: how providers guarantee schema-valid output
OpenAI — Structured Outputs (strict mode)
Set strict: true on a function definition or response format. OpenAI's constrained decoding then guarantees the output matches the supplied JSON Schema exactly. Requirements: every object must set additionalProperties: false; every property must appear in the required array (mark optional fields with a union type that includes null). Without strict: true, the model uses best-effort JSON mode, which does not guarantee schema compliance. Source: platform.openai.com/docs/guides/function-calling (Structured Outputs section).
Anthropic — tool use + tool_choice
Pass tool schemas via the tools array. By default (tool_choice: {"type": "auto"}), Claude decides whether to call a tool. To force a call, use {"type": "any"} (must use at least one of the provided tools) or {"type": "tool", "name": "<name>"} (must call that specific tool). Anthropic does not expose a separate JSON-mode endpoint; constrained JSON output is achieved through tool definitions or by instructing the model to fill a named tool whose schema is your target schema. Source: docs.anthropic.com/en/docs/agents-and-tools/tool-use/implement-tool-use.
Google Gemini — function calling + tool_config + responseSchema
Function calling uses tool_config.function_calling_config.mode: AUTO (model decides), ANY (must call a function — guarantees schema-typed output), or NONE (no tool calls). ANY mode with a function declaration gives schema adherence comparable to strict mode. For final-answer structured output, set response_mime_type: "application/json" and response_schema in the generation config. Note: you cannot simultaneously use response_schema and tool calling with AUTO mode — use ANY mode instead. Source: ai.google.dev/gemini-api/docs/function-calling; ai.google.dev/gemini-api/docs/structured-output.
Open models — grammar-based generation Local inference runtimes enforce schemas at the token-sampling layer. Two mainstream approaches:
llama.cpp GBNF — GGML BNF (GBNF), an extension of Backus-Naur Form, defines grammars that constrain token selection. Pass a grammar string or a JSON Schema (auto-converted to GBNF) at inference time. Note: grammar and function-calling cannot be used simultaneously in llama.cpp; function calling uses its own internal grammar. Source: github.com/ggml-org/llama.cpp/blob/master/grammars/README.md.
XGrammar — the default structured-generation backend for vLLM, SGLang, TensorRT-LLM, and MLC-LLM as of 2025. Compiles JSON Schema / EBNF to a pushdown automaton; applies bitwise token masking at under 40 µs overhead per token. Source: github.com/mlc-ai/xgrammar.
Outlines (dottxt-ai) — Python library that compiles JSON Schema or regex constraints to finite-state machines and masks invalid tokens during sampling. Works with Transformers, vLLM, Ollama, and others. Source: github.com/dottxt-ai/outlines.
Reliability failure modes and mitigations
| Failure mode | Mitigation |
|---|---|
| Hallucinated tool name — model calls a tool not in the declared set | Validate the returned tool name against your schema before executing; reject unknown names |
| Missing required arguments — model omits a field the schema marks required | Use strict mode / required + additionalProperties: false; validation library catches missing fields pre-execution |
| Extra / unexpected arguments — model adds fields not in schema | additionalProperties: false (OpenAI strict) or schema validation; strip unknown keys defensively |
| Malformed JSON — output is not parseable | Enable provider-level strict mode or constrained decoding; wrap parse in try/catch and retry with an error message |
| Wrong types — string where int expected, etc. | Declare enum or const values where possible; use schema validation (e.g. Pydantic, Zod) before consuming arguments |
| Over-calling — model calls tools unnecessarily | Use tool_choice: "auto" and a minimal toolset; evaluate on your task distribution, not just benchmark scores |
| Under-calling — model answers in text when a tool call was required | Force tool use via tool_choice: "any" (Anthropic), mode: "ANY" (Gemini), or remove text-only response option entirely |
| Parallel tool calls in wrong order — parallel calls with dependencies | Declare dependencies explicitly; use sequential tool_choice forcing when order matters |
| Chat-template mismatch — open model's tool schema injected with wrong template | Always match the inference framework's chat template exactly to the model's training template; mismatches silently degrade reliability |
Cross-cutting mitigations:
- Prefer enums and
constvalues over free-text fields wherever the value space is bounded. - Keep schemas shallow and required fields minimal — every optional field is a reliability risk.
- Add a validation + retry/repair loop: parse and validate the model's output; on failure, send the validation error back as a user message and request a corrected call.
- Use few-shot examples of correct tool calls in the system prompt.
- Lower temperature (toward 0) improves schema adherence on models without constrained decoding.
Evaluation
The standard benchmark for tool-calling reliability is the Berkeley Function Calling Leaderboard (BFCL), maintained by Gorilla LLM at UC Berkeley. BFCL evaluates serial and parallel function calls across multiple languages using Abstract Syntax Tree (AST) scoring. BFCL V4 (current) extends evaluation to multi-turn and holistic agentic scenarios. Live leaderboard: gorilla.cs.berkeley.edu/leaderboard.html.
For open-weight model tool-calling scores and license terms, see /resources/open-weight-models-for-agents. For validating tool outputs as a security control, see /resources/agentic-security-checklist.
Verified sources
- OpenAI function calling (includes Structured Outputs strict mode): https://platform.openai.com/docs/guides/function-calling
- OpenAI Structured Outputs guide: https://developers.openai.com/api/docs/guides/structured-outputs
- Anthropic tool use — implement tool use: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/implement-tool-use
- Anthropic tool choice overview: https://platform.claude.com/docs/en/agents-and-tools/tool-use/overview
- Google Gemini function calling: https://ai.google.dev/gemini-api/docs/function-calling
- Google Gemini structured output: https://ai.google.dev/gemini-api/docs/structured-output
- llama.cpp GBNF grammar README: https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md
- XGrammar (mlc-ai, default backend for vLLM/SGLang/TensorRT-LLM): https://github.com/mlc-ai/xgrammar
- Outlines (dottxt-ai, constrained decoding library): https://github.com/dottxt-ai/outlines
- Berkeley Function Calling Leaderboard (BFCL V4): https://gorilla.cs.berkeley.edu/leaderboard.html