# Reliable Tool Calling and Structured Outputs

> How providers guarantee schema-valid tool calls and structured output — mechanisms, failure modes, and mitigations — for production agent builders.

Category: Guide · Updated: 2026-06-15 · Tags: tool-calling, structured-outputs, json-mode, constrained-decoding, agents, reliability
Canonical: https://changegamer.ai/resources/reliable-tool-calling

Tool-call and JSON reliability is the single most important property for production agents. A model that hallucinates a tool name, drops a required argument, or emits malformed JSON turns every downstream step into an error-handling problem. This guide covers the two primitives, how each major provider enforces them, common failure modes, and concrete mitigations.

## The two primitives

**Function / tool calling** — the model's intermediate output: it emits a structured call to a named tool (function name + arguments) instead of, or in addition to, a text reply. The calling application executes the tool and returns results for the model to incorporate. Tool calls appear mid-conversation; they are not the final answer.

**Structured outputs / JSON mode** — the model's *final* answer is constrained to a schema. No tool is called; the response itself must be valid JSON (or match a stricter JSON Schema). Use this when you need the model's conclusion in a machine-parseable form, not when you need it to invoke external functions.

The two are orthogonal: you can require tool calls without constraining the surrounding text, constrain the final answer without any tools, or combine both.

## Constrained decoding: how providers guarantee schema-valid output

**OpenAI — Structured Outputs (strict mode)**
Set `strict: true` on a function definition or response format. OpenAI's constrained decoding then guarantees the output matches the supplied JSON Schema exactly. Requirements: every object must set `additionalProperties: false`; every property must appear in the `required` array (mark optional fields with a union type that includes `null`). Without `strict: true`, the model uses best-effort JSON mode, which does not guarantee schema compliance. Source: platform.openai.com/docs/guides/function-calling (Structured Outputs section).

**Anthropic — tool use + `tool_choice`**
Pass tool schemas via the `tools` array. By default (`tool_choice: {"type": "auto"}`), Claude decides whether to call a tool. To force a call, use `{"type": "any"}` (must use at least one of the provided tools) or `{"type": "tool", "name": "<name>"}` (must call that specific tool). Anthropic does not expose a separate JSON-mode endpoint; constrained JSON output is achieved through tool definitions or by instructing the model to fill a named tool whose schema is your target schema. Source: docs.anthropic.com/en/docs/agents-and-tools/tool-use/implement-tool-use.

**Google Gemini — function calling + `tool_config` + `responseSchema`**
Function calling uses `tool_config.function_calling_config.mode`: `AUTO` (model decides), `ANY` (must call a function — guarantees schema-typed output), or `NONE` (no tool calls). `ANY` mode with a function declaration gives schema adherence comparable to strict mode. For final-answer structured output, set `response_mime_type: "application/json"` and `response_schema` in the generation config. Note: you cannot simultaneously use `response_schema` and tool calling with `AUTO` mode — use `ANY` mode instead. Source: ai.google.dev/gemini-api/docs/function-calling; ai.google.dev/gemini-api/docs/structured-output.

**Open models — grammar-based generation**
Local inference runtimes enforce schemas at the token-sampling layer. Two mainstream approaches:

- *llama.cpp GBNF* — GGML BNF (GBNF), an extension of Backus-Naur Form, defines grammars that constrain token selection. Pass a grammar string or a JSON Schema (auto-converted to GBNF) at inference time. Note: grammar and function-calling cannot be used simultaneously in llama.cpp; function calling uses its own internal grammar. Source: github.com/ggml-org/llama.cpp/blob/master/grammars/README.md.

- *XGrammar* — the default structured-generation backend for vLLM, SGLang, TensorRT-LLM, and MLC-LLM as of 2025. Compiles JSON Schema / EBNF to a pushdown automaton; applies bitwise token masking at under 40 µs overhead per token. Source: github.com/mlc-ai/xgrammar.

- *Outlines (dottxt-ai)* — Python library that compiles JSON Schema or regex constraints to finite-state machines and masks invalid tokens during sampling. Works with Transformers, vLLM, Ollama, and others. Source: github.com/dottxt-ai/outlines.

## Reliability failure modes and mitigations

| Failure mode | Mitigation |
|---|---|
| **Hallucinated tool name** — model calls a tool not in the declared set | Validate the returned tool name against your schema before executing; reject unknown names |
| **Missing required arguments** — model omits a field the schema marks required | Use strict mode / `required` + `additionalProperties: false`; validation library catches missing fields pre-execution |
| **Extra / unexpected arguments** — model adds fields not in schema | `additionalProperties: false` (OpenAI strict) or schema validation; strip unknown keys defensively |
| **Malformed JSON** — output is not parseable | Enable provider-level strict mode or constrained decoding; wrap parse in try/catch and retry with an error message |
| **Wrong types** — string where int expected, etc. | Declare enum or `const` values where possible; use schema validation (e.g. Pydantic, Zod) before consuming arguments |
| **Over-calling** — model calls tools unnecessarily | Use `tool_choice: "auto"` and a minimal toolset; evaluate on your task distribution, not just benchmark scores |
| **Under-calling** — model answers in text when a tool call was required | Force tool use via `tool_choice: "any"` (Anthropic), `mode: "ANY"` (Gemini), or remove text-only response option entirely |
| **Parallel tool calls in wrong order** — parallel calls with dependencies | Declare dependencies explicitly; use sequential `tool_choice` forcing when order matters |
| **Chat-template mismatch** — open model's tool schema injected with wrong template | Always match the inference framework's chat template exactly to the model's training template; mismatches silently degrade reliability |

**Cross-cutting mitigations:**

- Prefer enums and `const` values over free-text fields wherever the value space is bounded.
- Keep schemas shallow and required fields minimal — every optional field is a reliability risk.
- Add a validation + retry/repair loop: parse and validate the model's output; on failure, send the validation error back as a user message and request a corrected call.
- Use few-shot examples of correct tool calls in the system prompt.
- Lower temperature (toward 0) improves schema adherence on models without constrained decoding.

## Evaluation

The standard benchmark for tool-calling reliability is the **Berkeley Function Calling Leaderboard (BFCL)**, maintained by Gorilla LLM at UC Berkeley. BFCL evaluates serial and parallel function calls across multiple languages using Abstract Syntax Tree (AST) scoring. BFCL V4 (current) extends evaluation to multi-turn and holistic agentic scenarios. Live leaderboard: gorilla.cs.berkeley.edu/leaderboard.html.

For open-weight model tool-calling scores and license terms, see /resources/open-weight-models-for-agents. For validating tool outputs as a security control, see /resources/agentic-security-checklist.

## Verified sources

- OpenAI function calling (includes Structured Outputs strict mode): https://platform.openai.com/docs/guides/function-calling
- OpenAI Structured Outputs guide: https://developers.openai.com/api/docs/guides/structured-outputs
- Anthropic tool use — implement tool use: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/implement-tool-use
- Anthropic tool choice overview: https://platform.claude.com/docs/en/agents-and-tools/tool-use/overview
- Google Gemini function calling: https://ai.google.dev/gemini-api/docs/function-calling
- Google Gemini structured output: https://ai.google.dev/gemini-api/docs/structured-output
- llama.cpp GBNF grammar README: https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md
- XGrammar (mlc-ai, default backend for vLLM/SGLang/TensorRT-LLM): https://github.com/mlc-ai/xgrammar
- Outlines (dottxt-ai, constrained decoding library): https://github.com/dottxt-ai/outlines
- Berkeley Function Calling Leaderboard (BFCL V4): https://gorilla.cs.berkeley.edu/leaderboard.html