Open-Weight Models for Agents

Reference · updated 2026-06-19 · Markdown variant

Cross-vendor comparison table of major open-weight LLM families — license, tool-calling support, context window, and agent-builder notes — as of June 2026.

Open-weight models let agent builders control inference, eliminate per-call vendor fees, and avoid rate-limit ceilings. The tradeoff is hosting cost and model-update lag. This page compares the major families through the lens of what agent builders actually need. All claims are dated June 2026; this space moves fast — verify before pinning a model version.

Comparison table

Family	Latest open-weight release	License	Native tool/function-calling	Notable for agents
Meta Llama 4	Llama 4 Maverick (17B active / 400B total MoE, Apr 2025)	Llama 4 Community License (not OSI open; MAU cap >700M requires Meta approval; commercial use otherwise permitted)	Yes — natively optimized for tool-calling and agentic use	Scout (10M-token context) and Maverick (1M) both remain available on Hugging Face under the Llama 4 Community License; Maverick is the stronger general/agentic pick. Multimodal. Behemoth was previewed but never open-released.
Mistral	Mistral Small 4 (119B total / 6.5B active MoE, Mar 2026); Mistral Medium 3.5 (128B dense, Apr 2026); Mistral Large 3 (675B total / 41B active MoE, Dec 2025)	Apache 2.0 (Small 4, Large 3); Modified MIT with revenue cap (Medium 3.5)	Yes — function calling and structured output supported across all three; Small 4 also unifies reasoning and vision	Small 4 (256K ctx) unifies Magistral reasoning + Pixtral vision + Devstral coding in one model. Medium 3.5 (256K ctx) is a frontier-class open coding/agentic model. Large 3 (256K ctx) is the largest open-weight Mistral. Mistral Small 3.2 deprecated April 30, 2026.
Alibaba Qwen	Qwen 3.6-27B / 3.6-35B-A3B (Apr 2026); Qwen3 base series (Apr 2025)	Apache 2.0	Yes — native tool-calling and MCP support via Qwen-Agent; all sizes	Dense and MoE variants from 0.6B to 235B. Up to 262K context (extensible to 1M via YaRN). Hybrid thinking/non-thinking mode. Qwen 3.7 is closed-weight API-only as of Jun 2026.
DeepSeek	DeepSeek V4-Pro (1.6T total / 49B active MoE) and V4-Flash (284B total / 13B active), both Apr 2026 preview	MIT	Yes — V4 natively supports function calling, JSON output, tool calls, and thinking / non-thinking modes	1M context window (default across both V4 variants). V4-Pro: frontier-class agentic coding. V4-Flash: fast/cheap inference. Weights on Hugging Face (deepseek-ai). V4 labeled preview; stable release expected later 2026.
Google Gemma 4	Gemma 4: E2B, E4B, 26B-A4B, 31B (Mar–Apr 2026); Gemma 4 12B Unified (Jun 2026, encoder-free, native audio)	Apache 2.0 (first Gemma release under true Apache 2.0)	Yes — native function-calling built into Gemma 4; FunctionGemma 270M for edge/on-device	128K context (E2B/E4B); 256K context (12B+, 26B, 31B). Gemma 4 12B Unified (Jun 3 2026) adds native audio + video via encoder-free architecture; runs on 16 GB RAM. Multimodal across the family.
Microsoft Phi-4	Phi-4-reasoning-vision-15B (15B, Mar 2026); Phi-4-reasoning (14B, May 2025); Phi-4-mini (3.8B)	MIT	Yes — Phi-4-mini has built-in function calling; the Phi-4 line supports tool use; Phi-4-reasoning for chain-of-thought agentic tasks	Efficiency-first: strong reasoning per parameter. MIT license. Phi-4-reasoning-vision-15B adds selective thinking mode + high-res vision. Phi-4-multimodal adds audio+vision.
IBM Granite 4.1	Granite 4.1 (3B, 8B, 30B, Apr 2026)	Apache 2.0	Yes — tool calling follows OpenAI function definition schema; benchmarked on Berkeley BFCL	512K context window. Enterprise-focused; ISO 42001 certified (Granite 4.0 line). 30B uses hybrid Mamba-Transformer architecture for long-context efficiency; 3B/8B are dense. Sizes 3B–30B.
OpenAI gpt-oss	gpt-oss-20b (21B total / 3.6B active, MoE); gpt-oss-120b (117B total / 5.1B active, MoE); released Aug 2025	Apache 2.0	Yes — native function calling, structured outputs, web browsing, code execution	Reasoning models (comparable to o3-mini / o4-mini). 128K context. MXFP4 quantized weights; 120B fits on a single 80GB GPU. Weights on Hugging Face: huggingface.co/openai.

What to look for when picking an open-weight model for agents

Tool-calling and JSON-mode reliability

Reliable structured output is the single most important property for agents. A model that hallucinates tool names, omits required arguments, or produces malformed JSON turns every downstream step into an error-handling problem. Check: (a) whether the model was instruction-tuned with a tool-use dataset, not just base-pretrained; (b) benchmark scores on Berkeley Function Calling Leaderboard (BFCL) for your target task category; (c) whether the inference framework you use supports the model's chat template exactly (template mismatches silently degrade tool-call reliability).

License terms: true-open vs source-available

Not all "open-weight" licenses are equal. For commercial agent deployments, the key questions are: (1) Is the license OSI-approved (Apache 2.0, MIT)? If yes, no usage caps or approval gatekeepers exist. (2) Is there a monthly active user (MAU) cap requiring vendor approval? Llama 4 Community License restricts deployments serving >700M MAU to Meta's discretion — irrelevant for most builders, but material at scale. (3) Does the license permit sublicensing or redistribution of fine-tunes? Apache 2.0 and MIT do; Llama Community License restricts this. As of June 2026: Qwen 3.6, Gemma 4, Phi-4, Granite 4.1, Mistral Small 4 / Large 3, DeepSeek V4, and gpt-oss are all Apache 2.0 or MIT — no MAU caps. Mistral Medium 3.5 uses a Modified MIT license with a revenue-threshold clause for large enterprises (not Apache/OSI-approved, but commercial-use-permissive for most builders).

Context length

Long context matters for agents that hold large tool outputs, long conversation histories, or multi-document retrieval results in the context window. Current verified windows: Qwen 3.6 up to 262K (extendable to 1M via YaRN); Llama 4 Maverick 1M / Scout 10M; Gemma 4 12B+ 256K; DeepSeek V4-Pro and V4-Flash 1M; gpt-oss 128K; Mistral Small 4 and Medium 3.5 256K; Granite 4.1 512K. Long-context performance degrades before the nominal limit — test your actual retrieval patterns, not just the window size.

Inference cost and the self-hosting tradeoff

Self-hosting eliminates per-token vendor fees but introduces GPU cost, model-update ops, and batching complexity. MoE architectures (Llama 4, Qwen 3.6 MoE, gpt-oss, DeepSeek) run fewer parameters per token at inference — lower latency and VRAM per request than dense models of comparable quality. Dense models (Phi-4 14B, Granite 4.1 3B/8B) are simpler to serve. For burst or experimental workloads, use an inference provider (Together AI, Fireworks, DeepInfra, Replicate) that hosts the weights — the Apache 2.0 / MIT license means no additional fee to the model vendor.

Verified sources

Meta Llama 4 blog: https://ai.meta.com/blog/llama-4-multimodal-intelligence/
Llama 4 model cards and prompt formats: https://www.llama.com/docs/model-cards-and-prompt-formats/llama4/
Mistral Small 4 announcement: https://mistral.ai/news/mistral-small-4/
Mistral Small 4 model card: https://docs.mistral.ai/models/model-cards/mistral-small-4-0-26-03
Mistral Medium 3.5 model card: https://docs.mistral.ai/models/model-cards/mistral-medium-3-5-26-04
Mistral Medium 3.5 on Hugging Face: https://huggingface.co/mistralai/Mistral-Medium-3.5-128B
Mistral Large 3 announcement: https://mistral.ai/news/mistral-3/
Qwen3 blog: https://qwenlm.github.io/blog/qwen3/
Qwen3.6 GitHub: https://github.com/QwenLM/Qwen3.6
Qwen-Agent (tool-calling + MCP): https://github.com/QwenLM/Qwen-Agent
DeepSeek V4 preview release notes: https://api-docs.deepseek.com/news/news260424
DeepSeek V4-Pro on Hugging Face: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro
DeepSeek V4-Flash on Hugging Face: https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash
Gemma 4 releases page: https://ai.google.dev/gemma/docs/releases
Gemma 4 12B Unified announcement: https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12b/
Function calling with Gemma 4: https://ai.google.dev/gemma/docs/capabilities/text/function-calling-gemma4
Gemma 4 Apache 2.0 announcement: https://opensource.googleblog.com/2026/03/gemma-4-expanding-the-gemmaverse-with-apache-20.html
Microsoft Phi-4 on Azure: https://azure.microsoft.com/en-us/products/phi/
Phi-4-reasoning-vision-15B (Microsoft Research): https://www.microsoft.com/en-us/research/blog/phi-4-reasoning-vision-and-the-lessons-of-training-a-multimodal-reasoning-model/
Phi-4-reasoning-vision-15B on Hugging Face: https://huggingface.co/microsoft/Phi-4-reasoning-vision-15B
IBM Granite 4.1 IBM Research blog: https://research.ibm.com/blog/granite-4-1-ai-foundation-models
OpenAI gpt-oss introduction: https://openai.com/index/introducing-gpt-oss/
gpt-oss GitHub (weights + model card): https://github.com/openai/gpt-oss

#open-weight #llm #tool-calling #agents #models #comparison

Category: Reference