ChangeGamer

← All resources

Open-Weight Models for Agents

Reference · updated 2026-06-19 · Markdown variant

Cross-vendor comparison table of major open-weight LLM families — license, tool-calling support, context window, and agent-builder notes — as of June 2026.


Open-weight models let agent builders control inference, eliminate per-call vendor fees, and avoid rate-limit ceilings. The tradeoff is hosting cost and model-update lag. This page compares the major families through the lens of what agent builders actually need. All claims are dated June 2026; this space moves fast — verify before pinning a model version.

Comparison table

Family Latest open-weight release License Native tool/function-calling Notable for agents
Meta Llama 4 Llama 4 Maverick (17B active / 400B total MoE, Apr 2025) Llama 4 Community License (not OSI open; MAU cap >700M requires Meta approval; commercial use otherwise permitted) Yes — natively optimized for tool-calling and agentic use Scout (10M-token context) and Maverick (1M) both remain available on Hugging Face under the Llama 4 Community License; Maverick is the stronger general/agentic pick. Multimodal. Behemoth was previewed but never open-released.
Mistral Mistral Small 4 (119B total / 6.5B active MoE, Mar 2026); Mistral Medium 3.5 (128B dense, Apr 2026); Mistral Large 3 (675B total / 41B active MoE, Dec 2025) Apache 2.0 (Small 4, Large 3); Modified MIT with revenue cap (Medium 3.5) Yes — function calling and structured output supported across all three; Small 4 also unifies reasoning and vision Small 4 (256K ctx) unifies Magistral reasoning + Pixtral vision + Devstral coding in one model. Medium 3.5 (256K ctx) is a frontier-class open coding/agentic model. Large 3 (256K ctx) is the largest open-weight Mistral. Mistral Small 3.2 deprecated April 30, 2026.
Alibaba Qwen Qwen 3.6-27B / 3.6-35B-A3B (Apr 2026); Qwen3 base series (Apr 2025) Apache 2.0 Yes — native tool-calling and MCP support via Qwen-Agent; all sizes Dense and MoE variants from 0.6B to 235B. Up to 262K context (extensible to 1M via YaRN). Hybrid thinking/non-thinking mode. Qwen 3.7 is closed-weight API-only as of Jun 2026.
DeepSeek DeepSeek V4-Pro (1.6T total / 49B active MoE) and V4-Flash (284B total / 13B active), both Apr 2026 preview MIT Yes — V4 natively supports function calling, JSON output, tool calls, and thinking / non-thinking modes 1M context window (default across both V4 variants). V4-Pro: frontier-class agentic coding. V4-Flash: fast/cheap inference. Weights on Hugging Face (deepseek-ai). V4 labeled preview; stable release expected later 2026.
Google Gemma 4 Gemma 4: E2B, E4B, 26B-A4B, 31B (Mar–Apr 2026); Gemma 4 12B Unified (Jun 2026, encoder-free, native audio) Apache 2.0 (first Gemma release under true Apache 2.0) Yes — native function-calling built into Gemma 4; FunctionGemma 270M for edge/on-device 128K context (E2B/E4B); 256K context (12B+, 26B, 31B). Gemma 4 12B Unified (Jun 3 2026) adds native audio + video via encoder-free architecture; runs on 16 GB RAM. Multimodal across the family.
Microsoft Phi-4 Phi-4-reasoning-vision-15B (15B, Mar 2026); Phi-4-reasoning (14B, May 2025); Phi-4-mini (3.8B) MIT Yes — Phi-4-mini has built-in function calling; the Phi-4 line supports tool use; Phi-4-reasoning for chain-of-thought agentic tasks Efficiency-first: strong reasoning per parameter. MIT license. Phi-4-reasoning-vision-15B adds selective thinking mode + high-res vision. Phi-4-multimodal adds audio+vision.
IBM Granite 4.1 Granite 4.1 (3B, 8B, 30B, Apr 2026) Apache 2.0 Yes — tool calling follows OpenAI function definition schema; benchmarked on Berkeley BFCL 512K context window. Enterprise-focused; ISO 42001 certified (Granite 4.0 line). 30B uses hybrid Mamba-Transformer architecture for long-context efficiency; 3B/8B are dense. Sizes 3B–30B.
OpenAI gpt-oss gpt-oss-20b (21B total / 3.6B active, MoE); gpt-oss-120b (117B total / 5.1B active, MoE); released Aug 2025 Apache 2.0 Yes — native function calling, structured outputs, web browsing, code execution Reasoning models (comparable to o3-mini / o4-mini). 128K context. MXFP4 quantized weights; 120B fits on a single 80GB GPU. Weights on Hugging Face: huggingface.co/openai.

What to look for when picking an open-weight model for agents

Tool-calling and JSON-mode reliability

Reliable structured output is the single most important property for agents. A model that hallucinates tool names, omits required arguments, or produces malformed JSON turns every downstream step into an error-handling problem. Check: (a) whether the model was instruction-tuned with a tool-use dataset, not just base-pretrained; (b) benchmark scores on Berkeley Function Calling Leaderboard (BFCL) for your target task category; (c) whether the inference framework you use supports the model's chat template exactly (template mismatches silently degrade tool-call reliability).

License terms: true-open vs source-available

Not all "open-weight" licenses are equal. For commercial agent deployments, the key questions are: (1) Is the license OSI-approved (Apache 2.0, MIT)? If yes, no usage caps or approval gatekeepers exist. (2) Is there a monthly active user (MAU) cap requiring vendor approval? Llama 4 Community License restricts deployments serving >700M MAU to Meta's discretion — irrelevant for most builders, but material at scale. (3) Does the license permit sublicensing or redistribution of fine-tunes? Apache 2.0 and MIT do; Llama Community License restricts this. As of June 2026: Qwen 3.6, Gemma 4, Phi-4, Granite 4.1, Mistral Small 4 / Large 3, DeepSeek V4, and gpt-oss are all Apache 2.0 or MIT — no MAU caps. Mistral Medium 3.5 uses a Modified MIT license with a revenue-threshold clause for large enterprises (not Apache/OSI-approved, but commercial-use-permissive for most builders).

Context length

Long context matters for agents that hold large tool outputs, long conversation histories, or multi-document retrieval results in the context window. Current verified windows: Qwen 3.6 up to 262K (extendable to 1M via YaRN); Llama 4 Maverick 1M / Scout 10M; Gemma 4 12B+ 256K; DeepSeek V4-Pro and V4-Flash 1M; gpt-oss 128K; Mistral Small 4 and Medium 3.5 256K; Granite 4.1 512K. Long-context performance degrades before the nominal limit — test your actual retrieval patterns, not just the window size.

Inference cost and the self-hosting tradeoff

Self-hosting eliminates per-token vendor fees but introduces GPU cost, model-update ops, and batching complexity. MoE architectures (Llama 4, Qwen 3.6 MoE, gpt-oss, DeepSeek) run fewer parameters per token at inference — lower latency and VRAM per request than dense models of comparable quality. Dense models (Phi-4 14B, Granite 4.1 3B/8B) are simpler to serve. For burst or experimental workloads, use an inference provider (Together AI, Fireworks, DeepInfra, Replicate) that hosts the weights — the Apache 2.0 / MIT license means no additional fee to the model vendor.

Verified sources

#open-weight #llm #tool-calling #agents #models #comparison

Category: Reference