Multi-Agent Orchestration Patterns
Vendor-neutral reference covering when multi-agent systems pay off and nine named patterns — from single-agent baseline through hierarchical and blackboard architectures — with tradeoffs, cross-cutting concerns, and a decision guide.
Multi-agent systems add real cost, latency, and operational complexity. The decision to use one should be driven by a concrete failure or scaling need in a single-agent design, not by the appeal of the architecture. This reference covers when to go multi-agent, the nine canonical patterns (with tradeoffs), and the cross-cutting concerns every builder encounters.
When to use multi-agent vs single-agent
Anthropic's "Building Effective Agents" (Schluntz & Zhang) frames agentic systems as either workflows — where LLMs and tools are orchestrated through predefined code paths — or agents — where the LLM dynamically directs its own process and tool use. Multi-agent adds a third dimension: multiple LLM-driven actors cooperating.
Add multi-agent only when one of these conditions holds:
- Separable subtasks: the problem cleanly decomposes into independent work units that do not require tight shared state. If subtasks are tightly coupled, coordination overhead outweighs the benefit.
- Parallelism pays: subtasks can run concurrently and the wall-clock gain justifies the added token fan-out cost. Anthropic's internal multi-agent research system outperformed a single Claude Opus 4 agent by 90.2% on breadth-first research queries where parallel exploration across many independent directions was the key differentiator.
- Specialization: different subtasks demand different system prompts, tool sets, or even models (e.g., a cheap fast model for routing, a large model for synthesis).
- Context-window limits: a task genuinely exceeds one context window and cannot be handled by summarization or retrieval alone.
The default should be: start single-agent with tools. Add structure only when a real failure or scaling need demands it.
The nine patterns
1. Single-agent-with-tools (baseline)
One LLM with access to a toolset in a loop. The baseline against which all multi-agent patterns should be benchmarked. When to use: all tasks where the scope fits one context window and does not require parallel execution. Tradeoff: limited by context window; no parallelism.
2. Prompt chaining / sequential pipeline
Output of step N becomes input to step N+1; each step uses a focused prompt. Named in Anthropic's "Building Effective Agents" as prompt chaining. When to use: tasks that decompose naturally into ordered stages (draft → critique → refine; extract → classify → summarize). Tradeoff: errors propagate forward; latency is additive; no parallelism.
3. Routing (classifier dispatch)
A classifier step reads the input and routes it to the appropriate specialist agent or prompt. Named in Anthropic's "Building Effective Agents" as routing. When to use: handling diverse input types that each require different handling (customer service triage, language detection, intent classification). Tradeoff: classification errors send tasks to the wrong handler; requires maintaining multiple specialist configurations.
4. Parallelization (sectioning + voting)
Multiple agents work on the same problem simultaneously. Two sub-variants from Anthropic's "Building Effective Agents": sectioning (divide a task into parallel independent chunks) and voting (multiple agents independently solve the same task; majority or best answer wins). When to use: long documents that can be chunked, independent research threads, or high-stakes decisions where redundancy reduces error rate. Tradeoff: token fan-out — cost multiplies with the number of parallel agents; requires aggregation logic.
5. Orchestrator-workers
A lead orchestrator agent dynamically spawns, delegates to, and aggregates results from worker subagents. Named in Anthropic's "Building Effective Agents" as orchestrator-workers and demonstrated in their multi-agent research system (where the orchestrator plans the research strategy and spawns parallel search subagents). When to use: tasks with dynamic scope — the number and type of subtasks is not known in advance. Tradeoff: orchestrator becomes a single point of failure; inter-agent communication cost; harder to debug.
6. Evaluator-optimizer (generator + critic loop)
One agent generates a candidate output; a second evaluates it against a rubric and returns feedback; the generator revises. Loop repeats until the evaluator is satisfied or a termination condition is met. Named in Anthropic's "Building Effective Agents" as evaluator-optimizer. When to use: tasks with a verifiable quality criterion (code that must pass tests, text that must meet a rubric). Tradeoff: requires a reliable evaluator — a weak critic produces useless loops; loop count must be bounded (termination guard mandatory).
7. Hierarchical / manager-of-managers
A top-level orchestrator delegates to sub-orchestrators, each of which manages their own worker pool. Extends orchestrator-workers to multiple tiers. When to use: very large decomposable tasks where a single orchestrator would exceed context or coordination limits. Tradeoff: coordination overhead grows with depth; error propagation is harder to trace; observability becomes critical (see /resources/agent-observability).
8. Group chat / debate
Multiple agents participate in a shared conversation, each contributing from its own perspective or role. A moderator (human or LLM) synthesizes or selects the final output. Sometimes called multi-agent debate. When to use: tasks benefiting from adversarial review, brainstorming, or simulated stakeholder perspectives. Tradeoff: verbose; expensive; convergence is not guaranteed without a strong moderator or termination criterion.
9. Blackboard / shared state
Agents read from and write to a shared structured artifact (the "blackboard") — a document, database, or structured object — rather than passing messages directly. Each agent acts when its triggering conditions are met. When to use: long-running tasks where agents work asynchronously and on different parts of the same artifact (co-authoring, iterative document refinement). Tradeoff: write conflicts require locking or versioning; shared state is a single point of corruption if an agent writes bad data.
Cross-cutting concerns
Context and state sharing — choose between shared memory (blackboard/database) and message passing. Shared memory enables tight coordination but requires conflict handling. Message passing is simpler to reason about but increases latency per hop.
Handoffs vs delegation — a handoff transfers full control (the calling agent stops); delegation keeps the orchestrator in control and aggregates results. Handoffs lose context; delegation multiplies context cost.
Error propagation and partial failure — in a multi-agent pipeline, a subagent failure can silently corrupt downstream results. Design explicit error contracts: subagents must return structured success/failure signals, not just text. The orchestrator must handle partial failure (retry, degrade, or surface the gap).
Cost explosion (token fan-out) — parallelization and orchestrator-workers multiply token spend. Model the cost before deploying: N parallel subagents at M tokens each costs N×M tokens. A 10-subagent orchestrator-workers pattern can be 10× more expensive than the single-agent baseline for the same task.
Termination and loop guards — evaluator-optimizer and group-chat patterns can loop indefinitely without a hard stop condition. Always set a maximum iteration count; prefer an evaluator that returns a structured {pass: bool, feedback: string} output so the loop can terminate deterministically.
Observability — multi-agent runs require a shared trace_id propagated across all subagent calls. Without it, cross-agent debugging is impossible. See /resources/agent-observability for the OpenTelemetry GenAI semantic conventions and tooling.
Inter-agent trust and security — subagents are not implicitly trusted. An agent receiving instructions from an orchestrator should apply the same prompt-injection and tool-abuse mitigations as it would for user input. For A2A delegation protocols and token audience binding across agents, see /resources/mcp-vs-a2a. For the full security checklist, see /resources/agentic-security-checklist.
Decision guide
- Start single-agent. Build a single LLM with the minimum toolset that could theoretically solve the task. Measure cost, latency, and success rate.
- Identify the concrete failure. Is it a context-window limit? A parallelism gap? A quality problem that needs a critic? Identify one specific failure before adding structure.
- Apply the minimum pattern. Prompt chaining before orchestrator-workers. Evaluator-optimizer before group chat. Each added tier multiplies complexity and cost.
- Add observability first. Before scaling to multi-agent, instrument your single-agent run with traces. You will need those signals to debug the multi-agent version.
For which frameworks implement which patterns, see /resources/agent-frameworks-compared.
Verified sources
- Anthropic — Building Effective Agents (Schluntz & Zhang): https://www.anthropic.com/research/building-effective-agents
- Anthropic — How we built our multi-agent research system: https://www.anthropic.com/engineering/multi-agent-research-system