Agent Cost and Latency Optimization

Guide · updated 2026-06-15 · Markdown variant

Practitioner reference for reducing the cost and latency of production AI agents: the compounding model, token-level levers (caching, pruning), request-level levers (Batch API, parallelism), model-level levers (routing, reasoning-effort controls), and architecture-level levers (step reduction, semantic caching, code offloading).

Production agents are expensive and slow for a structural reason: agent loops multiply both cost and latency. A single model call is cheap; a 20-step agentic trajectory at 10k tokens per call is not. Optimization levers exist at four levels — token, request, model, and architecture — and the highest-ROI wins are almost always architectural.

The compounding model

For a trajectory of N model calls:

Cost ≈ tokens_per_call × price_per_token × N (fan-out multiplies spend linearly).
Latency ≈ per_call_latency × N + tool_io_latency (steps are serial by default; parallelism is the main structural escape).

Fan-out is the dominant cost driver. A pipeline that spawns 5 sub-agents each making 10 calls multiplies spend by 50× relative to a single-agent 1-call solution. Measure steps-per-task before optimizing individual calls — see /resources/agent-observability (you cannot optimize what you do not measure; track cost and latency per span).

Token-level levers

Prompt caching (provider prefix caching) is the highest-ROI single lever for agents with a stable system prompt or large reference context that repeats across calls:

Anthropic: explicit cache control via cache_control breakpoints. Cache reads cost ~10% of standard input tokens (90% discount). Cache writes cost 1.25× standard input. TTL 5 minutes (extendable). Supported across the Claude 3/4 families.
OpenAI: automatic for prompts >1 024 tokens on supported models (GPT-4o, o-series, GPT-5 family). Cache hits receive a 50% discount on input tokens. No explicit API knob required — prefix stability determines hit rate.
Google Gemini: "implicit caching" enabled by default on Gemini 2.5 models. Minimum 1 024 tokens (Flash) or 2 048 tokens (Pro). Cache hits on Gemini 2.5 receive a 75% discount on cached tokens. Explicit context caching (named cache objects) is also available with configurable TTL.

To exploit caching: put stable content (system prompt, reference documents, few-shot examples) at the top of the context; vary only the dynamic suffix. Cross-link: /resources/agent-memory-context covers prompt caching as a context-management technique.

Concise prompts and context pruning: every unnecessary token in the context costs money on every call. Remove boilerplate, truncate stale history, and summarize instead of concatenating. Sliding-window and summarization strategies are covered in /resources/agent-memory-context.

Structured outputs to avoid re-asks: a malformed tool call or unparseable JSON response forces a retry, doubling token spend for that turn. Enable provider-level constrained decoding (OpenAI strict mode, Anthropic tool_choice forcing, Gemini ANY mode) to eliminate format retries. See /resources/reliable-tool-calling.

Limit injected context: retrieve only the chunks an agent actually needs (RAG top-k), not the entire knowledge base. See /resources/rag-retrieval-for-agents.

Request-level levers

Batch API (async, 50% discount): both Anthropic and OpenAI offer an asynchronous batch processing endpoint that processes requests within a 24-hour window in exchange for a 50% discount on all token costs (input and output). Verified for both providers as of June 2026. Use for non-latency-sensitive workloads: bulk document processing, nightly eval runs, data enrichment pipelines, and offline agentic sweeps. Batch and prompt caching discounts stack independently.

Streaming for perceived latency: for user-facing agents, enable streaming responses. The model starts returning tokens immediately; time-to-first-token drops significantly even when total generation time is unchanged. This does not reduce cost but reduces perceived latency, which matters for interactive agents.

Parallelizing independent tool calls and sub-agents: when an agent step fans out to multiple independent tool calls, run them in parallel (concurrent API calls or sub-agent threads) rather than sequentially. This reduces latency from sum-of-steps to max-of-steps. Most frameworks (LangGraph, OpenAI Agents SDK, CrewAI) support parallel tool execution natively. See /resources/multi-agent-orchestration-patterns for fan-out cost analysis.

Model-level levers

Model routing / cascades (cheap model first, escalate): route simple sub-tasks to a smaller, cheaper model and escalate to a larger model only when the cheaper model signals low confidence or the task matches known complexity thresholds. A well-tuned cascade can cut average cost per task substantially with minimal quality loss. See /resources/ai-gateways-llm-routing for routing infrastructure.

Smaller / distilled models for sub-tasks: use a large model for complex reasoning and orchestration, but delegate deterministic sub-tasks (classification, extraction, translation) to smaller distilled models. Open-weight models (Llama 4 Scout, Qwen 3.6, Phi-4-mini, Granite 4.1 3B) eliminate per-token vendor fees entirely for sub-tasks you can self-host. See /resources/open-weight-models-for-agents.

Reasoning-effort controls: providers expose knobs to trade reasoning depth for cost and latency:

OpenAI: reasoning_effort parameter (values: low, medium, high, xhigh on newer models; none/minimal on some). Lower effort uses fewer reasoning tokens — faster and cheaper. Default is medium on GPT-5.5. Verified via OpenAI reasoning docs.
Anthropic: extended thinking uses budget_tokens to set the maximum reasoning token budget. Newer Claude 4.x models support type: "adaptive" (the older type: "enabled" with explicit budget_tokens is deprecated but still functional). Reducing the budget cuts thinking cost directly.

Tune reasoning effort per task category — high effort for complex multi-step decisions, low/minimal for classification or extraction.

Speculative decoding (self-hosted): an inference-level optimization where a small "draft" model generates candidate tokens that the large "verifier" model checks in parallel, rejecting tokens that deviate from its distribution. Reduces latency by 2–4× on CPU-bounded inference hardware with no quality loss. Relevant for self-hosted deployments; not controllable via hosted provider APIs. See /resources/open-weight-models-for-agents.

Architecture-level levers

Reduce steps (fewer agent turns): the single highest-leverage intervention. Every additional turn multiplies both cost and latency. Audit your trajectories for redundant turns: unnecessary clarification rounds, sequential tool calls that could be one batched call, and loops that could be replaced by a single tool with richer output. If an agent reliably completes a task in 4 turns, refactoring to 2 turns halves both cost and latency with no model change. See /resources/multi-agent-orchestration-patterns.

Semantic caching of whole responses: cache complete LLM responses for semantically equivalent queries. When a new query is within a configurable cosine-similarity threshold of a cached query, return the cached response without a model call. Tools: GPTCache (github.com/zilliztech/GPTCache), Redis with vector similarity, or LangChain's built-in semantic cache. Effective for agents with repetitive sub-tasks or high query overlap. Not effective for inherently unique or user-specific queries.

Early-exit and termination guards: add explicit completion detection so the agent stops as soon as it has a satisfactory answer rather than continuing for a fixed maximum step count. An agent that runs 20 steps when 6 would suffice wastes 14 turns of compute.

Offload deterministic work to code and tools: do not ask the model to do arithmetic, date calculations, regex matching, or JSON transformation — call a function instead. Model inference is the expensive operation; a Python function call costs microseconds. This also improves reliability: models make arithmetic errors; code does not.

Measure first

No optimization is worth implementing without measurement. Instrument every agent run with span-level token counts, cost attribution, and latency per step before tuning. The highest-cost spans are rarely where intuition points. See /resources/agent-observability.

Highest-ROI checklist (in approximate order)

Instrument traces: cost and latency per span, steps per task.
Count steps per trajectory — reduce unnecessary turns before anything else.
Enable prompt caching for stable system-prompt prefixes.
Parallelize independent tool calls and sub-agent delegations.
Route simple sub-tasks to smaller/cheaper models.
Use constrained decoding to eliminate format-retry turns.
Shift non-latency-sensitive workloads to Batch API (50% discount).
Tune reasoning_effort / budget_tokens per task category.
Prune injected context to only what the current step needs.
Add semantic caching for repeated sub-task queries.
Offload deterministic computation to tool functions, not the model.

Verified sources

Anthropic prompt caching docs: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
Anthropic Message Batches API docs: https://docs.anthropic.com/en/docs/build-with-claude/message-batches
OpenAI prompt caching announcement: https://openai.com/index/api-prompt-caching/
OpenAI Batch API pricing: https://openai.com/api/pricing/
OpenAI reasoning effort docs: https://developers.openai.com/api/docs/guides/reasoning
Google Gemini implicit caching announcement: https://developers.googleblog.com/gemini-2-5-models-now-support-implicit-caching/
Google Gemini context caching docs: https://ai.google.dev/gemini-api/docs/caching
Anthropic extended thinking (adaptive mode / budget_tokens): https://platform.claude.com/docs/en/build-with-claude/extended-thinking
GPTCache (semantic whole-response caching): https://github.com/zilliztech/GPTCache

#cost #latency #optimization #agents #prompt-caching #batch-api #model-routing #architecture

Category: Guide