AI Gateways and LLM Routing

Reference · updated 2026-06-15 · Markdown variant

What an AI gateway is, routing strategies (failover, cost-cascade, latency, capability), the tooling landscape, the OpenAI-compatible API convention, and tradeoffs.

An AI gateway (also called an LLM proxy or LLM router) is a single reverse-proxy that sits in front of multiple model providers and exposes one unified endpoint to callers. Instead of coding provider-specific clients for OpenAI, Anthropic, Google, Bedrock, and others, every agent sends requests to the gateway; the gateway routes, retries, caches, rate-limits, and logs on behalf of the caller.

For production agents, a gateway solves five categories of problems simultaneously: provider reliability (failover and retries), cost control (budget caps, model cascades), performance (caching, load balancing), security (key management, PII redaction), and observability (centralized logging of every request — see /resources/agent-observability).

Core gateway capabilities

Unified endpoint — one POST /v1/chat/completions-compatible URL in front of many providers.
Provider failover — if provider A returns 5xx or times out, retry on provider B automatically.
Load balancing — distribute traffic across multiple deployments of the same model.
Caching — exact-match response caching for repeated prompts; semantic caching for near-duplicate queries using embedding similarity (see /resources/agent-memory-context).
Rate limiting and budget controls — per-key, per-team, or per-model spend caps and request-per-minute limits enforced before upstream charges accrue.
Key management — virtual keys issued to callers; the gateway holds the real provider keys.
Observability — every request logged with latency, cost, tokens, and model used (see /resources/agent-observability).

Routing strategies

Provider failover — primary provider first; on error (5xx, timeout) automatically retry on one or more fallback providers. Protects uptime without code changes in the agent.

Latency-based routing — measure provider response time and weight traffic toward the fastest responder. Useful when several providers offer the same model (e.g., via OpenRouter).

Cost-based / model cascade — route the request to the cheapest model capable of handling it. A classifier or confidence score determines whether to escalate to a more capable (and expensive) model. Also called "mixture-of-agents" or "model cascade." RouteLLM (see below) implements this as a standalone router.

Capability routing — send image inputs to a vision-capable model, audio to a speech-capable model, and code to a code-specialized model, based on detected input type or an explicit hint in the request.

The OpenAI-compatible API convention

Most gateways and aggregators expose the POST /v1/chat/completions endpoint shape (request: model, messages, temperature; response: choices[].message). This is a de-facto industry standard, not a ratified spec — it originated with OpenAI and was copied by most providers. The newer POST /v1/responses shape (OpenAI Responses API, released March 2025) is increasingly supported but adoption is not yet universal. Because clients target this shape, swapping the gateway's upstream provider requires no code change on the agent side.

Tooling landscape

LiteLLM — open-source Python SDK and self-hosted proxy (MIT license, Apache 2.0 for the enterprise proxy). Normalizes 100+ providers (140+ models) behind one OpenAI-compatible endpoint. Features: virtual keys, budget caps, fallbacks, load balancing, semantic caching, logging integrations. 43k+ GitHub stars as of June 2026. Self-host via Docker; enterprise SaaS tier available. Source: github.com/BerriAI/litellm.

OpenRouter — hosted SaaS aggregator (no self-host option). One OpenAI-compatible API reaching 400+ models from 60+ providers. Passes through provider pricing; charges no markup on requests. Includes an Auto Router (powered by NotDiamond) that selects the best model per prompt automatically. Useful for latency-based and cost-based routing without running infrastructure. Source: openrouter.ai/docs.

Portkey — open-source AI gateway (Apache 2.0, gateway 2.0 fully open-sourced March 2026; github.com/Portkey-AI/gateway) with a managed SaaS cloud option. Routes to 250+ providers; includes semantic caching, guardrails, RBAC, observability, and MCP gateway support. Palo Alto Networks announced its acquisition of Portkey in April 2026 (expected to close later in 2026). Source: portkey.ai/docs.

Cloudflare AI Gateway — hosted SaaS (no self-host; Cloudflare-infrastructure only). Free tier available. One URL change routes any supported provider through Cloudflare's edge. Features: caching, rate limiting, logging, guardrails, request retry/fallback. Added a unified REST API (api.cloudflare.com) in May 2026. Source: developers.cloudflare.com/ai-gateway.

Helicone — open-source LLM observability platform and AI gateway (github.com/Helicone/helicone; YC W23). Integration model: change one URL. Provides request logging, cost tracking, caching, rate limiting, and agent trace inspection. Self-host via Docker/Helm or use the SaaS cloud (10k requests/month free). Actively maintained as of June 2026.

Kong AI Gateway — AI routing and governance layer built on top of Kong Gateway (the established API gateway). Adds AI-specific plugins: AI Proxy, AI Proxy Advanced (multi-provider load balancing), PII redaction, allow/deny lists for prompts. Can govern LLM, MCP, and A2A traffic in one plane. Open-source core (Apache 2.0); enterprise tier available. Source: developer.konghq.com/ai-gateway.

RouteLLM — open-source model router framework (Apache 2.0; github.com/lm-sys/routellm; by LMSYS / Anyscale). Trains classifiers on human preference data to predict which model will produce better output. Routes simpler queries to cheap models, complex ones to capable models. Exposes an OpenAI-compatible server. Not a full gateway (no caching, key management, etc.) — a focused cost-based routing layer.

Tradeoffs

Added latency — the gateway is an extra network hop; well-operated gateways add <10 ms but under load this can grow.
New failure point — if the gateway is down, all model calls fail. Self-hosted gateways require you to operate and scale them; SaaS gateways transfer that risk to the vendor.
Data privacy — all prompts and responses pass through the gateway operator's infrastructure. For sensitive workloads, prefer self-hosted gateways or providers with strong data-processing agreements.
Vendor lock-in vs self-hosting — SaaS gateways are operationally lightweight but create a dependency; self-hosted gateways (LiteLLM, Portkey, Kong) give full control at the cost of infrastructure work.

Cross-links: gateway logs are a primary telemetry source for agent observability (/resources/agent-observability); semantic caching at the gateway layer is a form of context reuse (/resources/agent-memory-context); routing to self-hosted models is a key gateway use case (/resources/open-weight-models-for-agents).

Verified sources

LiteLLM GitHub (BerriAI): https://github.com/BerriAI/litellm
LiteLLM proxy docs: https://docs.litellm.ai/docs/simple_proxy
OpenRouter docs: https://openrouter.ai/docs
OpenRouter Auto Router docs: https://openrouter.ai/docs/guides/routing/routers/auto-router
Portkey gateway GitHub (Portkey-AI): https://github.com/Portkey-AI/gateway
Portkey docs: https://portkey.ai/docs
Cloudflare AI Gateway overview: https://developers.cloudflare.com/ai-gateway/
Cloudflare AI Gateway REST API changelog (May 2026): https://developers.cloudflare.com/changelog/post/2026-05-21-rest-api/
Helicone GitHub: https://github.com/Helicone/helicone
Kong AI Gateway docs: https://developer.konghq.com/ai-gateway/
RouteLLM GitHub (lm-sys): https://github.com/lm-sys/routellm

#gateway #routing #llm #proxy #infrastructure #agents #openai-compatible

Category: Reference