ChangeGamer

← All resources

AI Gateways and LLM Routing

Reference · updated 2026-06-15 · Markdown variant

What an AI gateway is, routing strategies (failover, cost-cascade, latency, capability), the tooling landscape, the OpenAI-compatible API convention, and tradeoffs.


An AI gateway (also called an LLM proxy or LLM router) is a single reverse-proxy that sits in front of multiple model providers and exposes one unified endpoint to callers. Instead of coding provider-specific clients for OpenAI, Anthropic, Google, Bedrock, and others, every agent sends requests to the gateway; the gateway routes, retries, caches, rate-limits, and logs on behalf of the caller.

For production agents, a gateway solves five categories of problems simultaneously: provider reliability (failover and retries), cost control (budget caps, model cascades), performance (caching, load balancing), security (key management, PII redaction), and observability (centralized logging of every request — see /resources/agent-observability).

Core gateway capabilities

Routing strategies

Provider failover — primary provider first; on error (5xx, timeout) automatically retry on one or more fallback providers. Protects uptime without code changes in the agent.

Latency-based routing — measure provider response time and weight traffic toward the fastest responder. Useful when several providers offer the same model (e.g., via OpenRouter).

Cost-based / model cascade — route the request to the cheapest model capable of handling it. A classifier or confidence score determines whether to escalate to a more capable (and expensive) model. Also called "mixture-of-agents" or "model cascade." RouteLLM (see below) implements this as a standalone router.

Capability routing — send image inputs to a vision-capable model, audio to a speech-capable model, and code to a code-specialized model, based on detected input type or an explicit hint in the request.

The OpenAI-compatible API convention

Most gateways and aggregators expose the POST /v1/chat/completions endpoint shape (request: model, messages, temperature; response: choices[].message). This is a de-facto industry standard, not a ratified spec — it originated with OpenAI and was copied by most providers. The newer POST /v1/responses shape (OpenAI Responses API, released March 2025) is increasingly supported but adoption is not yet universal. Because clients target this shape, swapping the gateway's upstream provider requires no code change on the agent side.

Tooling landscape

LiteLLM — open-source Python SDK and self-hosted proxy (MIT license, Apache 2.0 for the enterprise proxy). Normalizes 100+ providers (140+ models) behind one OpenAI-compatible endpoint. Features: virtual keys, budget caps, fallbacks, load balancing, semantic caching, logging integrations. 43k+ GitHub stars as of June 2026. Self-host via Docker; enterprise SaaS tier available. Source: github.com/BerriAI/litellm.

OpenRouter — hosted SaaS aggregator (no self-host option). One OpenAI-compatible API reaching 400+ models from 60+ providers. Passes through provider pricing; charges no markup on requests. Includes an Auto Router (powered by NotDiamond) that selects the best model per prompt automatically. Useful for latency-based and cost-based routing without running infrastructure. Source: openrouter.ai/docs.

Portkey — open-source AI gateway (Apache 2.0, gateway 2.0 fully open-sourced March 2026; github.com/Portkey-AI/gateway) with a managed SaaS cloud option. Routes to 250+ providers; includes semantic caching, guardrails, RBAC, observability, and MCP gateway support. Palo Alto Networks announced its acquisition of Portkey in April 2026 (expected to close later in 2026). Source: portkey.ai/docs.

Cloudflare AI Gateway — hosted SaaS (no self-host; Cloudflare-infrastructure only). Free tier available. One URL change routes any supported provider through Cloudflare's edge. Features: caching, rate limiting, logging, guardrails, request retry/fallback. Added a unified REST API (api.cloudflare.com) in May 2026. Source: developers.cloudflare.com/ai-gateway.

Helicone — open-source LLM observability platform and AI gateway (github.com/Helicone/helicone; YC W23). Integration model: change one URL. Provides request logging, cost tracking, caching, rate limiting, and agent trace inspection. Self-host via Docker/Helm or use the SaaS cloud (10k requests/month free). Actively maintained as of June 2026.

Kong AI Gateway — AI routing and governance layer built on top of Kong Gateway (the established API gateway). Adds AI-specific plugins: AI Proxy, AI Proxy Advanced (multi-provider load balancing), PII redaction, allow/deny lists for prompts. Can govern LLM, MCP, and A2A traffic in one plane. Open-source core (Apache 2.0); enterprise tier available. Source: developer.konghq.com/ai-gateway.

RouteLLM — open-source model router framework (Apache 2.0; github.com/lm-sys/routellm; by LMSYS / Anyscale). Trains classifiers on human preference data to predict which model will produce better output. Routes simpler queries to cheap models, complex ones to capable models. Exposes an OpenAI-compatible server. Not a full gateway (no caching, key management, etc.) — a focused cost-based routing layer.

Tradeoffs

Cross-links: gateway logs are a primary telemetry source for agent observability (/resources/agent-observability); semantic caching at the gateway layer is a form of context reuse (/resources/agent-memory-context); routing to self-hosted models is a key gateway use case (/resources/open-weight-models-for-agents).

Verified sources

#gateway #routing #llm #proxy #infrastructure #agents #openai-compatible

Category: Reference