# Deploying and Serving LLMs for Agents

> Serving-stack reference for teams self-hosting open-weight models for agents: production inference servers, local/dev runtimes, managed GPU endpoints, and key serving concepts — with decision guidance by load profile and verified sources.

Category: Reference · Updated: 2026-06-16 · Tags: llm, inference, serving, vllm, sglang, ollama, open-weight, gpu, infrastructure, agents
Canonical: https://changegamer.ai/resources/deploying-serving-llms

Agents make many sequential and parallel model calls. Serving-stack choices determine per-call latency, throughput, cost, and how easily existing OpenAI-compatible agent code can route to self-hosted models. The landscape splits into three tiers: production inference servers you self-host on GPUs, local/dev runtimes for development and edge, and managed/serverless GPU endpoints where someone else runs the hardware.

## Why the serving choice matters for agents

An agent making 20 sequential tool calls at 2 s per call waits 40 s end-to-end. Serving decisions affect all three cost axes: latency per call (time-to-first-token, token throughput), cost (per-token fees vs amortized GPU cost), and integration effort (OpenAI-compatible endpoints drop in; non-compatible ones require adapter code).

Key distinction: agents need **continuous/in-flight batching** — the ability to start processing a new request before the previous one finishes generating. Without it, concurrent agent calls queue behind one another and throughput collapses.

## Production inference servers (self-hosted)

**vLLM** (Apache 2.0, github.com/vllm-project/vllm) — the dominant open-source LLM serving framework. Core innovations: **PagedAttention** (paged KV-cache management borrowed from OS virtual memory — eliminates memory fragmentation and allows larger batches) and **continuous batching** (new requests join the batch mid-flight without waiting for a full batch to complete). Ships an **OpenAI-compatible HTTP server** out of the box. Supports tensor parallelism across multiple GPUs, speculative decoding, and structured output via XGrammar. 83k+ GitHub stars as of June 2026. Apache 2.0.

**SGLang** (Apache 2.0, github.com/sgl-project/sglang) — high-performance serving framework from UC Berkeley / LMSYS. Core innovation: **RadixAttention** — a radix-tree data structure for automatic, fine-grained **prefix/KV-cache reuse** across requests that share common prefixes (system prompts, few-shot examples, RAG context). Delivers up to 6x higher throughput than alternatives on workloads with shared prefixes. OpenAI-compatible endpoint. v0.5.8 (January 2026); powers 400k+ GPUs in production at xAI, NVIDIA, AMD, LinkedIn.

**Hugging Face TGI** (Apache 2.0, github.com/huggingface/text-generation-inference) — Rust + Python + gRPC inference server used by Hugging Face in production for the Inference API and Hugging Chat. Features continuous batching, tensor parallelism, flash attention, and quantization. OpenAI-compatible Messages API (`/v1/chat/completions`). Note: as of March 2026 TGI is in maintenance mode — Hugging Face recommends vLLM or SGLang for new production deployments.

**NVIDIA TensorRT-LLM + Triton** (Apache 2.0, github.com/NVIDIA/TensorRT-LLM) — NVIDIA's Python API for compiling LLMs into optimized TensorRT engines for NVIDIA GPUs, paired with NVIDIA Triton Inference Server (now part of the NVIDIA Dynamo platform as of March 2025) for serving. Key optimizations: kernel fusion, FP8/INT4 quantization, in-flight batching, and paged KV-caching. Highest throughput on NVIDIA hardware; highest ops complexity. Used by Baseten in production.

**LMDeploy** (Apache 2.0, github.com/InternLM/lmdeploy) — toolkit from the InternLM team for compressing, deploying, and serving LLMs. Two engines: **TurboMind** (C++/CUDA, maximum performance) and **PyTorch** (pure Python, easier to extend). OpenAI-compatible API server via `api_server`. Strong performance on vision-language models. v0.13 (June 2026).

## Local / dev runtimes

**Ollama** (MIT, ollama.com) — the simplest way to run open-weight models locally. CLI and REST API (OpenAI-compatible at `http://localhost:11434/v1`). One command to pull and run a model; handles quantization, GPU detection, and memory management automatically. macOS, Linux, Windows. v0.22.1 (April 2026). On Apple Silicon, Ollama is migrating its inference backend to MLX (announced March 2026, currently in preview).

**llama.cpp** (MIT, github.com/ggml-org/llama.cpp) — the foundational C/C++ LLM inference library. Introduced the **GGUF** model format (all weights + metadata in one portable file). Runs on CPU (with SIMD optimization), NVIDIA GPUs, AMD GPUs, Apple Silicon Metal, and edge hardware. Supports 1.5-bit through 8-bit quantization. Grammar-based constrained generation (GBNF) for structured outputs (see /resources/reliable-tool-calling). Ships an OpenAI-compatible HTTP server. The engine inside LM Studio and the predecessor to many production stacks.

**LM Studio** (lmstudio.ai) — cross-platform GUI application (macOS, Windows, Linux) that wraps llama.cpp and MLX backends behind a model browser, chat interface, and OpenAI-compatible local server. Supports running GGUF and MLX models simultaneously. v0.4.0 (January 2026) added parallel requests with continuous batching and a headless server mode. Free for personal use. Best fit: dev and prototyping; not designed for multi-tenant production.

**MLX** (MIT, github.com/ml-explore/mlx) — Apple's array framework for Apple Silicon, built around the unified memory architecture (CPU and GPU share the same DRAM). **MLX LM** (the companion package) enables LLM text generation and fine-tuning on-device. MLX leads llama.cpp by 20–87% on models under 14B on Apple Silicon where inference is compute-bound. Apple established MLX as the preferred Apple Silicon inference framework at WWDC 2025.

## Managed / serverless GPU and model endpoints

### Per-token APIs (serverless, shared infrastructure)

- **Together AI** (together.ai) — 200+ open-weight models via a unified serverless API. OpenAI-compatible endpoint. Also offers dedicated GPU endpoints, fine-tuning, and batch inference. Best for variable or bursty loads on open-source models.
- **Fireworks AI** (fireworks.ai) — serverless inference for 400+ models with strong latency optimization (P50 TTFT ~150 ms on Llama 3.3 70B). Per-token pricing; on-demand dedicated GPU endpoints on A100/H100/H200/B200 with per-second billing. Fine-tuning to serverless endpoint.
- **Replicate** (replicate.com) — model marketplace + per-second GPU billing; 50,000+ community models plus curated official models. Convenience-first; acquired by Cloudflare (November 2025), continues as an independent brand with planned Workers AI integration.

### Serverless GPU + code-defined infrastructure

- **Modal** (modal.com) — Python-native serverless GPU cloud. One decorator turns a Python function into a serverless GPU endpoint. GPU memory snapshots (alpha) enable fast model cold starts. gVisor sandbox isolation. $87M Series B (October 2025). Used by OpenAI Agents SDK as an official sandbox execution environment.
- **RunPod** (runpod.io) — GPU cloud with serverless endpoints (pay-per-second) and persistent pods. Competitive per-second pricing; FlashBoot for reduced cold starts. Good for bursty inference workloads and model experimentation.
- **Baseten** (baseten.co) — production-grade model serving platform. TensorRT-LLM Engine Builder for automatic model compilation; Triton-based serving; supports LoRA multi-adapter serving and speculative decoding. Targets teams that need maximum NVIDIA GPU utilization.
- **Anyscale** (anyscale.com) — LLM inference on Ray Serve + vLLM. Serverless and dedicated endpoints for popular open-source models. Strong for teams already on the Ray ecosystem.

### Hyperscaler managed endpoints

- **Amazon Bedrock** (aws.amazon.com/bedrock) — managed foundation model layer on AWS. Per-token API access to open-weight models (Llama 4, Mistral, Titan, and others) plus proprietary models via AWS IAM and CloudTrail governance.
- **Azure AI Foundry** (azure.microsoft.com) — Microsoft's managed model platform (formerly Azure AI Studio). Serverless per-token APIs and dedicated managed compute for open-weight models (Llama, Mistral, Phi) plus OpenAI and partner models. Azure-native identity and governance.
- **Google Vertex AI Model Garden** (cloud.google.com/vertex-ai) — Google Cloud managed model endpoints. Per-token APIs for Gemini, open-weight models (Llama, Mistral, Gemma), and partner models. GCP IAM governance.

## Key serving concepts

- **Continuous / in-flight batching** — new requests join the active generation batch immediately, without waiting for a full batch to complete. Essential for agent workloads with concurrent calls. Implemented in vLLM, SGLang, TGI, TensorRT-LLM, and LMDeploy.
- **Paged attention / KV-cache management** — allocates KV-cache in fixed-size pages to eliminate fragmentation, enabling larger effective batch sizes. vLLM's PagedAttention is the canonical implementation; SGLang's RadixAttention extends this with prefix sharing.
- **Prefix / prompt caching** — reuses KV-cache entries for shared prompt prefixes across requests. High-value for agents with identical system prompts or RAG context blocks repeated across calls. See /resources/agent-cost-latency-optimization for provider-level caching.
- **Quantization** — reduces model weight precision (FP16 → INT8 → INT4 → FP8, etc.) to cut VRAM and increase throughput at a small quality cost. GGUF quantization in llama.cpp; AWQ/GPTQ/FP8 in vLLM and TGI. See /resources/open-weight-models-for-agents.
- **Tensor / pipeline parallelism** — splits a model across multiple GPUs. Tensor parallelism splits weight matrices; pipeline parallelism splits layers. Required for models that exceed single-GPU VRAM. All production servers support tensor parallelism.
- **Structured output / grammar support** — constrains token generation to a JSON Schema or BNF grammar at the serving layer. vLLM and SGLang use XGrammar; llama.cpp uses GBNF. See /resources/reliable-tool-calling.

## Decision guidance

| Scenario | Recommended tier |
|---|---|
| Variable / bursty agent load; no GPU ops | Per-token API (Together AI, Fireworks, Bedrock) |
| High steady-volume or data-residency requirement | Self-hosted vLLM or SGLang on dedicated GPU |
| NVIDIA-maximum performance (H100/H200/B200) | TensorRT-LLM + Triton (via Baseten or self-hosted) |
| Dev / local testing | Ollama or LM Studio |
| Apple Silicon on-device / edge | MLX LM or Ollama (MLX backend preview) |
| Low-resource / CPU-only / air-gapped | llama.cpp |

Cross-links: for gateway/routing across providers see /resources/ai-gateways-llm-routing; for cost and latency optimization see /resources/agent-cost-latency-optimization; for open-weight model selection see /resources/open-weight-models-for-agents.

## Verified sources

- vLLM GitHub (Apache 2.0): https://github.com/vllm-project/vllm
- vLLM docs: https://docs.vllm.ai/en/latest/
- SGLang GitHub (Apache 2.0): https://github.com/sgl-project/sglang
- SGLang docs: https://docs.sglang.io/
- Hugging Face TGI GitHub (Apache 2.0): https://github.com/huggingface/text-generation-inference
- TGI Messages API (OpenAI-compatible): https://huggingface.co/docs/text-generation-inference/en/messages_api
- NVIDIA TensorRT-LLM GitHub (Apache 2.0): https://github.com/NVIDIA/TensorRT-LLM
- TensorRT-LLM Triton backend: https://github.com/triton-inference-server/tensorrtllm_backend
- LMDeploy GitHub (Apache 2.0): https://github.com/InternLM/lmdeploy
- LMDeploy OpenAI-compatible server: https://lmdeploy.readthedocs.io/en/latest/llm/api_server.html
- Ollama blog: https://ollama.com/blog
- llama.cpp GitHub (MIT): https://github.com/ggml-org/llama.cpp
- LM Studio homepage: https://lmstudio.ai/
- MLX GitHub (MIT, Apple): https://github.com/ml-explore/mlx
- Together AI serverless inference: https://docs.together.ai/docs/serverless/models
- Fireworks AI homepage: https://fireworks.ai/
- Baseten inference stack guide: https://www.baseten.co/resources/guide/the-baseten-inference-stack/
- Modal homepage: https://modal.com/
- RunPod homepage: https://www.runpod.io/
- Replicate homepage: https://replicate.com/
- Anyscale LLM inference: https://www.anyscale.com/use-case/llm-online-inference
- Amazon Bedrock homepage: https://aws.amazon.com/bedrock/
- Azure AI Foundry homepage: https://azure.microsoft.com/en-us/products/ai-foundry/
- Google Vertex AI Model Garden: https://cloud.google.com/vertex-ai/generative-ai/docs/model-garden/explore-models
