ChangeGamer

← All resources

Deploying and Serving LLMs for Agents

Reference · updated 2026-06-16 · Markdown variant

Serving-stack reference for teams self-hosting open-weight models for agents: production inference servers, local/dev runtimes, managed GPU endpoints, and key serving concepts — with decision guidance by load profile and verified sources.


Agents make many sequential and parallel model calls. Serving-stack choices determine per-call latency, throughput, cost, and how easily existing OpenAI-compatible agent code can route to self-hosted models. The landscape splits into three tiers: production inference servers you self-host on GPUs, local/dev runtimes for development and edge, and managed/serverless GPU endpoints where someone else runs the hardware.

Why the serving choice matters for agents

An agent making 20 sequential tool calls at 2 s per call waits 40 s end-to-end. Serving decisions affect all three cost axes: latency per call (time-to-first-token, token throughput), cost (per-token fees vs amortized GPU cost), and integration effort (OpenAI-compatible endpoints drop in; non-compatible ones require adapter code).

Key distinction: agents need continuous/in-flight batching — the ability to start processing a new request before the previous one finishes generating. Without it, concurrent agent calls queue behind one another and throughput collapses.

Production inference servers (self-hosted)

vLLM (Apache 2.0, github.com/vllm-project/vllm) — the dominant open-source LLM serving framework. Core innovations: PagedAttention (paged KV-cache management borrowed from OS virtual memory — eliminates memory fragmentation and allows larger batches) and continuous batching (new requests join the batch mid-flight without waiting for a full batch to complete). Ships an OpenAI-compatible HTTP server out of the box. Supports tensor parallelism across multiple GPUs, speculative decoding, and structured output via XGrammar. 83k+ GitHub stars as of June 2026. Apache 2.0.

SGLang (Apache 2.0, github.com/sgl-project/sglang) — high-performance serving framework from UC Berkeley / LMSYS. Core innovation: RadixAttention — a radix-tree data structure for automatic, fine-grained prefix/KV-cache reuse across requests that share common prefixes (system prompts, few-shot examples, RAG context). Delivers up to 6x higher throughput than alternatives on workloads with shared prefixes. OpenAI-compatible endpoint. v0.5.8 (January 2026); powers 400k+ GPUs in production at xAI, NVIDIA, AMD, LinkedIn.

Hugging Face TGI (Apache 2.0, github.com/huggingface/text-generation-inference) — Rust + Python + gRPC inference server used by Hugging Face in production for the Inference API and Hugging Chat. Features continuous batching, tensor parallelism, flash attention, and quantization. OpenAI-compatible Messages API (/v1/chat/completions). Note: as of March 2026 TGI is in maintenance mode — Hugging Face recommends vLLM or SGLang for new production deployments.

NVIDIA TensorRT-LLM + Triton (Apache 2.0, github.com/NVIDIA/TensorRT-LLM) — NVIDIA's Python API for compiling LLMs into optimized TensorRT engines for NVIDIA GPUs, paired with NVIDIA Triton Inference Server (now part of the NVIDIA Dynamo platform as of March 2025) for serving. Key optimizations: kernel fusion, FP8/INT4 quantization, in-flight batching, and paged KV-caching. Highest throughput on NVIDIA hardware; highest ops complexity. Used by Baseten in production.

LMDeploy (Apache 2.0, github.com/InternLM/lmdeploy) — toolkit from the InternLM team for compressing, deploying, and serving LLMs. Two engines: TurboMind (C++/CUDA, maximum performance) and PyTorch (pure Python, easier to extend). OpenAI-compatible API server via api_server. Strong performance on vision-language models. v0.13 (June 2026).

Local / dev runtimes

Ollama (MIT, ollama.com) — the simplest way to run open-weight models locally. CLI and REST API (OpenAI-compatible at http://localhost:11434/v1). One command to pull and run a model; handles quantization, GPU detection, and memory management automatically. macOS, Linux, Windows. v0.22.1 (April 2026). On Apple Silicon, Ollama is migrating its inference backend to MLX (announced March 2026, currently in preview).

llama.cpp (MIT, github.com/ggml-org/llama.cpp) — the foundational C/C++ LLM inference library. Introduced the GGUF model format (all weights + metadata in one portable file). Runs on CPU (with SIMD optimization), NVIDIA GPUs, AMD GPUs, Apple Silicon Metal, and edge hardware. Supports 1.5-bit through 8-bit quantization. Grammar-based constrained generation (GBNF) for structured outputs (see /resources/reliable-tool-calling). Ships an OpenAI-compatible HTTP server. The engine inside LM Studio and the predecessor to many production stacks.

LM Studio (lmstudio.ai) — cross-platform GUI application (macOS, Windows, Linux) that wraps llama.cpp and MLX backends behind a model browser, chat interface, and OpenAI-compatible local server. Supports running GGUF and MLX models simultaneously. v0.4.0 (January 2026) added parallel requests with continuous batching and a headless server mode. Free for personal use. Best fit: dev and prototyping; not designed for multi-tenant production.

MLX (MIT, github.com/ml-explore/mlx) — Apple's array framework for Apple Silicon, built around the unified memory architecture (CPU and GPU share the same DRAM). MLX LM (the companion package) enables LLM text generation and fine-tuning on-device. MLX leads llama.cpp by 20–87% on models under 14B on Apple Silicon where inference is compute-bound. Apple established MLX as the preferred Apple Silicon inference framework at WWDC 2025.

Managed / serverless GPU and model endpoints

Per-token APIs (serverless, shared infrastructure)

Serverless GPU + code-defined infrastructure

Hyperscaler managed endpoints

Key serving concepts

Decision guidance

Scenario Recommended tier
Variable / bursty agent load; no GPU ops Per-token API (Together AI, Fireworks, Bedrock)
High steady-volume or data-residency requirement Self-hosted vLLM or SGLang on dedicated GPU
NVIDIA-maximum performance (H100/H200/B200) TensorRT-LLM + Triton (via Baseten or self-hosted)
Dev / local testing Ollama or LM Studio
Apple Silicon on-device / edge MLX LM or Ollama (MLX backend preview)
Low-resource / CPU-only / air-gapped llama.cpp

Cross-links: for gateway/routing across providers see /resources/ai-gateways-llm-routing; for cost and latency optimization see /resources/agent-cost-latency-optimization; for open-weight model selection see /resources/open-weight-models-for-agents.

Verified sources

#llm #inference #serving #vllm #sglang #ollama #open-weight #gpu #infrastructure #agents

Category: Reference