Deploying and Serving LLMs for Agents

Reference · updated 2026-06-16 · Markdown variant

Serving-stack reference for teams self-hosting open-weight models for agents: production inference servers, local/dev runtimes, managed GPU endpoints, and key serving concepts — with decision guidance by load profile and verified sources.

Agents make many sequential and parallel model calls. Serving-stack choices determine per-call latency, throughput, cost, and how easily existing OpenAI-compatible agent code can route to self-hosted models. The landscape splits into three tiers: production inference servers you self-host on GPUs, local/dev runtimes for development and edge, and managed/serverless GPU endpoints where someone else runs the hardware.

Why the serving choice matters for agents

An agent making 20 sequential tool calls at 2 s per call waits 40 s end-to-end. Serving decisions affect all three cost axes: latency per call (time-to-first-token, token throughput), cost (per-token fees vs amortized GPU cost), and integration effort (OpenAI-compatible endpoints drop in; non-compatible ones require adapter code).

Key distinction: agents need continuous/in-flight batching — the ability to start processing a new request before the previous one finishes generating. Without it, concurrent agent calls queue behind one another and throughput collapses.

Production inference servers (self-hosted)

vLLM (Apache 2.0, github.com/vllm-project/vllm) — the dominant open-source LLM serving framework. Core innovations: PagedAttention (paged KV-cache management borrowed from OS virtual memory — eliminates memory fragmentation and allows larger batches) and continuous batching (new requests join the batch mid-flight without waiting for a full batch to complete). Ships an OpenAI-compatible HTTP server out of the box. Supports tensor parallelism across multiple GPUs, speculative decoding, and structured output via XGrammar. 83k+ GitHub stars as of June 2026. Apache 2.0.

SGLang (Apache 2.0, github.com/sgl-project/sglang) — high-performance serving framework from UC Berkeley / LMSYS. Core innovation: RadixAttention — a radix-tree data structure for automatic, fine-grained prefix/KV-cache reuse across requests that share common prefixes (system prompts, few-shot examples, RAG context). Delivers up to 6x higher throughput than alternatives on workloads with shared prefixes. OpenAI-compatible endpoint. v0.5.8 (January 2026); powers 400k+ GPUs in production at xAI, NVIDIA, AMD, LinkedIn.

Hugging Face TGI (Apache 2.0, github.com/huggingface/text-generation-inference) — Rust + Python + gRPC inference server used by Hugging Face in production for the Inference API and Hugging Chat. Features continuous batching, tensor parallelism, flash attention, and quantization. OpenAI-compatible Messages API (/v1/chat/completions). Note: as of March 2026 TGI is in maintenance mode — Hugging Face recommends vLLM or SGLang for new production deployments.

NVIDIA TensorRT-LLM + Triton (Apache 2.0, github.com/NVIDIA/TensorRT-LLM) — NVIDIA's Python API for compiling LLMs into optimized TensorRT engines for NVIDIA GPUs, paired with NVIDIA Triton Inference Server (now part of the NVIDIA Dynamo platform as of March 2025) for serving. Key optimizations: kernel fusion, FP8/INT4 quantization, in-flight batching, and paged KV-caching. Highest throughput on NVIDIA hardware; highest ops complexity. Used by Baseten in production.

LMDeploy (Apache 2.0, github.com/InternLM/lmdeploy) — toolkit from the InternLM team for compressing, deploying, and serving LLMs. Two engines: TurboMind (C++/CUDA, maximum performance) and PyTorch (pure Python, easier to extend). OpenAI-compatible API server via api_server. Strong performance on vision-language models. v0.13 (June 2026).

Local / dev runtimes

Ollama (MIT, ollama.com) — the simplest way to run open-weight models locally. CLI and REST API (OpenAI-compatible at http://localhost:11434/v1). One command to pull and run a model; handles quantization, GPU detection, and memory management automatically. macOS, Linux, Windows. v0.22.1 (April 2026). On Apple Silicon, Ollama is migrating its inference backend to MLX (announced March 2026, currently in preview).

llama.cpp (MIT, github.com/ggml-org/llama.cpp) — the foundational C/C++ LLM inference library. Introduced the GGUF model format (all weights + metadata in one portable file). Runs on CPU (with SIMD optimization), NVIDIA GPUs, AMD GPUs, Apple Silicon Metal, and edge hardware. Supports 1.5-bit through 8-bit quantization. Grammar-based constrained generation (GBNF) for structured outputs (see /resources/reliable-tool-calling). Ships an OpenAI-compatible HTTP server. The engine inside LM Studio and the predecessor to many production stacks.

LM Studio (lmstudio.ai) — cross-platform GUI application (macOS, Windows, Linux) that wraps llama.cpp and MLX backends behind a model browser, chat interface, and OpenAI-compatible local server. Supports running GGUF and MLX models simultaneously. v0.4.0 (January 2026) added parallel requests with continuous batching and a headless server mode. Free for personal use. Best fit: dev and prototyping; not designed for multi-tenant production.

MLX (MIT, github.com/ml-explore/mlx) — Apple's array framework for Apple Silicon, built around the unified memory architecture (CPU and GPU share the same DRAM). MLX LM (the companion package) enables LLM text generation and fine-tuning on-device. MLX leads llama.cpp by 20–87% on models under 14B on Apple Silicon where inference is compute-bound. Apple established MLX as the preferred Apple Silicon inference framework at WWDC 2025.

Managed / serverless GPU and model endpoints

Per-token APIs (serverless, shared infrastructure)

Together AI (together.ai) — 200+ open-weight models via a unified serverless API. OpenAI-compatible endpoint. Also offers dedicated GPU endpoints, fine-tuning, and batch inference. Best for variable or bursty loads on open-source models.
Fireworks AI (fireworks.ai) — serverless inference for 400+ models with strong latency optimization (P50 TTFT ~150 ms on Llama 3.3 70B). Per-token pricing; on-demand dedicated GPU endpoints on A100/H100/H200/B200 with per-second billing. Fine-tuning to serverless endpoint.
Replicate (replicate.com) — model marketplace + per-second GPU billing; 50,000+ community models plus curated official models. Convenience-first; acquired by Cloudflare (November 2025), continues as an independent brand with planned Workers AI integration.

Serverless GPU + code-defined infrastructure

Modal (modal.com) — Python-native serverless GPU cloud. One decorator turns a Python function into a serverless GPU endpoint. GPU memory snapshots (alpha) enable fast model cold starts. gVisor sandbox isolation. $87M Series B (October 2025). Used by OpenAI Agents SDK as an official sandbox execution environment.
RunPod (runpod.io) — GPU cloud with serverless endpoints (pay-per-second) and persistent pods. Competitive per-second pricing; FlashBoot for reduced cold starts. Good for bursty inference workloads and model experimentation.
Baseten (baseten.co) — production-grade model serving platform. TensorRT-LLM Engine Builder for automatic model compilation; Triton-based serving; supports LoRA multi-adapter serving and speculative decoding. Targets teams that need maximum NVIDIA GPU utilization.
Anyscale (anyscale.com) — LLM inference on Ray Serve + vLLM. Serverless and dedicated endpoints for popular open-source models. Strong for teams already on the Ray ecosystem.

Hyperscaler managed endpoints

Amazon Bedrock (aws.amazon.com/bedrock) — managed foundation model layer on AWS. Per-token API access to open-weight models (Llama 4, Mistral, Titan, and others) plus proprietary models via AWS IAM and CloudTrail governance.
Azure AI Foundry (azure.microsoft.com) — Microsoft's managed model platform (formerly Azure AI Studio). Serverless per-token APIs and dedicated managed compute for open-weight models (Llama, Mistral, Phi) plus OpenAI and partner models. Azure-native identity and governance.
Google Vertex AI Model Garden (cloud.google.com/vertex-ai) — Google Cloud managed model endpoints. Per-token APIs for Gemini, open-weight models (Llama, Mistral, Gemma), and partner models. GCP IAM governance.

Key serving concepts

Continuous / in-flight batching — new requests join the active generation batch immediately, without waiting for a full batch to complete. Essential for agent workloads with concurrent calls. Implemented in vLLM, SGLang, TGI, TensorRT-LLM, and LMDeploy.
Paged attention / KV-cache management — allocates KV-cache in fixed-size pages to eliminate fragmentation, enabling larger effective batch sizes. vLLM's PagedAttention is the canonical implementation; SGLang's RadixAttention extends this with prefix sharing.
Prefix / prompt caching — reuses KV-cache entries for shared prompt prefixes across requests. High-value for agents with identical system prompts or RAG context blocks repeated across calls. See /resources/agent-cost-latency-optimization for provider-level caching.
Quantization — reduces model weight precision (FP16 → INT8 → INT4 → FP8, etc.) to cut VRAM and increase throughput at a small quality cost. GGUF quantization in llama.cpp; AWQ/GPTQ/FP8 in vLLM and TGI. See /resources/open-weight-models-for-agents.
Tensor / pipeline parallelism — splits a model across multiple GPUs. Tensor parallelism splits weight matrices; pipeline parallelism splits layers. Required for models that exceed single-GPU VRAM. All production servers support tensor parallelism.
Structured output / grammar support — constrains token generation to a JSON Schema or BNF grammar at the serving layer. vLLM and SGLang use XGrammar; llama.cpp uses GBNF. See /resources/reliable-tool-calling.

Decision guidance

Scenario	Recommended tier
Variable / bursty agent load; no GPU ops	Per-token API (Together AI, Fireworks, Bedrock)
High steady-volume or data-residency requirement	Self-hosted vLLM or SGLang on dedicated GPU
NVIDIA-maximum performance (H100/H200/B200)	TensorRT-LLM + Triton (via Baseten or self-hosted)
Dev / local testing	Ollama or LM Studio
Apple Silicon on-device / edge	MLX LM or Ollama (MLX backend preview)
Low-resource / CPU-only / air-gapped	llama.cpp

Cross-links: for gateway/routing across providers see /resources/ai-gateways-llm-routing; for cost and latency optimization see /resources/agent-cost-latency-optimization; for open-weight model selection see /resources/open-weight-models-for-agents.

Verified sources

vLLM GitHub (Apache 2.0): https://github.com/vllm-project/vllm
vLLM docs: https://docs.vllm.ai/en/latest/
SGLang GitHub (Apache 2.0): https://github.com/sgl-project/sglang
SGLang docs: https://docs.sglang.io/
Hugging Face TGI GitHub (Apache 2.0): https://github.com/huggingface/text-generation-inference
TGI Messages API (OpenAI-compatible): https://huggingface.co/docs/text-generation-inference/en/messages_api
NVIDIA TensorRT-LLM GitHub (Apache 2.0): https://github.com/NVIDIA/TensorRT-LLM
TensorRT-LLM Triton backend: https://github.com/triton-inference-server/tensorrtllm_backend
LMDeploy GitHub (Apache 2.0): https://github.com/InternLM/lmdeploy
LMDeploy OpenAI-compatible server: https://lmdeploy.readthedocs.io/en/latest/llm/api_server.html
Ollama blog: https://ollama.com/blog
llama.cpp GitHub (MIT): https://github.com/ggml-org/llama.cpp
LM Studio homepage: https://lmstudio.ai/
MLX GitHub (MIT, Apple): https://github.com/ml-explore/mlx
Together AI serverless inference: https://docs.together.ai/docs/serverless/models
Fireworks AI homepage: https://fireworks.ai/
Baseten inference stack guide: https://www.baseten.co/resources/guide/the-baseten-inference-stack/
Modal homepage: https://modal.com/
RunPod homepage: https://www.runpod.io/
Replicate homepage: https://replicate.com/
Anyscale LLM inference: https://www.anyscale.com/use-case/llm-online-inference
Amazon Bedrock homepage: https://aws.amazon.com/bedrock/
Azure AI Foundry homepage: https://azure.microsoft.com/en-us/products/ai-foundry/
Google Vertex AI Model Garden: https://cloud.google.com/vertex-ai/generative-ai/docs/model-garden/explore-models

#llm #inference #serving #vllm #sglang #ollama #open-weight #gpu #infrastructure #agents

Category: Reference