Guardrails and Safety Filters for Agents
Runtime input/output/action controls that enforce policy independently of the model — tooling landscape, techniques, and layering guidance.
Guardrails are runtime checks that sit around — not inside — the model and enforce policy regardless of what the model itself produces. They complement but do not replace model-level safety training. See also: /resources/agentic-security-checklist for the threat-surface checklist that motivates these controls.
Three guardrail positions
Input guardrails run before the model sees a request. Targets: jailbreak and prompt-injection attempts, PII in user turns, off-topic or policy-violating content. A classifier or rule fires here and can block, rewrite, or flag the request.
Output guardrails run after the model responds, before the response reaches the caller. Targets: harmful or policy-violating content, PII in generated text (redact before returning), schema or format violations, and groundedness/hallucination checks (does the answer stay within the supplied context?).
Action guardrails run before a tool call executes. Targets: tool calls outside an allowlist, calls that exceed defined parameter ranges, and irreversible or high-stakes operations that require human approval. See /resources/agentic-security-checklist §7 for the human-in-the-loop gate pattern.
Techniques
- Classifier-based filters — a fine-tuned model labels input or output as safe/unsafe per a taxonomy. Fast and accurate for known harm categories; needs retraining as taxonomy evolves.
- Regex / deterministic validators — pattern matching for PII (credit-card numbers, email addresses), format enforcement (JSON schema, date formats). Zero latency; brittle for semantic violations.
- LLM-as-judge guardrails — a second LLM evaluates the primary model's output for policy compliance or groundedness. High flexibility; higher latency and cost.
- Constrained decoding — grammar-constrained generation forces the model to emit only tokens that satisfy a format specification at decode time. Eliminates format guardrail failures at the source. See /resources/reliable-tool-calling for structured-output details.
Defense-in-depth principle: stack multiple techniques. A regex catches known PII patterns cheaply; a classifier catches semantic violations the regex misses; an LLM-as-judge catches subtler policy issues. No single guardrail is sufficient, and prompt injection in particular has no complete solution — attackers can craft payloads that evade any single detector.
Tooling landscape
Llama Guard 4 (Meta, open-weight) — 12B multimodal safety classifier pruned from Llama 4 Scout. Classifies both prompt and response against the MLCommons hazards taxonomy; supports text and multiple images. Released April 2025. Model card: huggingface.co/meta-llama/Llama-Guard-4-12B.
Llama Prompt Guard 2 (Meta, open-weight) — lightweight BERT-style (DeBERTa) classifiers (22M and 86M params) for detecting direct jailbreaks and prompt-injection attacks. Outputs benign/malicious label; 512-token context. Model card: huggingface.co/meta-llama/Llama-Prompt-Guard-2-86M.
NeMo Guardrails (NVIDIA, open-source, Apache 2.0) — Python toolkit for adding programmable guardrails to LLM conversational systems. Intercepts input and output; policies expressed in Colang configuration; integrates with multiple embedding providers. GitHub: github.com/NVIDIA-NeMo/Guardrails.
Guardrails AI (guardrails-ai, open-source) — Python/JS framework for specifying and enforcing structure, type, and semantic constraints on LLM outputs. Includes a Guardrails Hub of pre-built validators; supports re-asking on failure. GitHub: github.com/guardrails-ai/guardrails.
OpenAI Moderation API + Agents SDK guardrails (OpenAI, vendor/SaaS) — the Moderation API is a free endpoint that classifies text (and images) for harmful content across categories (hate, harassment, self-harm, etc.). Moderation scores can also be requested inline with Responses API calls. The OpenAI Agents SDK exposes explicit input and output guardrail hooks that run per tool invocation. Docs: platform.openai.com/docs/guides/moderation.
Azure AI Content Safety / Prompt Shields (Microsoft, vendor/SaaS) — Content Safety API covers text and image harm categories. Prompt Shields (GA) detects user-prompt injection attacks and document attacks (indirect prompt injection embedded in retrieved content), with a Spotlighting capability announced at Build 2025. Docs: learn.microsoft.com/en-us/azure/ai-services/content-safety/.
ShieldGemma (Google, open-weight) — safety classifiers built on Gemma 2 (text, 2B/9B/27B params) covering four harm categories. ShieldGemma 2 (April 2025) is a 4B model built on Gemma 3 that adds image safety classification. Model card: huggingface.co/google/shieldgemma-9b.
Granite Guardian (IBM, open-weight) — safety models fine-tuned from IBM Granite. Latest: Granite Guardian 4.1 8B (April 2026), which adds improved bring-your-own-criteria (BYOC) support for custom judging criteria beyond pre-baked safety and hallucination detectors. Model card: huggingface.co/ibm-granite/granite-guardian-4.1-8b.
Lakera Guard (Cisco/Lakera, vendor/SaaS) — real-time API for detecting prompt injection, jailbreaks, and data leakage; claims sub-50 ms latency. Lakera was acquired by Cisco in May 2025; now part of the Cisco AI Defense portfolio. Docs: docs.lakera.ai/guard.
Practical guidance
- Layer multiple guardrails. Input + output + action coverage at minimum. A regex pre-filter reduces classifier load; a classifier catches what regex misses.
- Fail closed on high-stakes actions. If a guardrail errors or is inconclusive, block or escalate — do not default to allowing the action.
- Log guardrail decisions. Record every guardrail verdict (allowed/blocked, score, rule fired) alongside the trace ID for the agent run. See /resources/agent-observability for the broader observability pattern.
- Measure false-positive and false-negative rates. Guardrails that block too much degrade usability; guardrails that miss too much provide false confidence. Tune thresholds against a representative sample of real traffic.
- Keep humans in the loop for irreversible actions. No guardrail stack eliminates risk entirely, especially for prompt injection. For financial transfers, external communications, and data deletion, require explicit human confirmation regardless of guardrail output.
Verified sources
- Llama Guard 4 model card (Meta/HuggingFace): https://huggingface.co/meta-llama/Llama-Guard-4-12B
- Llama Prompt Guard 2 model card (Meta/HuggingFace): https://huggingface.co/meta-llama/Llama-Prompt-Guard-2-86M
- NeMo Guardrails GitHub (NVIDIA): https://github.com/NVIDIA-NeMo/Guardrails
- Guardrails AI GitHub: https://github.com/guardrails-ai/guardrails
- OpenAI Moderation API docs: https://platform.openai.com/docs/guides/moderation
- OpenAI Agents SDK guardrails: https://openai.github.io/openai-agents-python/guardrails/
- Azure AI Content Safety / Prompt Shields (Microsoft Learn): https://learn.microsoft.com/en-us/azure/ai-services/content-safety/concepts/jailbreak-detection
- ShieldGemma model card (Google/HuggingFace): https://huggingface.co/google/shieldgemma-9b
- ShieldGemma 2 on HuggingFace: https://huggingface.co/google/shieldgemma-2-4b-it
- Granite Guardian 4.1 8B model card (IBM/HuggingFace): https://huggingface.co/ibm-granite/granite-guardian-4.1-8b
- Lakera Guard docs (Cisco/Lakera): https://docs.lakera.ai/guard