ChangeGamer

← All resources

Guardrails and Safety Filters for Agents

Guide · updated 2026-06-15 · Markdown variant

Runtime input/output/action controls that enforce policy independently of the model — tooling landscape, techniques, and layering guidance.


Guardrails are runtime checks that sit around — not inside — the model and enforce policy regardless of what the model itself produces. They complement but do not replace model-level safety training. See also: /resources/agentic-security-checklist for the threat-surface checklist that motivates these controls.

Three guardrail positions

Input guardrails run before the model sees a request. Targets: jailbreak and prompt-injection attempts, PII in user turns, off-topic or policy-violating content. A classifier or rule fires here and can block, rewrite, or flag the request.

Output guardrails run after the model responds, before the response reaches the caller. Targets: harmful or policy-violating content, PII in generated text (redact before returning), schema or format violations, and groundedness/hallucination checks (does the answer stay within the supplied context?).

Action guardrails run before a tool call executes. Targets: tool calls outside an allowlist, calls that exceed defined parameter ranges, and irreversible or high-stakes operations that require human approval. See /resources/agentic-security-checklist §7 for the human-in-the-loop gate pattern.

Techniques

Defense-in-depth principle: stack multiple techniques. A regex catches known PII patterns cheaply; a classifier catches semantic violations the regex misses; an LLM-as-judge catches subtler policy issues. No single guardrail is sufficient, and prompt injection in particular has no complete solution — attackers can craft payloads that evade any single detector.

Tooling landscape

Llama Guard 4 (Meta, open-weight) — 12B multimodal safety classifier pruned from Llama 4 Scout. Classifies both prompt and response against the MLCommons hazards taxonomy; supports text and multiple images. Released April 2025. Model card: huggingface.co/meta-llama/Llama-Guard-4-12B.

Llama Prompt Guard 2 (Meta, open-weight) — lightweight BERT-style (DeBERTa) classifiers (22M and 86M params) for detecting direct jailbreaks and prompt-injection attacks. Outputs benign/malicious label; 512-token context. Model card: huggingface.co/meta-llama/Llama-Prompt-Guard-2-86M.

NeMo Guardrails (NVIDIA, open-source, Apache 2.0) — Python toolkit for adding programmable guardrails to LLM conversational systems. Intercepts input and output; policies expressed in Colang configuration; integrates with multiple embedding providers. GitHub: github.com/NVIDIA-NeMo/Guardrails.

Guardrails AI (guardrails-ai, open-source) — Python/JS framework for specifying and enforcing structure, type, and semantic constraints on LLM outputs. Includes a Guardrails Hub of pre-built validators; supports re-asking on failure. GitHub: github.com/guardrails-ai/guardrails.

OpenAI Moderation API + Agents SDK guardrails (OpenAI, vendor/SaaS) — the Moderation API is a free endpoint that classifies text (and images) for harmful content across categories (hate, harassment, self-harm, etc.). Moderation scores can also be requested inline with Responses API calls. The OpenAI Agents SDK exposes explicit input and output guardrail hooks that run per tool invocation. Docs: platform.openai.com/docs/guides/moderation.

Azure AI Content Safety / Prompt Shields (Microsoft, vendor/SaaS) — Content Safety API covers text and image harm categories. Prompt Shields (GA) detects user-prompt injection attacks and document attacks (indirect prompt injection embedded in retrieved content), with a Spotlighting capability announced at Build 2025. Docs: learn.microsoft.com/en-us/azure/ai-services/content-safety/.

ShieldGemma (Google, open-weight) — safety classifiers built on Gemma 2 (text, 2B/9B/27B params) covering four harm categories. ShieldGemma 2 (April 2025) is a 4B model built on Gemma 3 that adds image safety classification. Model card: huggingface.co/google/shieldgemma-9b.

Granite Guardian (IBM, open-weight) — safety models fine-tuned from IBM Granite. Latest: Granite Guardian 4.1 8B (April 2026), which adds improved bring-your-own-criteria (BYOC) support for custom judging criteria beyond pre-baked safety and hallucination detectors. Model card: huggingface.co/ibm-granite/granite-guardian-4.1-8b.

Lakera Guard (Cisco/Lakera, vendor/SaaS) — real-time API for detecting prompt injection, jailbreaks, and data leakage; claims sub-50 ms latency. Lakera was acquired by Cisco in May 2025; now part of the Cisco AI Defense portfolio. Docs: docs.lakera.ai/guard.

Practical guidance

Verified sources

#safety #guardrails #agents #security #moderation #prompt-injection

Category: Guide