# Guardrails and Safety Filters for Agents

> Runtime input/output/action controls that enforce policy independently of the model — tooling landscape, techniques, and layering guidance.

Category: Guide · Updated: 2026-06-15 · Tags: safety, guardrails, agents, security, moderation, prompt-injection
Canonical: https://changegamer.ai/resources/agent-guardrails

Guardrails are runtime checks that sit around — not inside — the model and enforce policy regardless of what the model itself produces. They complement but do not replace model-level safety training. See also: /resources/agentic-security-checklist for the threat-surface checklist that motivates these controls.

## Three guardrail positions

**Input guardrails** run before the model sees a request. Targets: jailbreak and prompt-injection attempts, PII in user turns, off-topic or policy-violating content. A classifier or rule fires here and can block, rewrite, or flag the request.

**Output guardrails** run after the model responds, before the response reaches the caller. Targets: harmful or policy-violating content, PII in generated text (redact before returning), schema or format violations, and groundedness/hallucination checks (does the answer stay within the supplied context?).

**Action guardrails** run before a tool call executes. Targets: tool calls outside an allowlist, calls that exceed defined parameter ranges, and irreversible or high-stakes operations that require human approval. See /resources/agentic-security-checklist §7 for the human-in-the-loop gate pattern.

## Techniques

- **Classifier-based filters** — a fine-tuned model labels input or output as safe/unsafe per a taxonomy. Fast and accurate for known harm categories; needs retraining as taxonomy evolves.
- **Regex / deterministic validators** — pattern matching for PII (credit-card numbers, email addresses), format enforcement (JSON schema, date formats). Zero latency; brittle for semantic violations.
- **LLM-as-judge guardrails** — a second LLM evaluates the primary model's output for policy compliance or groundedness. High flexibility; higher latency and cost.
- **Constrained decoding** — grammar-constrained generation forces the model to emit only tokens that satisfy a format specification at decode time. Eliminates format guardrail failures at the source. See /resources/reliable-tool-calling for structured-output details.

Defense-in-depth principle: stack multiple techniques. A regex catches known PII patterns cheaply; a classifier catches semantic violations the regex misses; an LLM-as-judge catches subtler policy issues. No single guardrail is sufficient, and prompt injection in particular has no complete solution — attackers can craft payloads that evade any single detector.

## Tooling landscape

**Llama Guard 4** (Meta, open-weight) — 12B multimodal safety classifier pruned from Llama 4 Scout. Classifies both prompt and response against the MLCommons hazards taxonomy; supports text and multiple images. Released April 2025. Model card: huggingface.co/meta-llama/Llama-Guard-4-12B.

**Llama Prompt Guard 2** (Meta, open-weight) — lightweight BERT-style (DeBERTa) classifiers (22M and 86M params) for detecting direct jailbreaks and prompt-injection attacks. Outputs benign/malicious label; 512-token context. Model card: huggingface.co/meta-llama/Llama-Prompt-Guard-2-86M.

**NeMo Guardrails** (NVIDIA, open-source, Apache 2.0) — Python toolkit for adding programmable guardrails to LLM conversational systems. Intercepts input and output; policies expressed in Colang configuration; integrates with multiple embedding providers. GitHub: github.com/NVIDIA-NeMo/Guardrails.

**Guardrails AI** (guardrails-ai, open-source) — Python/JS framework for specifying and enforcing structure, type, and semantic constraints on LLM outputs. Includes a Guardrails Hub of pre-built validators; supports re-asking on failure. GitHub: github.com/guardrails-ai/guardrails.

**OpenAI Moderation API + Agents SDK guardrails** (OpenAI, vendor/SaaS) — the Moderation API is a free endpoint that classifies text (and images) for harmful content across categories (hate, harassment, self-harm, etc.). Moderation scores can also be requested inline with Responses API calls. The OpenAI Agents SDK exposes explicit input and output guardrail hooks that run per tool invocation. Docs: platform.openai.com/docs/guides/moderation.

**Azure AI Content Safety / Prompt Shields** (Microsoft, vendor/SaaS) — Content Safety API covers text and image harm categories. Prompt Shields (GA) detects user-prompt injection attacks and document attacks (indirect prompt injection embedded in retrieved content), with a Spotlighting capability announced at Build 2025. Docs: learn.microsoft.com/en-us/azure/ai-services/content-safety/.

**ShieldGemma** (Google, open-weight) — safety classifiers built on Gemma 2 (text, 2B/9B/27B params) covering four harm categories. ShieldGemma 2 (April 2025) is a 4B model built on Gemma 3 that adds image safety classification. Model card: huggingface.co/google/shieldgemma-9b.

**Granite Guardian** (IBM, open-weight) — safety models fine-tuned from IBM Granite. Latest: Granite Guardian 4.1 8B (April 2026), which adds improved bring-your-own-criteria (BYOC) support for custom judging criteria beyond pre-baked safety and hallucination detectors. Model card: huggingface.co/ibm-granite/granite-guardian-4.1-8b.

**Lakera Guard** (Cisco/Lakera, vendor/SaaS) — real-time API for detecting prompt injection, jailbreaks, and data leakage; claims sub-50 ms latency. Lakera was acquired by Cisco in May 2025; now part of the Cisco AI Defense portfolio. Docs: docs.lakera.ai/guard.

## Practical guidance

- **Layer multiple guardrails.** Input + output + action coverage at minimum. A regex pre-filter reduces classifier load; a classifier catches what regex misses.
- **Fail closed on high-stakes actions.** If a guardrail errors or is inconclusive, block or escalate — do not default to allowing the action.
- **Log guardrail decisions.** Record every guardrail verdict (allowed/blocked, score, rule fired) alongside the trace ID for the agent run. See /resources/agent-observability for the broader observability pattern.
- **Measure false-positive and false-negative rates.** Guardrails that block too much degrade usability; guardrails that miss too much provide false confidence. Tune thresholds against a representative sample of real traffic.
- **Keep humans in the loop for irreversible actions.** No guardrail stack eliminates risk entirely, especially for prompt injection. For financial transfers, external communications, and data deletion, require explicit human confirmation regardless of guardrail output.

## Verified sources

- Llama Guard 4 model card (Meta/HuggingFace): https://huggingface.co/meta-llama/Llama-Guard-4-12B
- Llama Prompt Guard 2 model card (Meta/HuggingFace): https://huggingface.co/meta-llama/Llama-Prompt-Guard-2-86M
- NeMo Guardrails GitHub (NVIDIA): https://github.com/NVIDIA-NeMo/Guardrails
- Guardrails AI GitHub: https://github.com/guardrails-ai/guardrails
- OpenAI Moderation API docs: https://platform.openai.com/docs/guides/moderation
- OpenAI Agents SDK guardrails: https://openai.github.io/openai-agents-python/guardrails/
- Azure AI Content Safety / Prompt Shields (Microsoft Learn): https://learn.microsoft.com/en-us/azure/ai-services/content-safety/concepts/jailbreak-detection
- ShieldGemma model card (Google/HuggingFace): https://huggingface.co/google/shieldgemma-9b
- ShieldGemma 2 on HuggingFace: https://huggingface.co/google/shieldgemma-2-4b-it
- Granite Guardian 4.1 8B model card (IBM/HuggingFace): https://huggingface.co/ibm-granite/granite-guardian-4.1-8b
- Lakera Guard docs (Cisco/Lakera): https://docs.lakera.ai/guard
