Agentic Security Checklist
Cross-vendor, threat-surface-organized security checklist for building and operating AI agents — synthesizing OWASP, NIST, Anthropic, OpenAI, Google SAIF, and MITRE ATLAS.
Labs publish vendor-specific guidance. This checklist synthesizes it across vendors into one agent-consumable reference, organized by threat surface. Check each item before shipping an agent to production.
Summary table
| Threat surface | Highest-impact control |
|---|---|
| Prompt injection | Treat all external content as untrusted data, not instructions |
| Tool/function-call abuse | Least-privilege toolset; confirm before destructive calls |
| Excessive agency | Scope permissions to the minimum required for each task |
| Secrets & credentials | Never pass secrets through the context window |
| MCP / supply chain | Pin versions; audit tool descriptions before connecting |
| Output / action sandboxing | Human-in-the-loop gate on irreversible actions |
| Memory & context poisoning | Validate and sanitize retrieved context before injection |
| Data exfiltration | Network egress allowlist; outbound content inspection |
| Auth & OAuth scopes | Request only the OAuth scopes the agent needs per task |
| Logging & auditability | Log every tool call, input, and output with a trace ID |
1. Prompt injection — direct and indirect
- Enforce a strict separation between trusted instructions (system prompt) and untrusted data (user input, tool outputs, retrieved documents). Never concatenate them without a structural delimiter.
- Treat all content fetched from the web, email, databases, or tool responses as untrusted user data — not as instructions — regardless of where it originated (OWASP Agentic ASI01 Agent Goal Hijack; OWASP LLM01:2025 Prompt Injection).
- Strip or escape instruction-like patterns (
Ignore previous instructions,<system>,ASSISTANT:) from retrieved content before inserting into context. - Validate that tool outputs conform to the expected schema and type before the model acts on them.
- Run a separate classification step or guardrail on user input to detect jailbreak patterns before the main agent sees them (OpenAI agent builder safety guidance).
2. Tool and function-call abuse
- Maintain a minimal toolset: expose only the tools needed for the current task; disable or detach others at runtime (Anthropic trustworthy-agents framework — harness layer).
- Require explicit confirmation before any tool call that is irreversible, has financial impact, or modifies shared state (email, file system, database write, API mutation).
- Log the full tool-call request (name + arguments) and response before execution.
- Rate-limit tool calls per agent turn to bound blast radius from runaway loops.
- Reject tool calls with arguments that reference paths, URIs, or identifiers outside the expected domain.
3. Excessive agency and least-privilege
- Grant the agent only the permissions it needs for the specific task, not for all tasks it might ever do (OWASP LLM06:2025 Excessive Agency; Google SAIF principle of minimizing blast radius).
- Prefer read-only tool variants when write access is not required for the step.
- Time-bound credentials: issue short-lived tokens per task rather than long-lived agent credentials.
- Review and prune the tool list each time the agent's scope changes.
4. Untrusted content handling
- Sanitize HTML, Markdown, and JSON retrieved from external sources before insertion into the context window.
- Do not render or execute content returned by tools without validation.
- Limit the size of individual tool responses injected into context to prevent context-flooding attacks.
- Apply output encoding when agent-generated content is passed to downstream systems (SQL, shell, HTML rendering).
5. Secrets and credential management
- Never place API keys, passwords, or tokens in the system prompt, user message, or any field visible to the model.
- Inject secrets at the infrastructure layer (environment variables, secret managers) and resolve them in tool-call wrappers, not in context.
- Rotate credentials on a schedule; revoke immediately on suspected compromise.
- Audit which tool servers hold credentials on your behalf — long-lived third-party tokens stored server-side are high blast-radius targets.
6. MCP server trust and supply chain
- Only connect to MCP servers you can review or whose publisher you trust; third-party registries vary in vetting rigour.
- Audit the tool description text of every MCP server before connecting — malicious servers embed instructions in descriptions to hijack model behavior (prompt injection via tool metadata).
- Pin package versions with lock files and verify checksums; the first malicious MCP package appeared in September 2025.
- Prefer OAuth 2.1 + PKCE for remote MCP server auth (mandatory for HTTP transport per the June 2025 MCP spec); use per-agent client registrations, not shared credentials.
- Scope OAuth tokens to the minimum required tools; use per-tool scopes where the server supports them.
- Revoke server access when the agent task is complete.
7. Output and action sandboxing; human-in-the-loop gates
- Define a pre-flight checklist of action categories that always require human approval: financial transfers, sending external communications, deleting data, provisioning infrastructure (Anthropic trustworthy-agents framework; NIST AI RMF GOVERN function).
- Sandbox code execution and file operations in isolated environments with no network egress by default.
- Implement a rollback or undo path for every reversible action the agent takes.
- Emit a structured pre-action summary to the user before high-stakes tool calls and wait for acknowledgement.
8. Memory and context poisoning
- Validate retrieved memories or RAG results against a known-good schema before injecting into context.
- Treat vector-store content with the same distrust as external web content — it may have been poisoned at ingestion time (MITRE ATLAS AML.T0020 Poison Training Data, which covers fine-tuning and RAG data sources).
- Separate short-term working memory from long-term persistent storage; apply stricter validation before promoting content to persistent memory.
- Periodically audit long-term memory stores for injected instructions.
9. Data exfiltration channels
- Restrict agent network egress to a known allowlist of destinations; deny by default.
- Inspect outbound tool-call arguments for PII and secrets before execution; block calls that reference data outside their expected domain.
- Do not let the agent construct arbitrary URLs or shell commands from user-supplied input without sanitization.
- Monitor for unusual data volumes or frequencies in tool outputs that may indicate exfiltration via covert channels.
10. Auth and OAuth scopes
- Request OAuth scopes at the minimum granularity needed for each specific task; do not request broad scopes for future convenience.
- Use separate OAuth clients (and credentials) per agent instance; never share a client ID across agent instances.
- Implement token refresh and revocation; treat access tokens as ephemeral.
- Verify that the OAuth resource indicator (RFC 8707) matches the intended MCP server to prevent token mis-redemption attacks.
11. Logging and auditability
- Assign a unique trace ID to every agent run; propagate it through all tool calls and sub-agent invocations.
- Log: timestamp, trace ID, tool name, full arguments, full response, latency, and outcome for every tool call.
- Store logs in an append-only, tamper-evident store; agents must not be able to delete their own logs.
- Alert on anomalous patterns: high tool-call rates, calls to unexpected endpoints, or sudden changes in action type distribution (Google SAIF — Monitor and Respond principle).
- Retain logs long enough to support incident reconstruction; NIST AI RMF MANAGE function recommends documented response plans.
Verified sources
- OWASP Top 10 for Agentic Applications (2026): https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/
- OWASP LLM01:2025 Prompt Injection: https://genai.owasp.org/llmrisk/llm01-prompt-injection/
- Anthropic — Our framework for developing safe and trustworthy agents: https://www.anthropic.com/news/our-framework-for-developing-safe-and-trustworthy-agents
- OpenAI — Safety in building agents: https://platform.openai.com/docs/guides/agent-builder-safety
- OpenAI — A practical guide to building agents: https://openai.com/business/guides-and-resources/a-practical-guide-to-building-ai-agents/
- Google SAIF (Secure AI Framework): https://saif.google/secure-ai-framework
- MITRE ATLAS (adversarial threat landscape for AI): https://atlas.mitre.org/
- NIST AI Risk Management Framework: https://www.nist.gov/itl/ai-risk-management-framework