Code Execution Sandboxing for Agents

Guide · updated 2026-06-15 · Markdown variant

Isolation spectrum from language sandboxes to microVMs, WebAssembly as a portable sandbox, and a verified comparison of hosted agent-sandbox APIs — for agents that need to run model-generated code safely.

Running model-generated code is arbitrary code execution. Without isolation, a single malicious or buggy output can read host secrets, exfiltrate data, pivot to other tenants, or destroy infrastructure. Sandboxing is not optional for production agents that execute generated code or computer-use actions. See also: /resources/agentic-security-checklist (§7 — output and action sandboxing) and /resources/computer-use-browser-automation.

The isolation spectrum

Weakest to strongest. Each layer adds real isolation at the cost of startup time, resource overhead, or surface complexity.

1. In-process language sandboxes

What they are: code is restricted at the language level before it runs — no separate process, no OS boundary.

RestrictedPython (github.com/zopefoundation/RestrictedPython) — replaces Python's compile() to produce a restricted AST that disallows dangerous builtins and attribute access. The maintainers explicitly state it is not a full sandbox and must be combined with other controls; effectiveness depends on guard-function configuration.
Pyodide / WASM — CPython compiled to WebAssembly and run inside the browser's or Node's WASM runtime. Network access and filesystem access are unavailable by default, so the WASM sandbox provides structural isolation without configuration. See §3 for WASM details.

Isolation strength: weak. Language sandboxes have no OS-level boundary. They are defeated by native extensions, JIT vulnerabilities, or overlooked builtins. Use only for very-low-risk inputs or as a first filter in a layered stack.

2. OS containers (Docker / standard runtimes)

What they are: Linux namespaces (PID, net, mount, UTS, IPC) plus cgroups isolate a process from the host filesystem and network. Docker is the dominant packaging and runtime implementation.

Isolation strength: moderate. Containers share the host kernel. A kernel exploit inside a container can escape to the host. For internal developer tooling or low-privilege workloads this is often acceptable, but standard Docker is NOT a strong security boundary against untrusted or adversarially generated code. The attack surface is the entire Linux kernel syscall table.

Common misconception: "We run it in Docker so it is safe." A container restricts what the workload can see, not what kernel vulnerabilities it can trigger. Isolate untrusted code at a higher level.

3. Hardened container runtimes

What they are: drop-in replacements for the container runtime that add a security layer between the container and the host kernel — without requiring full VM provisioning.

gVisor (gvisor.dev) — developed by Google and open-sourced under Apache 2.0. Implements ~200 Linux syscalls in a user-space Go process called the Sentry. Application syscalls are intercepted by Sentry rather than reaching the host kernel directly; Sentry in turn makes a minimal set of host syscalls. To escape a gVisor sandbox, an attacker must simultaneously exploit Sentry and the host kernel, which share no code. Integrates with Docker and Kubernetes via the runsc runtime. Used by Google Cloud Run, OpenAI's Code Interpreter, and Modal.
Kata Containers (katacontainers.io) — open-source project under the OpenInfra Foundation. Each container runs inside its own lightweight VM with a dedicated kernel. Kata supports multiple hypervisors as backends: Firecracker, Cloud Hypervisor, and QEMU/KVM. From the orchestration layer (Kubernetes, containerd) it looks like a standard container. Used by Northflank's sandbox platform.

Isolation strength: strong. Kernel attack surface is either eliminated (gVisor) or reduced to the hypervisor boundary (Kata). Startup overhead: gVisor adds milliseconds; Kata adds a VM boot (100–300 ms depending on hypervisor).

4. MicroVMs

What they are: purpose-built VMMs (Virtual Machine Monitors) that boot a minimal Linux kernel in a hardware-virtualized VM in under 200 ms, with a minimal device model and small memory footprint.

Firecracker (github.com/firecracker-microvm/firecracker) — open-sourced by Amazon Web Services under Apache 2.0. The isolation technology behind AWS Lambda and AWS Fargate. Boots microVMs in ~125 ms, uses <5 MiB overhead per VM, and supports creating up to 150 microVMs/second on one host. Uses a jailer process and seccomp filters in production. Also used by E2B, Vercel Sandbox, and as a Kata Containers backend.
Cloud Hypervisor (github.com/cloud-hypervisor/cloud-hypervisor) — open-source VMM written in Rust, hosted under the Linux Foundation. Targets modern cloud workloads with minimal device emulation, low latency, and small memory footprint. Memory and thread safety via Rust reduces VMM attack surface. Supported by the Kata Containers project as a hypervisor backend; used by Northflank.

Isolation strength: very strong. Hardware virtualization boundary separates each workload's kernel from the host. Industry-standard for multi-tenant serverless infrastructure.

5. Full VMs

Conventional VMs (KVM/QEMU, Hyper-V, VMware) provide the strongest isolation at the cost of the highest startup time (seconds to minutes) and resource overhead. Rarely the right choice for agent code execution where fast ephemeral sandboxes are needed; microVMs deliver equivalent security with orders-of-magnitude better latency.

WebAssembly as a portable sandbox

WebAssembly (WASM) modules run in a capability-based, deny-by-default sandbox. A module cannot access memory outside its own linear memory, cannot make syscalls directly, and cannot use network or filesystem unless the host explicitly grants those capabilities.

Wasmtime (wasmtime.dev) — production-grade WASM runtime by the Bytecode Alliance, written in Rust. Implements WASI (WebAssembly System Interface), which defines capability-based access to filesystem, network, clocks, and random — each must be explicitly granted by the host. A Wasmtime module with no imports has zero host access. Wasmtime is the dominant server-side and edge WASM runtime.
Pyodide (github.com/pyodide/pyodide) — CPython compiled to WASM; runs in the browser or Node.js WASM sandbox. Supports NumPy, Pandas, and other C-extension packages via precompiled wheels. Network and filesystem are unavailable by default. Runs ~3–5× slower than native CPython. Ideal for in-browser Python REPL and lightweight sandboxed computation without server infrastructure.

WASM's limitation: compiled languages (Rust, C, Go) map well to WASM; Python via Pyodide is usable for many agent tasks but performance-sensitive workloads or native C extensions without precompiled wheels may not run. For general agent code execution, microVM-backed APIs (below) are typically the right default.

Hosted agent-sandbox APIs

For agents that need to execute code without managing isolation infrastructure, these APIs provide sandboxed environments callable from agent code. All are web-verified as of June 2026.

Product	Isolation	Cold start	GPU	Key differentiator
E2B (e2b.dev)	Firecracker microVM	~150 ms	No	Agent-first SDK; Python/JS; MCP integration; free tier; production references (Perplexity, Manus)
Modal (modal.com)	gVisor containers	Fast	Yes (A100, H100)	50k–100k concurrent sandboxes; GPU access; Lovable and Quora in production
Daytona (daytona.io)	gVisor	<90 ms	No	Open-source; stateful persistent workspace; sub-90 ms via pre-warmed pools; $24M Series A Feb 2026
Cloudflare Sandbox (developers.cloudflare.com/sandbox)	Containers (GA Apr 2026) + Dynamic Workers V8 isolates (beta)	ms (isolates) / container	No	Two-tier: full Linux containers via Sandbox SDK + isolate-based Dynamic Workers (100× faster); edge-distributed
Northflank (northflank.com)	Kata Containers (Cloud Hypervisor) + gVisor	—	No	Only platform offering both Kata and gVisor; BYOC (AWS/GCP/Azure); unlimited sessions; 2M+ isolated workloads/month
Vercel Sandbox (vercel.com/sandbox)	Firecracker microVM	—	No	Beta; free; 45 min–5 hr session cap; backed by Vercel Fluid compute; open-source Open Agents stack

Provider built-ins

OpenAI Code Interpreter (Responses API, developers.openai.com) — runs Python in gVisor-backed sandboxed containers. Available as the code_interpreter tool in the Responses API (added May 2025; Assistants API deprecated, shutdown August 26, 2026). Containers expire after 20 minutes of inactivity; supports file upload and download.
Anthropic code execution tool (docs.anthropic.com/en/docs/agents-and-tools/tool-use/code-execution-tool) — runs Python and Bash in Anthropic's sandboxed container. Latest version code_execution_20260120 adds REPL state persistence and programmatic tool calling from within the sandbox. Supported on Claude Opus 4.5+, Sonnet 4.5+.

Hardening checklist

Regardless of the isolation layer, apply these controls at the harness level:

No network egress by default. Block all outbound connections; allowlist specific hosts only (package registries, APIs the agent legitimately needs). Unrestricted egress allows data exfiltration and SSRF attacks against internal services.
Drop privileges. Run sandbox processes as non-root; apply seccomp filter to restrict syscall surface (Firecracker's jailer does this automatically; configure explicitly for gVisor/container-based options).
Ephemeral filesystem. Give each sandbox execution a fresh, ephemeral filesystem. Do not persist state between unrelated runs unless explicitly designed for it.
Resource limits. Cap CPU time, memory, and wall-clock timeout per execution. Unbounded resource usage enables denial-of-service against the host infrastructure.
No host secrets in the sandbox. Never inject API keys, database credentials, or other secrets into sandbox environment variables. Resolve secrets in the harness layer and pass only the minimum necessary result. See /resources/agentic-security-checklist §5.
Treat sandbox output as untrusted. Validate and sanitize all output from a sandbox before using it in subsequent agent steps. A malicious payload could attempt prompt injection through generated output. See /resources/agentic-security-checklist §1.

Verified sources

Firecracker GitHub (AWS, Apache 2.0): https://github.com/firecracker-microvm/firecracker
Firecracker — AWS Lambda and Fargate origin: https://aws.amazon.com/blogs/opensource/firecracker-open-source-secure-fast-microvm-serverless/
gVisor security model: https://gvisor.dev/docs/architecture_guide/security/
gVisor — What is gVisor: https://gvisor.dev/docs/
Kata Containers GitHub: https://github.com/kata-containers/kata-containers
Cloud Hypervisor GitHub: https://github.com/cloud-hypervisor/cloud-hypervisor
Wasmtime security docs: https://docs.wasmtime.dev/security.html
Pyodide GitHub: https://github.com/pyodide/pyodide
RestrictedPython GitHub: https://github.com/zopefoundation/RestrictedPython
E2B docs: https://e2b.dev/docs
Modal — code execution sandboxes for AI agents (2026): https://modal.com/resources/best-code-execution-sandboxes-ai-agents
Daytona homepage: https://www.daytona.io/
Cloudflare Sandbox SDK docs: https://developers.cloudflare.com/sandbox/
Cloudflare Sandboxes GA (April 2026): https://developers.cloudflare.com/changelog/post/2026-04-13-containers-sandbox-ga/
Northflank — how to sandbox AI agents: https://northflank.com/blog/how-to-sandbox-ai-agents
Vercel Sandbox docs: https://vercel.com/docs/sandbox
OpenAI Code Interpreter tool: https://developers.openai.com/api/docs/guides/tools-code-interpreter
Anthropic code execution tool: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/code-execution-tool

#sandboxing #security #code-execution #microvm #wasm #agents #isolation #firecracker #gvisor

Category: Guide