Why AI Agents Need Sandboxes
Every coding agent follows the same loop: generate code, execute it, read the output, fix errors, repeat. SWE-Bench tasks average 12-15 iterations per solution. Production debugging agents run even more. That execution step is where sandboxes come in.
Running AI-generated code on bare metal is a non-starter for three reasons.
Security Isolation
LLMs hallucinate system calls, write to arbitrary paths, and generate network requests to unexpected endpoints. A sandbox limits file access to the project directory, restricts network egress, and prevents privilege escalation. Without isolation, a single bad generation can read your SSH keys or exfiltrate environment variables.
Resource Containment
Infinite loops, memory leaks, and fork bombs are common failure modes of generated code. Sandboxes enforce CPU time limits, memory caps, and process count restrictions. When an agent generates an O(n!) algorithm, the sandbox kills it after the timeout instead of taking down your dev machine.
Reproducibility
Agents need consistent environments across iterations. If the first execution installs numpy 2.1 and the third installs 2.3, the agent wastes tokens debugging phantom dependency errors. Sandboxes provide snapshot-and-restore so each iteration starts from a known state, or persistent filesystems so installed packages carry forward predictably.
Parallelism
Multi-agent architectures run 4-8 agents simultaneously, each generating and executing code. Without sandboxes, those agents compete for ports, write to the same files, and corrupt each other's state. Isolated sandboxes let each agent operate in its own filesystem and network namespace.
What Makes a Good Agent Sandbox
Not all sandboxes are built for AI agents. A sandbox designed for human developers (spin up a VM, install your tools, code for hours) has different requirements than one designed for an agent that needs to boot, execute 50 lines of Python, read stdout, and tear down in under a second.
Six properties matter most for agent workloads:
Cold Start < 1 Second
Agents iterate fast. A 3-second cold start on each of 15 iterations adds 45 seconds of dead time per task. Firecracker microVMs achieve sub-150ms. Warm pools can bring this close to zero. Any sandbox targeting agent use cases should boot in under a second.
Persistent Filesystem
Agents install packages, write config files, and build artifacts across iterations. The sandbox needs to preserve these between executions within a session. Stateless sandboxes force the agent to reinstall dependencies every iteration, wasting tokens and time.
Package Management
Generated code imports libraries. The sandbox needs pip, npm, cargo, or equivalent. Pre-built templates with common packages reduce cold starts. Custom templates let teams bake in their specific dependencies.
Network Access Control
Some agent tasks require network access (API calls, git clone, package downloads). Others should be fully isolated. A good sandbox provides granular network policies: allow PyPI but block everything else, or allow outbound HTTPS but block raw sockets.
Multi-Language Support
Coding agents work across Python, TypeScript, Go, Rust, Java, and C++. The sandbox needs to support multiple runtimes, ideally with pre-installed toolchains. Switching languages mid-task (Python for prototyping, then rewriting in Go) should not require a new sandbox.
Per-Second Billing
Agent sandbox sessions last seconds to minutes, not hours. Per-hour or per-month pricing penalizes agent workloads. Per-second billing (or per-execution billing) aligns cost with actual usage. A typical agent task uses 5-30 seconds of total sandbox compute.
Sandboxing Approaches
Four architectures dominate the agent sandbox space in 2026. Each makes different tradeoffs between isolation strength, cold start speed, flexibility, and operational complexity.
Firecracker MicroVMs
Firecracker, originally built by AWS for Lambda, boots a minimal Linux kernel in a lightweight VM. Each sandbox gets its own kernel, memory space, and virtual network interface. E2B and Fly.io Machines use this approach.
Firecracker Tradeoffs
Strengths: sub-150ms cold starts, hardware-level isolation (each sandbox has its own kernel), small memory footprint (as low as 5MB overhead per VM). The isolation model is equivalent to running separate physical machines.
Weaknesses: less ecosystem tooling than Docker, custom kernel configuration for GPU passthrough, snapshot/restore adds complexity. You are managing VMs, not containers.
Container-Based Sandboxes
Docker containers share the host kernel and use Linux namespaces and cgroups for isolation. Modal, Railway, and most CI/CD systems use this model. Developers already know Docker, which reduces the learning curve.
Container Tradeoffs
Strengths: familiar tooling (Dockerfile, docker-compose), massive ecosystem of pre-built images, GPU support via NVIDIA Container Toolkit, easy local development parity.
Weaknesses: shared kernel means a kernel exploit in one container can compromise others. Cold starts range from 500ms to 5s depending on image size. Not suitable for running truly untrusted code without additional sandboxing layers (gVisor, Kata).
Full Dev Environments
Daytona and Gitpod provision complete development environments with IDE support, Git integration, and persistent workspaces. These are heavier than microVMs or containers but provide everything an agent (or human) needs for sustained development sessions.
Dev Environment Tradeoffs
Strengths: complete toolchain out of the box (LSP, debugger, Git, terminal), persistent workspaces that survive disconnections, IDE integration for human-in-the-loop workflows.
Weaknesses: cold starts measured in seconds to minutes, higher cost per session, overkill for agents that just need to run a script and read stdout. Best suited for long-running agent sessions or hybrid human-agent workflows.
Integrated Platforms
Morph combines sandbox execution with fast-apply (10,500 tok/s file editing), agentic code search, and context management in a single SDK. Instead of stitching together a sandbox provider, a diff-apply tool, and a code search engine, the entire agent loop runs through one API.
Integrated Platform Tradeoffs
Strengths: lowest total agent loop latency (execution + editing + search in one round trip), single billing relationship, no integration overhead between components. Morph's fast-apply processes edits at 10,500 tok/s, which means an agent applying a 200-line diff waits ~19ms instead of 2-3 seconds with naive search-and-replace.
Weaknesses: less flexibility to swap individual components. If you already have a sandbox solution and only need code search, a standalone tool like WarpGrep may fit better.
Provider Comparison
| Feature | E2B | Modal | Daytona | Morph |
|---|---|---|---|---|
| Isolation model | Firecracker microVM | Container (gVisor) | Container/VM | Integrated platform |
| Cold start | ~150ms | ~500ms-1s | 5-30s | Sub-second |
| Persistent filesystem | Yes (session-scoped) | Yes (volumes) | Yes (workspace) | Yes (session-scoped) |
| GPU support | No | Yes (A100, H100) | Via provider | No |
| Languages | Any (Linux) | Any (Docker) | Any (Docker) | Any (Linux) |
| SDK | Python, JS/TS | Python | REST API | Python, JS/TS |
| Code search | No | No | No | Yes (agentic) |
| Fast apply | No | No | No | Yes (10,500 tok/s) |
| Billing model | Per-second | Per-second + GPU | Per-workspace-hour | Per-request |
| Open source | Yes (SDK) | No | Yes | SDK open source |
| Best for | Pure execution | GPU workloads | Long dev sessions | Full agent stack |
The comparison reveals a clear split. E2B and Modal are sandbox-first: they do code execution well and leave everything else to you. Daytona provides complete environments for longer sessions. Morph is agent-first: the sandbox exists as one component of a larger system designed to minimize total agent loop time.
Beyond the Sandbox: The Full Agent Loop
Code execution gets the most attention, but it is often not the bottleneck. Cognition measured that their Devin agent spent 60% of its time on search and navigation, not execution. Anthropic reported 90% improvement when splitting tasks across specialized sub-agents. The sandbox is necessary but not sufficient.
A complete agent infrastructure stack has three components beyond the sandbox:
Fast Apply
Agents generate diffs and edits constantly. Naive search-and-replace fails on ambiguous matches and takes 2-3 seconds per file. Morph's fast-apply engine processes edits at 10,500 tok/s with 0.73 F1 accuracy on the SWE-Bench edit benchmark, turning a 200-line diff into a 19ms operation.
Code Search
Before an agent can fix a bug, it needs to find the relevant code. Sequential grep-based search takes 10+ seconds on large repos. Agentic search (8 parallel tool calls per turn, 4 turns, sub-6s total) finds the right files without burning the agent's context window on irrelevant results.
Context Management
LLM context windows are finite. An agent that stuffs its entire codebase into context wastes tokens and hits accuracy degradation (context rot). Effective context management feeds the model only the files it needs for the current step, keeping the working set small and the output quality high.
The Integration Tax
Using separate providers for sandbox, apply, and search means three API calls per agent iteration, three billing relationships, three failure modes, and serialization overhead between each component. Morph's integrated SDK eliminates this tax. A single API call can execute code, apply an edit, and search the codebase, with results returned in one response.
Getting Started with Morph Sandbox
The Morph SDK provides sandbox access alongside fast-apply and code search. Here is a minimal example of an agent loop that uses all three.
Agent loop with Morph Sandbox SDK
from morph import MorphClient
client = MorphClient(api_key="your-api-key")
# Create a sandbox session
sandbox = client.sandbox.create(
template="python-3.12",
timeout=300, # 5 minute max session
)
# Write a file into the sandbox
sandbox.files.write("main.py", """
import requests
def fetch_status(url: str) -> int:
return requests.get(url).status_code
print(fetch_status("https://httpbin.org/status/200"))
""")
# Execute and read output
result = sandbox.exec("python main.py")
print(result.stdout) # "200"
print(result.exit_code) # 0
# If the agent needs to edit the file, use fast-apply
edit_result = client.apply(
original=sandbox.files.read("main.py"),
update="Add retry logic with exponential backoff",
model="morph-v3-fast",
)
sandbox.files.write("main.py", edit_result.content)
# Search the codebase for related files
search = client.search(
query="retry logic implementation",
codebase_path="/workspace",
)
for file in search.results:
print(f"{file.path}: {file.relevance_score}")
# Clean up
sandbox.stop()The full SDK documentation, including custom templates, GPU sandboxes, and multi-agent session management, is available at docs.morphllm.com.
Frequently Asked Questions
What is an AI agent sandbox?
An isolated execution environment where AI coding agents run arbitrary code without affecting the host system. Sandboxes provide resource limits, filesystem isolation, and network controls. Common implementations use Firecracker microVMs, containers, or full cloud dev environments.
Why do AI coding agents need sandboxes?
Agents generate and execute code in a loop: write, run, observe, fix, repeat. Running untrusted AI-generated code on production infrastructure creates security risks (arbitrary file access, network exfiltration) and stability risks (infinite loops, memory leaks). Sandboxes isolate these risks while preserving the agent's ability to iterate.
What is cold start latency and why does it matter?
Cold start latency is the time between requesting a sandbox and having a running environment. For agents iterating 12-15 times per task, every 100ms of cold start compounds. Firecracker microVMs achieve sub-150ms. Container-based solutions range from 500ms to 5s. Warm pools can reduce cold starts to near-zero for frequently used templates.
Firecracker microVMs vs containers: which is better for AI agents?
Firecracker provides stronger isolation (separate kernel per sandbox) with faster cold starts (sub-150ms vs 500ms-5s for containers). Containers provide a more familiar tooling ecosystem (Docker) and better GPU support via NVIDIA Container Toolkit. For running untrusted AI-generated code, microVMs are the safer choice. For GPU workloads or teams already invested in Docker, containers make more sense.
How does Morph compare to E2B?
E2B is a standalone sandbox provider built on Firecracker. Morph is an integrated agent infrastructure platform where sandbox execution works alongside fast-apply (10,500 tok/s) and agentic code search. If you only need code execution, E2B is purpose-built for it. If you are building a full coding agent, Morph reduces total loop latency by combining execution, editing, and search in one SDK.
Can I use agent sandboxes for production workloads?
Sandboxes are designed for ephemeral tasks: running tests, executing scripts, validating builds. For long-running services, you need persistent infrastructure. Some providers (Daytona, Gitpod) offer longer-lived environments for development but not production hosting. Most sandbox providers bill per second, making them cost-effective for burst agent workloads.
What languages do agent sandboxes support?
Most sandboxes support any language that runs on Linux. Pre-built templates typically cover Python, Node.js, Go, Rust, Java, and C++. Container-based solutions accept custom Docker images with any language stack. The practical constraint is cold start time: heavier runtimes with more pre-installed packages increase boot latency.
How much do AI agent sandboxes cost?
E2B charges per second of sandbox runtime (from $0.000014/s). Modal charges per GPU-second with a free tier. Daytona and Gitpod charge per workspace-hour. For typical agent workloads running 10-50 sandbox sessions per task at 30 seconds each, expect $0.01-0.10 per task in sandbox costs. Morph includes sandbox access as part of its agent infrastructure pricing.
Related Articles
Build Agents with Integrated Sandbox, Apply, and Search
Morph SDK gives your coding agents sandbox execution, 10,500 tok/s fast-apply, and agentic code search in a single API. Stop stitching together separate providers.