What Is nsjail
nsjail is a process isolation tool for Linux. It wraps a single process (or process tree) in a combination of kernel-level isolation mechanisms: namespaces, cgroups, rlimits, and seccomp-bpf syscall filters. The result is a sandbox that restricts what the process can see, do, and consume, without a container runtime or virtual machine.
Google developed nsjail internally and released it as open source. It is not an official Google product, but it has been continuously maintained since 2015. Google uses it for hosting CTF (Capture The Flag) security challenges, where contestants upload and execute arbitrary code. That origin matters: nsjail was built for adversarial workloads, not just development convenience.
The tool operates as a single static binary. You point it at an executable, specify isolation constraints, and it forks a sandboxed child process. No daemon, no image registry, no orchestrator. This simplicity is the core design decision: nsjail does one thing (process isolation) and composes with whatever infrastructure you already have.
Who uses nsjail in production
Windmill uses nsjail to sandbox Python and Go workflow executions. Each job runs in its own nsjail instance with filesystem isolation, network restrictions, and resource limits. Google uses nsjail for CTF hosting and internal security testing. Multiple code evaluation platforms use it for running untrusted submissions in competitive programming and hiring pipelines.
Architecture: How nsjail Isolates Processes
nsjail layers four Linux kernel mechanisms. Each addresses a different class of escape or abuse.
Linux Namespaces
Isolates the process's view of system resources. PID namespace gives it its own process tree. Mount namespace controls filesystem visibility. Network namespace isolates network interfaces. User namespace maps UID/GID. UTS namespace isolates hostname. IPC namespace isolates inter-process communication. Cgroup namespace isolates cgroup visibility.
Cgroups (v1 and v2)
Limits resource consumption. Memory limits prevent OOM-killing the host. CPU limits prevent monopolizing cores. PID limits cap the number of processes the sandbox can spawn. Net_cls tags network traffic for QoS enforcement. Cgroups ensure a runaway process cannot degrade the host.
Seccomp-BPF
Filters system calls at the kernel boundary. Before the kernel executes any syscall from the sandboxed process, the BPF program inspects the call number and arguments. Disallowed syscalls are killed immediately. This blocks kernel exploits that rely on vulnerable syscall handlers.
Resource Limits (rlimits)
Per-process caps on file descriptors, virtual memory, CPU time, and file sizes. Unlike cgroups (which limit the group), rlimits constrain individual processes. nsjail applies both for defense in depth: cgroups for the sandbox as a whole, rlimits per process inside it.
Namespace Isolation in Detail
nsjail supports all seven Linux namespace types. Each is independently toggleable via configuration flags:
| Namespace | Isolates | Flag |
|---|---|---|
| PID | Process ID tree (sandbox sees PID 1 as init) | clone_newpid |
| Mount | Filesystem mounts (chroot/pivot_root) | clone_newns |
| Network | Network interfaces, routing tables, firewall rules | clone_newnet |
| User | UID/GID mappings (enables rootless operation) | clone_newuser |
| UTS | Hostname and domain name | clone_newuts |
| IPC | System V IPC, POSIX message queues | clone_newipc |
| Cgroup | Cgroup root directory visibility | clone_newcgroup |
Filesystem Constraints
nsjail uses pivot_root() (preferred) or chroot() to change the filesystem root for the sandboxed process. You define explicit mount points: which host paths are visible, whether they are read-only or read-write, and where tmpfs or proc filesystems are mounted. Everything not explicitly mounted is invisible.
Typical mount configuration
mount {
src: "/usr"
dst: "/usr"
is_bind: true
rw: false
}
mount {
src: "/lib"
dst: "/lib"
is_bind: true
rw: false
}
mount {
dst: "/tmp"
fstype: "tmpfs"
rw: true
is_bind: false
}
mount {
dst: "/proc"
fstype: "proc"
rw: false
}
# No /home, /root, /etc/shadow visible to sandboxConfiguration: Protobuf Config Files
nsjail accepts command-line flags for simple cases and protobuf-based configuration files for production deployments. The protobuf format is defined in config.proto in the nsjail repository. It covers every isolation knob: namespaces, mounts, cgroups, rlimits, seccomp policies, UID/GID mappings, and execution parameters.
Complete nsjail config for sandboxing Python execution
name: "python-sandbox"
description: "Sandbox for executing untrusted Python code"
mode: ONCE
hostname: "sandbox"
cwd: "/app"
time_limit: 30
max_cpus: 1
clone_newuser: true
clone_newnet: true
clone_newns: true
clone_newpid: true
clone_newipc: true
clone_newuts: true
clone_newcgroup: true
rlimit_as_type: HARD
rlimit_cpu_type: HARD
rlimit_fsize: 64 # Max file size: 64 MB
rlimit_nofile: 128 # Max open file descriptors
uidmap {
inside_id: "1000"
outside_id: ""
count: 1
}
gidmap {
inside_id: "1000"
outside_id: ""
count: 1
}
cgroup_mem_max: 536870912 # 512 MB
cgroup_pids_max: 64 # Max 64 processes
cgroup_cpu_ms_per_sec: 500 # 50% of one CPU core
mount {
src: "/usr"
dst: "/usr"
is_bind: true
rw: false
}
mount {
src: "/lib"
dst: "/lib"
is_bind: true
rw: false
}
mount {
src: "/lib64"
dst: "/lib64"
is_bind: true
rw: false
mandatory: false
}
mount {
src: "/bin"
dst: "/bin"
is_bind: true
rw: false
}
mount {
dst: "/tmp"
fstype: "tmpfs"
rw: true
}
mount {
dst: "/app"
fstype: "tmpfs"
rw: true
}
mount {
dst: "/proc"
fstype: "proc"
rw: false
}
mount {
src: "/dev/null"
dst: "/dev/null"
is_bind: true
rw: false
}
mount {
src: "/dev/urandom"
dst: "/dev/urandom"
is_bind: true
rw: false
}
exec_bin {
path: "/usr/bin/python3"
arg: "-c"
}Command-Line Usage
For quick tests and simple sandboxing, nsjail's CLI flags work without a config file. This is useful during development, though production deployments should use config files for reproducibility.
nsjail CLI: sandbox a Python script
# Basic: run a Python script with network isolation
nsjail \
--mode once \
--chroot / \
--user 65534 \
--group 65534 \
--time_limit 30 \
--rlimit_as 512 \
--rlimit_cpu 10 \
--rlimit_nofile 64 \
--clone_newnet \
--clone_newpid \
--clone_newns \
--clone_newuser \
-- /usr/bin/python3 /tmp/untrusted_script.py
# Using a protobuf config file
nsjail --config /etc/nsjail/python-sandbox.cfg \
-- /usr/bin/python3 -c "print('hello from sandbox')"
# TCP listener mode: accept connections, sandbox each
nsjail --mode listen_tcp --port 9999 \
--config /etc/nsjail/service.cfg \
-- /usr/local/bin/my_serviceExecution Modes
nsjail supports three execution modes, each suited to a different workload pattern:
ONCE
Runs the command once and exits. The standard mode for code execution: launch, run, collect output, done. Used by code evaluation pipelines and AI agent sandboxes.
RERUN
Re-executes the command each time it exits. Useful for persistent services that should restart on crash. The sandbox constraints persist across restarts.
LISTEN_TCP
Binds to a TCP port and forks a new sandboxed process for each incoming connection. Used for network services like HTTP handlers and CTF challenges. Each connection gets its own isolated process.
Seccomp-BPF and the Kafel Policy Language
Seccomp-BPF is the strongest isolation layer nsjail provides. Namespaces control visibility. Cgroups control resources. Seccomp-BPF controls what the process can ask the kernel to do. A sandboxed process that can only call read, write, mmap, and exit_group cannot open files, fork children, create sockets, or exploit kernel vulnerabilities through obscure syscall handlers.
Writing raw BPF bytecode is tedious and error-prone. nsjail uses Kafel, a domain-specific language that compiles human-readable rules into BPF programs. Kafel supports named policies, argument-level filtering, and policy composition.
Kafel policy: restrictive sandbox for code execution
POLICY code_execution {
/* File I/O */
ALLOW {
read, write, readv, writev,
open, openat, close,
stat, fstat, lstat, newfstatat,
lseek, access, faccessat
}
/* Memory management */
ALLOW {
mmap, munmap, mprotect, mremap, brk,
madvise
}
/* Process lifecycle */
ALLOW {
clone, fork, vfork, execve,
wait4, exit, exit_group,
getpid, getppid, gettid
}
/* Signals */
ALLOW {
rt_sigaction, rt_sigprocmask,
rt_sigreturn, kill
}
/* Filesystem metadata */
ALLOW {
getcwd, readlink, readlinkat,
getdents, getdents64
}
/* Misc required for Python/Node */
ALLOW {
futex, clock_gettime, clock_nanosleep,
nanosleep, getrandom, pipe, pipe2,
dup, dup2, dup3, fcntl, ioctl,
set_tid_address, set_robust_list,
sched_getaffinity, sched_yield,
arch_prctl, prctl, prlimit64
}
/* Block everything else */
DEFAULT KILL
}
USE code_execution DEFAULT KILLSyscall filtering is defense in depth
Seccomp-BPF policies do not replace namespace isolation. They complement it. A namespace escape exploit still needs to make syscalls, and seccomp-BPF blocks the syscalls the exploit relies on. Conversely, a seccomp bypass (rare, since the filter runs in kernel space) is contained by namespace isolation. The combination is stronger than either alone.
Argument-Level Filtering
Kafel can filter based on syscall arguments, not just syscall numbers. This lets you allow open() for reading but block open() for writing, or allow clone() for threads but block it for new processes.
Kafel: argument-level syscall filtering
POLICY restricted_io {
/* Allow open() only for read (O_RDONLY = 0) */
ALLOW {
open { arg1 == 0 }
openat { arg2 == 0 }
}
/* Allow clone() for threads only (CLONE_THREAD flag) */
ALLOW {
clone { arg0 & 0x00010000 } /* CLONE_THREAD */
}
/* Allow socket() only for AF_UNIX (local IPC) */
ALLOW {
socket { arg0 == 1 } /* AF_UNIX */
}
DEFAULT KILL
}
USE restricted_io DEFAULT KILLnsjail vs Docker vs gVisor
nsjail, Docker, and gVisor operate at different layers. Choosing between them depends on what you are isolating, how fast you need it, and how much infrastructure you want to manage.
| nsjail | Docker | gVisor | |
|---|---|---|---|
| Isolation target | Single process / process tree | Application stack (full OS userspace) | Application stack (intercepted syscalls) |
| Startup time | < 20ms | 500ms - 2s | 200 - 500ms |
| Daemon required | No | Yes (dockerd) | No (OCI runtime) |
| Syscall handling | Kernel executes allowed syscalls directly | Kernel executes all syscalls (default seccomp) | Sentry reimplements syscalls in userspace |
| CPU overhead | Near zero (BPF filter only) | Near zero (namespace overhead) | 5-20% (userspace syscall translation) |
| Kernel attack surface | Reduced (blocked syscalls never reach kernel) | Full (all syscalls reach kernel) | Minimal (only ~20 host syscalls used) |
| Filesystem model | Explicit bind mounts | Layered image filesystem (overlayfs) | Layered image filesystem (overlayfs) |
| Network isolation | Namespace-based (no networking by default) | Bridge networking (veth pairs) | Netstack (userspace TCP/IP) or host |
| Configuration | Protobuf config + Kafel policies | Dockerfile + compose | OCI spec + runsc flags |
| Best for | High-throughput code execution | Application deployment | Defense-in-depth sandboxing |
When to Choose nsjail
nsjail is the right tool when you need to sandbox many short-lived processes with minimal overhead. A code evaluation platform that runs 10,000 student submissions per hour benefits from 20ms launch times. An AI agent loop that executes code 20 times per session benefits from no daemon dependency. If you are already running on Linux and have engineers comfortable with namespace and seccomp configuration, nsjail gives you fine-grained control that no higher-level tool matches.
When to Choose Docker
Docker is the right tool when you need reproducible environments with dependency management. If your sandboxed code requires specific system libraries, language runtimes, or complex dependency trees, Docker images solve that cleanly. Docker also provides a well-understood ecosystem: registries, CI/CD integration, orchestration (Kubernetes). The tradeoff is startup time, daemon dependency, and a larger attack surface (the full syscall set is available by default unless you add a custom seccomp profile).
When to Choose gVisor
gVisor is the right tool when you need the strongest isolation and can tolerate the performance cost. Its userspace kernel (Sentry) intercepts every syscall before it reaches the host kernel, reducing the kernel attack surface to roughly 20 host syscalls. This matters when running truly adversarial code. The cost is 5-20% CPU overhead and higher memory usage. gVisor is used by Google Cloud Run and GKE Sandbox.
Production Deployment Patterns
Running nsjail in production requires solving problems that do not appear in local testing: process lifecycle management, log collection, concurrent execution, and failure handling.
Pattern 1: nsjail Inside Docker
The most common production pattern is running nsjail inside a Docker container. Docker provides the base environment (system libraries, language runtimes, nsjail binary). nsjail provides per-execution isolation within that container. This gives you Docker's dependency management and image distribution with nsjail's sub-20ms per-process sandboxing.
Dockerfile: nsjail execution environment
FROM ubuntu:24.04
# Install nsjail and language runtimes
RUN apt-get update && apt-get install -y \
nsjail \
python3 python3-pip \
nodejs npm \
&& rm -rf /var/lib/apt/lists/*
# Copy nsjail configuration
COPY nsjail-configs/ /etc/nsjail/
# Create sandbox workspace
RUN mkdir -p /sandbox/workspace && \
chown 65534:65534 /sandbox/workspace
# The execution wrapper handles:
# - Writing user code to /sandbox/workspace
# - Invoking nsjail with the right config
# - Collecting stdout/stderr/exit code
COPY executor /usr/local/bin/executor
EXPOSE 8080
CMD ["executor", "--listen", ":8080"]Pattern 2: Windmill-Style Worker Pool
Windmill's architecture demonstrates a production-grade pattern: a pool of worker processes, each running inside its own container, with nsjail providing per-job isolation. The worker receives a job from the queue, writes the code to a temporary directory, invokes nsjail with the appropriate config, collects output, and cleans up. nsjail's ONCE mode maps directly to this execute-and-discard pattern.
Worker loop: nsjail per-job isolation (pseudocode)
import subprocess
import tempfile
import json
def execute_sandboxed(code: str, language: str, timeout: int = 30):
"""Execute untrusted code inside an nsjail sandbox."""
with tempfile.TemporaryDirectory() as workdir:
# Write user code to workspace
code_path = f"{workdir}/main.py"
with open(code_path, "w") as f:
f.write(code)
# Execute inside nsjail
result = subprocess.run(
[
"nsjail",
"--config", f"/etc/nsjail/{language}-sandbox.cfg",
"--bindmount_ro", f"{code_path}:/app/main.py",
"--time_limit", str(timeout),
"--", "/usr/bin/python3", "/app/main.py"
],
capture_output=True,
text=True,
timeout=timeout + 5 # Buffer for nsjail setup
)
return {
"stdout": result.stdout,
"stderr": result.stderr,
"exit_code": result.returncode,
}
# Worker loop
while True:
job = queue.dequeue()
result = execute_sandboxed(job.code, job.language)
queue.complete(job.id, result)Pattern 3: TCP Listener for Network Services
nsjail's LISTEN_TCP mode binds to a port and forks a sandboxed process per connection. This is how Google hosts CTF challenges: each contestant connects to a port, gets a fresh sandbox, and their session is fully isolated from other contestants. The same pattern works for sandboxed REPL services or interactive code execution endpoints.
nsjail TCP listener config
name: "repl-service"
mode: LISTEN_TCP
port: 9999
max_conns: 100 # Max concurrent sandboxes
max_conns_per_ip: 5 # Per-IP limit
time_limit: 60 # 60 second session timeout
cgroup_mem_max: 268435456 # 256 MB per session
clone_newuser: true
clone_newnet: true
clone_newns: true
clone_newpid: true
mount {
src: "/usr"
dst: "/usr"
is_bind: true
rw: false
}
mount {
dst: "/tmp"
fstype: "tmpfs"
rw: true
}
exec_bin {
path: "/usr/bin/python3"
arg: "-i" # Interactive mode
}Limitations and Tradeoffs
nsjail is not a universal sandbox. Understanding its limitations is necessary before choosing it for production.
Linux-only
nsjail depends on Linux kernel features (namespaces, cgroups, seccomp-bpf). It does not work on macOS or Windows. Development and CI on non-Linux platforms require a Linux VM or Docker container.
Shared kernel attack surface
Unlike gVisor or Firecracker, nsjail runs sandboxed processes against the real kernel. A kernel exploit in an allowed syscall can escape the sandbox. Seccomp-BPF reduces the attack surface but does not eliminate it.
No built-in image management
nsjail does not have an image format or registry. You manage the root filesystem yourself through bind mounts and tmpfs. For complex dependency trees, this means either pre-installing packages on the host or using nsjail inside Docker.
Configuration complexity
A production nsjail config requires understanding namespaces, mount semantics, UID mappings, seccomp policies, and cgroup parameters. Getting the seccomp policy right for a given language runtime is iterative: run, hit a blocked syscall, allow it, repeat.
Seccomp Policy Iteration
The hardest part of deploying nsjail is getting the seccomp policy right. Python, Node.js, Go, and Rust each use different sets of syscalls. A policy that works for Python will block Go (which uses clone for goroutines). A policy that works for Go will block Node.js (which uses epoll_create for the event loop). You end up maintaining per-language seccomp policies, each developed through trial and error.
nsjail provides a --seccomp_log flag that logs blocked syscalls instead of killing the process. This is essential for policy development: run your workload with logging enabled, see which syscalls get blocked, and add them to the policy.
Developing a seccomp policy iteratively
# Step 1: Run with seccomp logging (not enforcing)
nsjail --config sandbox.cfg \
--seccomp_log \
-- /usr/bin/python3 -c "import numpy; print(numpy.__version__)"
# Step 2: Check the log for blocked syscalls
# nsjail will report: "seccomp violation: syscall 257 (openat)"
# Step 3: Add the missing syscalls to your Kafel policy
# Step 4: Test with enforcement enabled
nsjail --config sandbox.cfg \
-- /usr/bin/python3 -c "import numpy; print(numpy.__version__)"
# Repeat until the workload runs cleanManaged Alternatives: When nsjail Is Not Worth the Effort
nsjail gives you full control over process isolation. That control comes with cost: you write and maintain seccomp policies, manage the host filesystem, handle process lifecycle, build monitoring, and operate the infrastructure. For many teams, this is not the right tradeoff.
If sandboxing is not your core product, if you are building an AI coding tool and need code execution as a feature, the engineering time spent on nsjail configuration and operations is time not spent on your actual product. Managed sandbox APIs exist specifically for this case.
| Self-Hosted nsjail | Managed Sandbox API | |
|---|---|---|
| Setup time | Days to weeks (config, testing, infra) | Minutes (install SDK, get API key) |
| Seccomp policy | You write and maintain per-language policies | Pre-configured, tested across workloads |
| Dependency management | Manual (host filesystem or Docker) | Built-in (templates, package managers) |
| Scaling | You manage worker pools, load balancing | Automatic (provider infrastructure) |
| Multi-language | Separate config per language runtime | Pre-built templates for common languages |
| Filesystem persistence | You implement session state management | Session-scoped by default |
| Cost model | Infrastructure cost + engineering time | Per-second or bundled pricing |
| Control | Full (every kernel knob exposed) | Limited (provider's isolation model) |
Morph Sandbox SDK: equivalent of 200 lines of nsjail config
import { MorphSandbox } from "@anthropic-ai/morph-sandbox";
// Create an isolated sandbox (sub-300ms cold start)
const sandbox = await MorphSandbox.create({
apiKey: process.env.MORPH_API_KEY,
template: "python-3.12",
timeout: 300,
});
// Write untrusted code
await sandbox.filesystem.write("/app/main.py", untrustedCode);
// Execute with full isolation
const result = await sandbox.exec("cd /app && python main.py");
console.log(result.stdout);
console.log(result.exitCode);
// Filesystem persists between calls within the session
await sandbox.exec("pip install numpy pandas");
const analysis = await sandbox.exec("python /app/analyze.py");
await sandbox.destroy();When to self-host nsjail
Self-hosted nsjail makes sense in three cases: (1) you need sub-20ms process launch and cannot tolerate network latency to a remote sandbox API, (2) you run on air-gapped or regulated infrastructure where external API calls are prohibited, or (3) sandboxing is your product and you need full control over the isolation stack. For everyone else, a managed API saves weeks of engineering time.
Frequently Asked Questions
What is nsjail?
nsjail is a lightweight, open-source process isolation tool developed at Google. It uses Linux namespaces (PID, mount, network, user, UTS, IPC, cgroup), cgroups for resource limits, and seccomp-bpf for syscall filtering. It sandboxes a single process or process tree without a daemon, container runtime, or root privileges.
How does nsjail differ from Docker?
nsjail isolates individual processes with sub-20ms startup and no daemon. Docker isolates application stacks with 500ms-2s startup and requires dockerd. nsjail uses protobuf configs and Kafel seccomp policies. Docker uses Dockerfiles. nsjail is purpose-built for sandboxing untrusted code. Docker is a general-purpose container platform. The most common production pattern combines both: Docker for environment management, nsjail for per-execution isolation inside the container.
Can I use nsjail for AI agent code execution?
Yes. Windmill uses nsjail in production for sandboxing workflow executions, including AI agent workloads. The sub-20ms startup and zero-daemon design make it efficient for agent loops that execute code many times per session. The tradeoff is operational: you maintain seccomp policies, filesystem configuration, and infrastructure yourself.
What is the Kafel language?
Kafel is a domain-specific language for defining seccomp-bpf policies. It compiles human-readable rules into BPF bytecode. Instead of writing raw BPF instructions, you define policies like ALLOW { read, write, open } DEFAULT KILL. Kafel supports named policies, argument-level filtering, and composition.
Does nsjail require root?
Not always. With user namespaces enabled (clone_newuser: true), nsjail can run as an unprivileged user. Some features like network namespace creation may require CAP_SYS_ADMIN or sysctl adjustments. Many production deployments run nsjail inside Docker containers that provide the necessary capabilities.
How does nsjail compare to gVisor?
nsjail filters syscalls at the kernel boundary. Allowed syscalls execute on the real kernel with near-zero overhead. gVisor intercepts all syscalls in a userspace kernel (Sentry), adding 5-20% CPU overhead but reducing the host kernel attack surface to roughly 20 syscalls. Choose nsjail for throughput, gVisor for defense-in-depth against kernel exploits.
What are managed alternatives to building with nsjail?
Managed sandbox APIs like Morph Sandbox SDK, E2B, and Modal handle isolation, scaling, and lifecycle management for you. Morph Sandbox SDK provides sub-300ms cold starts with session-scoped filesystem persistence, bundled with Morph API plans. These are the better choice when sandboxing is a feature of your product, not the product itself.
Or Use Morph Sandbox SDK for Managed Sandboxes
Skip the nsjail configuration. Morph Sandbox SDK gives AI agents isolated code execution with sub-300ms cold starts, session-scoped persistence, and zero infrastructure to manage. Included free with Morph API.