Ollama makes local LLM inference simple. Docker Compose makes it reproducible. This guide gives you copy-paste configs for every stage: a minimal dev setup, GPU acceleration, health checks, reverse proxy, Open WebUI, and multi-instance production scaling.
Minimal Development Setup
This is the starting point. One service, one volume, one port. Models persist between restarts. Everything else is optional until you need it.
docker-compose.yml — minimal
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0
volumes:
ollama_data:Start Ollama and pull a model
# Start the container
docker compose up -d
# Pull a model (downloads once, persists on the volume)
docker exec ollama ollama pull llama3.2
# Test the API
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Why is the sky blue?",
"stream": false
}'Why OLLAMA_HOST=0.0.0.0
By default, Ollama binds to 127.0.0.1, which means only processes inside the container can reach it. Setting OLLAMA_HOST=0.0.0.0 lets the port mapping work so your host machine (and other containers) can connect.
GPU Passthrough with NVIDIA
Without GPU access, Ollama falls back to CPU inference. For a 7B parameter model that means seconds per token instead of tens of tokens per second. The difference is not subtle.
Two prerequisites on the host: working NVIDIA drivers and the NVIDIA Container Toolkit. Docker Compose handles the rest.
Host prerequisites (run once)
# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart dockerdocker-compose.yml — with NVIDIA GPU
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
ollama_data:Verify GPU inference is working
# Pull a model and check processor allocation
docker exec ollama ollama pull llama3.2
docker exec ollama ollama run llama3.2 "hello" --verbose
# The 'ollama ps' command shows where the model loaded
docker exec ollama ollama ps
# NAME ID SIZE PROCESSOR UNTIL
# llama3.2 ... 4.7 GB 100% GPU 4 minutes from nowUse deploy.resources, not runtime: nvidia
The older runtime: nvidia syntax causes "unknown or invalid runtime name: nvidia" errors on newer Docker versions. The deploy.resources.reservations.devices block is the supported method.
AMD GPUs and Vulkan
For AMD GPUs, use the ollama/ollama:rocm image with --device /dev/kfd --device /dev/dri passed through. Vulkan support is bundled in the standard image and enabled with OLLAMA_VULKAN=1.
Environment Variables Reference
Ollama's behavior is configured entirely through environment variables. These are the ones that matter for Docker deployments.
| Variable | Default | Purpose |
|---|---|---|
| OLLAMA_HOST | 127.0.0.1:11434 | Bind address. Set to 0.0.0.0 in Docker. |
| OLLAMA_MODELS | /root/.ollama | Model storage directory. Override for custom bind mounts. |
| OLLAMA_NUM_PARALLEL | 1 | Concurrent requests per loaded model. RAM scales linearly. |
| OLLAMA_MAX_LOADED_MODELS | 3 (CPU) or 3x GPUs | Models kept hot in memory simultaneously. |
| OLLAMA_KEEP_ALIVE | 5m | How long a model stays loaded after last request. Use -1 for permanent. |
| OLLAMA_MAX_QUEUE | 512 | Max queued requests before rejecting with 429. |
| OLLAMA_CONTEXT_LENGTH | 4096 | Default context window in tokens. Overridable per request. |
| OLLAMA_FLASH_ATTENTION | 0 | Set to 1 to enable Flash Attention. Reduces memory usage. |
| OLLAMA_KV_CACHE_TYPE | f16 | KV cache quantization: f16, q8_0, or q4_0. |
| OLLAMA_ORIGINS | (empty) | Allowed CORS origins for browser-based clients. |
docker-compose.yml — tuned for multi-model serving
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0
- OLLAMA_NUM_PARALLEL=4
- OLLAMA_MAX_LOADED_MODELS=3
- OLLAMA_KEEP_ALIVE=10m
- OLLAMA_FLASH_ATTENTION=1
- OLLAMA_KV_CACHE_TYPE=q8_0
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
ollama_data:Memory math
Required memory scales with OLLAMA_NUM_PARALLEL * context length. Setting OLLAMA_NUM_PARALLEL=4 with a 7B model at 4096 context roughly quadruples the per-model memory footprint. Measure before committing to a number.
Health Checks and Persistence
Two things that look optional until they bite you: health checks and persistent storage. Without health checks, dependent services start before Ollama is ready. Without persistent storage, every docker compose down && docker compose up triggers a full re-download of every model.
docker-compose.yml — with health check and bind mount
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "127.0.0.1:11434:11434"
volumes:
- ./ollama-models:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0
healthcheck:
test: ["CMD-SHELL", "ollama list || exit 1"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]Named volumes vs bind mounts
Named volumes (ollama_data:/root/.ollama) are simplest and Docker manages them. Bind mounts (./ollama-models:/root/.ollama) let you point to a specific disk, which matters when models are 4-70 GB each.
Health check without curl
The Ollama Docker image does not ship with curl. Use 'ollama list' as the health check command instead, or build a custom image that adds curl and check /api/tags directly.
start_period matters
Model loading takes 5-30 seconds depending on size and storage speed. Set start_period to at least 60 seconds so Docker does not restart the container during initial model load.
Pin image tags
Use ollama/ollama:0.6.2 instead of :latest in production. Pin the tag in a .env file. Rollback means reverting the tag. Model data on the volume is unaffected.
Production Stack with Nginx
Never expose port 11434 directly to the network. Ollama has no built-in authentication, rate limiting, or TLS. Nginx handles all three and adds proper timeout configuration for LLM streaming, which is the part most guides skip.
docker-compose.yml — production with Nginx
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0
- OLLAMA_NUM_PARALLEL=4
- OLLAMA_MAX_LOADED_MODELS=2
- OLLAMA_KEEP_ALIVE=10m
- OLLAMA_FLASH_ATTENTION=1
healthcheck:
test: ["CMD-SHELL", "ollama list || exit 1"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
networks:
- internal
nginx:
image: nginx:alpine
container_name: ollama-proxy
restart: unless-stopped
ports:
- "443:443"
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- ./certs:/etc/nginx/certs:ro
depends_on:
ollama:
condition: service_healthy
networks:
- internal
volumes:
ollama_data:
networks:
internal:
driver: bridgenginx.conf — streaming-aware reverse proxy
events {
worker_connections 1024;
}
http {
upstream ollama {
server ollama:11434;
}
server {
listen 80;
server_name your-domain.com;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl;
server_name your-domain.com;
ssl_certificate /etc/nginx/certs/fullchain.pem;
ssl_certificate_key /etc/nginx/certs/privkey.pem;
# Basic auth
auth_basic "Ollama API";
auth_basic_user_file /etc/nginx/.htpasswd;
location / {
proxy_pass http://ollama;
proxy_http_version 1.1;
# Streaming support
proxy_buffering off;
proxy_cache off;
chunked_transfer_encoding on;
# LLM responses can take minutes for long generations
proxy_read_timeout 300s;
proxy_connect_timeout 10s;
proxy_send_timeout 300s;
# Headers
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
}proxy_read_timeout is the critical setting
Nginx defaults to a 60-second read timeout. A long LLM generation (high token count, large model) easily exceeds that. Set proxy_read_timeout to at least 300 seconds or your streaming responses will be cut mid-generation with no error message to the client.
Adding Open WebUI
Open WebUI gives you a ChatGPT-style interface for any model Ollama can run. It connects to Ollama over Docker's internal network using the service name as hostname.
docker-compose.yml — Ollama + Open WebUI
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "127.0.0.1:11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0
healthcheck:
test: ["CMD-SHELL", "ollama list || exit 1"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
ports:
- "3000:8080"
volumes:
- webui_data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
ollama:
condition: service_healthy
volumes:
ollama_data:
webui_data:After docker compose up -d, Open WebUI is at http://localhost:3000. The first user to register becomes admin. Pull a model through the Ollama container and it appears in the Open WebUI model selector immediately.
http://ollama:11434, not http://localhost:11434
Inside Docker Compose, services reach each other by service name over the internal bridge network. localhost inside the Open WebUI container points to the Open WebUI container itself, not the Ollama container. This is the most common misconfiguration.
Multi-Instance Scaling
Ollama serializes requests per loaded model. A single instance under concurrent load becomes a bottleneck. The fix is multiple instances, each with explicit GPU assignment and a shared model volume.
Use named services, not deploy: replicas. Named services give you per-instance GPU control and independent health monitoring.
docker-compose.yml — multi-instance with Nginx load balancing
services:
ollama-1:
image: ollama/ollama:latest
container_name: ollama-1
restart: unless-stopped
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0
- OLLAMA_NUM_PARALLEL=2
- CUDA_VISIBLE_DEVICES=0
healthcheck:
test: ["CMD-SHELL", "ollama list || exit 1"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
networks:
- internal
ollama-2:
image: ollama/ollama:latest
container_name: ollama-2
restart: unless-stopped
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0
- OLLAMA_NUM_PARALLEL=2
- CUDA_VISIBLE_DEVICES=1
healthcheck:
test: ["CMD-SHELL", "ollama list || exit 1"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["1"]
capabilities: [gpu]
networks:
- internal
nginx:
image: nginx:alpine
container_name: ollama-lb
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ./nginx-lb.conf:/etc/nginx/nginx.conf:ro
depends_on:
ollama-1:
condition: service_healthy
ollama-2:
condition: service_healthy
networks:
- internal
volumes:
ollama_data:
networks:
internal:
driver: bridgenginx-lb.conf — least-connections load balancing
events {
worker_connections 1024;
}
http {
upstream ollama_cluster {
least_conn;
server ollama-1:11434;
server ollama-2:11434;
}
server {
listen 11434;
location / {
proxy_pass http://ollama_cluster;
proxy_http_version 1.1;
proxy_buffering off;
proxy_read_timeout 300s;
proxy_send_timeout 300s;
}
}
}Why least_conn instead of round-robin
LLM request durations vary wildly: 1 second for a short completion, 2 minutes for a long generation. Round-robin would stack requests on a busy instance while another sits idle. least_conn routes to whichever instance has the fewest active connections.
Development vs Production
| Dimension | Development | Production |
|---|---|---|
| Port binding | "11434:11434" (all interfaces) | "127.0.0.1:11434:11434" or no host port |
| Image tag | ollama/ollama:latest | ollama/ollama:0.6.2 (pinned in .env) |
| Reverse proxy | None needed | Nginx with TLS, auth, rate limiting |
| Health check | Optional | Required. Dependent services use condition: service_healthy |
| GPU reservation | count: all | Explicit device_ids per instance |
| OLLAMA_NUM_PARALLEL | 1 (default) | 2-8 depending on VRAM and model size |
| OLLAMA_KEEP_ALIVE | 5m (default) | 10m-30m or -1 for always-loaded |
| Storage | Named volume | Bind mount to dedicated disk |
| Scaling | Single instance | Multiple named instances with load balancer |
| Monitoring | docker logs | Prometheus metrics + Grafana dashboards |
When Docker Compose Stops Being Enough
Docker Compose works for single-machine deployments. When you need multi-node GPU clusters, automatic scaling based on queue depth, or zero-downtime model updates across a fleet, you are looking at Kubernetes with the NVIDIA GPU Operator.
Ollama is also not optimized for maximum throughput at high concurrency. It prioritizes ease of use. For consumer-facing products that need sub-200ms time-to-first-token at hundreds of concurrent connections, vLLM or Hugging Face Text Generation Inference (TGI) are better runtime choices, though they require significantly more operational work.
For most teams, the practical boundary is simpler: Docker Compose works until you need more GPUs than fit in one machine.
Local dev with Docker Compose, production agents with Morph
Docker Compose is the right tool for running Ollama locally during development. For production agent execution, where you need sandboxed environments, code execution, and model inference at scale without managing GPU infrastructure, Morph Sandbox handles the infrastructure so your team focuses on the agent logic. The split is natural: Docker Compose for local model experimentation, Morph for production workloads.
FAQ
Do I need Docker Compose, or can I just use docker run?
docker run works for a single Ollama container. Docker Compose becomes valuable when you add a second service (Open WebUI, Nginx, monitoring) or want reproducible configuration that lives in version control.
Can Ollama use my GPU inside Docker on macOS?
No. Docker Desktop on macOS runs containers in a Linux VM without GPU passthrough. GPU acceleration in Docker is available on Linux natively and on Windows via WSL2. On macOS, install Ollama directly to use the Apple Silicon GPU.
How much disk space do Ollama models need?
It depends on the model and quantization. Llama 3.2 3B is about 2 GB. Llama 3.1 70B at q4 quantization is about 40 GB. Plan your volume mount location accordingly, especially if your Docker root is on a small boot drive.
How do I pre-pull models so the container starts ready?
Build a custom Dockerfile that runs ollama pull during the image build, or use an init container / entrypoint script that pulls models on first start. The simplest approach: docker exec ollama ollama pull llama3.2 after the container is healthy, scripted in a post-deploy hook.
Can two Ollama instances share the same model volume?
Yes. Model files are read-only at inference time. Multiple Ollama instances can share a single named volume for model storage, which avoids duplicating 10-70 GB model files per instance.
What is the difference between OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS?
OLLAMA_NUM_PARALLEL controls how many requests a single model processes concurrently. OLLAMA_MAX_LOADED_MODELS controls how many different models stay loaded in memory at once. The first scales request throughput for one model. The second controls multi-model availability.
Docker Compose handles local inference. What handles production agents?
Morph Sandbox gives your coding agents isolated execution environments with model inference, code execution, and file system access. No GPU infrastructure to manage.