Ollama Docker Compose: Production-Ready Local LLM Setup in One File

Complete docker-compose.yml files for Ollama with GPU passthrough, persistent storage, health checks, Open WebUI, Nginx reverse proxy, and multi-instance scaling. Copy-paste configs for development and production.

April 5, 2026 · 2 min read

Ollama makes local LLM inference simple. Docker Compose makes it reproducible. This guide gives you copy-paste configs for every stage: a minimal dev setup, GPU acceleration, health checks, reverse proxy, Open WebUI, and multi-instance production scaling.

1 file
docker-compose.yml
:11434
Default API port
GPU
NVIDIA passthrough
0 deps
Models download on pull

Minimal Development Setup

This is the starting point. One service, one volume, one port. Models persist between restarts. Everything else is optional until you need it.

docker-compose.yml — minimal

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0

volumes:
  ollama_data:

Start Ollama and pull a model

# Start the container
docker compose up -d

# Pull a model (downloads once, persists on the volume)
docker exec ollama ollama pull llama3.2

# Test the API
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Why OLLAMA_HOST=0.0.0.0

By default, Ollama binds to 127.0.0.1, which means only processes inside the container can reach it. Setting OLLAMA_HOST=0.0.0.0 lets the port mapping work so your host machine (and other containers) can connect.

GPU Passthrough with NVIDIA

Without GPU access, Ollama falls back to CPU inference. For a 7B parameter model that means seconds per token instead of tens of tokens per second. The difference is not subtle.

Two prerequisites on the host: working NVIDIA drivers and the NVIDIA Container Toolkit. Docker Compose handles the rest.

Host prerequisites (run once)

# Install NVIDIA Container Toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

docker-compose.yml — with NVIDIA GPU

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  ollama_data:

Verify GPU inference is working

# Pull a model and check processor allocation
docker exec ollama ollama pull llama3.2
docker exec ollama ollama run llama3.2 "hello" --verbose

# The 'ollama ps' command shows where the model loaded
docker exec ollama ollama ps
# NAME        ID       SIZE   PROCESSOR    UNTIL
# llama3.2    ...      4.7 GB 100% GPU     4 minutes from now

Use deploy.resources, not runtime: nvidia

The older runtime: nvidia syntax causes "unknown or invalid runtime name: nvidia" errors on newer Docker versions. The deploy.resources.reservations.devices block is the supported method.

AMD GPUs and Vulkan

For AMD GPUs, use the ollama/ollama:rocm image with --device /dev/kfd --device /dev/dri passed through. Vulkan support is bundled in the standard image and enabled with OLLAMA_VULKAN=1.

Environment Variables Reference

Ollama's behavior is configured entirely through environment variables. These are the ones that matter for Docker deployments.

VariableDefaultPurpose
OLLAMA_HOST127.0.0.1:11434Bind address. Set to 0.0.0.0 in Docker.
OLLAMA_MODELS/root/.ollamaModel storage directory. Override for custom bind mounts.
OLLAMA_NUM_PARALLEL1Concurrent requests per loaded model. RAM scales linearly.
OLLAMA_MAX_LOADED_MODELS3 (CPU) or 3x GPUsModels kept hot in memory simultaneously.
OLLAMA_KEEP_ALIVE5mHow long a model stays loaded after last request. Use -1 for permanent.
OLLAMA_MAX_QUEUE512Max queued requests before rejecting with 429.
OLLAMA_CONTEXT_LENGTH4096Default context window in tokens. Overridable per request.
OLLAMA_FLASH_ATTENTION0Set to 1 to enable Flash Attention. Reduces memory usage.
OLLAMA_KV_CACHE_TYPEf16KV cache quantization: f16, q8_0, or q4_0.
OLLAMA_ORIGINS(empty)Allowed CORS origins for browser-based clients.

docker-compose.yml — tuned for multi-model serving

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_MAX_LOADED_MODELS=3
      - OLLAMA_KEEP_ALIVE=10m
      - OLLAMA_FLASH_ATTENTION=1
      - OLLAMA_KV_CACHE_TYPE=q8_0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  ollama_data:

Memory math

Required memory scales with OLLAMA_NUM_PARALLEL * context length. Setting OLLAMA_NUM_PARALLEL=4 with a 7B model at 4096 context roughly quadruples the per-model memory footprint. Measure before committing to a number.

Health Checks and Persistence

Two things that look optional until they bite you: health checks and persistent storage. Without health checks, dependent services start before Ollama is ready. Without persistent storage, every docker compose down && docker compose up triggers a full re-download of every model.

docker-compose.yml — with health check and bind mount

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "127.0.0.1:11434:11434"
    volumes:
      - ./ollama-models:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
    healthcheck:
      test: ["CMD-SHELL", "ollama list || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Named volumes vs bind mounts

Named volumes (ollama_data:/root/.ollama) are simplest and Docker manages them. Bind mounts (./ollama-models:/root/.ollama) let you point to a specific disk, which matters when models are 4-70 GB each.

Health check without curl

The Ollama Docker image does not ship with curl. Use 'ollama list' as the health check command instead, or build a custom image that adds curl and check /api/tags directly.

start_period matters

Model loading takes 5-30 seconds depending on size and storage speed. Set start_period to at least 60 seconds so Docker does not restart the container during initial model load.

Pin image tags

Use ollama/ollama:0.6.2 instead of :latest in production. Pin the tag in a .env file. Rollback means reverting the tag. Model data on the volume is unaffected.

Production Stack with Nginx

Never expose port 11434 directly to the network. Ollama has no built-in authentication, rate limiting, or TLS. Nginx handles all three and adds proper timeout configuration for LLM streaming, which is the part most guides skip.

docker-compose.yml — production with Nginx

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_MAX_LOADED_MODELS=2
      - OLLAMA_KEEP_ALIVE=10m
      - OLLAMA_FLASH_ATTENTION=1
    healthcheck:
      test: ["CMD-SHELL", "ollama list || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    networks:
      - internal

  nginx:
    image: nginx:alpine
    container_name: ollama-proxy
    restart: unless-stopped
    ports:
      - "443:443"
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./certs:/etc/nginx/certs:ro
    depends_on:
      ollama:
        condition: service_healthy
    networks:
      - internal

volumes:
  ollama_data:

networks:
  internal:
    driver: bridge

nginx.conf — streaming-aware reverse proxy

events {
    worker_connections 1024;
}

http {
    upstream ollama {
        server ollama:11434;
    }

    server {
        listen 80;
        server_name your-domain.com;
        return 301 https://$host$request_uri;
    }

    server {
        listen 443 ssl;
        server_name your-domain.com;

        ssl_certificate     /etc/nginx/certs/fullchain.pem;
        ssl_certificate_key /etc/nginx/certs/privkey.pem;

        # Basic auth
        auth_basic "Ollama API";
        auth_basic_user_file /etc/nginx/.htpasswd;

        location / {
            proxy_pass http://ollama;
            proxy_http_version 1.1;

            # Streaming support
            proxy_buffering off;
            proxy_cache off;
            chunked_transfer_encoding on;

            # LLM responses can take minutes for long generations
            proxy_read_timeout 300s;
            proxy_connect_timeout 10s;
            proxy_send_timeout 300s;

            # Headers
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
        }
    }
}

proxy_read_timeout is the critical setting

Nginx defaults to a 60-second read timeout. A long LLM generation (high token count, large model) easily exceeds that. Set proxy_read_timeout to at least 300 seconds or your streaming responses will be cut mid-generation with no error message to the client.

Adding Open WebUI

Open WebUI gives you a ChatGPT-style interface for any model Ollama can run. It connects to Ollama over Docker's internal network using the service name as hostname.

docker-compose.yml — Ollama + Open WebUI

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "127.0.0.1:11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
    healthcheck:
      test: ["CMD-SHELL", "ollama list || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    volumes:
      - webui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      ollama:
        condition: service_healthy

volumes:
  ollama_data:
  webui_data:

After docker compose up -d, Open WebUI is at http://localhost:3000. The first user to register becomes admin. Pull a model through the Ollama container and it appears in the Open WebUI model selector immediately.

http://ollama:11434, not http://localhost:11434

Inside Docker Compose, services reach each other by service name over the internal bridge network. localhost inside the Open WebUI container points to the Open WebUI container itself, not the Ollama container. This is the most common misconfiguration.

Multi-Instance Scaling

Ollama serializes requests per loaded model. A single instance under concurrent load becomes a bottleneck. The fix is multiple instances, each with explicit GPU assignment and a shared model volume.

Use named services, not deploy: replicas. Named services give you per-instance GPU control and independent health monitoring.

docker-compose.yml — multi-instance with Nginx load balancing

services:
  ollama-1:
    image: ollama/ollama:latest
    container_name: ollama-1
    restart: unless-stopped
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_NUM_PARALLEL=2
      - CUDA_VISIBLE_DEVICES=0
    healthcheck:
      test: ["CMD-SHELL", "ollama list || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]
    networks:
      - internal

  ollama-2:
    image: ollama/ollama:latest
    container_name: ollama-2
    restart: unless-stopped
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_NUM_PARALLEL=2
      - CUDA_VISIBLE_DEVICES=1
    healthcheck:
      test: ["CMD-SHELL", "ollama list || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]
    networks:
      - internal

  nginx:
    image: nginx:alpine
    container_name: ollama-lb
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ./nginx-lb.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      ollama-1:
        condition: service_healthy
      ollama-2:
        condition: service_healthy
    networks:
      - internal

volumes:
  ollama_data:

networks:
  internal:
    driver: bridge

nginx-lb.conf — least-connections load balancing

events {
    worker_connections 1024;
}

http {
    upstream ollama_cluster {
        least_conn;
        server ollama-1:11434;
        server ollama-2:11434;
    }

    server {
        listen 11434;

        location / {
            proxy_pass http://ollama_cluster;
            proxy_http_version 1.1;
            proxy_buffering off;
            proxy_read_timeout 300s;
            proxy_send_timeout 300s;
        }
    }
}

Why least_conn instead of round-robin

LLM request durations vary wildly: 1 second for a short completion, 2 minutes for a long generation. Round-robin would stack requests on a busy instance while another sits idle. least_conn routes to whichever instance has the fewest active connections.

Development vs Production

DimensionDevelopmentProduction
Port binding"11434:11434" (all interfaces)"127.0.0.1:11434:11434" or no host port
Image tagollama/ollama:latestollama/ollama:0.6.2 (pinned in .env)
Reverse proxyNone neededNginx with TLS, auth, rate limiting
Health checkOptionalRequired. Dependent services use condition: service_healthy
GPU reservationcount: allExplicit device_ids per instance
OLLAMA_NUM_PARALLEL1 (default)2-8 depending on VRAM and model size
OLLAMA_KEEP_ALIVE5m (default)10m-30m or -1 for always-loaded
StorageNamed volumeBind mount to dedicated disk
ScalingSingle instanceMultiple named instances with load balancer
Monitoringdocker logsPrometheus metrics + Grafana dashboards

When Docker Compose Stops Being Enough

Docker Compose works for single-machine deployments. When you need multi-node GPU clusters, automatic scaling based on queue depth, or zero-downtime model updates across a fleet, you are looking at Kubernetes with the NVIDIA GPU Operator.

Ollama is also not optimized for maximum throughput at high concurrency. It prioritizes ease of use. For consumer-facing products that need sub-200ms time-to-first-token at hundreds of concurrent connections, vLLM or Hugging Face Text Generation Inference (TGI) are better runtime choices, though they require significantly more operational work.

For most teams, the practical boundary is simpler: Docker Compose works until you need more GPUs than fit in one machine.

Local dev with Docker Compose, production agents with Morph

Docker Compose is the right tool for running Ollama locally during development. For production agent execution, where you need sandboxed environments, code execution, and model inference at scale without managing GPU infrastructure, Morph Sandbox handles the infrastructure so your team focuses on the agent logic. The split is natural: Docker Compose for local model experimentation, Morph for production workloads.

FAQ

Do I need Docker Compose, or can I just use docker run?

docker run works for a single Ollama container. Docker Compose becomes valuable when you add a second service (Open WebUI, Nginx, monitoring) or want reproducible configuration that lives in version control.

Can Ollama use my GPU inside Docker on macOS?

No. Docker Desktop on macOS runs containers in a Linux VM without GPU passthrough. GPU acceleration in Docker is available on Linux natively and on Windows via WSL2. On macOS, install Ollama directly to use the Apple Silicon GPU.

How much disk space do Ollama models need?

It depends on the model and quantization. Llama 3.2 3B is about 2 GB. Llama 3.1 70B at q4 quantization is about 40 GB. Plan your volume mount location accordingly, especially if your Docker root is on a small boot drive.

How do I pre-pull models so the container starts ready?

Build a custom Dockerfile that runs ollama pull during the image build, or use an init container / entrypoint script that pulls models on first start. The simplest approach: docker exec ollama ollama pull llama3.2 after the container is healthy, scripted in a post-deploy hook.

Can two Ollama instances share the same model volume?

Yes. Model files are read-only at inference time. Multiple Ollama instances can share a single named volume for model storage, which avoids duplicating 10-70 GB model files per instance.

What is the difference between OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS?

OLLAMA_NUM_PARALLEL controls how many requests a single model processes concurrently. OLLAMA_MAX_LOADED_MODELS controls how many different models stay loaded in memory at once. The first scales request throughput for one model. The second controls multi-model availability.

Docker Compose handles local inference. What handles production agents?

Morph Sandbox gives your coding agents isolated execution environments with model inference, code execution, and file system access. No GPU infrastructure to manage.