DeepSeek R1 on Ollama: Run a Reasoning Model Locally in 5 Minutes

Run DeepSeek R1 locally with Ollama: complete setup, VRAM requirements for every quantization level, performance benchmarks on consumer hardware, reasoning chain examples, and when local R1 beats cloud APIs.

April 5, 2026 ยท 2 min read

Quick Start: Zero to Inference in 3 Commands

Ollama handles model downloading, quantization, and serving behind a single CLI. You do not need to configure Python environments, download GGUF files manually, or set up model configs. Pick a model size, pull it, run it.

Linux / WSL

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull DeepSeek R1 (14B distill, good balance of speed and quality)
ollama pull deepseek-r1:14b

# Run it
ollama run deepseek-r1:14b

On macOS, download the Ollama desktop app from ollama.com instead of using the curl installer. After installation, the ollama command is available in your terminal. The pull and run commands are identical.

macOS (after installing the app)

ollama pull deepseek-r1:14b
ollama run deepseek-r1:14b

First run

The first ollama pull downloads the model weights. The 14B Q4 quantization is about 9 GB. Subsequent runs load from local cache instantly. If you have limited bandwidth, start with deepseek-r1:7b (4.7 GB) or even deepseek-r1:1.5b (1.1 GB) to verify your setup works before pulling a larger model.

Which Model Size to Pick

DeepSeek released six distilled versions of R1, plus the full 671B model. The distilled models are not simplified versions. They were trained via knowledge distillation from the full model using 800K curated reasoning samples. Smaller variants are based on Qwen 2.5 (1.5B, 7B, 14B, 32B) and Llama 3 (8B, 70B).

Model TagParametersDownload SizeBase ArchitectureContext Window
deepseek-r1:1.5b1.5B1.1 GBQwen 2.5128K
deepseek-r1:7b7B4.7 GBQwen 2.5128K
deepseek-r1:8b8B5.2 GBLlama 3.1128K
deepseek-r1:14b14B9.0 GBQwen 2.5128K
deepseek-r1:32b32B20 GBQwen 2.5128K
deepseek-r1:70b70B43 GBLlama 3.3128K
deepseek-r1:671b671B404 GBDeepSeek MoE128K

Recommendations by Use Case

Testing and prototyping

deepseek-r1:7b or :8b. Runs on 8 GB RAM. Fast enough for iterating on prompts and validating reasoning behavior. Not production quality, but a good sanity check before committing to a larger model.

Daily driver for coding and math

deepseek-r1:14b. Needs 10-12 GB RAM. Beats QwQ-32B-Preview on every evaluation metric despite being half the size. The best ratio of quality to resource cost.

Maximum local quality

deepseek-r1:32b. Needs 20-24 GB RAM (fits on RTX 4090 or M-series Mac with 32GB+). Outperforms o1-mini on most benchmarks. The right choice if you have the hardware.

Near-frontier reasoning

deepseek-r1:70b. Needs 48+ GB RAM or a multi-GPU setup. Scores 86.7% on AIME 2024 and 57.5 on LiveCodeBench. Approaches the full 671B model on many tasks.

VRAM Requirements by Quantization

Quantization reduces model precision to fit in less memory. Q4_K_M (4-bit, mixed precision) is the Ollama default and the best tradeoff for most people. More important layers keep higher precision, so reasoning quality degrades less than you would expect from a 4x compression.

ModelQ4_K_M (4-bit)Q5_K_M (5-bit)Q8_0 (8-bit)FP16 (full)
1.5B~2 GB~2.5 GB~3 GB~4 GB
7B~5 GB~6 GB~9 GB~15 GB
8B~6 GB~7 GB~10 GB~17 GB
14B~9 GB~11 GB~16 GB~30 GB
32B~20 GB~24 GB~36 GB~68 GB
70B~43 GB~50 GB~75 GB~145 GB
671B~250 GB~300 GB~400 GB~1.3 TB

Quantization quality

Q4_K_M uses mixed precision: attention layers and the first/last transformer blocks retain higher precision while less critical layers are quantized more aggressively. On reasoning benchmarks, Q4_K_M models lose 1-3% accuracy compared to FP16. Q5_K_M closes most of that gap if you have the headroom. Q8_0 is near-lossless but doubles the memory cost. For reasoning tasks specifically, Q5_K_M is measurably better than Q4 on multi-step math and code generation.

Pulling a Specific Quantization

Ollama defaults to Q4_K_M. To use a different quantization, check available tags on the Ollama model page and pull the specific tag:

Pulling a Q8 quantization

# Pull the 14B model at 8-bit quantization
ollama pull deepseek-r1:14b-q8_0

# Or the 32B at 5-bit for better reasoning accuracy
ollama pull deepseek-r1:32b-q5_K_M

Performance Benchmarks

These are published numbers from DeepSeek's paper and independent evaluations. The distilled models punch well above their weight class because distillation transfers reasoning patterns, not just knowledge.

BenchmarkR1 7BR1 14BR1 32BR1 70BR1 671B (full)
AIME 202455.5%69.7%72.6%86.7%79.8%
MATH-50092.8%93.9%94.3%94.5%97.3%
LiveCodeBench39.6%~45%57.2%57.5%65.9%
GPQA Diamond~49%~56%~62%~65%71.5%

Notice the 70B distill scores higher than the full 671B on AIME 2024 (86.7% vs 79.8%). This is not an error. Distillation can produce models that outperform their teacher on specific benchmarks by concentrating the learned reasoning patterns into a smaller, more focused architecture. The full 671B model is more versatile across tasks, but on pure math reasoning, the 70B distill is stronger.

97.3%
MATH-500 (full 671B)
86.7%
AIME 2024 (70B distill)
57.5
LiveCodeBench (70B distill)

R1-0528 Update

In May 2025, DeepSeek released R1-0528 with notable improvements. AIME 2025 accuracy jumped from 70% to 87.5%. Average reasoning depth increased from 12K to 23K tokens per question. Hallucination rate dropped. Function calling support was added. System prompts are now supported natively, removing the earlier workaround of prepending think tokens manually.

Apple Silicon Guide

Ollama runs natively on Apple Silicon. Macs use unified memory, which means system RAM doubles as GPU memory. This gives Apple hardware an advantage for local LLMs: a 64 GB M4 Pro can run models that would need a dedicated GPU on x86.

MacRAMBest Model SizeApprox. Speed
M1/M2 (16 GB)16 GB7B or 8B20-30 tok/s
M2 Pro / M3 (32 GB)32 GB14B15-20 tok/s
M3 Pro / M4 Pro (36-48 GB)36-48 GB32B (Q4)11-14 tok/s
M4 Pro / M4 Max (64 GB)64 GB32B (Q5/Q8)12-16 tok/s
M3 Ultra / M4 Max (128+ GB)128+ GB70B8-12 tok/s
M3 Ultra (512 GB)512 GB671B (full)~17 tok/s

The speed numbers are for token generation (not prompt processing, which is faster). For coding tasks, 10-15 tok/s is comfortable for interactive use. Below 5 tok/s, you will notice lag. The 32B model on an M4 Pro is the sweet spot for Mac users who want quality reasoning without waiting.

Memory pressure

Leave 4-8 GB of RAM free for the OS and KV cache. If you have 32 GB total, the 14B model (9 GB weights + cache) will run well. The 32B model (20 GB weights) will work but may hit swap if you have other apps open. Close browsers and Electron apps before running large models.

How the Reasoning Works

R1 is not a standard chat model. Every response starts with a chain-of-thought reasoning phase wrapped in <think> tags. The model works through the problem step by step, then produces a final answer. This is the mechanism behind R1's benchmark scores: it trades speed for accuracy by spending more tokens thinking before committing.

Example: R1 reasoning through a coding question

> How many times does the letter 'r' appear in 'strawberry'?

<think>
Let me examine each letter in "strawberry" one at a time:
s - not r
t - not r
r - this is r (count: 1)
a - not r
w - not r
b - not r
e - not r
r - this is r (count: 2)
r - this is r (count: 3)
y - not r

So I found 3 occurrences of 'r'.
</think>

The letter 'r' appears **3 times** in "strawberry" - at positions 3, 8, and 9.

The thinking phase is visible by default in Ollama's terminal output. When using the API, set the think parameter to true to receive the reasoning trace separately from the final answer.

Accessing the reasoning trace via Ollama's Python API

from ollama import chat

response = chat(
    model='deepseek-r1:14b',
    messages=[{'role': 'user', 'content': 'Solve: 2x + 5 = 17'}],
    think=True
)

print("Reasoning:", response.message.thinking)
print("Answer:", response.message.content)

Standard LLM

Generates the answer directly. Fast, but prone to errors on multi-step problems. No visibility into how the model reached its conclusion.

DeepSeek R1 (reasoning model)

Generates a step-by-step reasoning chain first, then produces the final answer. Slower per response, but significantly more accurate on math, logic, and code tasks.

The tradeoff is token count. A reasoning response can be 5-10x longer than a standard response because of the thinking phase. For simple questions, this is wasted compute. For debugging a complex function or solving a multi-step math problem, the reasoning chain is what makes R1 competitive with models 10x its size.

API Server Mode

Ollama automatically serves an OpenAI-compatible REST API at http://localhost:11434/v1. Any tool built for the OpenAI SDK can connect to your local R1 instance with a one-line URL change. No additional configuration needed.

Using the OpenAI SDK with local DeepSeek R1

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama',  // Ollama doesn't validate this
});

const response = await client.chat.completions.create({
  model: 'deepseek-r1:14b',
  messages: [
    { role: 'user', content: 'Write a TypeScript function to debounce async calls' }
  ],
});

console.log(response.choices[0].message.content);

This compatibility layer means you can swap between local R1 and cloud APIs by changing two variables: the base URL and the model name. During development, run locally for free. In production, point at a cloud endpoint for reliability.

Adding a Web Interface

For a ChatGPT-like UI on top of your local R1, install Open WebUI:

Open WebUI with Docker

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

# Then visit http://localhost:3000
# Select deepseek-r1:14b from the model dropdown

Local R1 vs Cloud APIs

Running R1 locally is not always the right choice. Here is when it makes sense and when it does not.

DimensionLocal (Ollama)Cloud (DeepSeek API)Cloud (OpenAI o1)
Cost per 1M input tokens$0 (after hardware)$0.55$15.00
Cost per 1M output tokens$0 (after hardware)$2.19$60.00
Data privacyComplete (nothing leaves machine)Data sent to DeepSeek serversData sent to OpenAI servers
Rate limitsNoneTier-basedTier-based
UptimeDepends on your hardware99.9%+ SLA99.9%+ SLA
Model sizeLimited by your VRAMFull 671BFull o1
Latency (first token)Low (no network hop)50-200ms network50-200ms network
Speed (tok/s)10-30 (consumer GPU)50-100+50-100+

When Local Wins

High-volume coding agents

A coding agent making 200+ inference calls per session racks up API bills fast. At $2.19/M output tokens, a heavy session costs $5-20. Locally, the same workload costs electricity. If you run agents daily, the hardware pays for itself in weeks.

Sensitive codebases

Enterprise code, proprietary algorithms, client data. If the code cannot leave your machine, local inference is the only option. No API provider can guarantee they won't log or train on your prompts, regardless of what their ToS says.

Offline development

Airplanes, trains, restricted networks. Local models work without connectivity. Pull the model once, run anywhere. Particularly useful for on-site consulting where client networks block external APIs.

Experimentation and fine-tuning

Testing prompt strategies, comparing quantization levels, building custom workflows. Local inference has no per-call cost, so you can iterate without watching a billing dashboard.

When Cloud Wins

If you need the full 671B model quality and do not have 400+ GB of VRAM, cloud is the only option. If uptime matters more than cost, cloud is more reliable than a single machine. If you run inference infrequently (a few calls per day), the API bill is negligible and not worth the hardware investment.

Practical Tips

1. Set Context Length Explicitly

R1 supports 128K context, but Ollama defaults to 2048 tokens. For coding tasks that need more context, set it when running:

Increasing context length

# Set 32K context (good for most coding tasks)
ollama run deepseek-r1:14b --ctx-size 32768

# Or via the API with num_ctx
curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-r1:14b",
  "prompt": "your prompt here",
  "options": { "num_ctx": 32768 }
}'

Context vs VRAM

Larger context windows consume more VRAM for the KV cache. At 32K context, add roughly 2-4 GB on top of the model weights. At 128K, add 8-16 GB. If you are running close to your VRAM limit, keep context at 8K-16K and feed relevant code snippets rather than entire files.

2. Use the Right Model for the Task

R1 is a reasoning model. It is overkill for simple tasks like reformatting JSON or generating boilerplate. Use a standard model (like llama3:8b or qwen2.5:14b) for straightforward generation, and switch to R1 when you need the model to actually think through a problem: debugging, algorithm design, complex refactors, math proofs.

3. Keep Ollama Updated

Ollama ships performance improvements regularly. The thinking model support (separating reasoning traces from final answers) requires Ollama 0.6+. Function calling with R1-0528 requires the latest version. Run ollama --version and update if you are behind.

4. GPU Offloading on Mixed Systems

If you have a GPU that cannot fit the entire model, Ollama automatically splits layers between GPU and CPU. GPU handles what fits, CPU handles the rest. This is slower than full GPU inference but faster than pure CPU. Check layer allocation in the Ollama logs at startup.

Frequently Asked Questions

How much VRAM do I need to run DeepSeek R1 on Ollama?

The 7B distill at Q4 quantization needs about 5 GB. The 14B needs 9 GB. The 32B needs 20 GB, which fits on an RTX 4090 or any Mac with 32 GB+ unified memory. The 70B needs 43 GB at Q4. The full 671B needs 400+ GB across multiple GPUs. For most developers, the 14B or 32B distill is the sweet spot.

What is the fastest way to install DeepSeek R1 with Ollama?

Three commands on Linux: install Ollama, pull the model, run it. On macOS, download the desktop app from ollama.com, then pull and run from terminal. The whole process takes under 5 minutes on a fast connection. The model download is the bottleneck: 9 GB for the 14B, 20 GB for the 32B.

How does DeepSeek R1 compare to OpenAI o1?

Close on reasoning benchmarks. R1 scores 79.8% on AIME 2024 (o1 scores ~83%) and 97.3% on MATH-500 (matching o1). R1 generates longer reasoning chains, so individual responses are slower. The API cost difference is 27x: $0.55 vs $15 per million input tokens. The biggest distinction: R1 is open-weight. You can run it locally at zero marginal cost. o1 is closed-source and API-only.

Can I run DeepSeek R1 on a MacBook?

Yes. Ollama supports Apple Silicon natively and uses the unified memory architecture. A 16 GB M1 MacBook runs the 7B distill at 20-30 tok/s. A 64 GB M4 Pro runs the 32B at 11-14 tok/s. An M3 Ultra Mac Studio with 512 GB can run the full 671B model at about 17 tok/s.

What are the distilled DeepSeek R1 models?

Six smaller variants trained via knowledge distillation from the full 671B model. Qwen-based: 1.5B, 7B, 14B, 32B. Llama-based: 8B, 70B. They inherit R1's reasoning behavior, including the visible chain-of-thought in think tags. The 32B outperforms OpenAI o1-mini. The 14B outperforms QwQ-32B-Preview. All have 128K context windows and support commercial use.

Does Ollama expose an OpenAI-compatible API for DeepSeek R1?

Yes. http://localhost:11434/v1 implements the OpenAI chat completions API. Point the OpenAI SDK at this URL and use deepseek-r1:14b as the model name. Ollama does not validate API keys, so any string works. This lets you swap between local and cloud inference by changing two config values.

Is R1-0528 available on Ollama?

The 0528 update is available as a community model and the default deepseek-r1:latest tag now points to the updated version for the 671B and 8B variants. For distilled models, check the tags page for the latest available quantizations.

Can I use DeepSeek R1 for coding agents?

Yes, and this is one of the strongest use cases. The reasoning chain helps R1 break down complex coding problems step by step. The 32B distill scores 57.2 on LiveCodeBench, competitive with much larger models. Pair it with Morph's code tools for a fully local coding agent that keeps your code private and runs without API costs.

Related Articles

Build a Private Coding Agent with Local R1 + Morph

Run DeepSeek R1 locally for reasoning. Connect Morph's fast-apply engine for code edits. Zero API costs, zero data leakage, full coding agent capabilities.

Sources