Ollama + Claude Code: How to Run Claude Code with Local Models (and Why You'd Want To)

Set up Claude Code with Ollama to run local models for free. Step-by-step configuration, model recommendations, performance benchmarks vs cloud Claude, hardware requirements, and privacy benefits.

April 5, 2026 · 2 min read

Ollama v0.14 shipped native Anthropic API compatibility. Claude Code can now talk to local models directly, no proxy needed. You get the same CLI, the same tool-use workflow, the same CLAUDE.md files, but the inference happens on your hardware.

The search queries tell the story. “Ollama Claude Code” pulls 390 monthly searches at KD 4. “Claude Code Ollama” adds another 320 at KD 3. Combined CPC is $62, which means people searching this are ready to build, not just browse.

390
Monthly volume (primary)
KD 4
Keyword difficulty
$36
CPC (primary keyword)
$0
Cost to run locally

Why Run Claude Code with Local Models

Claude Code against Anthropic’s API is the default. It works well. But three scenarios push developers toward local inference:

Zero API Cost

No per-token billing. Heavy users spending $60-100/month on Claude API break even on a $500 GPU in 6-8 months. Ollama's cloud models are free with generous limits.

Data Never Leaves Your Machine

Code stays on localhost. No cross-border transfer, no third-party processing agreements, no audit trails. Critical for healthcare, defense, and regulated industries.

No Rate Limits

Anthropic caps requests per minute. During crunch time, you hit them. Local models run as fast as your hardware allows, with no throttling.

The tradeoff is speed. Local models on consumer hardware are significantly slower than cloud inference. The rest of this guide quantifies exactly how much slower, so you can decide whether the tradeoffs work for your situation.

Prerequisites

  • Ollama v0.14+ (required for Anthropic API compatibility). Check with ollama --version.
  • Claude Code CLI installed via curl -fsSL https://claude.ai/install.sh | bash (macOS/Linux) or irm https://claude.ai/install.ps1 | iex (Windows).
  • Hardware: 16GB+ RAM (Apple Silicon) or 16GB+ VRAM (GPU) for local models. Cloud models have no hardware floor.

Pre-release note

Streaming tool calls, which Claude Code depends on for its agentic workflow, require Ollama 0.14.3-rc1 or later. If tool calls fail silently on stable Ollama, install the pre-release: curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.14.3-rc1 sh

Quick Start: ollama launch claude

The fastest path. One command, no environment variables, no config files. Ollama handles everything.

One-command setup

# Launch Claude Code with Ollama's default coding model
ollama launch claude

# Or specify a model
ollama launch claude --model glm-4.7-flash

# Cloud model (no local hardware needed)
ollama launch claude --model qwen3.5:cloud

# Headless mode for scripts and CI
ollama launch claude --model glm-4.7-flash --yes -- -p "explain this codebase"

That’s it. ollama launch sets ANTHROPIC_AUTH_TOKEN, ANTHROPIC_BASE_URL, and ANTHROPIC_API_KEY automatically, then starts Claude Code pointed at your local Ollama instance.

Manual Configuration

If you prefer explicit control, or need to integrate with existing shell profiles, set the environment variables yourself.

Environment variables

# Set these in your terminal or add to ~/.zshrc / ~/.bashrc
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""
export ANTHROPIC_BASE_URL=http://localhost:11434

# Pull a model first
ollama pull glm-4.7-flash

# Start Claude Code with that model
claude --model glm-4.7-flash

Or inline, without touching your shell profile:

Inline (single session)

ANTHROPIC_AUTH_TOKEN=ollama \
ANTHROPIC_BASE_URL=http://localhost:11434 \
ANTHROPIC_API_KEY="" \
claude --model glm-4.7-flash

Settings.json Approach

You can also configure this in Claude Code’s settings file for persistence across sessions:

~/.claude/settings.json

{
  "env": {
    "ANTHROPIC_AUTH_TOKEN": "ollama",
    "ANTHROPIC_API_KEY": "",
    "ANTHROPIC_BASE_URL": "http://localhost:11434"
  },
  "model": "glm-4.7-flash"
}

The role model problem

Claude Code routes different request types to different model tiers: haiku for quick tasks, sonnet for standard work, opus for complex reasoning. If you only set ANTHROPIC_MODEL, the role-model requests still try to reach Anthropic’s servers.

Fix this by also setting ANTHROPIC_DEFAULT_HAIKU_MODEL, ANTHROPIC_DEFAULT_SONNET_MODEL, and ANTHROPIC_DEFAULT_OPUS_MODEL to your local model name. Or use claude-launcher (npm install -g claude-launcher), which remaps all role models to your Ollama model automatically.

Which Models Work

Not every Ollama model is a good fit. Claude Code needs large context windows (64K+ tokens minimum) and tool calling support. Models without tool calling can generate text but cannot read files, run commands, or apply edits autonomously.

Local Models

ModelParametersContextTool CallingRAM/VRAM
glm-4.7-flash30B MoE (3B active)128KYes (79.5% benchmark)~6.5GB (Q4)
qwen3-coder30B128KYes~20GB (Q4)
gpt-oss:20b20B32KYes~12GB (Q4)
devstral-small-224B128KYes~16GB (Q4)
qwen2.5-coder:32b32B128KLimited~24GB (Q4)

Top pick: GLM-4.7-Flash. It uses mixture-of-experts architecture, so only 3B parameters activate per token despite being a 30B model. That means it runs on 16GB RAM or an RTX 3060. The 128K context window handles substantial codebases, and its tool calling scored 79.5% on agent benchmarks.

Cloud Models (via Ollama)

Ollama also proxies cloud models. These run on remote infrastructure but use the same Ollama interface. Free tier with generous rate limits.

ModelContextTool CallingCost
qwen3.5:cloud128K+YesFree (rate limited)
glm-5:cloud128K+YesFree (rate limited)
kimi-k2.5:cloud128K+YesFree (rate limited)
minimax-m2.7:cloud128K+YesFree (rate limited)
qwen3-coder:480b-cloud128K+YesFree (rate limited)
gpt-oss:120b-cloud128K+YesFree (rate limited)

Cloud models are the pragmatic middle ground

If your goal is free Claude Code usage without buying a GPU, cloud models through Ollama are the move. You get frontier-level quality (Qwen 3.5 and GLM-5 are competitive with proprietary models on coding benchmarks), full context windows, and the same ollama launch claude setup. The tradeoff is that you’re sending code to external servers, which defeats the privacy argument.

Cloud Models: Free Inference Through Ollama

This is the setup most people actually want. You get Claude Code’s full agentic workflow, powered by models that rival Claude Sonnet, for $0.

Using Ollama cloud models

# Qwen 3.5 — strong all-around coding model
ollama launch claude --model qwen3.5:cloud

# GLM-5 — competitive on code generation benchmarks
ollama launch claude --model glm-5:cloud

# Kimi K2.5 — good at multi-step reasoning
ollama launch claude --model kimi-k2.5:cloud

# 480B parameter Qwen3-Coder — largest available
ollama launch claude --model qwen3-coder:480b-cloud

Cloud models automatically get full context length. No need to configure num_ctx or worry about context truncation. Rate limits exist but are generous enough for normal development workflows.

Performance: Local vs Cloud

Real numbers from published benchmarks. The speed gap is the defining tradeoff of local inference.

Token Throughput

SetupTokens/secNotes
Claude API (Sonnet 4)60-80 tok/sAnthropic's infrastructure
Ollama cloud model30-60 tok/sVaries by model and load
RTX 4070 Ti Super (32B Q4)15-25 tok/s$489 GPU, 16GB VRAM
M1 Max 64GB (30B MoE)10-20 tok/sApple Silicon unified memory
RTX 3060 (GLM-4.7-Flash)8-15 tok/sBudget GPU, 12GB VRAM

Real-World Task Comparison

From a published benchmark tracing entity relationships across a .NET monorepo:

1m 13s
Cloud Claude (Sonnet)
82 min
GLM-4.7 local (M1 Max)
68x
Speed difference

The output quality was near-identical for that task. Both identified the same architectural patterns and produced equivalent analysis. But the cloud version finished in the time it takes to make coffee. The local version took longer than most meetings.

Coding Quality Benchmarks

From a controlled comparison running 50 coding tasks across 5 categories on an RTX 4070 Ti Super:

TaskQwen2.5-Coder-32BDeepSeek-Coder-V2Claude Sonnet 4
Function generation4.13.74.4
Bug detection3.83.44.6
Refactoring4.03.54.3
Multi-file context2.82.44.5
Code explanation4.23.94.1

Local models score within 85-90% of Claude on single-file tasks like function generation and code explanation. The gap widens on multi-file reasoning, where Claude’s training and larger serving infrastructure give it a clear advantage. Bug detection also favors Claude: catching subtle issues requires the kind of deep reasoning that benefits from both model quality and inference speed.

Cost Breakeven

Claude APILocal GPUOllama Cloud
Upfront cost$0$489 (RTX 4070 Ti)$0
Monthly cost$60-100$8-12 (electricity)$0
6-month total$360-600$537-561$0
12-month total$720-1,200$585-633$0
QualityBest85-90%90-95%
SpeedFastest3-5x slower1.5-2x slower

A heavy API user breaks even on GPU hardware in 6-8 months. Ollama cloud models are free but come with rate limits and send your code to external servers. For cost-sensitive developers without strict privacy requirements, cloud models through Ollama are the strongest value.

Hardware Requirements

TierHardwareBest ModelExperience
Minimum viable16GB RAM (M1/M2) or RTX 3060 12GBGLM-4.7-Flash (Q4)Usable for single-file tasks. Slower on complex operations.
Recommended32GB RAM (M1 Pro/Max) or RTX 4070 Ti 16GBQwen3-Coder 30B (Q4)Solid for most coding workflows. Multi-file works but slower.
Ideal64GB+ RAM (M2/M3 Max) or RTX 4090 24GBQwen2.5-Coder-32B (Q6)Best local experience. Higher quantization, faster throughput.

Apple Silicon advantage

Macs with unified memory handle large models surprisingly well. The M1 Max with 64GB can run 30B models with full 128K context, though at 10-20 tok/s. The M4 and future chips will narrow the speed gap with cloud inference further.

If you want local inference without buying hardware, Ollama’s cloud models skip this entire section. Run ollama launch claude --model qwen3.5:cloud and your only requirement is a machine that can run Ollama itself.

Tool Calling and Limitations

Claude Code is not just a chat interface. It reads files, runs shell commands, applies edits, and searches code. All of that depends on tool calling: the model sends structured function calls, Claude Code executes them, and the results feed back into the conversation.

Without tool calling, Claude Code degrades to a plain text generator. You ask it to read a file and it describes what it would do instead of doing it.

What Works

  • GLM-4.7-Flash has native tool calling support. 79.5% on agent benchmarks. Best local option.
  • Qwen3-Coder supports tool calling. Slightly less reliable than GLM on complex multi-step chains.
  • All Ollama cloud models (qwen3.5:cloud, glm-5:cloud, kimi-k2.5:cloud) have full tool calling.

What Doesn’t

  • Most 7B models have weak or no tool calling. They generate text, but Claude Code cannot use them for autonomous operations.
  • Older Ollama versions (pre-0.14) do not support the Anthropic API at all. Streaming tool calls specifically require 0.14.3-rc1+.
  • Quantized models at Q2/Q3 lose too much instruction-following ability for reliable tool use.

The Edit Accuracy Problem

Even models with tool calling struggle with edit accuracy. Claude Code’s edit system was optimized for Claude’s output format. Alternative models approximate the format but miss details: wrong line numbers, bad whitespace, mismatched context. Published measurements show raw model diffs land at 70-80% accuracy for non-Claude models vs 98% for Claude.

That 20-30% failure rate compounds. Over a 50-edit session, you spend more time fixing broken patches than writing code.

Privacy and Compliance

This is the strongest argument for local models. When inference happens on your machine, the compliance picture simplifies dramatically:

  • No cross-border data transfer. Code never leaves your jurisdiction. Relevant for GDPR, data residency laws, and government contracts.
  • No third-party processing. Eliminates the need for Data Processing Agreements with model providers.
  • Right to erasure. Your code is never encoded in external model weights. There is nothing to delete because nothing was stored.
  • Air-gapped environments. Defense contractors and healthcare organizations often cannot send code to any external API. Local models are the only option.

A practical pattern for mixed environments: use local models for sensitive codebases (PII handling, medical records, proprietary algorithms) and cloud models for open-source or public-facing work. Tools like claude-launcher make switching between -l (local) and -a (API) instant.

Troubleshooting

Claude Code says “connection refused”

Fix: Verify Ollama is running

# Check if Ollama is serving
curl http://localhost:11434/api/version

# If nothing responds, start it
ollama serve

# Verify the base URL matches
echo $ANTHROPIC_BASE_URL
# Should print: http://localhost:11434

Model just talks instead of acting

You ask Claude Code to read a file. It responds with “I would read the file and then...” instead of actually reading it. This means tool calling is not working.

  • Verify your Ollama version supports streaming tool calls (0.14.3-rc1+).
  • Switch to a model with confirmed tool support: GLM-4.7-Flash or any cloud model.
  • Check that ANTHROPIC_AUTH_TOKEN is set to ollama, not an actual API key.

Role model requests fail

Claude Code tries to use “haiku” for background tasks and fails because there’s no haiku model in Ollama.

Fix: Remap all role models

# Point all role models to your local model
export ANTHROPIC_DEFAULT_HAIKU_MODEL=glm-4.7-flash
export ANTHROPIC_DEFAULT_SONNET_MODEL=glm-4.7-flash
export ANTHROPIC_DEFAULT_OPUS_MODEL=glm-4.7-flash

# Or use claude-launcher for automatic remapping
npm install -g claude-launcher
claude-launcher -l

Context window too small

If the model truncates responses or loses track of files, it may be running with a smaller context than configured.

Fix: Set context length explicitly

# Create a Modelfile with explicit context
cat > Modelfile << 'EOF'
FROM glm-4.7-flash
PARAMETER num_ctx 65536
EOF

ollama create glm-4.7-flash-64k -f Modelfile
claude --model glm-4.7-flash-64k

Slow generation

If tokens come out painfully slow (under 5/sec), you are likely running a model that exceeds your hardware.

  • Drop to a smaller quantization: Q4_K_M instead of Q6_K.
  • Reduce context length. 32K is faster than 128K even if the model supports both.
  • Switch to GLM-4.7-Flash if using a dense model. MoE architecture uses 3B active parameters from a 30B model.
  • Consider cloud models. qwen3.5:cloud runs at 30-60 tok/s with zero hardware load.

Frequently Asked Questions

Can you use Ollama with Claude Code?

Yes. Ollama v0.14 and later support the Anthropic Messages API natively. You can run ollama launch claude to auto-configure everything, or manually set ANTHROPIC_AUTH_TOKEN=ollama and ANTHROPIC_BASE_URL=http://localhost:11434, then start Claude Code with --model pointing to your Ollama model.

Is Ollama + Claude Code free?

Running local models through Ollama is completely free. No API keys, no billing. Ollama also offers cloud models (like glm-5:cloud and qwen3.5:cloud) with generous free tiers. Your only cost for local models is hardware and electricity.

What is the best Ollama model for Claude Code?

For local inference, GLM-4.7-Flash is the top recommendation: 128K context, native tool calling, and it runs on 16GB RAM thanks to mixture-of-experts architecture. For cloud models through Ollama, Qwen 3.5 and GLM-5 offer frontier-level quality at no cost.

How much slower is local inference compared to Claude's API?

Significantly. Consumer hardware generates 15-25 tokens per second. Anthropic's infrastructure pushes 60-80. In real-world testing, a task that took cloud Claude 73 seconds took a local model 82 minutes. Short prompts feel responsive; long multi-file operations expose the gap.

Do tool calls work with Ollama models?

Yes, but not all models support tool calling. GLM-4.7-Flash, Qwen3-Coder, and cloud models like kimi-k2.5:cloud have reliable tool use. Models without tool calling support will still generate text responses but cannot execute file operations, run commands, or apply edits automatically.

What hardware do I need to run Claude Code with Ollama?

At minimum, 16GB RAM (Apple Silicon) or 16GB VRAM (GPU). This runs models like GLM-4.7-Flash comfortably. For 32B parameter models like Qwen2.5-Coder-32B, you need 24GB+ VRAM or 32GB+ unified memory. Cloud models through Ollama have no hardware requirements beyond running Ollama itself.

Can I switch between local and cloud models in the same session?

You can switch models between sessions using the --model flag or /model command. Switching mid-session requires restarting Claude Code with the new model configuration. Tools like claude-launcher make toggling between local and cloud instant.

The Bottleneck Is Not the Model

Whether you run Claude Code against Anthropic’s API, a local Ollama model, or a free cloud model through Ollama, the same two problems dominate the agent loop: edit accuracy and search efficiency.

Local models make the edit problem worse. Claude’s 98% edit accuracy drops to 70-80% with alternative models. That 20-30% failure rate means retry loops, wasted tokens, and manual patch-fixing. Search is the other bottleneck: every irrelevant file loaded into context dilutes the signal the model reasons over.

These problems are solvable at a different layer. Morph’s Fast Apply model intercepts edit operations and merges them with a purpose-built model trained specifically for code application, pushing accuracy back to 98% regardless of which model generated the diff. It streams at 10,500+ tokens per second, so even large file rewrites complete in under a second. WarpGrep handles code search with 8 parallel tool calls per turn, filtering results before they hit your context window.

Both work with Claude Code as an MCP server. The model backend, whether Anthropic, Ollama local, or Ollama cloud, does not matter. The apply and search operations route through Morph independently.

Fix the edit bottleneck, regardless of model backend

Fast Apply pushes edit accuracy to 98% and streams at 10,500+ tok/s. WarpGrep searches your codebase in sub-6 seconds with 8 parallel tool calls. Both plug into Claude Code via MCP.