Ollama v0.14 shipped native Anthropic API compatibility. Claude Code can now talk to local models directly, no proxy needed. You get the same CLI, the same tool-use workflow, the same CLAUDE.md files, but the inference happens on your hardware.
The search queries tell the story. “Ollama Claude Code” pulls 390 monthly searches at KD 4. “Claude Code Ollama” adds another 320 at KD 3. Combined CPC is $62, which means people searching this are ready to build, not just browse.
Why Run Claude Code with Local Models
Claude Code against Anthropic’s API is the default. It works well. But three scenarios push developers toward local inference:
Zero API Cost
No per-token billing. Heavy users spending $60-100/month on Claude API break even on a $500 GPU in 6-8 months. Ollama's cloud models are free with generous limits.
Data Never Leaves Your Machine
Code stays on localhost. No cross-border transfer, no third-party processing agreements, no audit trails. Critical for healthcare, defense, and regulated industries.
No Rate Limits
Anthropic caps requests per minute. During crunch time, you hit them. Local models run as fast as your hardware allows, with no throttling.
The tradeoff is speed. Local models on consumer hardware are significantly slower than cloud inference. The rest of this guide quantifies exactly how much slower, so you can decide whether the tradeoffs work for your situation.
Prerequisites
- Ollama v0.14+ (required for Anthropic API compatibility). Check with
ollama --version. - Claude Code CLI installed via
curl -fsSL https://claude.ai/install.sh | bash(macOS/Linux) orirm https://claude.ai/install.ps1 | iex(Windows). - Hardware: 16GB+ RAM (Apple Silicon) or 16GB+ VRAM (GPU) for local models. Cloud models have no hardware floor.
Pre-release note
Streaming tool calls, which Claude Code depends on for its agentic workflow, require Ollama 0.14.3-rc1 or later. If tool calls fail silently on stable Ollama, install the pre-release: curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.14.3-rc1 sh
Quick Start: ollama launch claude
The fastest path. One command, no environment variables, no config files. Ollama handles everything.
One-command setup
# Launch Claude Code with Ollama's default coding model
ollama launch claude
# Or specify a model
ollama launch claude --model glm-4.7-flash
# Cloud model (no local hardware needed)
ollama launch claude --model qwen3.5:cloud
# Headless mode for scripts and CI
ollama launch claude --model glm-4.7-flash --yes -- -p "explain this codebase"That’s it. ollama launch sets ANTHROPIC_AUTH_TOKEN, ANTHROPIC_BASE_URL, and ANTHROPIC_API_KEY automatically, then starts Claude Code pointed at your local Ollama instance.
Manual Configuration
If you prefer explicit control, or need to integrate with existing shell profiles, set the environment variables yourself.
Environment variables
# Set these in your terminal or add to ~/.zshrc / ~/.bashrc
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""
export ANTHROPIC_BASE_URL=http://localhost:11434
# Pull a model first
ollama pull glm-4.7-flash
# Start Claude Code with that model
claude --model glm-4.7-flashOr inline, without touching your shell profile:
Inline (single session)
ANTHROPIC_AUTH_TOKEN=ollama \
ANTHROPIC_BASE_URL=http://localhost:11434 \
ANTHROPIC_API_KEY="" \
claude --model glm-4.7-flashSettings.json Approach
You can also configure this in Claude Code’s settings file for persistence across sessions:
~/.claude/settings.json
{
"env": {
"ANTHROPIC_AUTH_TOKEN": "ollama",
"ANTHROPIC_API_KEY": "",
"ANTHROPIC_BASE_URL": "http://localhost:11434"
},
"model": "glm-4.7-flash"
}The role model problem
Claude Code routes different request types to different model tiers: haiku for quick tasks, sonnet for standard work, opus for complex reasoning. If you only set ANTHROPIC_MODEL, the role-model requests still try to reach Anthropic’s servers.
Fix this by also setting ANTHROPIC_DEFAULT_HAIKU_MODEL, ANTHROPIC_DEFAULT_SONNET_MODEL, and ANTHROPIC_DEFAULT_OPUS_MODEL to your local model name. Or use claude-launcher (npm install -g claude-launcher), which remaps all role models to your Ollama model automatically.
Which Models Work
Not every Ollama model is a good fit. Claude Code needs large context windows (64K+ tokens minimum) and tool calling support. Models without tool calling can generate text but cannot read files, run commands, or apply edits autonomously.
Local Models
| Model | Parameters | Context | Tool Calling | RAM/VRAM |
|---|---|---|---|---|
| glm-4.7-flash | 30B MoE (3B active) | 128K | Yes (79.5% benchmark) | ~6.5GB (Q4) |
| qwen3-coder | 30B | 128K | Yes | ~20GB (Q4) |
| gpt-oss:20b | 20B | 32K | Yes | ~12GB (Q4) |
| devstral-small-2 | 24B | 128K | Yes | ~16GB (Q4) |
| qwen2.5-coder:32b | 32B | 128K | Limited | ~24GB (Q4) |
Top pick: GLM-4.7-Flash. It uses mixture-of-experts architecture, so only 3B parameters activate per token despite being a 30B model. That means it runs on 16GB RAM or an RTX 3060. The 128K context window handles substantial codebases, and its tool calling scored 79.5% on agent benchmarks.
Cloud Models (via Ollama)
Ollama also proxies cloud models. These run on remote infrastructure but use the same Ollama interface. Free tier with generous rate limits.
| Model | Context | Tool Calling | Cost |
|---|---|---|---|
| qwen3.5:cloud | 128K+ | Yes | Free (rate limited) |
| glm-5:cloud | 128K+ | Yes | Free (rate limited) |
| kimi-k2.5:cloud | 128K+ | Yes | Free (rate limited) |
| minimax-m2.7:cloud | 128K+ | Yes | Free (rate limited) |
| qwen3-coder:480b-cloud | 128K+ | Yes | Free (rate limited) |
| gpt-oss:120b-cloud | 128K+ | Yes | Free (rate limited) |
Cloud models are the pragmatic middle ground
If your goal is free Claude Code usage without buying a GPU, cloud models through Ollama are the move. You get frontier-level quality (Qwen 3.5 and GLM-5 are competitive with proprietary models on coding benchmarks), full context windows, and the same ollama launch claude setup. The tradeoff is that you’re sending code to external servers, which defeats the privacy argument.
Cloud Models: Free Inference Through Ollama
This is the setup most people actually want. You get Claude Code’s full agentic workflow, powered by models that rival Claude Sonnet, for $0.
Using Ollama cloud models
# Qwen 3.5 — strong all-around coding model
ollama launch claude --model qwen3.5:cloud
# GLM-5 — competitive on code generation benchmarks
ollama launch claude --model glm-5:cloud
# Kimi K2.5 — good at multi-step reasoning
ollama launch claude --model kimi-k2.5:cloud
# 480B parameter Qwen3-Coder — largest available
ollama launch claude --model qwen3-coder:480b-cloudCloud models automatically get full context length. No need to configure num_ctx or worry about context truncation. Rate limits exist but are generous enough for normal development workflows.
Performance: Local vs Cloud
Real numbers from published benchmarks. The speed gap is the defining tradeoff of local inference.
Token Throughput
| Setup | Tokens/sec | Notes |
|---|---|---|
| Claude API (Sonnet 4) | 60-80 tok/s | Anthropic's infrastructure |
| Ollama cloud model | 30-60 tok/s | Varies by model and load |
| RTX 4070 Ti Super (32B Q4) | 15-25 tok/s | $489 GPU, 16GB VRAM |
| M1 Max 64GB (30B MoE) | 10-20 tok/s | Apple Silicon unified memory |
| RTX 3060 (GLM-4.7-Flash) | 8-15 tok/s | Budget GPU, 12GB VRAM |
Real-World Task Comparison
From a published benchmark tracing entity relationships across a .NET monorepo:
The output quality was near-identical for that task. Both identified the same architectural patterns and produced equivalent analysis. But the cloud version finished in the time it takes to make coffee. The local version took longer than most meetings.
Coding Quality Benchmarks
From a controlled comparison running 50 coding tasks across 5 categories on an RTX 4070 Ti Super:
| Task | Qwen2.5-Coder-32B | DeepSeek-Coder-V2 | Claude Sonnet 4 |
|---|---|---|---|
| Function generation | 4.1 | 3.7 | 4.4 |
| Bug detection | 3.8 | 3.4 | 4.6 |
| Refactoring | 4.0 | 3.5 | 4.3 |
| Multi-file context | 2.8 | 2.4 | 4.5 |
| Code explanation | 4.2 | 3.9 | 4.1 |
Local models score within 85-90% of Claude on single-file tasks like function generation and code explanation. The gap widens on multi-file reasoning, where Claude’s training and larger serving infrastructure give it a clear advantage. Bug detection also favors Claude: catching subtle issues requires the kind of deep reasoning that benefits from both model quality and inference speed.
Cost Breakeven
| Claude API | Local GPU | Ollama Cloud | |
|---|---|---|---|
| Upfront cost | $0 | $489 (RTX 4070 Ti) | $0 |
| Monthly cost | $60-100 | $8-12 (electricity) | $0 |
| 6-month total | $360-600 | $537-561 | $0 |
| 12-month total | $720-1,200 | $585-633 | $0 |
| Quality | Best | 85-90% | 90-95% |
| Speed | Fastest | 3-5x slower | 1.5-2x slower |
A heavy API user breaks even on GPU hardware in 6-8 months. Ollama cloud models are free but come with rate limits and send your code to external servers. For cost-sensitive developers without strict privacy requirements, cloud models through Ollama are the strongest value.
Hardware Requirements
| Tier | Hardware | Best Model | Experience |
|---|---|---|---|
| Minimum viable | 16GB RAM (M1/M2) or RTX 3060 12GB | GLM-4.7-Flash (Q4) | Usable for single-file tasks. Slower on complex operations. |
| Recommended | 32GB RAM (M1 Pro/Max) or RTX 4070 Ti 16GB | Qwen3-Coder 30B (Q4) | Solid for most coding workflows. Multi-file works but slower. |
| Ideal | 64GB+ RAM (M2/M3 Max) or RTX 4090 24GB | Qwen2.5-Coder-32B (Q6) | Best local experience. Higher quantization, faster throughput. |
Apple Silicon advantage
Macs with unified memory handle large models surprisingly well. The M1 Max with 64GB can run 30B models with full 128K context, though at 10-20 tok/s. The M4 and future chips will narrow the speed gap with cloud inference further.
If you want local inference without buying hardware, Ollama’s cloud models skip this entire section. Run ollama launch claude --model qwen3.5:cloud and your only requirement is a machine that can run Ollama itself.
Tool Calling and Limitations
Claude Code is not just a chat interface. It reads files, runs shell commands, applies edits, and searches code. All of that depends on tool calling: the model sends structured function calls, Claude Code executes them, and the results feed back into the conversation.
Without tool calling, Claude Code degrades to a plain text generator. You ask it to read a file and it describes what it would do instead of doing it.
What Works
- GLM-4.7-Flash has native tool calling support. 79.5% on agent benchmarks. Best local option.
- Qwen3-Coder supports tool calling. Slightly less reliable than GLM on complex multi-step chains.
- All Ollama cloud models (qwen3.5:cloud, glm-5:cloud, kimi-k2.5:cloud) have full tool calling.
What Doesn’t
- Most 7B models have weak or no tool calling. They generate text, but Claude Code cannot use them for autonomous operations.
- Older Ollama versions (pre-0.14) do not support the Anthropic API at all. Streaming tool calls specifically require 0.14.3-rc1+.
- Quantized models at Q2/Q3 lose too much instruction-following ability for reliable tool use.
The Edit Accuracy Problem
Even models with tool calling struggle with edit accuracy. Claude Code’s edit system was optimized for Claude’s output format. Alternative models approximate the format but miss details: wrong line numbers, bad whitespace, mismatched context. Published measurements show raw model diffs land at 70-80% accuracy for non-Claude models vs 98% for Claude.
That 20-30% failure rate compounds. Over a 50-edit session, you spend more time fixing broken patches than writing code.
Privacy and Compliance
This is the strongest argument for local models. When inference happens on your machine, the compliance picture simplifies dramatically:
- No cross-border data transfer. Code never leaves your jurisdiction. Relevant for GDPR, data residency laws, and government contracts.
- No third-party processing. Eliminates the need for Data Processing Agreements with model providers.
- Right to erasure. Your code is never encoded in external model weights. There is nothing to delete because nothing was stored.
- Air-gapped environments. Defense contractors and healthcare organizations often cannot send code to any external API. Local models are the only option.
A practical pattern for mixed environments: use local models for sensitive codebases (PII handling, medical records, proprietary algorithms) and cloud models for open-source or public-facing work. Tools like claude-launcher make switching between -l (local) and -a (API) instant.
Troubleshooting
Claude Code says “connection refused”
Fix: Verify Ollama is running
# Check if Ollama is serving
curl http://localhost:11434/api/version
# If nothing responds, start it
ollama serve
# Verify the base URL matches
echo $ANTHROPIC_BASE_URL
# Should print: http://localhost:11434Model just talks instead of acting
You ask Claude Code to read a file. It responds with “I would read the file and then...” instead of actually reading it. This means tool calling is not working.
- Verify your Ollama version supports streaming tool calls (0.14.3-rc1+).
- Switch to a model with confirmed tool support: GLM-4.7-Flash or any cloud model.
- Check that
ANTHROPIC_AUTH_TOKENis set toollama, not an actual API key.
Role model requests fail
Claude Code tries to use “haiku” for background tasks and fails because there’s no haiku model in Ollama.
Fix: Remap all role models
# Point all role models to your local model
export ANTHROPIC_DEFAULT_HAIKU_MODEL=glm-4.7-flash
export ANTHROPIC_DEFAULT_SONNET_MODEL=glm-4.7-flash
export ANTHROPIC_DEFAULT_OPUS_MODEL=glm-4.7-flash
# Or use claude-launcher for automatic remapping
npm install -g claude-launcher
claude-launcher -lContext window too small
If the model truncates responses or loses track of files, it may be running with a smaller context than configured.
Fix: Set context length explicitly
# Create a Modelfile with explicit context
cat > Modelfile << 'EOF'
FROM glm-4.7-flash
PARAMETER num_ctx 65536
EOF
ollama create glm-4.7-flash-64k -f Modelfile
claude --model glm-4.7-flash-64kSlow generation
If tokens come out painfully slow (under 5/sec), you are likely running a model that exceeds your hardware.
- Drop to a smaller quantization: Q4_K_M instead of Q6_K.
- Reduce context length. 32K is faster than 128K even if the model supports both.
- Switch to GLM-4.7-Flash if using a dense model. MoE architecture uses 3B active parameters from a 30B model.
- Consider cloud models.
qwen3.5:cloudruns at 30-60 tok/s with zero hardware load.
Frequently Asked Questions
Can you use Ollama with Claude Code?
Yes. Ollama v0.14 and later support the Anthropic Messages API natively. You can run ollama launch claude to auto-configure everything, or manually set ANTHROPIC_AUTH_TOKEN=ollama and ANTHROPIC_BASE_URL=http://localhost:11434, then start Claude Code with --model pointing to your Ollama model.
Is Ollama + Claude Code free?
Running local models through Ollama is completely free. No API keys, no billing. Ollama also offers cloud models (like glm-5:cloud and qwen3.5:cloud) with generous free tiers. Your only cost for local models is hardware and electricity.
What is the best Ollama model for Claude Code?
For local inference, GLM-4.7-Flash is the top recommendation: 128K context, native tool calling, and it runs on 16GB RAM thanks to mixture-of-experts architecture. For cloud models through Ollama, Qwen 3.5 and GLM-5 offer frontier-level quality at no cost.
How much slower is local inference compared to Claude's API?
Significantly. Consumer hardware generates 15-25 tokens per second. Anthropic's infrastructure pushes 60-80. In real-world testing, a task that took cloud Claude 73 seconds took a local model 82 minutes. Short prompts feel responsive; long multi-file operations expose the gap.
Do tool calls work with Ollama models?
Yes, but not all models support tool calling. GLM-4.7-Flash, Qwen3-Coder, and cloud models like kimi-k2.5:cloud have reliable tool use. Models without tool calling support will still generate text responses but cannot execute file operations, run commands, or apply edits automatically.
What hardware do I need to run Claude Code with Ollama?
At minimum, 16GB RAM (Apple Silicon) or 16GB VRAM (GPU). This runs models like GLM-4.7-Flash comfortably. For 32B parameter models like Qwen2.5-Coder-32B, you need 24GB+ VRAM or 32GB+ unified memory. Cloud models through Ollama have no hardware requirements beyond running Ollama itself.
Can I switch between local and cloud models in the same session?
You can switch models between sessions using the --model flag or /model command. Switching mid-session requires restarting Claude Code with the new model configuration. Tools like claude-launcher make toggling between local and cloud instant.
The Bottleneck Is Not the Model
Whether you run Claude Code against Anthropic’s API, a local Ollama model, or a free cloud model through Ollama, the same two problems dominate the agent loop: edit accuracy and search efficiency.
Local models make the edit problem worse. Claude’s 98% edit accuracy drops to 70-80% with alternative models. That 20-30% failure rate means retry loops, wasted tokens, and manual patch-fixing. Search is the other bottleneck: every irrelevant file loaded into context dilutes the signal the model reasons over.
These problems are solvable at a different layer. Morph’s Fast Apply model intercepts edit operations and merges them with a purpose-built model trained specifically for code application, pushing accuracy back to 98% regardless of which model generated the diff. It streams at 10,500+ tokens per second, so even large file rewrites complete in under a second. WarpGrep handles code search with 8 parallel tool calls per turn, filtering results before they hit your context window.
Both work with Claude Code as an MCP server. The model backend, whether Anthropic, Ollama local, or Ollama cloud, does not matter. The apply and search operations route through Morph independently.
Fix the edit bottleneck, regardless of model backend
Fast Apply pushes edit accuracy to 98% and streams at 10,500+ tok/s. WarpGrep searches your codebase in sub-6 seconds with 8 parallel tool calls. Both plug into Claude Code via MCP.