Best Ollama Models: 12 Models Ranked for Coding, RAG & Agents (2026)

Running LLMs locally went from novelty to default workflow in 2026. Ollama made it one command: ollama pull, then inference at localhost:11434. The hard part is no longer setup. It is choosing the right model for your task, hardware, and tolerance for quantization tradeoffs.

We tested 12 models across coding, RAG, agent tasks, and general reasoning on real hardware (RTX 4090, RTX 3090, MacBook Pro M3 Max, and a 16GB laptop GPU). This ranking is based on benchmark scores, measured token throughput, VRAM consumption, and how each model actually performs when you point a coding agent or RAG pipeline at it.

Benchmarks verified April 5, 2026

92.7%

Top HumanEval (Qwen2.5-Coder 32B)

Models ranked and tested

6GB

Minimum VRAM (Llama 3.1 8B Q4)

Ongoing inference cost

Quick-Pick: Best Ollama Model by Task

If you know what you need, start here. Each pick is the strongest model for that specific use case that actually runs on the hardware most developers have.

Best for coding: Qwen2.5-Coder 32B

92.7% HumanEval, competitive with GPT-4o. 22GB VRAM at Q4_K_M. Dominates EvalPlus, LiveCodeBench, BigCodeBench. The coding model to beat on local hardware.

Best for reasoning: DeepSeek-R1 32B

Chain-of-thought reasoning that traces through logic before answering. 72.6% LiveCodeBench, 79.8% MATH-500. 20GB VRAM at Q4. Think of it as local o1.

Best for agents: Qwen3-Coder 30B

3B active params from 30B total. 256K native context. Built for agentic coding with RL training on SWE-Bench. Tool calling works. 18GB VRAM.

Best for RAG: Llama 3.3 70B + nomic-embed-text

Llama 3.3 for generation (128K context, strong instruction following). Nomic-embed-text for embeddings (274MB, 8,192 tokens/chunk). The standard local RAG stack.

Best for 8GB GPU: Qwen2.5-Coder 7B

Leads HumanEval among 7-8B models. Fits on any modern GPU. 40+ tok/s on RTX 3060. Better at code than general models twice its size.

Best for math/STEM: Phi-4 14B

80.4% MATH benchmark. Beats 70B general models on structured reasoning. 10GB VRAM at Q4. Microsoft's best-kept secret for analytical tasks.

Best multimodal: Gemma 4 26B

Native vision, function calling, and structured JSON output. Apache 2.0 license. #6 on Arena AI leaderboard. The agent-ready multimodal model.

Best all-rounder: Llama 3.1 8B

111M+ downloads on Ollama. Runs on 6GB VRAM. Good enough for conversation, summarization, light coding. The Honda Civic of local LLMs.

Master Comparison: 12 Ollama Models Ranked

Model	Params	HumanEval	VRAM (Q4_K_M)	Best For
Qwen2.5-Coder 32B	32B dense	92.7%	22GB	Code generation, refactoring
Qwen3-Coder 30B	30B (3B active)	~65%	18GB	Agentic coding, tool use
DeepSeek-R1 32B	32B distill	~82%	20GB	Reasoning through bugs
DeepSeek Coder V2 16B	16B lite	~78%	10-12GB	Budget coding GPU
CodeLlama 34B	34B dense	53.7%	20-22GB	Python, legacy codebases
Qwen2.5-Coder 7B	7B dense	~75%	5GB	8GB GPU coding

Model	Params	Context	VRAM (Q4_K_M)	Best For
Llama 3.3 70B	70B dense	128K	43GB	RAG generation, instruction
Gemma 4 26B MoE	26B (4B active)	128K	~16GB	Agents, function calling
Phi-4 14B	14B dense	16K	10GB	Math, STEM, reasoning
Gemma 3 12B	12B dense	128K	8-10GB	Multimodal, vision tasks
Mistral Small 3.2	24B dense	32K	16GB	Multilingual, chat
Llama 3.1 8B	8B dense	128K	6GB	General purpose, entry point

Model	Size	Max Tokens	Dims	Best For
nomic-embed-text	274MB	8,192	768	RAG embeddings, semantic search

VRAM measured with Q4_K_M quantization at default context length. Actual usage varies with context size. KV cache adds 1-2GB at default lengths, 14GB+ at 32K context for 70B models.

1. Qwen2.5-Coder 32B: The Local Coding Benchmark King

Qwen2.5-Coder 32B Instruct is the model that made "local coding AI" a serious statement instead of a compromise. It matches GPT-4o on HumanEval and leads every open-source model on EvalPlus, LiveCodeBench, and BigCodeBench. It runs on a single RTX 4090.

92.7%

HumanEval Pass@1

90.2%

MBPP

65.9

McEval (40+ languages)

22GB

VRAM at Q4_K_M

The 32B model scores 75.2 on MdEval (multi-language code repair), ranking first among all open-source models. It handles 40+ programming languages well, including less common ones like Haskell and Racket. Available in 6 sizes (0.5B, 1.5B, 3B, 7B, 14B, 32B), so you can match the model to your GPU.

Running It

Pull and run Qwen2.5-Coder 32B

ollama pull qwen2.5-coder:32b
ollama run qwen2.5-coder:32b "Refactor this function to use async/await"

Hardware Reality

At Q4_K_M quantization, the 32B model needs 22GB VRAM. An RTX 4090 (24GB) handles it with room for a 4-8K context window. An RTX 3090 (24GB) also works. For larger contexts, you will need to drop to Q3_K_M or use the 14B variant. On a MacBook Pro with 36GB+ unified memory, it runs comfortably at full Q4 precision.

The 7B variant fits on any 8GB GPU and still leads HumanEval among 7-8B class models. If you have a laptop with a GTX 1660 or RTX 3060, start with qwen2.5-coder:7b.

When to use the 32B vs 7B

The 32B model is meaningfully better at multi-file refactoring, complex debugging, and generating tests for edge cases. The 7B model is fine for single-function generation, autocompletion, and quick code explanations. The jump from 7B to 32B matters most on tasks that require understanding context across files.

2. DeepSeek-R1 (Distilled): Chain-of-Thought Reasoning, Locally

DeepSeek-R1 is a "thinking" model. It reasons through problems step by step before producing an answer, similar to OpenAI's o1 series. The distilled versions bring this capability to local hardware at sizes from 1.5B to 70B.

72.6%

LiveCodeBench (32B distill)

97.3%

MATH-500 (full 671B)

20GB

VRAM for 32B at Q4

Distill sizes available

The full R1 model is 671B parameters and needs ~400GB VRAM with 4-bit quantization. Not practical for local use. But DeepSeek open-sourced distilled checkpoints at 1.5B, 7B, 8B, 14B, 32B, and 70B, trained using 800,000 samples from the full model. The distills retain the reasoning behavior at a fraction of the compute.

Which Distill Size?

8B (6GB VRAM): Entry-level reasoning. Noticeably better than Llama 3.1 8B on math and logic puzzles. Viable on laptop GPUs.
14B (12GB VRAM): Sweet spot for mid-range GPUs. Good enough for debugging complex logic, tracing through state machines, explaining algorithms.
32B (20GB VRAM): The one to run if you have an RTX 4090. Scores 72.6% LiveCodeBench and systematically traces through logic before proposing fixes. Competitive with GPT-4o on reasoning tasks.
70B (43GB+ VRAM): Needs dual GPUs or a Mac with 64GB+ unified memory. Exceeds GPT-4o on several benchmarks.

Pull the 32B reasoning distill

ollama pull deepseek-r1:32b
ollama run deepseek-r1:32b "Why does this recursive function stack overflow on inputs > 10000?"

Reasoning vs speed tradeoff

R1 models think before answering. This means higher latency per response, often 2-3x slower than a non-reasoning model at the same parameter count. The output quality on analytical tasks justifies it. For autocomplete or quick code snippets, use Qwen2.5-Coder instead.

3. Qwen3-Coder 30B: The Agentic Coding Specialist

Qwen3-Coder is not just a code generation model. It was trained with reinforcement learning on SWE-Bench and designed for multi-step agentic workflows: reading files, running tests, editing code across repositories. The 30B variant (30B total, 3.3B active per token via MoE) runs on consumer hardware.

30B

Total params (3.3B active)

256K

Native context (1M extended)

18GB

VRAM at Q4_K_M

MoE

Architecture

The MoE architecture means only 3.3B parameters activate per token out of 30B total. This gives you the quality of a much larger model with the inference speed and memory footprint of a small one. Context extends to 256K tokens natively and up to 1M with YaRN extrapolation, making it viable for repository-scale tasks.

Agent Tasks

Qwen3-Coder is the model to run when your workflow involves: reading a file, understanding a bug report, searching for related code, editing multiple files, and running tests. The RL training on SWE-Bench means it learned the loop of "read, reason, edit, verify" rather than just "generate code." Tool calling works reliably through Ollama's API.

Pull Qwen3-Coder for agent workflows

ollama pull qwen3-coder:30b
# Use with any OpenAI-compatible agent framework
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-coder:30b",
    "messages": [{"role": "user", "content": "Find and fix the null pointer in src/auth.ts"}],
    "tools": [{"type": "function", "function": {"name": "read_file", "parameters": {"type": "object", "properties": {"path": {"type": "string"}}}}}]
  }'

Qwen3-Coder vs Qwen2.5-Coder

Qwen2.5-Coder 32B wins on raw code generation benchmarks (HumanEval, MBPP). Qwen3-Coder 30B wins on agentic tasks that require multi-step reasoning, tool use, and repository navigation. If you are plugging a model into an agent framework, Qwen3-Coder. If you want the best autocomplete or single-function generation, Qwen2.5-Coder.

4. Llama 3.3 70B: The RAG Workhorse

Meta trained Llama 3.3 70B to match the 405B model's performance at a fraction of the compute. It succeeded. Instruction following at 92.1 beats every model in its class. Math improved by 9.2 points over Llama 3.1 70B. The 128K context window makes it the natural choice for RAG pipelines where you need to stuff retrieved documents into the prompt.

70B

Dense parameters

128K

Context window

43GB

VRAM at Q4_K_M

92.1

Instruction following score

For RAG, what matters is: can the model synthesize an answer from injected context without hallucinating? Llama 3.3 70B does this well. The 128K context means you can inject 20-30 documents without truncation. Instruction following ensures the model answers the question you asked, not a tangentially related one.

Hardware

43GB VRAM at Q4_K_M. No single consumer GPU handles this. Options: dual RTX 3090s (48GB total), a single RTX A6000 (48GB), or a MacBook Pro with 64GB+ unified memory. With CPU offload and 64GB+ system RAM, you get 2-5 tokens/second, which is usable for batch processing but painful for interactive chat. The 8B variant (llama3.1:8b) fits on any modern GPU and serves as a capable lighter alternative.

Why not Llama 3.1 405B?

Llama 3.3 70B was specifically trained to match the 405B on instruction tasks. For most RAG and chat workloads, you get 95%+ of the quality at 17% of the parameters. The 405B only makes sense if you have H100-class infrastructure and need the absolute ceiling on multilingual or niche domain tasks.

5. Gemma 4: Native Function Calling Under Apache 2.0

Gemma 4 launched April 2, 2026, and immediately became the strongest open model for agent tasks in the sub-32B range. Four sizes: E2B (effective 2B), E4B, 26B MoE, and 31B dense. The 26B MoE activates 4B parameters per token, giving agent-grade quality at edge-friendly compute.

Arena AI text leaderboard (31B)

Arena AI text leaderboard (26B)

Apache 2.0

License

~16GB

26B MoE VRAM at Q4

The differentiator: native function calling, structured JSON output, and system instruction support. Other models bolt these on via prompt engineering. Gemma 4 was trained for them. This means fewer dropped tool calls, fewer malformed JSON responses, and more reliable multi-step workflows.

Sizes and Use Cases

E2B: On-device agents, mobile apps. The smallest model in this list that reliably calls tools.
E4B: Edge deployments, Raspberry Pi-class hardware with GPU offload.
26B MoE: The sweet spot. 4B active parameters, native multimodal (vision + text), fits on 16GB VRAM. Use this for local agent pipelines.
31B Dense: Maximum quality when you have 20GB+ VRAM. #3 on Arena AI.

Pull Gemma 4 for agent tasks

ollama pull gemma4:26b
# Supports native tool calling
ollama run gemma4:26b "Search for files matching *.ts in the src directory"

6. DeepSeek Coder V2 16B: Coding on a Budget GPU

DeepSeek Coder V2 Lite (16B) is the best coding model for developers with 12-16GB GPUs who want more than what 7B models offer. It uses a Mixture-of-Experts architecture, activating a subset of its 16B parameters per token for efficient inference.

16B

Total parameters (MoE)

10-12GB

VRAM at Q4

128K

Context window

On an RTX 3090 with 24GB, it runs at higher quantization levels (Q5_K_M, Q6_K) with room for 32K+ context. On a 16GB laptop GPU, Q4_K_M fits with default context. The 128K context window is large enough for most single-file and small multi-file tasks.

Compared to Qwen2.5-Coder 7B, the 16B gives noticeably better output on complex multi-step generation. Compared to Qwen2.5-Coder 32B, it is clearly weaker but runs on half the VRAM. The right model for the RTX 3060 Ti / RTX 4060 tier.

Pull DeepSeek Coder V2 Lite

ollama pull deepseek-coder-v2:16b
ollama run deepseek-coder-v2:16b "Write a binary search tree with delete operation in Rust"

7. Phi-4 14B: Math and STEM on 10GB

Microsoft's Phi-4 is the model that violates the "bigger is better" assumption. At 14B parameters, it outperforms many 30-70B models on mathematical reasoning, structured logic, and STEM tasks. The tradeoff: a 16K context window, much smaller than the 128K offered by Llama or Gemma models.

80.4%

MATH benchmark

14B

Parameters

10GB

VRAM at Q4_K_M

16K

Context window

On MATH, Phi-4 scores 80.4% versus Llama 3.3 8B at 68.0% and Qwen 2.5 14B at 75.6%. It delivers the best results per GB of VRAM for analytical tasks in 2026. The Phi-4-reasoning variant pushes this further, exceeding DeepSeek-R1 Distill Llama 70B (a model 5x its size) on reasoning benchmarks.

When to Use It

Phi-4 is the pick for: data analysis pipelines, scientific computation assistance, algorithm design, and any task where step-by-step logical reasoning matters more than broad knowledge or long-context synthesis. It is not the model for RAG (16K context is too small) or creative writing (not its training focus).

Phi-4-mini for 8GB machines

Phi-4-mini (3.8B) is the only viable reasoning model for 8GB VRAM. It won't match the 14B on complex tasks, but for quick analytical queries and math assistance, it outperforms general 7-8B models. Run it with ollama pull phi4-mini.

8. Gemma 3 12B: Multimodal Vision on One GPU

Gemma 3 is Google's previous-generation open model, still widely used because Gemma 4 only launched days ago. The 12B variant hits a sweet spot: multimodal (text + image), 128K context, and fits on a single 12GB GPU at Q4.

12B

Dense parameters

128K

Context window

8-10GB

VRAM at Q4_K_M

Gemma 3's QAT (quantization-aware trained) variants preserve near-BF16 quality at 3x less memory. The 12B QAT model runs on 8GB VRAM while maintaining quality that previously required 24GB. For vision tasks like OCR, chart understanding, and screenshot analysis, Gemma 3 12B is more accessible than Llama 3.2 Vision (which starts at 11B but needs more VRAM for the vision encoder).

The 4B model is worth considering for edge deployments: ~3GB VRAM at Q4, 128K context, multimodal. It loses meaningfully on complex reasoning compared to the 12B, but for structured output generation and simple visual tasks, it works.

9. Mistral Small 3.2: Multilingual Quality at 24B

Mistral Small 3.2 is Mistral's 24B dense model positioned for production chat and multilingual tasks. It delivers higher language quality than similarly sized models, particularly for European languages, but at a speed penalty: 18.5 tokens/sec on 16GB VRAM versus 70+ tokens/sec for smaller models.

24B

Dense parameters

32K

Default context

16GB

VRAM at Q4_K_M

The use case for Mistral Small 3.2 is specific: you need a local model that handles French, German, Spanish, or Italian text well, and you care more about output quality than speed. For English-only tasks, Qwen2.5-Coder 32B or Llama 3.3 70B are better choices. For multilingual customer support bots, documentation translation, or code comments in non-English languages, Mistral Small is the pick.

10. CodeLlama 34B: The Legacy Workhorse

CodeLlama 34B was state-of-the-art in 2023-2024. In 2026, it ranks 10th. That is not a knock on the model. It is a sign of how fast this space moves. Qwen2.5-Coder 32B at 92.7% HumanEval makes CodeLlama 34B at 53.7% look dated.

34B

Dense parameters

53.7%

HumanEval

20-22GB

VRAM at Q4_K_M

There are still reasons to use it. CodeLlama has the most mature fine-tuned ecosystem. If you have a domain-specific fine-tune trained on CodeLlama 34B, switching to Qwen2.5-Coder means retraining. The Python-specialized variant (CodeLlama-Python) is still competitive for pure Python generation tasks. And the 7B model (5-6GB VRAM) remains a decent option for constrained hardware where even Qwen2.5-Coder 7B feels slow.

Migration path

If you are running CodeLlama in production, benchmark Qwen2.5-Coder on your specific tasks before switching. The HumanEval gap is large, but HumanEval measures single-function generation. On domain-specific tasks where you have a fine-tuned CodeLlama, the gap may be smaller or even reversed.

11. Llama 3.1 8B: The Default Starting Point

111 million downloads on Ollama. If local LLMs had a "hello world," this is it. Llama 3.1 8B runs on 6GB VRAM, handles conversation, summarization, light coding, and general Q&A at a quality level that would have been impressive from a cloud API two years ago.

Dense parameters

128K

Context window

6GB

VRAM at Q4_K_M

It is not the best at anything specific. It is good enough at almost everything. That makes it the right model for: learning Ollama, prototyping pipelines before scaling to a larger model, running on machines with 8GB GPUs, and any task where you need a quick answer and do not need specialist coding or reasoning capability.

The Llama 3.2 variants add vision capability (image understanding) at 11B. If you need multimodal on low hardware, consider Llama 3.2 Vision 11B as an alternative, though Gemma 3 12B generally performs better on visual tasks.

12. Nomic Embed Text: The RAG Embedding Standard

Every RAG pipeline needs an embedding model. Nomic Embed Text is the default choice in the Ollama ecosystem: 274MB, 8,192 tokens per chunk, 768-dimensional vectors, and it outperforms OpenAI's text-embedding-ada-002 and text-embedding-3-small on both short and long-context retrieval tasks.

274MB

Model size

8,192

Max tokens per chunk

768

Embedding dimensions

It runs on any hardware. A CPU-only machine handles it fine. The v2 MoE variant adds multilingual support if you are indexing documents in multiple languages.

Generate embeddings for RAG

ollama pull nomic-embed-text
# Generate embeddings via API
curl http://localhost:11434/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text",
    "input": "How does the authentication middleware handle expired tokens?"
  }'

Full local RAG stack

The minimal local RAG stack: nomic-embed-text for embeddings, ChromaDB or Qdrant for vector storage, and Llama 3.3 70B (or Qwen2.5-Coder 32B for code-heavy docs) for generation. All running locally, zero API costs, full data privacy. Pair with LangChain or LlamaIndex for the orchestration layer.

VRAM Cheat Sheet: What Runs on Your Hardware

The single biggest factor in local LLM performance is whether the model fits entirely in VRAM. A model that fits in GPU memory is 5-10x faster than one that spills to system RAM. This table maps hardware tiers to the best models you can run at full GPU speed.

VRAM	Hardware Examples	Best Coding Model	Best General Model
8GB	RTX 3060, RTX 4060, M2 Air	Qwen2.5-Coder 7B	Llama 3.1 8B
12GB	RTX 3060 12GB, RTX 4070	DeepSeek Coder V2 16B	Gemma 3 12B
16GB	RTX 4060 Ti, RTX 4080	Qwen2.5-Coder 14B	Gemma 4 26B MoE
24GB	RTX 3090, RTX 4090	Qwen2.5-Coder 32B	Mistral Small 3.2 / DeepSeek-R1 32B
48GB+	2x RTX 3090, A6000, Mac 64GB	Qwen3-Coder 480B (quantized)	Llama 3.3 70B

KV cache eats VRAM

These numbers assume default context lengths. Expanding context adds KV cache memory: at 32K context, a 70B model adds ~14GB for KV cache alone. At 128K, it exceeds 40GB. Set OLLAMA_KV_CACHE_TYPE=q8_0 to halve KV cache memory with minimal quality loss. Or simply limit context with num_ctx in your Modelfile.

Connecting Ollama to Morph Infrastructure

Whichever model you pick from this list, Morph's infrastructure works with it. Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1. Any tool that accepts an OpenAI base URL can point to Ollama instead.

What This Means in Practice

Fast Apply: Morph's apply model takes LLM-generated diffs and applies them to your codebase. It works with any model producing the edit. Use Qwen2.5-Coder 32B locally for code generation, then route through Fast Apply for reliable file edits. The apply model handles the last-mile formatting, which is where local models sometimes stumble.
WarpGrep: The RL-trained search subagent runs in its own context window and finds relevant code before your main model starts working. Pairing WarpGrep with a local Ollama model means the local model only sees pre-filtered, relevant code rather than wasting its limited context on irrelevant files.
Sandbox: Morph's sandboxed execution environment lets your Ollama model run generated code safely. The model writes code locally, Sandbox executes it in isolation, and the results flow back.

Point Morph tools at Ollama

from openai import OpenAI

# Connect to local Ollama
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but unused
)

# Use any Ollama model
response = client.chat.completions.create(
    model="qwen2.5-coder:32b",
    messages=[{"role": "user", "content": "Refactor auth.ts to use JWT instead of sessions"}]
)

# Same client works with Morph's API
morph_client = OpenAI(
    base_url="https://api.morphllm.com/v1",
    api_key="your-morph-key"
)

# Fast Apply takes the diff from any model
apply_response = morph_client.chat.completions.create(
    model="morph-v3-fast",
    messages=[{
        "role": "user",
        "content": f"<code>{original_code}</code>\n<update>{response.choices[0].message.content}</update>"
    }]
)

The combination of local inference (zero marginal cost, full privacy) with Morph's specialized infrastructure (reliable code application, intelligent search, safe execution) gives you the best of both worlds. Your code and prompts stay on your machine. Only the final diffs and search queries hit Morph's API.

Frequently Asked Questions

What is the best Ollama model for coding in 2026?

Qwen2.5-Coder 32B Instruct. It scores 92.7% on HumanEval (matching GPT-4o), runs on 24GB VRAM with Q4_K_M quantization, and leads open-source models on EvalPlus, LiveCodeBench, and BigCodeBench. For agentic coding, Qwen3-Coder 30B is the better choice because it was RL-trained on SWE-Bench and supports tool calling natively.

What is the best Ollama model for RAG?

Llama 3.3 70B for the generation model (128K context, strong instruction following, minimal hallucination on retrieved content). Nomic-embed-text for embeddings (274MB, 8,192 tokens/chunk). If 70B is too large for your hardware, Qwen2.5-Coder 32B handles code-heavy RAG well, and Llama 3.1 8B works as a lightweight alternative.

Which Ollama model supports tool calling for agents?

Qwen3 models have the most stable tool calling, rarely hallucinating calls or dropping parameters. Gemma 4 ships with native function calling trained into the model weights. Llama 3.3 also supports tool calling. Ollama's /api/chat endpoint handles tool definitions in the OpenAI format. Check Ollama's tool calling docs for the supported format.

How much VRAM do I need to run Ollama models?

At Q4_K_M quantization: 8B models need 6GB, 14B models need 10GB, 32B models need 20-22GB, 70B models need 43GB. The critical threshold is whether the model fits entirely in VRAM. Fully GPU-loaded models run 5-10x faster than those spilling to system RAM. For 8GB GPUs, Qwen2.5-Coder 7B or Phi-4-mini 3.8B are the best options.

Is Ollama's API compatible with OpenAI?

Yes. Ollama exposes OpenAI-compatible endpoints at http://localhost:11434/v1, including /chat/completions, /completions, /embeddings, and /models. Set base_url to http://localhost:11434/v1 and api_key to 'ollama' in the OpenAI Python client. This means Morph's Fast Apply, WarpGrep, and Sandbox work with any Ollama model without code changes.

What is the fastest Ollama model?

Speed depends on model size and whether it fits in VRAM. On 16GB VRAM, Phi-4-mini 3.8B and Qwen2.5-Coder 7B hit 40+ tokens/sec. Mistral Small 3.2 at 24B runs 18.5 tokens/sec on the same hardware. The general rule: a model that fits entirely in GPU memory generates 5-10x faster than one that spills to system RAM.

Can I use Ollama models with coding agents?

Yes. Any agent framework that supports OpenAI-compatible endpoints works with Ollama. Point the base URL to http://localhost:11434/v1. For agent-specific tasks, use Qwen3-Coder 30B (RL-trained for multi-step coding workflows) or Gemma 4 26B (native function calling). Morph's WarpGrep subagent pairs with any Ollama model to improve code search quality during agent runs.

Pair Your Local Model with Morph Infrastructure

Morph's Fast Apply, WarpGrep, and Sandbox work with any Ollama model via the OpenAI-compatible API. Keep code and prompts on your machine. Route only final diffs through Morph for reliable application. WarpGrep pre-filters code so your local model's context stays clean.

Try Morph Playground

View API Docs

Morph Fast Apply

Morph WarpGrep

Morph Compact

Morph Glance

Morph MCP

Morph Monitor

Blog

Startup Credits

Students

Contact Us

About

Careers

Best Ollama Models: 12 Models Ranked for Coding, RAG, and Agent Tasks

Quick-Pick: Best Ollama Model by Task

Best for coding: Qwen2.5-Coder 32B

Best for reasoning: DeepSeek-R1 32B

Best for agents: Qwen3-Coder 30B

Best for RAG: Llama 3.3 70B + nomic-embed-text

Best for 8GB GPU: Qwen2.5-Coder 7B

Best for math/STEM: Phi-4 14B

Best multimodal: Gemma 4 26B

Best all-rounder: Llama 3.1 8B

Master Comparison: 12 Ollama Models Ranked

1. Qwen2.5-Coder 32B: The Local Coding Benchmark King

Running It

Pull and run Qwen2.5-Coder 32B

Hardware Reality

When to use the 32B vs 7B

2. DeepSeek-R1 (Distilled): Chain-of-Thought Reasoning, Locally

Which Distill Size?

Pull the 32B reasoning distill

Reasoning vs speed tradeoff

3. Qwen3-Coder 30B: The Agentic Coding Specialist

Agent Tasks

Pull Qwen3-Coder for agent workflows

Qwen3-Coder vs Qwen2.5-Coder

4. Llama 3.3 70B: The RAG Workhorse

Hardware

Why not Llama 3.1 405B?

5. Gemma 4: Native Function Calling Under Apache 2.0

Sizes and Use Cases

Pull Gemma 4 for agent tasks

6. DeepSeek Coder V2 16B: Coding on a Budget GPU

Pull DeepSeek Coder V2 Lite

7. Phi-4 14B: Math and STEM on 10GB

When to Use It

Phi-4-mini for 8GB machines

8. Gemma 3 12B: Multimodal Vision on One GPU

9. Mistral Small 3.2: Multilingual Quality at 24B

10. CodeLlama 34B: The Legacy Workhorse

Migration path

11. Llama 3.1 8B: The Default Starting Point

12. Nomic Embed Text: The RAG Embedding Standard

Generate embeddings for RAG

Full local RAG stack

VRAM Cheat Sheet: What Runs on Your Hardware

KV cache eats VRAM

Connecting Ollama to Morph Infrastructure

What This Means in Practice

Point Morph tools at Ollama

Frequently Asked Questions

What is the best Ollama model for coding in 2026?

What is the best Ollama model for RAG?

Which Ollama model supports tool calling for agents?

How much VRAM do I need to run Ollama models?

Is Ollama's API compatible with OpenAI?

What is the fastest Ollama model?

Can I use Ollama models with coding agents?

Pair Your Local Model with Morph Infrastructure