Ollama Embedding Models: Benchmarks, VRAM, and Which to Use

Ollama lets you run embedding models locally. No API keys, no per-token billing, no data leaving your machine. The hard part is choosing which of the 12+ available models to use. This guide compares every Ollama embedding model with benchmark numbers, VRAM measurements, and practical recommendations for RAG and code search.

70.58

MTEB score: qwen3-embedding 8B (highest)

274MB

Disk size: nomic-embed-text (most popular)

15-50ms

Local latency vs 200-800ms cloud

Per-token cost with Ollama

Every Ollama Embedding Model Compared

The table below covers every embedding model in the Ollama library as of April 2026. MTEB scores are from the English retrieval subset (nDCG@10) unless noted. Disk sizes are for the default quantization.

Model	Params	Dims	Context	MTEB Overall	Disk Size
qwen3-embedding:8b	8B	32-4096	8192	70.58	~4.9GB (Q4)
qwen3-embedding:4b	4B	32-4096	8192	~67	~2.5GB (Q4)
mxbai-embed-large	335M	1024	512	64.68	670MB
bge-m3	568M	1024	8192	~63.0	1.2GB
nomic-embed-text v1.5	137M	768*	8192	62.39	274MB
snowflake-arctic-embed-l	335M	1024	512	55.98 (R)	670MB
snowflake-arctic-embed2 (568m)	568M	1024	8192	~58 (R)	~1.1GB
granite-embedding (278m)	278M	768	512	~58	~560MB
nomic-embed-text-v2-moe	~305M	768	8192	~63	~610MB
qwen3-embedding:0.6b	0.6B	32-4096	8192	~60	~400MB
all-minilm (L6-v2)	23M	384	256	~56	46MB
granite-embedding (30m)	30M	384	512	~50	~60MB

Reading the table

Dims = output vector dimensions. Higher dimensions capture more nuance but use more storage. Context = maximum input tokens. Text beyond this limit is silently truncated. MTEB Overall = average score across retrieval, classification, clustering, and STS tasks. Entries marked (R) show retrieval-only scores. * = nomic-embed-text v1.5 supports Matryoshka reduction to 512, 256, 128, or 64 dimensions.

nomic-embed-text: The Default Pick

nomic-embed-text is the most pulled embedding model on Ollama, and for good reason. At 137M parameters and 274MB on disk, it runs on a laptop CPU without touching the GPU. The 8192-token context window means you can embed entire functions, documentation pages, or long paragraphs without truncation.

Version 1.5 added Matryoshka Representation Learning, which lets you truncate embeddings to any dimension between 64 and 768. At 512 dimensions, it outperforms OpenAI text-embedding-ada-002 while cutting memory usage by 3x. At 256 dimensions, it performs similarly to all-MiniLM-L6-v2 at 12x less memory than nomic-embed-text v1.

137M

Parameters

768

Default dimensions (64-768)

8192

Context window (tokens)

62.39

MTEB overall score

Nomic also released nomic-embed-text-v2-moe, a Mixture-of-Experts variant with state-of-the-art multilingual performance for its size class. It supports ~100 languages and was trained on 1.6B text pairs. If you need multilingual support without the 1.2GB cost of bge-m3, the v2-moe is worth benchmarking against your data.

Pull and use nomic-embed-text with Ollama

# Pull the model
ollama pull nomic-embed-text

# Generate an embedding
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "Ollama runs embedding models locally"
}'

# Batch multiple inputs
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": [
    "First document chunk",
    "Second document chunk",
    "Third document chunk"
  ]
}'

When to pick nomic-embed-text

Choose nomic-embed-text when you need long-context embeddings (up to 8192 tokens), when running on limited hardware, or when you want Matryoshka dimension flexibility. It is the safest default for most RAG applications. Its main weakness is retrieval accuracy on short queries, where mxbai-embed-large scores 5+ points higher on MTEB Retrieval.

mxbai-embed-large: Best Retrieval Under 500M Parameters

mxbai-embed-large from Mixedbread AI uses a BERT-large backbone with 335M parameters and produces 1024-dimensional embeddings. It scores 64.68 on MTEB overall, with 54.39 on retrieval. That 54.39 retrieval score beats both nomic-embed-text (49.01) and OpenAI text-embedding-3-large on the same benchmark, despite being trained with no overlap of the MTEB evaluation data.

The critical limitation is its 512-token context window. Any input longer than 512 tokens gets truncated. This makes chunking strategy essential. If you are embedding code functions, API documentation, or any content that regularly exceeds 512 tokens, nomic-embed-text or bge-m3 with their 8192-token windows are better choices despite lower retrieval scores.

335M

Parameters

1024

Dimensions

512

Context window (tokens)

54.39

MTEB retrieval score

A head-to-head test on real data found that nomic-embed-text outperformed mxbai-embed-large on short, direct questions (57.5% vs 63.75% retrieval accuracy), while mxbai-embed-large performed better on context-heavy, implied questions. If your user queries tend to be specific ("what does handleWebhook do?"), nomic wins. If they tend to be conceptual ("how does the payment flow work?"), mxbai wins.

bge-m3: Multilingual + Multi-Vector Retrieval

bge-m3 from BAAI stands for Multi-Functionality, Multi-Linguality, and Multi-Granularity. It is the only Ollama embedding model that supports dense retrieval, sparse (lexical) retrieval, and ColBERT-style multi-vector retrieval simultaneously. At 568M parameters and 1.2GB on disk, it handles 100+ languages with an 8192-token context window.

Dense Retrieval

Standard embedding vectors. 1024 dimensions, suitable for any vector database. Best for general semantic similarity.

Sparse Retrieval

BM25-like lexical matching within the same model. Captures exact keyword matches that dense embeddings miss.

Multi-Vector (ColBERT)

Token-level embeddings for fine-grained matching. Higher accuracy at the cost of more storage and compute.

The VRAM situation is nuanced. Loading bge-m3 for inference in F16 takes about 1.06GB. But with the default batch processing settings (batch_size=256, max_length=512), total VRAM usage climbs to ~5.7GB. On a 8GB consumer GPU, you can run it, but batch sizes may need adjustment.

If your application serves a global audience, bge-m3 is the clear choice. Chinese, Arabic, and Hindi queries perform nearly as well as English. For English-only applications, nomic-embed-text gives you similar overall quality at a quarter of the disk and memory cost.

snowflake-arctic-embed: Size-Optimized Retrieval

Snowflake Arctic Embed offers four size variants tuned for different hardware constraints. The family spans 22M to 335M parameters, with the large model achieving the highest retrieval-specific MTEB score (55.98 nDCG@10) of any model under 500M parameters.

Variant	Params	Dims	Architecture	MTEB Retrieval
arctic-embed-xs	22M	384	MiniLMv2	~47
arctic-embed-s	33M	384	MiniLMv2	~50
arctic-embed-m	110M	768	BERT-base	~53
arctic-embed-l	335M	1024	BERT-large	55.98

Arctic Embed 2.0 added multilingual support without sacrificing English performance. The 568M parameter model supports Matryoshka reduction to 256 dimensions and achieves strong scores on both English MTEB Retrieval and multilingual CLEF benchmarks. If you need multilingual retrieval with a smaller footprint than bge-m3, Arctic Embed 2.0 is worth evaluating.

qwen3-embedding: The New State of the Art

qwen3-embedding is the first embedding model family on Ollama that competes with commercial APIs across the board. The 8B model scores 70.58 on the MTEB multilingual leaderboard, ranking first as of June 2025. It supports 100+ languages including programming languages, with dimensions configurable from 32 to 4096.

70.58

MTEB multilingual score (8B)

3 sizes

0.6B, 4B, 8B parameters

32-4096

Configurable dimensions

100+

Supported languages

The 8B model needs 16GB+ VRAM at F16. With Q4_K_M quantization, it fits in ~5GB VRAM, making it runnable on an RTX 4060 Ti or M1 Pro with 16GB unified memory. The 4B variant scores ~67 on MTEB and needs roughly half the resources. The 0.6B variant scores ~60 and runs on almost anything.

A key feature is instruction support. Adding task-specific instructions (e.g., "Retrieve relevant code snippets for the following query") typically improves retrieval by 1-5% over using the model without instructions. Most other Ollama embedding models do not support instructions.

Size vs quality tradeoff

qwen3-embedding 8B scores 70.58 but needs a dedicated GPU. nomic-embed-text scores 62.39 but runs on a CPU. That 8-point MTEB gap translates to noticeably better retrieval on complex, ambiguous queries. For simple factual lookups, the gap is smaller. If you have the hardware, qwen3-embedding 8B is the best model on Ollama. If you do not, nomic-embed-text v1.5 is the best model you can run everywhere.

all-minilm and granite-embedding: The Lightweight Options

all-minilm

all-minilm (L6-v2) has 23M parameters, takes 46MB on disk, and produces 384-dimensional embeddings. The 256-token context window is its biggest constraint. It was state of the art in 2022. In 2026, it is best suited for prototyping, resource-constrained edge deployments, or applications where embedding speed matters more than retrieval quality.

granite-embedding

IBM Granite Embedding models range from 30M to 278M parameters. The English-only variants produce 384 or 768 dimensions. The multilingual variants (107M and 278M) support 12 languages including German, Japanese, Arabic, and Chinese. Built with retrieval-oriented pretraining and knowledge distillation, they target enterprise use cases where IBM's support contracts and licensing matter.

Scenario	Best Pick	Why
Prototyping / testing	all-minilm	46MB, runs instantly, good enough to validate a pipeline
Edge / IoT	granite-embedding 30m	60MB, 384 dims, minimal compute
Enterprise multilingual	granite-embedding 278m	IBM support, 12 languages, 768 dims
Production RAG	nomic-embed-text	Better scores, same ease of use, 274MB

Ollama vs Cloud Embedding APIs

Running embeddings locally versus calling a cloud API is a cost, latency, and quality decision. Here is how Ollama models stack up against the major cloud providers.

Model	MTEB Overall	Cost / 1M Tokens	Latency	Context
qwen3-embedding 8B (Ollama)	70.58	Hardware only	15-50ms	8192
Voyage 4 Large	~70	$0.12	100-300ms	32000
mxbai-embed-large (Ollama)	64.68	Hardware only	15-50ms	512
OpenAI text-embedding-3-large	64.6	$0.13	200-800ms	8191
nomic-embed-text (Ollama)	62.39	Hardware only	15-50ms	8192
OpenAI text-embedding-3-small	62.3	$0.02	200-800ms	8191
Cohere embed-v4	~66	$0.10	100-400ms	4096
Voyage 4 Lite	~63	$0.02	100-300ms	32000

When Local Wins

Latency-sensitive applications. If your RAG pipeline makes multiple retrieval calls per user query, local embeddings save 2-8 seconds of pure network overhead. For interactive code completion, document Q&A, or real-time search, this is the difference between feeling instant and feeling slow.

Data sovereignty. Legal, healthcare, and government applications often cannot send data to third-party APIs. Local embeddings keep everything on-premises.

High volume. Above ~50 million tokens per month, the per-token savings from local embeddings outweigh the hardware cost, especially if you already have a GPU for other workloads.

When Cloud Wins

Low volume. OpenAI text-embedding-3-small at $0.02/M tokens means 10 million tokens costs $0.20/month. Keeping a machine running to save $0.20 makes no sense.

No GPU available. Cloud APIs return embeddings from optimized infrastructure. Running qwen3-embedding 8B on CPU takes seconds per embedding instead of milliseconds.

Long context needs. Voyage 4 supports 32,000-token context. The longest Ollama models top out at 8,192. If you embed entire documents without chunking, cloud APIs have more headroom.

Setup and API Usage

Ollama's embedding API is OpenAI-compatible, which means most RAG frameworks can swap in Ollama with minimal code changes.

Basic setup: install, pull, embed

# Install Ollama (macOS)
brew install ollama

# Start the Ollama server
ollama serve

# Pull an embedding model
ollama pull nomic-embed-text

# Generate a single embedding
curl http://localhost:11434/api/embed \
  -d '{"model": "nomic-embed-text", "input": "your text here"}'

# Response:
# {
#   "model": "nomic-embed-text",
#   "embeddings": [[0.0123, -0.0456, 0.0789, ...]],
#   "total_duration": 14000000
# }

Python: Ollama + ChromaDB for RAG

import ollama
import chromadb

# Initialize ChromaDB
client = chromadb.Client()
collection = client.create_collection("docs")

# Embed and store documents
docs = ["First document chunk", "Second chunk", "Third chunk"]
for i, doc in enumerate(docs):
    response = ollama.embed(model="nomic-embed-text", input=doc)
    collection.add(
        ids=[str(i)],
        embeddings=[response["embeddings"][0]],
        documents=[doc]
    )

# Query
query_response = ollama.embed(
    model="nomic-embed-text",
    input="search query here"
)
results = collection.query(
    query_embeddings=[query_response["embeddings"][0]],
    n_results=3
)

Using Ollama as an OpenAI-compatible embedding provider

# LangChain
from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# LlamaIndex
from llama_index.embeddings.ollama import OllamaEmbedding
embed_model = OllamaEmbedding(model_name="nomic-embed-text")

# Any OpenAI-compatible client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.embeddings.create(
    model="nomic-embed-text",
    input="your text here"
)

API migration note

Ollama 0.2.0 changed the embedding endpoint from /api/embeddings to /api/embed. The new endpoint supports batched inputs (array of strings), returns L2-normalized vectors, and supports the truncate parameter. If you are upgrading from an older Ollama version, update your API calls accordingly.

Quantization and VRAM Management

Quantization reduces model precision to save memory. For embedding models, the quality-memory tradeoff is more forgiving than for generative models, because embeddings are less sensitive to small weight perturbations.

Format	Memory Reduction	Quality Impact	Best For
F16	Baseline	None	Maximum quality, ample VRAM
Q8_0	~50%	Negligible	Models < 1B params (recommended default)
Q5_K_M	~65%	Minimal	Large models (qwen3-embedding 8B)
Q4_K_M	~75%	Moderate	Memory-constrained, acceptable quality loss

Models that fit entirely in VRAM run 5-30x faster than models that spill to system RAM. If mxbai-embed-large in F16 (670MB) fits in your GPU but qwen3-embedding 8B in F16 (16GB) does not, mxbai will generate embeddings orders of magnitude faster despite being a smaller model. Always check that the quantized model fits in your available VRAM before choosing a larger model.

Check VRAM usage and pull a specific quantization

# Pull the default quantization
ollama pull qwen3-embedding

# Pull a specific quantization for reduced VRAM
ollama pull qwen3-embedding:8b    # default quantization
# Community quantizations available:
# dengcao/Qwen3-Embedding-8B:Q4_K_M  (~4.9GB)
# dengcao/Qwen3-Embedding-8B:Q8_0    (~8.5GB)

# Check which models are loaded and their memory usage
ollama ps

# Set context length to reduce VRAM (if not using full context)
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "shorter context saves memory",
  "options": {"num_ctx": 2048}
}'

Code Search: Where Embeddings Break

For code retrieval, the natural instinct is to embed your codebase with nomic-embed-text or qwen3-embedding and run vector search. This works for finding functions by description ("the webhook handler") but fails on the queries that matter most.

Code embeddings have three fundamental problems. First, they capture syntax, not semantics. Changing < to <= barely moves the embedding vector but changes program behavior. Second, two implementations with different variable names and control flow sit far apart in vector space despite being functionally identical. Third, embeddings treat test fixtures, deprecated code, and production code as equally relevant when they are semantically similar to a query.

Nomic Embed Code, a 7B code embedding model, outperforms Voyage Code 3 and OpenAI text-embedding-3-large on CodeSearchNet. It supports Python, Java, Ruby, PHP, JavaScript, and Go. But it is a 7B model, requiring substantial hardware, and the fundamental limitations of code embeddings still apply.

For AI coding agents, the search problem is even harder. Agents spend over 60% of their first turn just retrieving context. Each file read and grep result stays in the context window. The agent accumulates 20,000+ tokens of irrelevant search noise before it starts editing. Whether that noise came from vector search or grep, the damage to the reasoning model is the same. This is the context rot problem.

WarpGrep takes a different approach to code search. Instead of embedding the codebase, it uses reinforcement learning to train a search agent that learns which queries to run, which results to keep, and when to stop. It operates in its own context window, so the parent coding model never sees the files that were explored and rejected. The result: 70% less context rot and 40% faster task completion, without managing embeddings at all.

No Index to Maintain

WarpGrep searches the current state of the code. No stale embeddings, no reindexing when files change, no vector database infrastructure.

Isolated Search Context

Search happens in a separate context window. The parent coding model only sees the final results, not the 15 files that were explored and rejected.

RL-Trained Retrieval

Trained with reinforcement learning to decide when to grep, when to read files, and when to stop. Not hardcoded heuristics or embedding similarity.

Frequently Asked Questions

What is the best Ollama embedding model for RAG?

For most RAG applications, nomic-embed-text v1.5 is the best starting point. It scores 62.39 on MTEB, supports 8192-token context, runs on minimal hardware (274MB), and supports Matryoshka dimensionality reduction. If retrieval accuracy is critical and your chunks are under 512 tokens, mxbai-embed-large scores 64.68. For maximum quality with sufficient hardware, qwen3-embedding 8B scores 70.58.

How much VRAM do Ollama embedding models need?

all-minilm: ~50MB. nomic-embed-text: ~300MB. mxbai-embed-large: ~700MB. bge-m3: ~1.2GB (up to 5.7GB with batch processing). qwen3-embedding 8B: 16GB+ at F16, or 4-6GB with Q4 quantization. Models that fit entirely in VRAM generate embeddings 5-30x faster than those that spill to system RAM.

How do Ollama embeddings compare to OpenAI?

nomic-embed-text (62.39 MTEB) roughly matches OpenAI text-embedding-3-small (62.3). mxbai-embed-large (64.68) matches text-embedding-3-large (64.6). qwen3-embedding 8B (70.58) outperforms all current OpenAI embedding models. Ollama models return vectors in 15-50ms on localhost versus 200-800ms for cloud APIs. Cloud APIs cost $0.02-$0.13 per million tokens. Ollama costs nothing per token but requires hardware.

Can I use Ollama embedding models for code search?

General-purpose models like nomic-embed-text work for basic code retrieval. Nomic also released nomic-embed-code (7B parameters) which outperforms Voyage Code 3 on CodeSearchNet. But code embeddings have fundamental limitations: they capture syntax, not behavior. For production code search in AI coding agents, purpose-built tools like WarpGrep use RL-trained search agents instead of embeddings.

What is the Ollama embedding API endpoint?

POST http://localhost:11434/api/embed with a JSON body containing model and input fields. The input field accepts a single string or an array of strings for batch processing. Returns L2-normalized vectors. The API is OpenAI-compatible, so frameworks like LangChain, LlamaIndex, and ChromaDB work with minimal configuration changes.

Does quantization affect embedding quality?

Q8_0 quantization cuts memory by ~50% with negligible quality loss and is recommended for models under 1B parameters. Q5_K_M is recommended for the qwen3-embedding 8B model. Q4_K_M saves ~75% memory with moderate quality reduction. In practice, Q8_0 embeddings are nearly indistinguishable from F16 on retrieval benchmarks.

Which Ollama embedding model supports the longest context?

nomic-embed-text v1.5, bge-m3, snowflake-arctic-embed2, and all qwen3-embedding variants support 8192-token context windows. mxbai-embed-large is limited to 512 tokens. all-minilm supports only 256 tokens. For RAG, ensure your chunk size does not exceed the model's context window, as exceeding it causes silent truncation.

Code Search Without Managing Embeddings

WarpGrep uses RL-trained search agents instead of vector embeddings. No index to maintain, no stale embeddings, no VRAM budgeting. 70% less context rot, 40% faster task completion.

Try WarpGrep

Read the Research

Morph Fast Apply

Morph WarpGrep

Morph Compact

Morph Glance

Morph MCP

Morph Monitor

Blog

Startup Credits

Students

Contact Us

About

Careers

Ollama Embedding Models: Which One Actually Works for Code Search and RAG?