Ollama Embedding Models: Which One Actually Works for Code Search and RAG?

Every Ollama embedding model compared by MTEB retrieval score, VRAM usage, dimensions, and context length. Benchmark data for nomic-embed-text, mxbai-embed-large, bge-m3, snowflake-arctic-embed, qwen3-embedding, and all-minilm. When to use local embeddings vs cloud APIs.

April 5, 2026 ยท 5 min read

Ollama lets you run embedding models locally. No API keys, no per-token billing, no data leaving your machine. The hard part is choosing which of the 12+ available models to use. This guide compares every Ollama embedding model with benchmark numbers, VRAM measurements, and practical recommendations for RAG and code search.

70.58
MTEB score: qwen3-embedding 8B (highest)
274MB
Disk size: nomic-embed-text (most popular)
15-50ms
Local latency vs 200-800ms cloud
$0
Per-token cost with Ollama

Every Ollama Embedding Model Compared

The table below covers every embedding model in the Ollama library as of April 2026. MTEB scores are from the English retrieval subset (nDCG@10) unless noted. Disk sizes are for the default quantization.

ModelParamsDimsContextMTEB OverallDisk Size
qwen3-embedding:8b8B32-4096819270.58~4.9GB (Q4)
qwen3-embedding:4b4B32-40968192~67~2.5GB (Q4)
mxbai-embed-large335M102451264.68670MB
bge-m3568M10248192~63.01.2GB
nomic-embed-text v1.5137M768*819262.39274MB
snowflake-arctic-embed-l335M102451255.98 (R)670MB
snowflake-arctic-embed2 (568m)568M10248192~58 (R)~1.1GB
granite-embedding (278m)278M768512~58~560MB
nomic-embed-text-v2-moe~305M7688192~63~610MB
qwen3-embedding:0.6b0.6B32-40968192~60~400MB
all-minilm (L6-v2)23M384256~5646MB
granite-embedding (30m)30M384512~50~60MB

Reading the table

Dims = output vector dimensions. Higher dimensions capture more nuance but use more storage. Context = maximum input tokens. Text beyond this limit is silently truncated. MTEB Overall = average score across retrieval, classification, clustering, and STS tasks. Entries marked (R) show retrieval-only scores. * = nomic-embed-text v1.5 supports Matryoshka reduction to 512, 256, 128, or 64 dimensions.

nomic-embed-text: The Default Pick

nomic-embed-text is the most pulled embedding model on Ollama, and for good reason. At 137M parameters and 274MB on disk, it runs on a laptop CPU without touching the GPU. The 8192-token context window means you can embed entire functions, documentation pages, or long paragraphs without truncation.

Version 1.5 added Matryoshka Representation Learning, which lets you truncate embeddings to any dimension between 64 and 768. At 512 dimensions, it outperforms OpenAI text-embedding-ada-002 while cutting memory usage by 3x. At 256 dimensions, it performs similarly to all-MiniLM-L6-v2 at 12x less memory than nomic-embed-text v1.

137M
Parameters
768
Default dimensions (64-768)
8192
Context window (tokens)
62.39
MTEB overall score

Nomic also released nomic-embed-text-v2-moe, a Mixture-of-Experts variant with state-of-the-art multilingual performance for its size class. It supports ~100 languages and was trained on 1.6B text pairs. If you need multilingual support without the 1.2GB cost of bge-m3, the v2-moe is worth benchmarking against your data.

Pull and use nomic-embed-text with Ollama

# Pull the model
ollama pull nomic-embed-text

# Generate an embedding
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "Ollama runs embedding models locally"
}'

# Batch multiple inputs
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": [
    "First document chunk",
    "Second document chunk",
    "Third document chunk"
  ]
}'

When to pick nomic-embed-text

Choose nomic-embed-text when you need long-context embeddings (up to 8192 tokens), when running on limited hardware, or when you want Matryoshka dimension flexibility. It is the safest default for most RAG applications. Its main weakness is retrieval accuracy on short queries, where mxbai-embed-large scores 5+ points higher on MTEB Retrieval.

mxbai-embed-large: Best Retrieval Under 500M Parameters

mxbai-embed-large from Mixedbread AI uses a BERT-large backbone with 335M parameters and produces 1024-dimensional embeddings. It scores 64.68 on MTEB overall, with 54.39 on retrieval. That 54.39 retrieval score beats both nomic-embed-text (49.01) and OpenAI text-embedding-3-large on the same benchmark, despite being trained with no overlap of the MTEB evaluation data.

The critical limitation is its 512-token context window. Any input longer than 512 tokens gets truncated. This makes chunking strategy essential. If you are embedding code functions, API documentation, or any content that regularly exceeds 512 tokens, nomic-embed-text or bge-m3 with their 8192-token windows are better choices despite lower retrieval scores.

335M
Parameters
1024
Dimensions
512
Context window (tokens)
54.39
MTEB retrieval score

A head-to-head test on real data found that nomic-embed-text outperformed mxbai-embed-large on short, direct questions (57.5% vs 63.75% retrieval accuracy), while mxbai-embed-large performed better on context-heavy, implied questions. If your user queries tend to be specific ("what does handleWebhook do?"), nomic wins. If they tend to be conceptual ("how does the payment flow work?"), mxbai wins.

bge-m3: Multilingual + Multi-Vector Retrieval

bge-m3 from BAAI stands for Multi-Functionality, Multi-Linguality, and Multi-Granularity. It is the only Ollama embedding model that supports dense retrieval, sparse (lexical) retrieval, and ColBERT-style multi-vector retrieval simultaneously. At 568M parameters and 1.2GB on disk, it handles 100+ languages with an 8192-token context window.

Dense Retrieval

Standard embedding vectors. 1024 dimensions, suitable for any vector database. Best for general semantic similarity.

Sparse Retrieval

BM25-like lexical matching within the same model. Captures exact keyword matches that dense embeddings miss.

Multi-Vector (ColBERT)

Token-level embeddings for fine-grained matching. Higher accuracy at the cost of more storage and compute.

The VRAM situation is nuanced. Loading bge-m3 for inference in F16 takes about 1.06GB. But with the default batch processing settings (batch_size=256, max_length=512), total VRAM usage climbs to ~5.7GB. On a 8GB consumer GPU, you can run it, but batch sizes may need adjustment.

If your application serves a global audience, bge-m3 is the clear choice. Chinese, Arabic, and Hindi queries perform nearly as well as English. For English-only applications, nomic-embed-text gives you similar overall quality at a quarter of the disk and memory cost.

snowflake-arctic-embed: Size-Optimized Retrieval

Snowflake Arctic Embed offers four size variants tuned for different hardware constraints. The family spans 22M to 335M parameters, with the large model achieving the highest retrieval-specific MTEB score (55.98 nDCG@10) of any model under 500M parameters.

VariantParamsDimsArchitectureMTEB Retrieval
arctic-embed-xs22M384MiniLMv2~47
arctic-embed-s33M384MiniLMv2~50
arctic-embed-m110M768BERT-base~53
arctic-embed-l335M1024BERT-large55.98

Arctic Embed 2.0 added multilingual support without sacrificing English performance. The 568M parameter model supports Matryoshka reduction to 256 dimensions and achieves strong scores on both English MTEB Retrieval and multilingual CLEF benchmarks. If you need multilingual retrieval with a smaller footprint than bge-m3, Arctic Embed 2.0 is worth evaluating.

qwen3-embedding: The New State of the Art

qwen3-embedding is the first embedding model family on Ollama that competes with commercial APIs across the board. The 8B model scores 70.58 on the MTEB multilingual leaderboard, ranking first as of June 2025. It supports 100+ languages including programming languages, with dimensions configurable from 32 to 4096.

70.58
MTEB multilingual score (8B)
3 sizes
0.6B, 4B, 8B parameters
32-4096
Configurable dimensions
100+
Supported languages

The 8B model needs 16GB+ VRAM at F16. With Q4_K_M quantization, it fits in ~5GB VRAM, making it runnable on an RTX 4060 Ti or M1 Pro with 16GB unified memory. The 4B variant scores ~67 on MTEB and needs roughly half the resources. The 0.6B variant scores ~60 and runs on almost anything.

A key feature is instruction support. Adding task-specific instructions (e.g., "Retrieve relevant code snippets for the following query") typically improves retrieval by 1-5% over using the model without instructions. Most other Ollama embedding models do not support instructions.

Size vs quality tradeoff

qwen3-embedding 8B scores 70.58 but needs a dedicated GPU. nomic-embed-text scores 62.39 but runs on a CPU. That 8-point MTEB gap translates to noticeably better retrieval on complex, ambiguous queries. For simple factual lookups, the gap is smaller. If you have the hardware, qwen3-embedding 8B is the best model on Ollama. If you do not, nomic-embed-text v1.5 is the best model you can run everywhere.

all-minilm and granite-embedding: The Lightweight Options

all-minilm

all-minilm (L6-v2) has 23M parameters, takes 46MB on disk, and produces 384-dimensional embeddings. The 256-token context window is its biggest constraint. It was state of the art in 2022. In 2026, it is best suited for prototyping, resource-constrained edge deployments, or applications where embedding speed matters more than retrieval quality.

granite-embedding

IBM Granite Embedding models range from 30M to 278M parameters. The English-only variants produce 384 or 768 dimensions. The multilingual variants (107M and 278M) support 12 languages including German, Japanese, Arabic, and Chinese. Built with retrieval-oriented pretraining and knowledge distillation, they target enterprise use cases where IBM's support contracts and licensing matter.

ScenarioBest PickWhy
Prototyping / testingall-minilm46MB, runs instantly, good enough to validate a pipeline
Edge / IoTgranite-embedding 30m60MB, 384 dims, minimal compute
Enterprise multilingualgranite-embedding 278mIBM support, 12 languages, 768 dims
Production RAGnomic-embed-textBetter scores, same ease of use, 274MB

Ollama vs Cloud Embedding APIs

Running embeddings locally versus calling a cloud API is a cost, latency, and quality decision. Here is how Ollama models stack up against the major cloud providers.

ModelMTEB OverallCost / 1M TokensLatencyContext
qwen3-embedding 8B (Ollama)70.58Hardware only15-50ms8192
Voyage 4 Large~70$0.12100-300ms32000
mxbai-embed-large (Ollama)64.68Hardware only15-50ms512
OpenAI text-embedding-3-large64.6$0.13200-800ms8191
nomic-embed-text (Ollama)62.39Hardware only15-50ms8192
OpenAI text-embedding-3-small62.3$0.02200-800ms8191
Cohere embed-v4~66$0.10100-400ms4096
Voyage 4 Lite~63$0.02100-300ms32000

When Local Wins

Latency-sensitive applications. If your RAG pipeline makes multiple retrieval calls per user query, local embeddings save 2-8 seconds of pure network overhead. For interactive code completion, document Q&A, or real-time search, this is the difference between feeling instant and feeling slow.

Data sovereignty. Legal, healthcare, and government applications often cannot send data to third-party APIs. Local embeddings keep everything on-premises.

High volume. Above ~50 million tokens per month, the per-token savings from local embeddings outweigh the hardware cost, especially if you already have a GPU for other workloads.

When Cloud Wins

Low volume. OpenAI text-embedding-3-small at $0.02/M tokens means 10 million tokens costs $0.20/month. Keeping a machine running to save $0.20 makes no sense.

No GPU available. Cloud APIs return embeddings from optimized infrastructure. Running qwen3-embedding 8B on CPU takes seconds per embedding instead of milliseconds.

Long context needs. Voyage 4 supports 32,000-token context. The longest Ollama models top out at 8,192. If you embed entire documents without chunking, cloud APIs have more headroom.

Setup and API Usage

Ollama's embedding API is OpenAI-compatible, which means most RAG frameworks can swap in Ollama with minimal code changes.

Basic setup: install, pull, embed

# Install Ollama (macOS)
brew install ollama

# Start the Ollama server
ollama serve

# Pull an embedding model
ollama pull nomic-embed-text

# Generate a single embedding
curl http://localhost:11434/api/embed \
  -d '{"model": "nomic-embed-text", "input": "your text here"}'

# Response:
# {
#   "model": "nomic-embed-text",
#   "embeddings": [[0.0123, -0.0456, 0.0789, ...]],
#   "total_duration": 14000000
# }

Python: Ollama + ChromaDB for RAG

import ollama
import chromadb

# Initialize ChromaDB
client = chromadb.Client()
collection = client.create_collection("docs")

# Embed and store documents
docs = ["First document chunk", "Second chunk", "Third chunk"]
for i, doc in enumerate(docs):
    response = ollama.embed(model="nomic-embed-text", input=doc)
    collection.add(
        ids=[str(i)],
        embeddings=[response["embeddings"][0]],
        documents=[doc]
    )

# Query
query_response = ollama.embed(
    model="nomic-embed-text",
    input="search query here"
)
results = collection.query(
    query_embeddings=[query_response["embeddings"][0]],
    n_results=3
)

Using Ollama as an OpenAI-compatible embedding provider

# LangChain
from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# LlamaIndex
from llama_index.embeddings.ollama import OllamaEmbedding
embed_model = OllamaEmbedding(model_name="nomic-embed-text")

# Any OpenAI-compatible client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.embeddings.create(
    model="nomic-embed-text",
    input="your text here"
)

API migration note

Ollama 0.2.0 changed the embedding endpoint from /api/embeddings to /api/embed. The new endpoint supports batched inputs (array of strings), returns L2-normalized vectors, and supports the truncate parameter. If you are upgrading from an older Ollama version, update your API calls accordingly.

Quantization and VRAM Management

Quantization reduces model precision to save memory. For embedding models, the quality-memory tradeoff is more forgiving than for generative models, because embeddings are less sensitive to small weight perturbations.

FormatMemory ReductionQuality ImpactBest For
F16BaselineNoneMaximum quality, ample VRAM
Q8_0~50%NegligibleModels < 1B params (recommended default)
Q5_K_M~65%MinimalLarge models (qwen3-embedding 8B)
Q4_K_M~75%ModerateMemory-constrained, acceptable quality loss

Models that fit entirely in VRAM run 5-30x faster than models that spill to system RAM. If mxbai-embed-large in F16 (670MB) fits in your GPU but qwen3-embedding 8B in F16 (16GB) does not, mxbai will generate embeddings orders of magnitude faster despite being a smaller model. Always check that the quantized model fits in your available VRAM before choosing a larger model.

Check VRAM usage and pull a specific quantization

# Pull the default quantization
ollama pull qwen3-embedding

# Pull a specific quantization for reduced VRAM
ollama pull qwen3-embedding:8b    # default quantization
# Community quantizations available:
# dengcao/Qwen3-Embedding-8B:Q4_K_M  (~4.9GB)
# dengcao/Qwen3-Embedding-8B:Q8_0    (~8.5GB)

# Check which models are loaded and their memory usage
ollama ps

# Set context length to reduce VRAM (if not using full context)
curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "shorter context saves memory",
  "options": {"num_ctx": 2048}
}'

Frequently Asked Questions

What is the best Ollama embedding model for RAG?

For most RAG applications, nomic-embed-text v1.5 is the best starting point. It scores 62.39 on MTEB, supports 8192-token context, runs on minimal hardware (274MB), and supports Matryoshka dimensionality reduction. If retrieval accuracy is critical and your chunks are under 512 tokens, mxbai-embed-large scores 64.68. For maximum quality with sufficient hardware, qwen3-embedding 8B scores 70.58.

How much VRAM do Ollama embedding models need?

all-minilm: ~50MB. nomic-embed-text: ~300MB. mxbai-embed-large: ~700MB. bge-m3: ~1.2GB (up to 5.7GB with batch processing). qwen3-embedding 8B: 16GB+ at F16, or 4-6GB with Q4 quantization. Models that fit entirely in VRAM generate embeddings 5-30x faster than those that spill to system RAM.

How do Ollama embeddings compare to OpenAI?

nomic-embed-text (62.39 MTEB) roughly matches OpenAI text-embedding-3-small (62.3). mxbai-embed-large (64.68) matches text-embedding-3-large (64.6). qwen3-embedding 8B (70.58) outperforms all current OpenAI embedding models. Ollama models return vectors in 15-50ms on localhost versus 200-800ms for cloud APIs. Cloud APIs cost $0.02-$0.13 per million tokens. Ollama costs nothing per token but requires hardware.

Can I use Ollama embedding models for code search?

General-purpose models like nomic-embed-text work for basic code retrieval. Nomic also released nomic-embed-code (7B parameters) which outperforms Voyage Code 3 on CodeSearchNet. But code embeddings have fundamental limitations: they capture syntax, not behavior. For production code search in AI coding agents, purpose-built tools like WarpGrep use RL-trained search agents instead of embeddings.

What is the Ollama embedding API endpoint?

POST http://localhost:11434/api/embed with a JSON body containing model and input fields. The input field accepts a single string or an array of strings for batch processing. Returns L2-normalized vectors. The API is OpenAI-compatible, so frameworks like LangChain, LlamaIndex, and ChromaDB work with minimal configuration changes.

Does quantization affect embedding quality?

Q8_0 quantization cuts memory by ~50% with negligible quality loss and is recommended for models under 1B parameters. Q5_K_M is recommended for the qwen3-embedding 8B model. Q4_K_M saves ~75% memory with moderate quality reduction. In practice, Q8_0 embeddings are nearly indistinguishable from F16 on retrieval benchmarks.

Which Ollama embedding model supports the longest context?

nomic-embed-text v1.5, bge-m3, snowflake-arctic-embed2, and all qwen3-embedding variants support 8192-token context windows. mxbai-embed-large is limited to 512 tokens. all-minilm supports only 256 tokens. For RAG, ensure your chunk size does not exceed the model's context window, as exceeding it causes silent truncation.

Code Search Without Managing Embeddings

WarpGrep uses RL-trained search agents instead of vector embeddings. No index to maintain, no stale embeddings, no VRAM budgeting. 70% less context rot, 40% faster task completion.