RAG lets an LLM answer questions about your documents. Most tutorials send your data to OpenAI or Cohere for embedding. If those documents contain proprietary code, legal contracts, medical records, or anything you would not paste into a chatbot, that is a problem. Ollama runs the entire pipeline on your machine. No API keys. No data leaving your network. This guide covers every step from model selection to production tuning.
Why Local RAG
Cloud RAG works. OpenAI embeddings are good. But every document you embed through a cloud API crosses a network boundary you do not control. For many teams, that is a non-starter.
Regulated industries (healthcare, finance, legal) often cannot send documents to third-party APIs without compliance review. Startups working on pre-launch products do not want their codebase indexed by an external service. Air-gapped environments have no external network at all.
Local RAG with Ollama solves all three. Your documents stay on your hardware. Your embeddings stay on your hardware. The LLM generating answers stays on your hardware. The trade-off is that you need to manage the infrastructure yourself, and local models are smaller than frontier cloud models. For document Q&A, that trade-off is often worth it.
Architecture Overview
Every RAG system has the same five stages, whether it runs locally or in the cloud. The only difference is where each stage executes.
1. Load
Read documents from disk. PDFs, Markdown, plain text, HTML, code files. Convert everything to raw text strings.
2. Split
Break documents into chunks small enough for the embedding model's context window. Typical: 500-1,000 tokens per chunk with overlap.
3. Embed
Convert each chunk into a vector using Ollama's embedding endpoint. The vector captures the semantic meaning of the text.
4. Store
Write vectors and their source text to a vector database. ChromaDB for local, Qdrant or pgvector for production scale.
5. Retrieve
At query time, embed the user's question, find the most similar chunks by vector distance, and return the top-k results.
6. Generate
Pass the retrieved chunks plus the question to an Ollama chat model. The model answers grounded in the retrieved context.
Stages 1-4 happen once at ingestion time (or when documents change). Stages 5-6 happen on every query. This separation is why RAG scales: you pay the embedding cost once and amortize it across all queries.
Choosing an Embedding Model
The embedding model is the most important decision in a RAG pipeline. It determines retrieval quality, and retrieval quality determines answer quality. No amount of prompt engineering on the generation side compensates for retrieving the wrong chunks.
Ollama ships several embedding models. Three cover the practical range:
| Model | Dimensions | Context | Memory | MTEB Retrieval | Best For |
|---|---|---|---|---|---|
| nomic-embed-text | 1,024 | 8,192 tokens | ~0.5 GB | 53.01 | Best all-around. Long docs, general RAG. |
| mxbai-embed-large | 1,024 | 512 tokens | ~1.2 GB | 64.68 | Maximum retrieval accuracy. Short chunks. |
| all-minilm | 384 | 512 tokens | ~0.1 GB | ~42 | Resource-constrained. Raspberry Pi, CI. |
Start with nomic-embed-text. It handles long documents well (8,192 token context), runs fast, and uses moderate memory. If your chunks are short (under 512 tokens) and you need the highest retrieval precision, switch to mxbai-embed-large. If you are running on a machine with 8 GB RAM total and no GPU, all-minilm works.
Embedding model lock-in
Once you pick an embedding model, every document in your vector store is encoded with it. Switching models means re-embedding your entire corpus. Choose carefully, or build your ingestion pipeline to make re-embedding cheap.
Setting Up Ollama
Install Ollama and pull the models. Two models: one for embedding, one for generation.
Install Ollama and pull models
# Install Ollama (macOS, Linux, Windows)
curl -fsSL https://ollama.com/install.sh | sh
# Pull the embedding model
ollama pull nomic-embed-text
# Pull a generation model
ollama pull llama3.1
# Verify both models are available
ollama listOllama runs a local server on port 11434 by default. The API is HTTP, so any language can call it. Two endpoints matter for RAG:
Ollama API endpoints for RAG
# Generate embeddings (batch)
POST http://localhost:11434/api/embed
{
"model": "nomic-embed-text",
"input": ["First document chunk", "Second document chunk"]
}
# Returns: { "embeddings": [[0.123, -0.456, ...], [...]] }
# Chat completion (generation)
POST http://localhost:11434/api/chat
{
"model": "llama3.1",
"messages": [
{"role": "system", "content": "Answer based on the provided context."},
{"role": "user", "content": "Context: ...\n\nQuestion: ..."}
],
"stream": false
}Use /api/embed, not /api/embeddings
Ollama has two embedding endpoints. /api/embeddings is the older one: it accepts a single string and returns a single vector. /api/embed accepts an array of strings and returns multiple vectors in one call. Use /api/embed for batch ingestion.
Document Ingestion Pipeline
The ingestion pipeline converts raw documents into embedded chunks stored in your vector database. This is the part that runs once per document (or on update). Quality here determines everything downstream.
Loading Documents
Start simple. Read files from a directory. If you need PDF support, add a parser. The key constraint: your loader must produce clean text. Garbage in, garbage out applies more to RAG than to most systems because the retriever cannot distinguish noise from signal.
Document loading (Python)
import os
from pathlib import Path
def load_documents(directory: str) -> list[dict]:
"""Load text files from a directory."""
docs = []
supported = {".txt", ".md", ".py", ".ts", ".js", ".html", ".csv"}
for path in Path(directory).rglob("*"):
if path.suffix in supported and path.is_file():
text = path.read_text(encoding="utf-8", errors="ignore")
docs.append({
"text": text,
"metadata": {
"source": str(path),
"filename": path.name,
"extension": path.suffix,
}
})
return docs
# For PDFs, add PyPDF:
# pip install pypdf
from pypdf import PdfReader
def load_pdf(path: str) -> list[dict]:
reader = PdfReader(path)
docs = []
for i, page in enumerate(reader.pages):
text = page.extract_text() or ""
if text.strip():
docs.append({
"text": text,
"metadata": {"source": path, "page": i + 1}
})
return docsChunking Strategy
Chunking determines what the retriever can find. Too large and chunks contain irrelevant noise that dilutes the answer. Too small and chunks lose context the model needs to answer correctly. The standard approach: 500-1,000 characters per chunk with 10-20% overlap so that sentences split across boundaries still appear in at least one chunk.
Text splitting with overlap
def split_text(
text: str,
chunk_size: int = 800,
chunk_overlap: int = 150,
separators: list[str] | None = None,
) -> list[str]:
"""
Recursively split text, preferring natural boundaries.
Tries separators in order: paragraph > newline > sentence > space.
"""
if separators is None:
separators = ["\n\n", "\n", ". ", " "]
chunks = []
current_sep = separators[0]
# If text fits in one chunk, return it
if len(text) <= chunk_size:
return [text.strip()] if text.strip() else []
# Split on current separator
parts = text.split(current_sep)
current_chunk = ""
for part in parts:
candidate = current_chunk + current_sep + part if current_chunk else part
if len(candidate) <= chunk_size:
current_chunk = candidate
else:
if current_chunk:
chunks.append(current_chunk.strip())
# If a single part exceeds chunk_size, try next separator
if len(part) > chunk_size and len(separators) > 1:
chunks.extend(
split_text(part, chunk_size, chunk_overlap, separators[1:])
)
current_chunk = ""
else:
current_chunk = part
if current_chunk.strip():
chunks.append(current_chunk.strip())
# Add overlap: prepend tail of previous chunk to each chunk
if chunk_overlap > 0 and len(chunks) > 1:
overlapped = [chunks[0]]
for i in range(1, len(chunks)):
prev_tail = chunks[i - 1][-chunk_overlap:]
overlapped.append(prev_tail + " " + chunks[i])
chunks = overlapped
return chunksChunk size depends on your embedding model
nomic-embed-text handles up to 8,192 tokens. You have headroom. mxbai-embed-large and all-minilm cap at 512 tokens (roughly 350-400 words). If you pick a 512-token model, keep chunks under 400 tokens or the embedding truncates silently, losing the tail of each chunk.
Generating Embeddings
Embed chunks with Ollama
import requests
from typing import List
OLLAMA_URL = "http://localhost:11434"
def embed_texts(texts: list[str], model: str = "nomic-embed-text") -> list[list[float]]:
"""Embed a batch of texts using Ollama's /api/embed endpoint."""
response = requests.post(
f"{OLLAMA_URL}/api/embed",
json={"model": model, "input": texts},
)
response.raise_for_status()
return response.json()["embeddings"]
# Example: embed 100 chunks in one call
chunks = ["chunk 1 text...", "chunk 2 text...", ...] # your split documents
vectors = embed_texts(chunks)
print(f"Embedded {len(vectors)} chunks, each {len(vectors[0])} dimensions")On an M2 MacBook Pro, nomic-embed-text embeds roughly 50-80 chunks per second. On a machine with a dedicated GPU (RTX 3090, A100), expect 200-500 per second depending on chunk length. For a corpus of 10,000 chunks, that is 20 seconds to 3 minutes. You pay this cost once.
Vector Store with ChromaDB
ChromaDB is the default vector database for local RAG. It runs in-process (no separate server), persists to disk, and handles collections, metadata filtering, and similarity search out of the box.
ChromaDB setup and ingestion
# pip install chromadb
import chromadb
# Create a persistent client (data survives restarts)
client = chromadb.PersistentClient(path="./chroma_db")
# Create or get a collection
collection = client.get_or_create_collection(
name="my-documents",
metadata={"hnsw:space": "cosine"}, # cosine similarity
)
def ingest_documents(docs: list[dict], collection):
"""Split, embed, and store documents in ChromaDB."""
all_chunks = []
all_metadatas = []
all_ids = []
for doc in docs:
chunks = split_text(doc["text"])
for i, chunk in enumerate(chunks):
all_chunks.append(chunk)
all_metadatas.append({
**doc["metadata"],
"chunk_index": i,
})
all_ids.append(f"{doc['metadata']['filename']}_{i}")
# Embed all chunks
embeddings = embed_texts(all_chunks)
# Store in ChromaDB (batch for large corpora)
batch_size = 500
for start in range(0, len(all_chunks), batch_size):
end = start + batch_size
collection.add(
ids=all_ids[start:end],
embeddings=embeddings[start:end],
documents=all_chunks[start:end],
metadatas=all_metadatas[start:end],
)
print(f"Ingested {len(all_chunks)} chunks from {len(docs)} documents")
# Run ingestion
docs = load_documents("./my_docs")
ingest_documents(docs, collection)The hnsw:space parameter matters. Cosine similarity is standard for normalized embeddings. Ollama's /api/embed returns L2-normalized vectors, so cosine and inner product give the same rankings. Stick with cosine for clarity.
ChromaDB handles embedding internally too
ChromaDB can call Ollama directly if you configure an embedding function, skipping the manual embed_texts call. But managing embeddings yourself gives you control over batching, error handling, and the ability to swap embedding providers without changing your storage code.
Retrieval and Generation
This is the query path. User asks a question, you embed it, find similar chunks, and pass them to the LLM. The entire round trip stays local.
Complete query pipeline
import requests
def query_rag(
question: str,
collection,
n_results: int = 5,
model: str = "llama3.1",
) -> str:
"""Full RAG query: embed question, retrieve, generate."""
# Step 1: Embed the question
q_embedding = embed_texts([question])[0]
# Step 2: Retrieve similar chunks
results = collection.query(
query_embeddings=[q_embedding],
n_results=n_results,
)
# Step 3: Build context from retrieved chunks
context_parts = []
for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
source = meta.get("source", "unknown")
context_parts.append(f"[Source: {source}]\n{doc}")
context = "\n\n---\n\n".join(context_parts)
# Step 4: Generate answer with Ollama
prompt = f"""Use the following context to answer the question.
If the context does not contain enough information, say so.
Do not make up information.
Context:
{context}
Question: {question}"""
response = requests.post(
f"{OLLAMA_URL}/api/chat",
json={
"model": model,
"messages": [
{"role": "system", "content": "You answer questions based on provided context. Be precise and cite sources."},
{"role": "user", "content": prompt},
],
"stream": False,
"options": {"num_ctx": 8192},
},
)
response.raise_for_status()
return response.json()["message"]["content"]
# Usage
answer = query_rag("What is the refund policy?", collection)
print(answer)The num_ctx parameter controls how many tokens the generation model considers. Ollama defaults to 2,048, which is too small for RAG. With 5 retrieved chunks of 800 characters each plus the question and system prompt, you need at least 4,096. Set it to 8,192 for safety. Higher values use more VRAM: roughly 1 GB per additional 4,096 tokens on most models.
Performance Tuning
Local RAG performance depends on hardware. CPU-only works for prototyping. Production needs a GPU or Apple Silicon with sufficient unified memory.
Hardware Baselines
| Hardware | Embed Speed | Generate Speed | Memory |
|---|---|---|---|
| M1/M2 MacBook (16 GB) | ~50 chunks/s | ~20 tok/s | Fits 7B model + embeddings |
| M3 Pro/Max (36 GB) | ~100 chunks/s | ~40 tok/s | Fits 13B model comfortably |
| RTX 3090 (24 GB VRAM) | ~300 chunks/s | ~80 tok/s | Fits 13B model + large index |
| RTX 4090 (24 GB VRAM) | ~500 chunks/s | ~120 tok/s | Fits 13B with 32k context |
Key Tuning Parameters
Ollama performance tuning
# Increase context window (costs ~1 GB VRAM per 4k tokens)
# Set in the API call:
"options": {"num_ctx": 8192}
# Or create a Modelfile with persistent settings:
# Modelfile.rag
FROM llama3.1
PARAMETER num_ctx 8192
PARAMETER num_gpu 999 # offload all layers to GPU
PARAMETER num_thread 8 # CPU threads for non-GPU work
# Build the custom model
ollama create llama3.1-rag -f Modelfile.rag
# For Apple Silicon: Flash Attention is enabled automatically
# For CUDA: Flash Attention enabled by default since Ollama 0.3+
# Monitor VRAM usage during queries
# macOS: Activity Monitor > Memory (Wired + Compressed)
# Linux: nvidia-smi -l 1Reducing Latency
Three things dominate RAG latency: embedding the question (~50ms on GPU), vector search (~5ms for 100k vectors in ChromaDB), and generation (1-3 seconds for a 200-token answer). Generation is the bottleneck. To speed it up:
- Use a smaller generation model.
llama3.1:8bis 2x faster thanllama3.1:70bfor most RAG tasks, and the quality difference on factual Q&A is smaller than you would expect. - Keep the model loaded. Ollama unloads models after 5 minutes of inactivity by default. Set
OLLAMA_KEEP_ALIVE=-1to keep models in memory permanently. - Reduce
n_results. Retrieving 3 chunks instead of 10 cuts generation time because the model processes less context. Only increase if answer quality suffers.
Hybrid Search and Reranking
Pure vector search misses exact keyword matches. If someone asks about "error code E-4012" and that string exists in your docs, vector similarity might rank it below semantically similar but wrong chunks. Hybrid search fixes this by combining vector similarity with BM25 keyword matching.
Hybrid search: BM25 + vector similarity
# pip install rank-bm25
from rank_bm25 import BM25Okapi
import numpy as np
class HybridRetriever:
def __init__(self, collection, documents: list[str]):
self.collection = collection
self.documents = documents
# Build BM25 index from the same documents
tokenized = [doc.lower().split() for doc in documents]
self.bm25 = BM25Okapi(tokenized)
def search(
self,
query: str,
n_results: int = 5,
vector_weight: float = 0.7,
) -> list[dict]:
# Vector search
q_embedding = embed_texts([query])[0]
vector_results = self.collection.query(
query_embeddings=[q_embedding],
n_results=n_results * 2, # over-retrieve
)
# BM25 search
tokenized_query = query.lower().split()
bm25_scores = self.bm25.get_scores(tokenized_query)
bm25_top = np.argsort(bm25_scores)[::-1][:n_results * 2]
# Normalize and combine scores
vector_ids = set(vector_results["ids"][0])
combined = {}
for i, doc_id in enumerate(vector_results["ids"][0]):
# Vector score: higher rank = higher score
combined[doc_id] = vector_weight * (1.0 - i / len(vector_results["ids"][0]))
for rank, idx in enumerate(bm25_top):
doc_id = f"doc_{idx}" # match your ID scheme
kw_score = (1 - vector_weight) * (1.0 - rank / len(bm25_top))
combined[doc_id] = combined.get(doc_id, 0) + kw_score
# Sort by combined score and return top-n
ranked = sorted(combined.items(), key=lambda x: x[1], reverse=True)
return ranked[:n_results]Cross-Encoder Reranking
Retrieve broadly (top 20), then rerank precisely (top 5). A cross-encoder scores each query-document pair together, producing higher-quality relevance judgments than the bi-encoder embedding used for initial retrieval. This is the single highest-impact improvement you can add to a RAG pipeline after getting the basics working.
Cross-encoder reranking
# pip install sentence-transformers
from sentence_transformers import CrossEncoder
# Load a small cross-encoder (runs on CPU in ~50ms per pair)
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, documents: list[str], top_k: int = 5) -> list[str]:
"""Re-score documents with a cross-encoder and return top-k."""
pairs = [[query, doc] for doc in documents]
scores = reranker.predict(pairs)
# Sort by score descending
ranked = sorted(
zip(documents, scores),
key=lambda x: x[1],
reverse=True,
)
return [doc for doc, score in ranked[:top_k]]
# Usage in the RAG pipeline:
# 1. Retrieve top-20 from ChromaDB
# 2. Rerank to top-5
# 3. Send top-5 to the LLM
initial_results = collection.query(query_embeddings=[q_emb], n_results=20)
reranked = rerank(question, initial_results["documents"][0], top_k=5)Reranking is cheap
A small cross-encoder like ms-marco-MiniLM-L-6-v2 re-scores 20 documents in under 100ms on CPU. It runs locally, no API calls. The precision gain is significant: in production pipelines, reranking consistently improves answer relevance by 15-30% measured by human evaluation.
When Local RAG Beats Cloud RAG
Local and cloud RAG are not interchangeable. Each wins in different situations.
| Dimension | Local (Ollama) | Cloud (OpenAI/Cohere) |
|---|---|---|
| Privacy | Data never leaves your machine | Data crosses network to third-party servers |
| Cost at scale | Fixed hardware cost. 50k queries/day = $0 marginal | Per-token pricing. 50k queries/day = significant spend |
| Latency | No network hop. Embedding <100ms on GPU | Network round trip + queue time. 200-500ms typical |
| Model quality | 7B-8B generation models. Good for factual Q&A | GPT-4o, Claude. Better reasoning and synthesis |
| Embedding quality | nomic-embed-text competitive with ada-002 | text-embedding-3-large best in class |
| Setup complexity | Install Ollama, pull models, manage hardware | API key, one HTTP call |
| Offline capable | Yes. Air-gapped environments work | No. Requires internet |
The rule of thumb: if your documents are sensitive, your query volume is high, or you need offline operation, local RAG wins. If you need frontier-model reasoning quality (multi-hop synthesis, complex analysis) and your data is not sensitive, cloud RAG gives better answers per query.
The hybrid approach works well in practice. Embed locally with Ollama (privacy), store in a local vector database, retrieve locally, then send only the retrieved chunks (not the full corpus) to a cloud LLM for generation. This limits data exposure to the 5-10 chunks per query that are already relevant to the user's question.
Limitations
Local RAG with Ollama is not a drop-in replacement for every use case. Understanding the constraints avoids wasted effort.
Generation Quality Ceiling
Local 7B-8B models handle factual Q&A well but struggle with multi-step reasoning, synthesizing across many sources, and nuanced analysis. Cloud frontier models are measurably better at these tasks.
Hardware Requirements
Minimum 16 GB RAM. A GPU with 8+ GB VRAM for production speed. Running both embedding and generation models simultaneously is memory-intensive. Budget $500-2,000 for capable hardware.
Retrieval Fragility
Retrieval quality depends on chunking, embedding model, and query phrasing. Poorly chunked documents or ambiguous queries produce irrelevant results. No model compensates for bad retrieval.
No Built-in Evaluation
Unlike managed RAG platforms, local pipelines have no automatic evaluation. You need to build your own test suite to measure retrieval precision, answer accuracy, and hallucination rate.
Hallucination Persists
RAG reduces hallucination but does not eliminate it. Research shows poorly evaluated RAG systems hallucinate in up to 40% of responses even when the correct information was retrieved. Prompt design and retrieval quality both matter.
Corpus Scale Limits
ChromaDB handles millions of vectors but query latency grows. For corpora beyond ~10M chunks, you need a dedicated vector database (Qdrant, Milvus, pgvector) with proper indexing and sharding.
Code-Specific RAG: Where General Pipelines Break
Building RAG over a codebase sounds like the same problem as document RAG. It is not. Code has structure that general-purpose embedding models do not capture well: function boundaries, import graphs, type relationships, call hierarchies. Chunking code by character count splits functions mid-body. Embedding a function with nomic-embed-text captures its textual content but not its role in the system.
The common failure mode: you ask "where is the webhook handler?" and get back a test file that mentions "webhook" in a comment, not the actual handler in src/api/webhooks.ts. The embedding model cannot distinguish between code that implements a feature and code that references it.
For code-specific search, purpose-built tools outperform generic RAG. WarpGrep uses an RL-trained search agent that explores in its own isolated context window, iteratively searching, reading, filtering, and backtracking. It returns precise results like src/api/webhooks.ts, lines 47-89, not paragraph-level chunks. No vector database to maintain. No embedding model to choose. No chunking strategy to tune.
When to use generic RAG vs. specialized code search
Use Ollama RAG for: documentation, knowledge bases, legal documents, research papers, support tickets, anything that is primarily natural language text. Use a dedicated code search tool for: navigating codebases, finding implementations, understanding call paths, and answering "where is X defined?" questions.
Frequently Asked Questions
What is Ollama RAG?
Building a Retrieval-Augmented Generation pipeline using Ollama for both embeddings and generation, running entirely on your local machine. Documents are embedded into vectors, stored in a local vector database, and retrieved at query time to ground the LLM's responses in your private data.
Which Ollama embedding model should I use for RAG?
Start with nomic-embed-text. It has the best balance of quality, speed, and memory for most RAG workloads: 1,024 dimensions, 8,192 token context, ~0.5 GB. Switch to mxbai-embed-large if you need maximum retrieval precision on short chunks. Use all-minilm only on hardware with less than 8 GB RAM.
How much RAM do I need?
Minimum 16 GB. The embedding model needs 0.5-1.2 GB, the generation model needs 4-8 GB (for 7B-8B parameters), and the vector database needs memory proportional to your corpus size. 32 GB is comfortable. A GPU with 8+ GB VRAM makes both embedding and generation significantly faster.
Is local RAG as good as cloud RAG?
For retrieval quality, yes, if you choose a good embedding model and chunking strategy. For generation quality, local 7B-8B models are measurably worse than GPT-4o or Claude at complex reasoning. For factual Q&A over your own documents, the gap is small. Local RAG wins on privacy, cost at scale, and latency.
What vector database should I use?
ChromaDB for getting started: zero-config, runs in-process, persists to disk. Qdrant (Docker) for production scale. pgvector if you already run PostgreSQL. FAISS for maximum speed but no built-in persistence.
How do I improve retrieval quality?
Three high-impact changes: (1) semantic chunking that respects paragraph and section boundaries, (2) hybrid search combining BM25 keywords with vector similarity, (3) cross-encoder reranking to re-score your top-20 results before sending to the LLM. Each step compounds.
Can I use Ollama RAG for code search?
You can, but general-purpose embeddings miss code structure. For codebase search, purpose-built tools like WarpGrep provide more precise retrieval without building and maintaining a custom pipeline.
What is the difference between /api/embed and /api/embeddings?
/api/embed is the current endpoint. It accepts an array of strings and returns multiple embeddings in one call. /api/embeddings is the older endpoint: single string in, single vector out. Use /api/embed.
Code Search Without Building a Pipeline
Ollama RAG works for documents. For code, WarpGrep provides semantic codebase search using an RL-trained agent that reads, filters, and backtracks in its own context window. No vector database. No embedding model. No chunking. Just precise file and line references.