Qwen3-Reranker and Qwen3-Embedding: Sizes, MTEB Scores, and How to Use Them

Qwen3-Reranker and Qwen3-Embedding are the Qwen team's (Alibaba) open-weight retrieval models, released June 5, 2025 under Apache 2.0 in 0.6B, 4B, and 8B sizes. The embedding models output 1024, 2560, or 4096-dimensional vectors; Qwen3-Embedding-8B ranked No. 1 on the MTEB multilingual leaderboard at 70.58. In a retrieve-then-rerank pipeline the embedding model finds candidates and the reranker reorders them by relevance. Morph ships the same stack as morph-embedding-v4 and morph-rerank-v4.

0.6B / 4B / 8B

Reranker and embedding sizes

70.58

Qwen3-Embedding-8B MTEB multilingual

4096

Max embedding dimension (8B)

Apache 2.0

License (commercial use)

The Problem: Vector Search Returns the Wrong Order

An embedding model encodes a query and every document into the same vector space, then ranks documents by cosine similarity to the query. This scales to millions of documents because the document vectors are precomputed and the query vector is compared against an index. The cost per query is a handful of dot products.

The weakness is that the query and the document are encoded separately. The model never sees them together, so it compresses each into a fixed vector and hopes the geometry lines up. For a query like "how do I cancel a subscription mid-cycle," a document about "subscription billing cycles" can sit closer in vector space than the document that actually explains cancellation. The right answer is usually in the top 50 candidates, just not at position 1.

Sending an unranked top-50 to an LLM wastes context and dilutes the answer. The fix is a second pass that reads the query and each candidate together and scores true relevance. That second pass is reranking.

Bi-encoder vs cross-encoder

Embedding models are bi-encoders: query and document are encoded independently, then compared. Rerankers are cross-encoders: query and document are fed through the model jointly, so attention runs across both. Cross-encoders are more accurate per pair but cannot be precomputed, which is why you run them only on a small candidate set.

What a Reranker Does

A reranker takes a query and a list of candidate documents and returns a relevance score for each query-document pair. Unlike an embedding model, it does not produce a reusable vector. It reads both texts at once and outputs a single number, so you call it per candidate at query time.

Qwen3-Reranker is instruction-aware. You can prepend a task description to steer relevance (for example, "rank passages that answer a how-to question" versus "rank passages that define a term"). It does not use Matryoshka Representation Learning, because there is no vector to truncate. Its output is the score itself.

Three properties define where it fits in a pipeline:

Joint encoding

The query and candidate are read together, so the model can match exact intent rather than approximate vector proximity. This is what lifts the right answer to the top.

Bounded cost

A cross-encoder cannot be precomputed, so you run it only on the top-k retrieved candidates (for example 100), not the whole corpus. Cost scales with k, not corpus size.

Single score output

The reranker emits one relevance score per pair, not an embedding. You sort by that score and keep the top results for the LLM.

Qwen3-Reranker Sizes

Qwen3-Reranker comes in three sizes. All share a 32K token context length and support over 100 languages including programming languages. The smaller model is faster and cheaper to serve; the larger model is more accurate. The 0.6B model has 28 layers, the 4B and 8B have 36 layers.

Model	Parameters	Layers	Context	Latency note
Qwen3-Reranker-0.6B	0.6B	28	32K	Lowest latency; rerank larger candidate sets
Qwen3-Reranker-4B	4B	36	32K	Mid accuracy and cost
Qwen3-Reranker-8B	8B	36	32K	Highest accuracy; rerank smaller top-k

The size you pick is set by your top-k and latency budget. Reranking 100 candidates with the 8B model is heavier than reranking 20 with the 0.6B model. A common pattern is to retrieve 100 with embeddings, rerank with 0.6B for throughput, then optionally re-score the top 10 with 8B for the final order.

Qwen3-Embedding Sizes and Dimensions

Qwen3-Embedding ships in the same three sizes, with output dimensions that grow with parameter count. All sizes share a 32K token context length and support over 100 languages. The embedding model is instruction-aware: user-defined task instructions improve performance by 1% to 5%, per the Qwen3-Embedding-8B model card.

Model	Parameters	Embedding dim	MRL range	Context
Qwen3-Embedding-0.6B	0.6B	1024	32-1024	32K
Qwen3-Embedding-4B	4B	2560	32-2560	32K
Qwen3-Embedding-8B	8B	4096	32-4096	32K

Matryoshka Representation Learning (MRL) lets you truncate the output vector to any dimension from 32 up to the model maximum. A 4096-dimensional vector truncated to 512 dimensions stores 8x smaller and searches faster, at some accuracy cost. This is a knob the reranker does not have, since it outputs a score rather than a vector.

MTEB Scores and License

Qwen3-Embedding-8B ranked No. 1 on the MTEB multilingual leaderboard with a score of 70.58 as of June 5, 2025, per the Qwen3-Embedding-8B model card. MTEB (Massive Text Embedding Benchmark) aggregates retrieval, classification, clustering, and reranking tasks across many languages into one score, so it measures embedding quality across task types, not a single retrieval benchmark.

Both Qwen3-Reranker and Qwen3-Embedding are released under the Apache 2.0 license, which permits commercial use, modification, and redistribution. That is a more permissive license than the community licenses attached to some open-weight LLMs. The models were released June 5, 2025 (arXiv 2506.05176).

MTEB measures embeddings, not rerankers directly

The 70.58 MTEB multilingual score is for Qwen3-Embedding-8B, the bi-encoder. A reranker is evaluated on reranking-specific metrics (for example nDCG over a retrieved candidate set), not the full MTEB embedding leaderboard. Treat the embedding score and the reranker as separate measurements in the same pipeline.

Retrieve-Then-Rerank in RAG

A retrieval-augmented generation (RAG) pipeline has two retrieval stages when you add a reranker. The first stage is recall: cast a wide net cheaply. The second stage is precision: reorder the net's contents accurately.

Stage	Model type	Operates on	Output
1. Retrieve	Embedding (bi-encoder)	Whole corpus via index	Top 100 candidates
2. Rerank	Reranker (cross-encoder)	Top 100 candidates only	Reordered top 5
3. Generate	LLM	Top 5 documents	Grounded answer

The split exists because the two model types have opposite cost profiles. The embedding model is cheap per document and precomputable, so it can scan a corpus, but it is approximate. The reranker is accurate but cannot be precomputed, so you run it only on the top 100 the embedding model surfaced. Each model does the job it is cheap at.

The win is concentrated where it matters: the documents that reach the LLM. A reranker rarely changes which documents are in the top 100, but it routinely changes which are in the top 5, and the top 5 are what the model reads.

How to Use a Reranker

The flow is the same whether you self-host Qwen3 weights or call a hosted API: embed and retrieve candidates, rerank the candidates, keep the top results. Below is the two-stage pattern against an OpenAI-compatible API, which is the shape Morph exposes at api.morphllm.com/v1.

Retrieve-then-rerank with an OpenAI-compatible API

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.MORPH_API_KEY,
  baseURL: "https://api.morphllm.com/v1",
});

// 1. Embed the query (and your corpus, precomputed offline)
const queryEmbedding = await client.embeddings.create({
  model: "morph-embedding-v4", // 1536-dimensional vectors
  input: "how do I cancel a subscription mid-cycle",
});

// 2. Vector search returns a wide candidate set (top 100)
const candidates = await vectorIndex.search(queryEmbedding.data[0].embedding, 100);

// 3. Rerank the candidates by joint query-document relevance
const reranked = await fetch("https://api.morphllm.com/v1/rerank", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.MORPH_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "morph-rerank-v4",
    query: "how do I cancel a subscription mid-cycle",
    documents: candidates.map((c) => c.text),
    top_n: 5,
  }),
}).then((r) => r.json());

// 4. Send only the top 5 reranked documents to the LLM
const context = reranked.results.map((r) => candidates[r.index].text);

For self-hosted Qwen3 weights, the pattern is identical: encode the corpus with Qwen3-Embedding (truncating dimensions via MRL if you want smaller vectors), retrieve top-k by cosine similarity, then score each query-document pair with Qwen3-Reranker and sort. The reranker call is the only part that runs per query-candidate pair.

Reranking for Code Search

Code search is a retrieval problem where reranking matters more than usual. A coding agent searching a repository for "where is the rate limit enforced" gets dozens of plausible files: the middleware, the config, the tests, the docs. Embedding similarity surfaces all of them. The reranker decides which one the agent reads first.

Qwen3-Embedding and Qwen3-Reranker both support programming languages directly, so they apply to code as well as prose. Morph runs the same two-stage stack as a managed service for coding agents: morph-embedding-v4 produces 1536-dimensional embeddings, morph-rerank-v4 reorders retrieved results, and morph-warp-grep searches code directly. Reranking is the precision layer that turns a noisy top-100 into a top-5 the agent can act on without re-reading the repository.

Retrieval quality compounds in an agent loop. An agent that reads the right file first saves a turn; one that reads three wrong files first burns context and latency on every step. See AI inference for how Morph serves these models, and the LLM router for routing the generation step to the right model tier.

Tradeoffs

Reranking is not free and not always worth it. State the downsides plainly before adding a second stage.

Latency per query

A cross-encoder runs per query-candidate pair at query time, unlike precomputed embeddings. Reranking 100 candidates with the 8B model adds real latency. Reduce k or use the 0.6B model when throughput dominates.

No precomputation

You cannot cache reranker scores the way you cache embeddings, because the score depends on the specific query. Every query pays the full reranking cost over its candidate set.

Diminishing returns at low recall

A reranker only reorders what retrieval already found. If the embedding stage misses the right document entirely (it is not in the top 100), reranking cannot recover it. Fix recall first.

Extra moving part

A two-stage pipeline is two models to serve, version, and monitor. For small corpora where the top-5 embedding results are already good, a reranker adds cost without changing the answer.

The rule of thumb: add a reranker when your top-k retrieval reliably contains the right document but in the wrong order, and when your corpus is large enough that the embedding stage has to return a wide net. For a 200-document knowledge base, embeddings alone are often enough. For a large repository or document store, reranking is what makes the top-5 trustworthy.

Frequently Asked Questions

What is a reranker and how is it different from an embedding model?

An embedding model maps a query and each document into a vector independently, then ranks by cosine similarity. It is fast and runs over a whole corpus, but it never sees the query and document together. A reranker reads the query and each candidate jointly and outputs a single relevance score, which is more accurate. The embedding model retrieves candidates; the reranker reorders them.

What sizes does Qwen3-Reranker come in?

Three sizes: 0.6B, 4B, and 8B parameters, all with a 32K token context length and support for over 100 languages including programming languages. The 0.6B model has 28 layers; the 4B and 8B have 36 layers. It was released June 5, 2025 under Apache 2.0.

What dimensions and MTEB score does Qwen3-Embedding have?

Qwen3-Embedding outputs 1024-dimensional vectors at 0.6B, 2560 at 4B, and 4096 at 8B, with Matryoshka Representation Learning supporting dimensions from 32 to 4096. Qwen3-Embedding-8B ranked No. 1 on the MTEB multilingual leaderboard at 70.58 as of June 5, 2025.

Is Qwen3-Reranker open source?

Yes. Both Qwen3-Reranker and Qwen3-Embedding are released by the Qwen team (Alibaba) under the Apache 2.0 license, which permits commercial use. The weights for all three sizes are available on Hugging Face.

How do you use a reranker in a RAG pipeline?

Retrieve a wide candidate set with an embedding model (for example top 100 by cosine similarity), then pass each query-document pair to the reranker. Keep the highest-scoring documents (for example top 5) for the LLM. You rerank a bounded candidate set, not the whole corpus, so the second stage stays cheap.

What is the best reranker model?

It depends on your latency and accuracy budget. Qwen3-Reranker-8B is the most accurate of the open-weight Qwen rerankers; Qwen3-Reranker-0.6B is faster for high-throughput retrieval. For coding agents on a managed API, Morph offers morph-rerank-v4 paired with morph-embedding-v4 (1536-dim) so retrieval and reranking run as one hosted stack.

Does Qwen3-Embedding support variable output dimensions?

Yes, via Matryoshka Representation Learning. You can truncate the output to any dimension from 32 to 4096 to trade storage and search speed against accuracy. Qwen3-Reranker does not use MRL because it outputs a single relevance score per pair, not a vector.

Related Resources

Sharpen Retrieval for Your Coding Agent

Morph ships the two-stage retrieval stack as a managed API: morph-embedding-v4 (1536-dimensional embeddings), morph-rerank-v4 for precision reranking, and morph-warp-grep for code search. OpenAI-compatible at api.morphllm.com. Reranking turns a noisy top-100 into a clean top-5 your agent can act on.

Read the Docs

Get an API Key

Fast Apply

WarpGrep

Compact

Model Router

DeepSeek

MiniMax

Qwen

Blog

Startup Credits

Students

Contact Us

About

Careers